Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:01
it has to be here um and i'm sure that many of you have scenes with the slides before
00:00:07
but that's not an apology 'cause you want to see them in the same order the same combination i know that nobody has seen them all
00:00:13
if you think a slice like spice and you think about twelve quite camille that i hope you could enjoy my impromptu cooking
00:00:22
and if we're going to talk about
00:00:25
the signals are in fact this is the seventy fifth anniversary
00:00:31
uh_huh detective analysis of the interaction of individuals this is the copy of
00:00:36
the front page of the general downfall anthropology just thirteen nineteen thirty nine
00:00:43
chapter was um he was in the the bits of analysis and critique and
00:00:51
they're active listening and i think i started with his work and those co workers
00:00:56
to the manual typewriter which is the thing you used to use the typing letters off to put an electric motor
00:01:02
i don't spool of paper so that when something happened on observer could make a mark on the paper
00:01:09
and he actually used to a small shaft so the
00:01:13
operator could measure changes and subject subject activity over time
00:01:19
uh he was able to observe the discourse sections of two individuals and to tennis shoes situations of their actions
00:01:26
to understand the way which sequence of durations is organised nice clean
00:01:31
is too small it's automatically going into some very grand way presentation that
00:01:35
shows you what's coming next but not what's yeah now um this
00:01:39
with anyway this is the first recorded a sequence analysis of human behaviour
00:01:45
it was my fault just i don't condone is also metal just apologise people are typically
00:01:51
study animal behaviour and we are animals so they study of behaviour from that point of view
00:01:56
i think and and has written that the first task of a human level just like
00:02:00
never before just to set so to study abroad or official monkey must be systematic description
00:02:07
he must adapt to see what behaviours structures the human being has
00:02:12
in doing this with people he says it would seem best to begin with
00:02:15
those aspects of behaviour which are most likely to be shared with other animals
00:02:21
i i it kinda dismisses languages as well detailed analysis of language must eventually find placing him
00:02:27
these little g. these do not seem to be the best aspects of human behaviour to start with
00:02:33
electronically says you know we have very advanced
00:02:36
speech recognition slide recognition we can process text but
00:02:41
most of the time the devices the process the text don't know what to do with it
00:02:46
the text is almost irrelevant to the interaction so what i want to focus on is not what was said but what was done
00:02:54
okay
00:02:56
ah
00:02:57
conversation analysts discourse analysts they have also performed similar
00:03:02
studies but many people claim that the work comp generous
00:03:06
and my name is to produce the technology a machine or device or a a module which will observe
00:03:15
a systematic socially organised procedures underline the ways in which social actors move into a
00:03:21
mutually ratified participation in an encounter was the technical terms can then calls that frame achievement
00:03:29
in a in a simpler to um it just speech processing but it's not speech based text based speech processing
00:03:37
okay the goes to produce a technology for tracking discourse nosing conversational speech
00:03:42
by focusing on the behaviour of participants
00:03:44
to make inferences about the discourse participation status
00:03:49
not the text in other words if i talk to you first are you there
00:03:54
can you hear me can you see are you listening to me do you understand me do you agree with me
00:04:01
did you agree with me before i said it or do you agree with
00:04:04
me freshly now are you expressing surprise that that agreement many many levels of participation
00:04:11
this is a highly formal one where you don't have many rights to express feedback
00:04:16
everybody's doing your own thing and i can tell from the timing of units produce smiling how i can face might talk
00:04:23
ah anyway what you're doing is active listening
00:04:29
and i think that's a technology which is missing we have active speaking and we have transcribing machines
00:04:37
we do not have devices the degree your machine
00:04:39
if into interaction with the machine monitors the hippopotamus
00:04:46
traditional pose a approaches the spoken dialogue interface design which is what i have tended
00:04:51
to work tend to assume a ping pong will push to talk style of speech interaction
00:04:56
so the system talks you answer it responds you reply and so forth there is one older thrown
00:05:02
from side to side and my talk is gonna show the other day to do not support this view
00:05:07
and this is what i spend all my time looking at i love it okay i'll try and talk you through it you'll see lots of pictures like this today
00:05:14
it's a telephone conversation it's speech activity the blue
00:05:19
is j. i. m. a. pull from elle in
00:05:21
this case j. f. a. is a female speaker and it's the essex conversation in a series of ten
00:05:28
we pay people to come into a room pick up the telephone and talk on
00:05:32
this it will give you this much money for thirty minutes each and that's always it
00:05:37
and it was very frightening because they had no idea he was gonna be only into that awful
00:05:42
but over this year is the kind conversations ago actually got to know each other they became friends
00:05:48
they spoke about things that richard interest um because they bought some for male female uh et cetera
00:05:53
et cetera we have a very interesting corpus and this is my preferred way of looking at that speech
00:06:01
looking at activity patterns
00:06:05
two party telephone ah no constraints on the content little page stroke thirty
00:06:10
minutes each week for three months ten compositions hello conversations will manually transcribed
00:06:16
cost a lot of money took a lot of time but i think you unless you have that human know how
00:06:21
in your initial data or a lot i think it's ready to do that
00:06:28
resulting text is almost impossible to read it's very very fragmented very very broken
00:06:35
if overlapping speech you may have noticed that accounts for as much
00:06:39
as high off of the individual solo talking time in these conversations
00:06:44
there are very few grammatical sentences that much interactive time thinking with the listener
00:06:49
often conflicting the utterances of the quotes speaker they they dance around the conversation
00:06:57
i i'm unlikely human machine interface really spoken dialogue as you constraints on to and take
00:07:05
uh both buttons typically interrupt each other very often for this mutual construction my mother
00:07:11
said don't talk when yielding spoken to she said you know if somebody sticks you listen
00:07:16
be polite this data does not show polite activity
00:07:22
the first minute bubble
00:07:26
i mean you didn't get
00:07:29
this is probably best referred to as the time in dialogue
00:07:33
and you get very much backchannel or feedback
00:07:38
but the more you look at this the concept of the time
00:07:43
is a very difficult one to explain on the concert even of an utterance
00:07:46
is it very difficult to explain but we have data here which we can interpret
00:07:52
meaningfully to represent stages in the discourse more
00:07:56
important relationship between people social relationships between discourse interact
00:08:02
just very briefly look at some of the speech activity second backup my was with data with numbers
00:08:08
we calculate just that sort of activity silence overlap sinus a sense p.
00:08:14
so the way so we'll be and talk they want to be with the numbers and i'll show you a chart net
00:08:20
which shows median maximum minimalist look at the mediums we recorded
00:08:25
we pay them for thirty minutes they gave us ten percent extra
00:08:30
so they're actually spoke they hold the telephone it's of the thirty three minutes but there was
00:08:35
i let the three minutes so they spoke exactly for thirty minutes is what we paid for
00:08:40
he was silent for half the time he was silent for half the time overlapping speech of that
00:08:49
put look look eighteen minutes taking minutes more than half the time
00:08:54
and uh this is the solo talk love minutes ten minutes and the other out
00:08:59
in most cases is more than half of the solar talking time so something very interesting is happening
00:09:05
now you can begin to look at these pictures with more understanding you can begin to interpret this
00:09:11
you can see flow you can see parts where one person is dominating on the other person
00:09:16
is actively participating receiving and then when they shift the balance and it goes across the other way
00:09:23
machines can do this as well because they don't have to do
00:09:25
any speech analysis and speech recognition it's basically noise detection on off uh_huh
00:09:34
yeah no i guess not looking you some details j. f. a. j. m. a. first composition
00:09:39
j. f. a. japanese female j. and a japanese male to female
00:09:45
again it's think but that's not a general rule it's go back one
00:09:49
yeah in this case we may list thank and the female that's a
00:09:52
that's a japanese male six conversation talk in english female female is blue here
00:09:59
that's a typically bonds composition if we go back to this lady with a
00:10:04
part that just a little bit and he's just thirty years old he's twenty five
00:10:10
if she intimidates him in the first conversation he can't get a word
00:10:15
in edgewise he's politely respond and if we don't see the transcription yeah
00:10:23
first conversation as strangers the composition zero one that they're absolutely terrified that that they don't know what they're getting into
00:10:31
and i think she's trying to calm him or well that's not join for thirty times at a time for discussion later
00:10:38
this is a little bit you can you can play it if if you click on any of this you can see the text
00:10:42
for those of you read japanese it's good if you doubleclick you can hear the speech so you can understand throw
00:10:48
i'm jeff they'd yeah my second composition same people second opposition so it may or may not be a human
00:10:55
characters she does tend to dominate quite if you if you count the time bins topic which is for against blue
00:11:04
and another one has this they may effect ah the same japanese guy
00:11:09
twenty five and this time is an english speaking female she's also twenty five
00:11:14
um for those of you like dirty pictures this is from having
00:11:20
enough with each other he's chatting her up and she's really corporate
00:11:23
there's a lot of laughter a lot of very clip very brief i'm flirting behaviour
00:11:29
basically they couldn't see each other that they got to know each other quite well
00:11:34
okay so now we can think of a concept of measuring flow just from
00:11:39
this plus minus noise um the ratio of speech to nonspeech activity in any line
00:11:47
and in fact if you skated by the link the current utterance that's
00:11:50
a useful measure um if it's high if there's that's degrees dominating the discourse
00:11:56
and if it's low it says that the speaker made be listening maybe thinking that we know not dominating
00:12:04
um if the conversation follows a ping pong pattern then you'd expect respect is to give you a negative correlation
00:12:12
on the other hand if you get a positive correlation i think it's quite interesting because it says
00:12:17
that both participants tend to speak at the same time and to be quite at the same time
00:12:24
uh_huh
00:12:26
okay that's the formula basically you somebody uh if you think three
00:12:29
speech utterances that gives you for speech silence is so you some these
00:12:33
i think the average is some these take the average you do ratio you scale it
00:12:36
by the uh like the standard utterance so basically you comparing ratios the speech to nonspeech activity
00:12:44
i haven't um if this is an extreme case the
00:12:47
first thirteen minutes of the last conversation between j. f. b.
00:12:51
and a japanese female part of j. f. c. uh he'll the flow will be very high for blue every love okay
00:12:59
but we definitely dominates here's another one uh speech activity the first thirteen is the
00:13:05
last conversation between japanese you know a and a partner j. m. a. beautifully balanced
00:13:11
they've found things that we can talk about the could be equal status with each other such such if we look at the overlap lips
00:13:18
and if you look at the overall average we find that
00:13:21
j. f. b. and am see this woman and the man who
00:13:25
did not i did of through well together have a very high negative correlation when she talks he listens when he talked she listens
00:13:33
but these two guys damn c. d. m. b. two young guys talk about baseball
00:13:37
whatever when he said yeah yeah they won the other guy says yeah yeah they wipe
00:13:42
and then they go are you it's the downs like yeah
00:13:49
the the talk is talk but what we do with
00:13:52
talk isn't much more complex much more interesting social ritual
00:13:58
okay let's just from preferences sound neither things like
00:14:01
digression here teaches some japanese um some nonverbal speech signals
00:14:07
this is how off my corpus a one hundred utterances those one hundred
00:14:13
utterances account for more than half of the number of utterances in the corpus
00:14:19
one thousand uh no ten thousand times somebody said oh
00:14:23
and in that means yes um eight thousand six hundred times high high
00:14:30
means yes and then there's lots of laughter three thousand five hundred times huh
00:14:36
and means yes and then a a means yes and hi i'm is yes then one is yes
00:14:46
means no oh wait a minute i would think about this and
00:14:50
then um i know ah ah i i know huh that's it
00:14:56
if you get a pattern matching or bearing in mind that one of these dashes means elongation elongation of all right
00:15:03
you can see that is a very simple utterances the very highly um
00:15:09
repetitive very complex morphology that can be very long ah
00:15:14
yeah yeah yeah yeah yeah yeah yeah yeah yeah thirteen times
00:15:19
now whether yeah and yeah yeah and yeah yeah yeah yeah they may
00:15:22
mean different things but the point is that they're very very very frequent
00:15:28
they're very simple very common and they carry very complex information so
00:15:35
let's look at this is way off track but it's interesting numbers
00:15:39
the japanese female talking to these various people in c. is chinese female
00:15:42
chinese male and a female in this no japanese you not you know
00:15:46
okay the japanese people more often than a doctor foreigners because i was more
00:15:49
interested in that kind of data that you look at once is is ah
00:15:55
ah quite a lot with chinese people quite a lot with
00:15:58
japanese people but not so often with english people the story
00:16:05
demo then moments but it logically complex operator no no no no yes yes
00:16:12
when japanese talk to japanese these more complex structures it's
00:16:15
obvious this high yes very formal yes no and yes
00:16:24
chinese chinese english english what
00:16:32
yeah let's go on and take this example of yes i go
00:16:36
on so looking no down through the conversations to the d. f.
00:16:40
b. you know these two people now first conversation second letter to
00:16:43
the tent conversation first time it comes at twenty six times uses high
00:16:47
second conversation thirteen thought the button seven four three one that's a beautiful curve if you plotted
00:16:58
when they don't know each other even japanese japanese there on this kind of very tentative
00:17:03
for sure yes yes i doesn't get to know each other it yeah yeah yeah yeah yeah
00:17:10
that comes out of the data
00:17:13
what we see here is a bonding uh_huh anyway um this is another
00:17:20
form if the stand works i need a place and samples now of ah
00:17:26
one of these sounds the japanese were home are on now is local director the area where i live and
00:17:33
it means really and it can be used as a modify like really hot like this room is really hot
00:17:39
stream is really stuffy or it can be an interjection only i didn't things so that's
00:17:47
okay i know oh oh oh oh oh oh
00:17:56
just thing lately
00:17:59
same word oh
00:18:02
oh well
00:18:12
oh oh different
00:18:18
uh_huh of that one word from that one person have three thousand five hundred tokens
00:18:23
if anybody wants to play with that data it's on the web you were looking for
00:18:26
maybe fifteen about fifteen different classes and the
00:18:31
very along that dimension which i can't explain so
00:18:35
really includes emotion she's laughing sometimes she sounds very said this year is a lot times but
00:18:40
emotion per se is not the best terminology to explain that to match that's wanna stay here
00:18:45
one year summarise that part of the talk by saying that common events facilitate simple comparisons
00:18:52
this noise is is very very frequent even if you've never met her before you talk after five minutes and you've had kind of them
00:19:00
you're the first two or three and they get your baseline and then you can
00:19:05
you can make a comparison this one is louder longer soft harder than the previous one
00:19:11
so the very frequent simple sounds a lot the list adequately estimate the affective states of the speaker
00:19:18
they're simple another precinct areas of voice quality and prosodic information
00:19:22
that's all they can carry on the interest with very readily throughout the speech
00:19:26
so i claim that many people say that speech spontaneous speech is still forms
00:19:31
what we do is a kind of noisy representation of this beautiful abstract language jump
00:19:37
i don't think so i think it's in fort i think those
00:19:41
noises hesitations quote phyllis uh they
00:19:45
carried very very useful prosodic social information
00:19:50
discuss controlled all that much but it's a true but what i wanna talk about here is multimodal data
00:19:57
processing large numbers of multimodal data because you know we need to look as well as listen
00:20:03
previous work with speech speech only i'm proud to say and that's the truth so this is
00:20:09
already on the s. a. s. p. net portal um so you download it uh it's uh
00:20:16
this is the machine this is part of the machines we used to capture recapture a lot of real interactive
00:20:21
speech we put this little bag on the table it in fact contains a three hundred and sixty degree camera
00:20:27
which captures everything around it thirty degrees down six degrees up so
00:20:31
in this angle around itself um a little very old now a stereo
00:20:37
wave recorder very high quality mikes and then in the background summaries it's
00:20:42
with this like the guy at the back and just checks that you know
00:20:46
people haven't put a biker crust from the crux of in which case you can't do anything
00:20:50
you can't remove the back you live without the data missing you make a note of it somewhere
00:20:56
i'm in is new to fall it looks like this and i stuck an array of microphones on it at one time
00:21:03
um in fact i don't use any my took the lines up it's a tiny tiny thing you can script
00:21:09
the lenses is beautiful the microphones are not necessary because it's
00:21:13
actually show new noise no noise for speech nonspeech is sufficient
00:21:18
so the microphone and this machine any cheap tent alright is enough i think for this type of work and just
00:21:24
put out an analogue signal and it's better to have a digital signal so if you buy spend a lot of money
00:21:30
on the point briefly or something of that quality something that colour but this
00:21:33
is the digital firewire camera this is the same thing look at the size of
00:21:38
a visa to lenses that you can this number just plug into dust covered
00:21:42
table under a just like a decoration but it watches everything that goes on around
00:21:50
uh okay
00:21:53
yeah
00:21:54
me this later
00:22:01
she signed a contract with this
00:22:07
ooh
00:22:09
damien is the e. e. o. pain in other graphics guy he's is done amazing face checking for me
00:22:19
uh christina is the pragmatic linguists discourse analysts
00:22:24
um this lady is a french you can work on fallen free we have three days of filming contracted
00:22:31
one uh she's very fluent in english belgian finish
00:22:35
australian if expert brit japanese were all speaking in english
00:22:40
um we did ninety minutes each day recorded administrators to so that's the kind of data i'm working with you can see here one
00:22:47
two three of those cameras couple microphones and stuff we also had
00:22:51
a microphone hanging from the ceiling for very high court is off
00:22:55
i'm not all that stuff is on the web you can see it the
00:22:59
a. c. r. feast is diffuse that died the lab i was working in crashed
00:23:06
um but this page is not be reconstructed with michelle soap on the s.
00:23:10
s. p. server and you can see day one day today three in various formats
00:23:15
from the the three hundred and sixty degree camera et cetera with flat cameras from
00:23:19
here from their itself was so there's a lot of video data is also label it
00:23:25
we do the topic list three the the things of the
00:23:28
composition of uh the emotion list the heat of the conversation
00:23:34
and we had two or three label is listening to the stop and annotating not we
00:23:38
didn't use your trust it was similar to filter as a high activation the activation hard but
00:23:43
and the various shots at such if you got into this link here you get the reforms of data
00:23:50
that you can download angle one angle to angle three three sixty degree et cetera with you or do you
00:23:56
um some bad sound but you have to live with that if you have enough capture devices you typically can recover a little
00:24:02
bit okay annotations this is what the transcription looks like but that's
00:24:07
a terrible way to access data you cannot reach discourse from text
00:24:14
uh this is better this is topic that time align the time data
00:24:18
what happened to the change of topic was mainly talking and listening i
00:24:22
was reading and the mood is it heated acquired interested very funny bit quiet
00:24:27
uh doing it was the topic to general understanding kind i think this is like the script of the movie
00:24:32
but it's done in retrospect we have people look at it and i basically said to me a scene analysis
00:24:38
describe each scene for me that's that's that we also have this software which is great so
00:24:44
um before people are colour coded and we have automatic head tracking and automatic speech from the transcription
00:24:52
so uh as you scroll through this is the flash interface a flash movie as
00:24:57
you scroll through it you can see the here for colours because for people speaking
00:25:02
you see the discourse interactions you can see the activity you can also
00:25:06
see the output of each of the colour coded had trackers body trackers
00:25:12
if you can find heads in a video you know typically that unless
00:25:17
you're in space or simple or something is gonna be a body underneath it
00:25:21
figurehead tracking you get a body shot for free you can measure the movement in this area and you can measure the movement
00:25:26
in this area separately and just by looking at movement i think you can for a lot i will go on to that
00:25:33
um
00:25:34
okay if you could we could even ask what active listening is it
00:25:38
says something i personally don't really agree with but maybe that's the standard definition
00:25:42
it says that activists thing is a structured way of listening and responding to others
00:25:47
which focuses attention on the speaker suspending ones on frame of reference
00:25:51
and suspending judgement are important in order to fully attend to the speaker
00:25:55
i think that's just eccentric what i mean by uh active listening is participation and i'll show you the
00:26:03
maybe this is true having the ability to interpret a person's body language allows
00:26:08
the listener to develop a more accurate understand is because what i would support that
00:26:13
i tend to think of it it's actions rather than words
00:26:17
participants actively engage in disgust in overlapping and complementary manner
00:26:21
i now focuses on the contradictory on participatory discourse actions rather than only cognitive attention
00:26:28
states of the listener base these actions the physical observables and they can easily be measured
00:26:35
so ah
00:26:36
now we have a different view of interaction model
00:26:40
i'm not processing discussed focus content but i'm much more interested in the dots
00:26:46
this is a socially evolving event it's multifaceted multidimensional and it's integrate
00:26:52
synchronised it's loosely based around the frame locus synchrony which i didn't understand
00:26:58
i know the the temple dynamics are essential
00:27:02
i can show you how closely aligned they are but i don't you have a model of
00:27:08
anyway what is that often come up in this context our engagement in
00:27:13
train meant mutual cooperation we had a model the session yesterday the music
00:27:18
which i was able to over here a little bit with these words also
00:27:22
came up musicians do it officially we're all musicians when it comes to social interaction
00:27:30
okay so this is what my one aspect of my data looks like here you can see
00:27:39
unit one grey is dominating green is taking part in
00:27:44
another conversation the others may or may not be present
00:27:48
except he of bach you get an explosion green comes in and then again another
00:27:54
explosion and here a very definite explosion and hear anything bigger one and he'll well
00:28:00
something's happened between these four people where they gradually come together
00:28:08
something is being talked about um you know this command they join in and they take part in it you can see
00:28:13
waves or you can think of where it's going through that going around that table like the the things i want to quantify
00:28:22
um
00:28:24
use a joint it's very difficult to look at it in this format
00:28:27
but you can see the talking about some particular text as the scrolls through
00:28:33
i didn't want to do a live demonstration here because the things can go wrong they will go wrong but if you're interested i have all this on my machine
00:28:40
if you get to the s. s. p. that you can download and play with this you click
00:28:43
anywhere on that previous screen and you'll come up on the slot that it's one thing puts it
00:28:50
and then we get the head tracking we actually had a dummy head on day one to measure the drift in the head track
00:28:57
we know this thing doesn't move but if you get and you do get movement on that
00:29:01
and you can we calibrate my cameras also get knocked and it's nice to be able to
00:29:09
anyway looking at this you see the the the speech movement the
00:29:13
explosion you can expose the body movement hit movement at the same time
00:29:18
it's common sense when people laugh they move the hits ha ha ha
00:29:24
when people talk the note
00:29:28
so not surprisingly when i talk my it was my body most own correlation is really high
00:29:36
any person's head correlates about not point it with that body
00:29:39
throughout that that that's common sense but the interesting thing here
00:29:44
is that my body and hey synchronise with your speech to
00:29:48
significantly high no point for local formal point five or higher numbers
00:29:54
and these are having correlations between uh what does it show shows which is that people are present it
00:30:01
shows that they're attentive shows that the sharing it shows that they are forming it could be that you like
00:30:09
more importantly it can be measured him because we have these that the strikes and they find that this
00:30:15
remarkably okay he agreed just talking green is talking and there's a lot of
00:30:19
activity the um no activity on red and yellow and grey they could be dead
00:30:25
they might be sleeping but here we have a peek at the peak to peak and a big a
00:30:30
big which massively coincide with the fact that those picks
00:30:34
coincide confirms to me that they would not get there
00:30:39
they went sleeping they were listening i can go back now and say from zero dayton zero move i guess yeah yeah i got this thing
00:30:46
because i get such a p. here and then you look at this and i it
00:30:51
if you look at the video it's almost like they're they're puppets being pulled by the same piece of string
00:30:56
on passing goes for the other person goes back all the girls go full together there's a tremendous think reno movement
00:31:05
okay so we can look at these places the speech activity and we can look at the traces of a
00:31:10
movement data you know if you look possible to them but ten frames per second so it's like this movie
00:31:18
we can see that they align really precisely i was expecting a cascade of movements participants react so
00:31:25
jane says something and john laughs and then fred last and then mary
00:31:29
last you know a like a like a disease going around the room
00:31:33
it turns out that if you just low pass filtering that i
00:31:36
went to one frame is sufficient to capture many of those synchronous movements
00:31:43
this activity peaks
00:31:46
activity picks in the movement indicate burst of high interaction composition and
00:31:51
they're very clear sequences of pro 'cause it lies in its propositional count
00:31:55
one person is speaking at length and the others are very static a language stuff and then bursts
00:32:01
of high activity the transitions engagement on because i think a key points and interaction you couldn't make inference
00:32:09
if if everybody listens attentively and then laughs and then somebody else but starts talking you have a topic change
00:32:17
oh so you can do topic detection
00:32:20
that's right and uh we haven't even started thinking about transcription yeah this is just
00:32:24
sensory input from a camera and if you want to french input from a single microphone
00:32:30
those automatic measurements typically call it writes a book about not point eight uh i claim
00:32:37
that they probably render a manual transcription relevant that you know there was we can capture much
00:32:41
much more data we can do very simple analysis of of look mattresses noise trusses and
00:32:47
then they can go in and focus on manual labour on the points high engagement high activity
00:32:54
he is three had traces for the same for people body movement traces for the simple for people and
00:33:00
this is just the some of the whole lot so this is the group activity and this is individual activities
00:33:06
it's remarkable i claim here absolute alignment here too
00:33:11
slight lead for the yellow but absolute for everybody
00:33:14
else there are so many positions with those please coworker that you can definitely get a a technology out
00:33:22
okay it's remarkable synchrony
00:33:27
as i said i expected a sliding window will be necessary but that's not the case um and
00:33:31
low pass filtering is moving we find very very close activity picks that reveal special moments in the discourse
00:33:38
so in summary i hope picking up your computer was assigned i have five minutes left of the of the um
00:33:45
okay okay five to get the slider on the web or they will be an awful lot of time um
00:33:53
with respect to sequence of moves in the social conversation interaction this is a quote from adam can
00:33:59
the personal communication each contribute to the images of
00:34:03
a joint use the same system of coordinated action patterns
00:34:06
and the emergent common understanding is well maybe the cognitive consequences
00:34:13
that he's gonna make plans about cognitive consequences i'm happy to produce the technology
00:34:18
but we have to share our knowledge and that's why i'm very excited
00:34:21
to be here anyway this particular talk a key point maybe this data
00:34:25
is that whereas previous work required lengthy and expensive manual transcription of this audio visual multimodal data
00:34:31
the proposed automatic procedures deprived from very simple easily downloadable free
00:34:37
image processing that show a very high correlation with transcribed speech
00:34:43
basically if you don't have to see the you have this technology they needed some wonderful hacks to make it very robust
00:34:49
and in a constraint situation we had the vertical and they tend to
00:34:54
look at the camera once in a while we can do better than i'd that sam had detection
00:35:01
range of like you've done outside than inside the flight is fine
00:35:06
so the technology is there and i'm very although the say that this
00:35:11
one little colouration that it's not really available yes it's the website um
00:35:16
they suggested i give it a name and free talk seem to be a a an adequate
00:35:21
description um special thanks to michelle who actually up
00:35:25
little stuff kathleen who initiated this she off um
00:35:32
use a lot of people around the world are playing with this there is no duty except if
00:35:38
you derive anything from it but it back on the systems other people can benefit from that um okay
00:35:46
sure wrap up um active listening is something that really
00:35:52
excites me at the moment typical traditional technology says that
00:35:56
speech synthesis talks i just the box it goes on but doesn't do any sense
00:36:01
thing it doesn't even look around to see if there are any people in there
00:36:05
that is true if you could find the people and detect their participation then
00:36:10
it could rephrase it could restructures of it could repeat simplify jump ahead church
00:36:16
so we could have some very intelligent speech uh interaction lips what um
00:36:23
people talk interactively they overlap very often an overlap is not the performance or
00:36:30
i think overlap as a social act uh it's not just ping pong turn taking
00:36:36
with respect to discuss synchrony um
00:36:40
people interact together and the best
00:36:43
it's like a time go it's not part not a is dominant and part not be follows if you watch
00:36:50
people dancing attack when i had the great pleasure to walk through a park in dublin at time go first
00:36:57
with about ten couples with an argentine violinist guitarist or and they
00:37:02
were dancing and it wasn't that the man led and the woman followed
00:37:07
she knew exactly where he was going and she was just there when it happened
00:37:12
i think that what happens is speech is not that you talk i process and then react
00:37:17
i pro actively participate in our conversation
00:37:23
alright finish your sentences because i know what you're gonna say it's not interrupting you it's complimenting you with the um
00:37:31
general dynamics ah something i did not yet understand but i think there's a very
00:37:36
magical element there if we could understand the temperate annex of this kind of interaction
00:37:42
oh okay a part of the stock are also presented at the
00:37:47
into speech special session of the same name discourse green active listening
00:37:51
on the ninth day of the ninth month of the night you of the third millennium
00:37:57
and with no terrorist activity unless this critical but um so okay um
00:38:07
a acknowledgements i want i was i was the only every three and the the first part is
00:38:13
that knowledge is the n. i. c. t. i. eighty and and who use the phone me um
00:38:18
do it because in dublin i i've give me money to carry on this work for the next five years and i still
00:38:24
have to use left the japanese government funded so i i'm looking
00:38:27
both sides or think about common and headed to here because she
00:38:32
have the confidence to believe which is a very strong program but
00:38:36
for programming is available if you have data which is a similar format
00:38:42
we have time aligned information of any phone you can also view your data
00:38:48
and finally we are recruiting so if you wanna contrivance interested in this city very welcome to uh
00:38:54
okay i have to stop there
00:39:02
i see uh_huh off
00:39:11
yeah
00:39:15
yeah
00:39:18
yes of course
00:39:21
yeah
00:39:23
sensing technology uh oh oh oh
00:39:32
if i if you
00:39:41
exactly before u. s. e. cadence circuit if the cadence posing
00:39:48
and amperage shopping and so forth if you look at the
00:39:50
ends of utterances nigel ward justin nice technology if the people katie agents that went and got a locking a list of some
00:39:58
i've made a very nice machine where they actually model much more complex if allowed to take what direction to provide feedback
00:40:06
but i think it's not just physical cues i think there's
00:40:10
something swell uh i can i i ended musical how impressed
00:40:16
musical rhythm is very clearly formalised and there's a framework and the
00:40:21
skill in music deviating from the framework in a a controlled way
00:40:28
that's what makes performance as opposed to rendition i i think
00:40:31
we are masters of that performance in this kind of discourse
00:40:36
i mean here is different but if you could also some people it single curved probably because
00:40:41
when there are no external constraints noticed this conversation that's
00:40:45
when the social aspects can emerge was strong i think and
00:40:50
yeah okay interrupt me
00:40:55
sure
00:40:57
oh no i haven't i'll check out that that ugliness
00:41:03
yeah this series every time
00:41:08
what got these three days they one day today three same people coming in so yes it is a series overlap
00:41:19
anyone that even in any one moment you get yeah or even one topic group
00:41:25
somebody will introduce something new then it sparks of memories or
00:41:30
key points talking points than the other three you know that mentions treating children some years is a yeah yeah i've been there
00:41:36
and then it's all they had this problem the corner kind of thing that people adding information making that way
00:41:43
i think he still do we get from this date or you i i'm i'm an an
00:41:53
office in the thrills but i can use tools like it's the m.'s markov models et cetera
00:42:00
you can use it was
00:42:05
that's defined data that's part of the of the that 'cause it's a time aligned annotation
00:42:10
the front me yeah
00:42:17
come back next week we're going to police no facts
00:42:33
i've
00:42:39
the stream of the student that comes in from a camera exactly like uh don't not because the the three hundred and sixty
00:42:45
degree think it's very hard to reserve the first thing we do is is virtually straight that and then we choose reason of interest
00:42:52
i'm from that will have the pointing up relatively speaking i detection we've got with anger
00:42:58
down below the had two point five times the width and that gives us about it
00:43:03
and we simply do um
00:43:06
i forget the term for it but there's
00:43:09
yeah optical flow of um within the box around the box within this box around this box so that no i guess
00:43:17
there is movement left or right x. coordinate y. coordinate and of a is that
00:43:21
corner because we have and um zoom if you like people come closer for the word
00:43:27
that's there and the same for the body so it's x. y. and that for the two parts times number of people
00:43:38
uh oh
00:43:58
uh_huh
00:44:02
yeah
00:44:07
uh_huh
00:44:12
oh
00:44:16
that also it sure is a gross exaggeration to say that we adopt all the time of crazy but
00:44:22
if if you can measure that then you can measure that the amount of
00:44:26
discordant the amount of delay feedback et cetera and that again becomes rich information source
00:44:33
um
00:44:36
yeah
00:44:43
if
00:44:48
oh i like to make that generalisation but i don't at the moment i can't say that
00:44:54
it's it's very complex
00:44:56
yeah i'm noise cancellation is a magic word in this context you know the headphones that you were
00:45:01
on the aeroplane it takes the background noise and subtract so that it leaves the the significant noise
00:45:07
we need to we how are you going to say we need to develop the technology like noise cancellation where we can look at
00:45:14
the the round it's catching on the the nonspeech related movement
00:45:19
i'm that the speech related movements emerge because of the synchrony across them
00:45:24
so we can use the the multi tracks to to this noise cancellation
00:45:29
talking of which i will do that
00:45:35
yeah
00:45:39
yes
00:45:42
uh_huh he
00:45:52
this is also the final yeah of course there are differences
00:45:55
the cultural differences there a language related differences there so she related differences
00:46:01
bus drivers do it but they do it differently from office workers office workers
00:46:04
do it differently from university professors but they'll do the same kind of thing
00:46:09
i don't think this technology would not work for any particular language of
00:46:12
cultures just you'd you you couldn't generalised models trained in one region too
00:46:18
yeah that's going back to what i said earlier speech is such a wonderfully efficient mechanism that it
00:46:23
be contains docking points it contains repetitions to allow
00:46:27
you to form a basement maker a local compare
00:46:31
even if you've never met you before you don't know my voice you don't know my speaking characteristics like comparing
00:46:37
small changes in the the frequent simple actually relevant
00:46:41
uh sections then you can do that kind of processing
00:46:47
so i think we can get a robust technology
00:46:57
uh_huh right
00:47:00
kansas city with that
00:47:03
his mode of ways you can do that don't see saying like
00:47:09
so
00:47:12
i don't think it does work unless that was my once they know they have other things in there if if if
00:47:19
it's japanese uh why don't we first went up by japanese was very very pool but we managed to communicate very well um
00:47:27
so we
00:47:35
yeah

Share this talk: 


Conference Program

Tracking 'the 2nd channel' of information in speech
Nick Campbell, Trinity College Dublin
Sept. 13, 2009 · 2:30 p.m.
Tracking 'the 2nd channel' of information in speech [slightly higher video quality]
Nick Campbell, Trinity College Dublin
Sept. 13, 2009 · 2:35 p.m.

Recommended talks

Speech Graphics Presentation
Gregor Hofer
Feb. 24, 2015 · 10:10 a.m.
138 views