Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
it's a privilege to be here i'm i'm i'm sure that many of you have scenes with the slides before
00:00:07
but that's not an apology 'cause you want to see them in the same order the same combination i know that nobody has seen them all
00:00:13
if you think of slides like spice and you think of a talk like a meal that i hope you could enjoy my impromptu cooking
00:00:21
um and yeah if we're going to talk about
00:00:25
the signals like in fact this is the seventy fifth anniversary
00:00:31
f. quantitative analysis of the interaction of individuals this is the copy of
00:00:36
the front page of the general downfall anthropology january thirteenth nineteen thirty nine
00:00:43
which happened was um he was in the the quantitative analysis of synchrony and
00:00:51
they're active listening and i think it started with his work and those co workers
00:00:55
he adapt to the manual typewriter which is the thing you used to used to typing letters off to put an electric motor
00:01:02
i don't spill of paper so that when something happened on observer could make a mark on the paper
00:01:09
and he actually used to a small shaft so the
00:01:13
operator could measure changes and subject subject activity over time
00:01:19
uh he was able to observe the discourse sections of two individuals i'm to tennis shoes of generations of their actions
00:01:26
to understand the way in which sequence of directions organised nice green
00:01:31
is too small is automatically going into some very grand way presentation that
00:01:35
shows you what's coming next but not what's yeah now um this
00:01:39
was anyway this is the first recorded a sequence analysis if human behaviour
00:01:45
he wasn't i thought just i don can them is also nestle just apologise people typically
00:01:51
study animal behaviour and we are animals so they study about eighty from that point of view
00:01:56
i don't condone has written that the first task of a human it'll just like that of your father just to sets out
00:02:02
the study uh but or official monkey must be systematic description you
00:02:07
must set out to see what behaviours structures the human being has
00:02:12
in doing this with people he says it would seem best to begin with
00:02:15
those aspects of behaviour which are most likely to be shared with other animals
00:02:21
i i it kind of just misses languages as well detailed analysis of language must eventually find place
00:02:27
in human lethal g. these do not seem to be the best aspects of human behaviour to start with
00:02:33
and i took lisa is you know we have very advanced speech recognition slide recognition we can process
00:02:40
text but most of the time the devices the process the text don't know what to do with it
00:02:46
the text is almost irrelevant to the interaction so what i want to focus on is not what was said but what was done
00:02:54
okay
00:02:56
uh
00:02:57
conversation analysts discourse analysts they have also perform similar
00:03:02
studies but many people claim that the work comp generalised
00:03:06
and my name is to produce the technology i machine or device or a a module which will observe
00:03:15
a systematic socially organised procedures and the line the ways in which social actors move into
00:03:21
a mutually ratified participation in an encounter those the technical terms can then calls that frame achievement
00:03:29
in a in a simpler to um it's just speech processing but it's not speech based text based speech processing
00:03:37
okay the goal is to produce a technology for tracking discourse most in conversational speech
00:03:42
by focusing on the behaviour of participants
00:03:44
to make inferences about the discourse participation status
00:03:49
not the text in other words if i talk to you first are you there
00:03:54
can you hear me can you see me i you listing a do you understand me do you agree with me
00:04:01
did you agree with me before i said it or do you agree with
00:04:04
me freshly now are you expressing surprise that that agreement many many levels of participation
00:04:11
this is a highly formal one where you don't have many rights to express feedback
00:04:16
everybody's doing it you're all nodding and i can tell from the timing of units produce smiling how i can face my talk
00:04:23
uh anyway if what you're doing is active listening
00:04:29
and i think that's a technology which is missing we have acted speaking and we have transcribing
00:04:36
machines we do not have devices politically you machine
00:04:39
if into interaction where the machine monitors the hippopotamus
00:04:46
traditional pose a approaches to spoken dialogue interface design which is what i have tended
00:04:51
to work tend to assume a ping pong will push to talk style of speech interaction
00:04:56
so the system talks you answered it responds you reply and so forth there is one older throwing
00:05:02
from side to side and my talk is gonna show the our data do not support this view
00:05:07
and this is what i spend all my time looking at i love it okay i'll try and talk you through it you'll see lots of pictures like this today
00:05:14
it's a telephone conversation it's speech activity the blue
00:05:19
is j. uh am i able from elle in
00:05:21
this case j. f. a. is a female speaker and it's there's six conversation in a series of ten
00:05:28
we pay people to come into a room pick up the telephone and talk and
00:05:32
we said we'll give you this much money for thirty minutes each and that's always it
00:05:37
and it was very frightening because they had no idea he was gonna be only into that
00:05:41
awful but over a series of ten conversations ago actually got to know each other they became friends
00:05:48
they spoke about things which are interest um because we bought some for male female uh et cetera
00:05:53
et cetera we have a very interesting corpus and this is my preferred way of looking at that speech
00:06:01
looking at activity patterns
00:06:05
two party telephone or no constraints on the content little page talk thirty
00:06:10
minutes each week for three months ten compositions when the conversations were manually transcribed
00:06:16
cost a lot of money talk a lot of time but i think you unless you have that human know how
00:06:21
in your initial data and a lot i think it's better to do that
00:06:28
the resulting text is almost impossible to read it's very very fragmented very very broken
00:06:35
overlapping speech you may have noticed it accounts for as much
00:06:39
as half of the individual solar talking time in these conversations
00:06:44
there are very few grammatical sentences but much interactive turn taking with the listener
00:06:49
often completing the utterances of the courts speaker they they dance around the conversation
00:06:57
i i'm i'm like the human machine interface rely spoken dialogue as few constraints aunt and take
00:07:05
uh both buttons typically interrupt each other very often for this mutual construction my mother said don't talk when you're
00:07:12
being spoken to she said you know if somebody sticks
00:07:14
you listen people liked this data does not show polite activity
00:07:22
the first minute one bubble
00:07:26
i mean you do get
00:07:29
this is probably best referred to as a ton in dialogue
00:07:33
and you get very much backchannel or feedback
00:07:38
but the more you look at this the concept of the time um
00:07:43
is a very difficult one to explain how on the concept even of an utterance
00:07:46
is a very difficult one to explain but we have data here which we can interpret
00:07:52
meaningfully to represent stages in the discourse more
00:07:56
important relationship between people social relationships between discourse interacting
00:08:02
just very briefly we'll look at some of the speech activity second backup my words with data with numbers
00:08:08
we calculate is that sort of activity silence overlap so an essay sounds be
00:08:14
so they so we'll be and talk they will talk b. is the numbers and i'll show you a chart next
00:08:20
which shows median maximum minimum just look at the mediums we recorded
00:08:25
we pay them for thirty minutes they gave us ten percent extra
00:08:30
so they actually spoke they held the telephone itself the thirty three minutes but they was
00:08:35
silent the three minutes so they spoke exactly for thirty minutes is what we paid for
00:08:40
he was silent for half the time he was signed of half the time overlapping speech off that
00:08:49
total coke eighteen minutes eighteen minutes more than half the time
00:08:54
uh_huh and this is the solo talk the minister minutes and the overlap
00:08:59
in most cases is more than half of the solar talking time so something very interesting is happening
00:09:05
now you can begin to look at these pictures with more understanding you can begin to interpret this
00:09:11
you can see flow you can see parts where one person is dominating on the other person is
00:09:16
actively participating but receiving and then when they shift the balance and it goes across the other way
00:09:23
machines can do this as well because they don't have to do
00:09:25
any speech analysis any speech recognition it's basically noise detection on off uh_huh
00:09:34
hair now we can start looking at some details j. f. a. j. m. a. first conversation
00:09:39
j. f. a. japanese female j. m. age actually smell the female
00:09:45
again is pink but that's not a general rule it's go back one
00:09:49
yeah in this case the mail is pink and the female is um
00:09:52
that's a japanese male six conversation talk in english female female is blue here
00:09:59
that's a typically bonds composition if we go back to this lady with a
00:10:04
partner just a bit older than he is she's thirty years old he's twenty five
00:10:10
she intimidates him in the first conversation he can't get a word
00:10:15
in edgewise he's politely respond and we don't see the transcription yeah
00:10:23
first composition of strangers so composition zero one that they're absolutely terrified that that they don't know what they're getting into
00:10:31
and i think she's trying to calm him or well that's not draw inferences yep times at a time for discussion later
00:10:38
this is a little bit you can you can play it if if if you click on any of these you can see the text
00:10:42
for those of you read japanese it's good if you doubleclick you can hear the speech so you can understand as well
00:10:48
i'm jeff ha my second composition same people second
00:10:52
composition so it may or may not be human characteristic
00:10:57
she does tend to dominate quite if you if you count the time pinks talking which is her against blue
00:11:04
and another one has this j. may effect ah the same japanese guy
00:11:09
twenty five and this time it's an english speaking female she's also twenty five
00:11:14
um for those of you like dirty pictures this is them having
00:11:20
enough with each other he's chatting her up and she's really corporate
00:11:23
there's a lot of laughter a lot of very clip very brief i'm flirting behaviour
00:11:29
basically they couldn't see each other uh they got to know each other quite well
00:11:34
okay so then we can think of a concept of measuring flow just from
00:11:39
this plus minus noise um the ratio of speech to nonspeech activity in any line
00:11:47
and in fact if you skated by the links the current utterance that's a
00:11:50
useful measure um if it's high it says that speaker is dominating the discourse
00:11:56
and if it's a logo it says that the speaker may be listening maybe thinking that we know not dominating
00:12:04
um if the conversation follows a ping pong button then you'd expect these figures to give you a negative correlation uh_huh
00:12:12
on the other hand if you get a positive correlation i think it's quite interesting because it says
00:12:17
that both participants tend to speak at the same time and to be quite at the same time
00:12:24
uh_huh
00:12:26
okay that's the formula basically you some very uh if you think of
00:12:29
three speech utterances that gives you for speech silences so you some these
00:12:33
i think the average is some these take the average you do ratio you scale it
00:12:36
by the uh like the standard utterance so basically you comparing ratios the speech to nonspeech activity
00:12:44
hand um if this is an extreme case the first
00:12:47
thirty minutes of the last conversation between j. f. b.
00:12:51
and a japanese female partner j. f. c. uh here the flow will be very high for blue every love back
00:12:59
but we definitely dominates that's who is another one uh speech activity the first thirteen
00:13:05
is the last conversation between japanese you know am a partner g. m. a. beautifully balanced
00:13:11
they've found things that they can talk about they could be equal status with each other et cetera et cetera if we look at the overlap lips
00:13:18
don't if uh if you look at the overall average we find that j. f. b. and am see this woman and the man
00:13:25
who did not headed off to well together have a very high negative correlation when she talks he listens when he talked she listens
00:13:33
but these two guys damn c. g. m. b. two young guys talk about baseball
00:13:37
whatever when he said yeah yeah they won the other guy says yeah yeah they wipe
00:13:42
and then they go ah you know it's the dance again
00:13:49
uh_huh so talk is talk but what we do is
00:13:52
talk isn't much more complex much more interesting social ritual
00:13:58
okay let's just from preferences sound nine i think it's like
00:14:01
digression here to just some japanese um some nonverbal speech signals
00:14:07
this is how off my corpus a one hundred utterances those one hundred
00:14:13
utterances account for more than half of the number of utterances in the corpus
00:14:19
one thousand uh no ten thousand times somebody said oh
00:14:23
and in that means yes uh_huh um eight thousand six hundred times high high
00:14:30
means yes and then there's lots of laughter three thousand five hundred times um
00:14:36
and means yes and then a a means yes and hi i'm is yes known him is yes
00:14:46
means now maybe later but i would think about this and
00:14:50
then um i know ah ah i i know huh et cetera
00:14:55
if you get a pattern matching or or bearing in mind that one of these dashes means elongation elongation of all right
00:15:04
see that is a very simple utterances they're very highly um repetitive very complex morphology they
00:15:11
can be very long ah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah thirteen times
00:15:18
uh_huh now whether yeah and yeah yeah and yeah yeah yeah yeah they may
00:15:22
mean different things but the point is that they are very very very frequent
00:15:28
they're very simple they're very common and they carry very complex information
00:15:34
so let's look at this is way off track but it's interesting numbers
00:15:39
the japanese female talking to these various people in c. is chinese
00:15:42
female chinese mail english emailing the smell japanese you not you know
00:15:46
okay they talk to japanese people more often than they talked foreigners because i was
00:15:49
more interested in that kind of data but you look at once is is ah
00:15:55
ah quite a lot with chinese people quite a lot with
00:15:58
japanese people but not so often with which people the story there
00:16:05
demo demo means but it's it logically complex operator no no no no yes yes
00:16:12
when japanese talk to japanese these more complex structures it's obvious
00:16:17
what about this high yes very formal yes i do not think that yes
00:16:24
chinese chinese english english what
00:16:33
let's go on and take this example of yes i go onto
00:16:37
looking no down through the conversations j. f. a. g. s. b.
00:16:40
you know these two people now first conversation second but switch the
00:16:43
tent conversation first time it comes at twenty six times uses high
00:16:47
second conversation thirteen spot the button seven four three one and that's a beautiful curve if you plotted
00:16:58
when they don't know each other even japanese japanese there on this kind of very tentative
00:17:03
social yes yes and as they get to know each other it yeah yeah yeah yeah yeah
00:17:10
and that comes out of the data what we see here is a bonding uh_huh anyway
00:17:18
um this is another form if the sound works i'm gonna play some samples now of
00:17:26
one of these sounds the japanese were home now on my is a local
00:17:31
dialect of the area where i live and it means really and it can be
00:17:35
used as a modify like really hot like this room is really hot stream is
00:17:39
really stuffy or it can be an interjection or really i didn't things so that's
00:17:47
okay and i oh oh oh oh oh oh
00:17:56
the same lady
00:17:59
same word or oh
00:18:02
oh oh oh oh oh
00:18:12
oh oh this is different
00:18:18
all of that one word from that one person have three thousand five hundred tokens if
00:18:23
anybody wants to play with that data it's on the web you were locked into it
00:18:26
maybe fifteen about fifteen different classes and they
00:18:31
vary along that dimension which i can't explain
00:18:34
so it includes emotion she's laughing sometimes she sounds very set the series other times but
00:18:40
emotion per se is not the best terminology to explain that dimension that's wanna say here
00:18:45
when you summarise that part of the talk by saying that common events facilitate simple comparisons
00:18:52
this noise is is very very frequent even if you've never met her before you talk have a five minutes and you've had kind of them
00:19:00
you're the first two or three and they get a baseline and then you can
00:19:05
you can make a comparison this one is louder longer soft harder than the previous one
00:19:11
so the very frequent simple sounds rather listed uh could we estimate the effective states of the speaker
00:19:18
they're simple another to suck areas of voice quality and prosodic information and that's all they can carry and the
00:19:24
interest birth very readily throughout the speech so i claim that many people say that speech spontaneous speech is you'll forms
00:19:31
what we do is a kind of noisy representation of these beautiful abstract language jump
00:19:37
i don't think so i think it's informed i think those noises
00:19:41
hesitations quote phyllis uh they carry
00:19:45
very very useful prosodic social information
00:19:50
discuss controlled all that much visitor but what i wanna talk about here is multimodal data
00:19:57
processing large numbers of multimodal data because you know we need to look as well as listen
00:20:03
previous work with speech speech only i'm proud to say and that's the truth so this is
00:20:09
already on the s. a. s. p. net portal um so you download it uh it's there
00:20:16
this is the machine this is part of the machines we used to capture we capture a lot of real
00:20:21
interactive speech we put this little bag on the table in in fact contains a three hundred and sixty degree camera
00:20:27
which captures everything around it thirty degrees down six degrees up so
00:20:31
in this angle around itself um a little very old now a stereo
00:20:37
wave recorder very high quality mikes and then in the background somebody's it's
00:20:42
with this like the guy at the back and just checks that you know
00:20:46
people haven't put a biker crisp separate the for example in which case you can't do anything you
00:20:50
can't say move the bag you live with half the data missing you make a note of it somewhere
00:20:56
i'm in is new to form it looks like this and i stuck an array of microphones on it at one time
00:21:03
um in fact i don't use any my took the lens out it's a tiny tiny thing you can screw it
00:21:09
the lenses it's beautiful the microphones are not necessary because it's
00:21:13
actually show new noise no noise for speech minus speech is sufficient
00:21:18
so the microphone and this machine any cheap tent overnight is enough i think for this type of work and this
00:21:24
puts out an analogue signal and it's better to have a digital signal so if you buy spend a lot of money
00:21:30
on the point briefly or something of that quality something affect how about this is a
00:21:33
digital file lack camera this is the same thing i look at the size of it
00:21:38
and these uh two lenses that you can this somewhat just plug into dust covered
00:21:42
table and it it just like a decoration but it watches everything that goes on around
00:21:50
uh okay
00:21:52
mm
00:21:54
me with this lady
00:22:01
she signed a contract with this
00:22:07
ooh
00:22:08
damien is the e. e. o. c. damien is a graphics guy he's is done amazing face tracking for me
00:22:19
uh christine as the pragmatic linguists discourse analysts
00:22:24
um this lady's her friends you cannot one for a long free we have three days of filming contracted
00:22:31
one uh she's very fluent in english belgian finish
00:22:35
australian if expert brit japanese were all speaking in english
00:22:40
and we did ninety minutes each day recorded agnes directories to so
00:22:43
that's a candidate i'm working with you can see here one two
00:22:47
three of those cameras couple microphones and stuff we also have a
00:22:51
microphone hanging from the ceiling for for very high court is on
00:22:55
um all that stuff is on the web you can see it the a.
00:22:59
t. r. faced is the fees that died the lab i was working and crashed
00:23:06
um but this page is not be reconstructed with michelle's help on the s.
00:23:10
s. p. server and you can see they one day today three in various formats
00:23:15
from the the three hundred and sixty degree camera et cetera with flat cameras from
00:23:19
here from their et cetera so there's a lot of video data is also label it
00:23:25
we did a topic list the the the themes of
00:23:28
the conversation uh the emotion list the heat of the conversation
00:23:34
and we had two three labels listening to this stuff has annotated not we didn't use
00:23:38
your choice but it was similar to filter is a high activation like division high positive
00:23:43
and the various charts it search if you got into this link here you get the rule forms of data
00:23:50
that you can download angle one angle to angle three three sixty degree et cetera with you or do you
00:23:56
um some bad sound but you have to live with that if you have enough capture devices you typically can recover a little bit
00:24:03
okay annotations this is what the transcription looks like but that's
00:24:07
a terrible way to access data you cannot reach discourse from text
00:24:14
uh this is better this is topic that time align to time data
00:24:18
what happened to the change of topic has mainly talking and listening and
00:24:22
reading and the mood is it heated all quiet interested very funny bit quiet
00:24:27
uh damien was the topic too gentle understanding kind i think this is like the script of the movie
00:24:32
but it's done in retrospect we had people look at it and i basically said to me a scene analysis
00:24:38
and describe each scene for me and that's that's that we also have this software which is great so
00:24:44
um the four people are colour coded and we have automatic head tracking and automatic speech from the transcription
00:24:52
so uh as you scroll for this is the flash interface a flash movie as
00:24:57
you scroll through it you can see the here for colours because paul people speaking
00:25:02
you see the discourse interactions you can see the activity you can also
00:25:06
see the output of each of the colour coded had trackers body trackers
00:25:12
if you can find heads in a video you know typically that unless
00:25:17
you're in space or simple or something is gonna be a body underneath it
00:25:21
figurehead tracking you get about a tractor free you can measure the movement in this area and you can measure the movement
00:25:26
in this area separately and just by looking at movement i think you can for a lot i will go on to that
00:25:33
um
00:25:34
okay if you really could be your last what active listening is it
00:25:38
says something i personally don't really agree with but maybe that's a standard definition
00:25:42
it says that activist thing is a structured way of listening and responding to others
00:25:47
which focuses attention on the speaker suspending ones on frame of reference
00:25:51
and suspending judgement are important in order to fully attend to the speaker
00:25:55
i think that's just eccentric what i mean by uh active
00:25:58
listening is participation and i'll show you that maybe this is true
00:26:04
having the ability to interpret a person's body language allows the listed
00:26:08
to develop a more accurate understand is because words i would support that
00:26:13
i tend to think though that it's actions rather than words
00:26:17
participants actively engage in disgust in overlapping and complementary manna
00:26:21
i now focuses on the contradictory on participatory discourse actions rather than only cognitive attention
00:26:28
states of the listener these these actions a physical observables and they can easily be measured
00:26:35
so
00:26:36
now we have a different view of interaction model
00:26:40
i'm not processing discussed focus content but i'm much more interested in the dance
00:26:46
this is a socially evolving event it's multifaceted multidimensional and it's integrated
00:26:52
synchronised it's loosely based around the frame locus synchrony which i didn't understand
00:26:58
i know the the temple dynamics are essential
00:27:02
i can show you how closely aligned they are but i don't have a model of
00:27:08
anyway what is that often come up in this context our engagement in
00:27:13
train meant mutual cooperation we had a model the session yesterday in the music
00:27:18
which i was able to over here a little bit with these words also
00:27:22
came up musicians do it officially we're all musicians when it comes to social interaction
00:27:30
okay so this is what my one aspect of my data looks like here you can see
00:27:39
unit one grey is dominating green is taking part in
00:27:44
another conversation the others may or may not be present
00:27:48
except he a bach you get an explosion green comes in and then again another
00:27:54
explosion and here a very definite explosion and hear anything bigger one and he'll while
00:28:00
something's happened between these four people where they gradually come together
00:28:08
something is being talked about on the others come in they join in and they take part in it you can see waves
00:28:14
or you can think of where it's going to that going around that table if you like these are things i want to quantify
00:28:22
um
00:28:24
use a joint it's very difficult to look at it in this format
00:28:27
but you can see that talking about some particular text as this scroll through
00:28:33
i didn't want to do a live demonstration here because it's things can go wrong they will go wrong but if you're interested i have all this on my machine
00:28:40
if you get to the s. s. b. that you can download and play with this you click
00:28:43
anywhere on that previous screen and you'll come up on the slot that it's like that thing could sit
00:28:50
and then we get the head tracking we actually had a dummy head on day one to measure the drift in the head track
00:28:57
we know this thing doesn't move but if you get and you do get movement on that
00:29:01
and you can re calibrate all cameras also get knocked and it's nice to be able to
00:29:09
anyway i'm looking at this you see the the the speech movement the
00:29:13
explosion you can explode the body movement had movement at the same time
00:29:18
it's common sense when people laugh they move their hands ha ha ha
00:29:24
when people talk they not
00:29:28
so not surprisingly when i told my it was my body most own correlation is really high
00:29:36
any person's head correlates about not point it with the body
00:29:39
throughout that data that's common sense but the interesting thing here
00:29:44
is that my body and head synchronise with your speech to
00:29:48
significantly high no point fold open formal point five or higher numbers
00:29:54
and these are having correlations between uh what does it show shows which is the people present it shows
00:30:02
that they're attentive shows that the sharing it shows that they are forming it could be that you like
00:30:08
and more importantly they can be measured him because we have these data strikes
00:30:14
and they find that this remarkably okay he agreed is talking green is talking
00:30:19
and there's a lot of activity the um no activity on red and yellow and grey they could be dead
00:30:25
they might be sleeping but here we have a peek at the peak to peak and the kind of
00:30:30
big which massively coincide the fact that those picks coincide confirms to me that they will not get there
00:30:39
they went sleeping they were listening i can go back now and say from zero dated zero move i guess yeah yeah like i was listening
00:30:46
because i get such a big here
00:30:49
and then you look at this and i if if you look at the video it's almost like they're they're puppets being pulled by the
00:30:55
same piece of string one person goes for the other person goes back
00:30:58
or that was go forward together there's a tremendous synchrony in them movement
00:31:05
yeah okay so we can look at these places the speech activity and we can look at the traces of
00:31:10
a movement data you know if you low pass filter them but ten frames per second so it's like this movie
00:31:18
we can see that they align really precisely i was expecting a cascade of movements as participants reacts so
00:31:25
jane says something and john laughs and then fred last and then mary
00:31:29
last you know a like a like a disease going around the room
00:31:33
it turns out that if you just low pass filtering then i went to
00:31:37
one frame is sufficient to capture many of those synchronous movements this activity picks
00:31:46
activity picks in the movement indicate bursts of high interaction composition
00:31:51
and they're very clear sequences of propose lies units propositional content
00:31:55
one person is speaking at length and the other is a very static a language stuff and then bursts of
00:32:02
high activity the transitions i engagement on those i think a key points in the interaction you couldn't make inference
00:32:09
if everybody listens attentively in the laughs and then somebody else but starts talking you have a topic change
00:32:17
oh so you can do topic detection using this data or we haven't even started thinking about transcription yeah
00:32:23
this is just sensory input from a camera and if you want to century input from a single microphone
00:32:30
those automatic measurements typically call it rates above about not point eight uh and i claim that they
00:32:37
probably render a manual transcription relevant that you know that was we can capture much much more data
00:32:43
we can do very simple analysis of of movement traces noise stresses and then we
00:32:47
can go in and focus on manual labour on the points of high engagement high activity
00:32:54
he is the head traces for the same for people body movement traces for the simple for people and
00:33:00
this is just the some of the whole lot so this is the group activity and this is individual activities
00:33:06
it's remarkable i claim here absolute alignment here to
00:33:11
slight lead for the yellow but absolute for everybody
00:33:14
else there are so many positions with those pigs coworker that you can definitely get a a technology out
00:33:22
okay it's remarkable synchrony
00:33:27
as i said i expected a sliding window will be necessary but that's not the case and low
00:33:31
pass filtering is moving we find very very close activity picks that reveal special moments in the discourse
00:33:38
so in summary i hope picking up your computer was assigned i have five minutes left or something um
00:33:45
okay okay five to get the slider on the web or they will be an awful lot of time um
00:33:53
with respect to sequence of moves in the social conversation interaction this is a quote from adam can
00:33:59
the personal communication each contributes to the emergence of
00:34:03
a joint lisa same system of coordinated action patterns
00:34:06
and the emergent common understanding is well maybe the cognitive consequences
00:34:13
yeah he's gonna make claims about cognitive consequences i'm happy to produce the technology
00:34:18
but we have to share our knowledge and that's why i'm very excited
00:34:21
to be here anyway this particular talk a key point made in this paper
00:34:25
is that whereas previous work required lengthy and expensive manual transcription of this audio visual multimodal data
00:34:31
the proposed automatic procedures to derived from very simple easily downloadable
00:34:36
free image processing that show a very high correlation with transcribed speech
00:34:43
basically if you don't know the c. v. you have this technology they needed some wonderful hacks to make it very robust
00:34:49
and in a constraint situation where heads a vertical and they tend to
00:34:54
look at the camera once in a while we can do better than ninety five percent hit detection
00:35:01
range of like you've done outside than inside this light is fine
00:35:06
so the technology is there and i'm very all that to say that this multi
00:35:11
little conversation that it's not really available on the s. s. p. what side um
00:35:16
they suggested i give it a name and free talk seem to be a a an adequate description um
00:35:22
special thanks to michelle who actually up with all the stuff catherine who initiated this she asked for um
00:35:32
use it a lot of people around the world are playing with this there is no duty except that
00:35:38
if you derive anything from it but it back on the systems other people can benefit from that um okay
00:35:46
sure wrap up um active listening is something that really
00:35:52
excites me at the moment typical traditional technology says that
00:35:56
speech synthesis talks i just the box it goes on that doesn't do any
00:36:00
sensing it doesn't even look around to see if there are any people in there
00:36:05
better still if you could find the people and detect their participation
00:36:10
then it could rephrase it could restructures off it could repeat simplify jump ahead it's such
00:36:16
so we could have some very intelligent speech uh interaction lips what um
00:36:23
people talk interactively they overlap very often an overlap is not a performance or
00:36:30
i think overlap is a social act uh it's not just ping pong turn taking
00:36:36
with respect to discuss synchrony um
00:36:40
people interact together and the best
00:36:43
it's like a time go it's not partner a is dominant partner before those if you watch people
00:36:51
dancing the tango i had the great pleasure to walk through a park in dublin added time go first
00:36:57
was about ten couples with an argentine violinist guitars or and they
00:37:02
were dancing and it wasn't that the man led and the woman followed
00:37:07
she knew exactly where he was going and she was just there when it happened
00:37:12
i think that's what happens in speech it's not that you talk i process and then react
00:37:17
i pro actively participate in our conversation
00:37:23
and i finish your sentences because i know what you're gonna say is not interrupting you it's complimenting you
00:37:29
with me are um tempo dynamics are something i did not yet understand but i think
00:37:35
there's a very magical element there we could understand the temple dynamics of this type of interaction
00:37:42
oh okay uh parts of this talk were also presented at the
00:37:47
into speech special session of the same name discourse some green active listing
00:37:51
on the ninth day of the ninth month of the night you of the third millennium
00:37:57
and it was no terrorist activity unless this could be cool but um so okay um
00:38:07
a acknowledgements i want one says going through the and the the first part is that knowledge
00:38:13
is the n. i. c. t. in eighty are in japan who used to phone me um
00:38:18
due to coach in dublin i i've give me money to carry on this work for the next five
00:38:23
years and i still have to use left the japanese government funded so i i'm looking at both sides
00:38:28
i would think about time i'm i'm headed to here because she
00:38:32
doesn't have the confidence to believe that she is a very strong program
00:38:36
but uh programming is available if you have data which is a similar format
00:38:42
we have time aligned information of any form you can also your your data
00:38:48
and finally we are recruiting so if you wanna control haven't tasted gonna see the very welcome to come see
00:38:54
okay actually stopped there
00:39:04
uh_huh
00:39:11
yeah
00:39:18
yes of course
00:39:20
the
00:39:23
exactly sensing technology
00:39:32
if you
00:39:41
exactly yes you cadence circus the cadence posing and and page topic and so forth if you look at the ends
00:39:51
of utterances nigel ward just on my technology the people katie age inset and and got a locking a list of some
00:39:58
i've made a very nice machine where they actually model much more complexity for that to take what direction to provide feedback
00:40:06
but i think it's not just physical cues i think there's some
00:40:10
things well i i can i i i need musical help in this
00:40:16
music rhythm is very clearly formalised and there's a framework and the
00:40:21
skill in music is deviating from the framework in a a controlled away
00:40:28
that's what makes performance as opposed to rendition i i think
00:40:31
we are masters of that performance in this kind of discourse
00:40:36
i mean here is different but if you good assessment people eating all good probably because
00:40:41
when there are no external 'cause i just noticed this conversation
00:40:44
that's when the social aspects can emerge was strong i think and
00:40:50
yeah okay interrupt me gothic
00:40:55
actually
00:40:58
oh no i haven't out i got that that ugliness
00:41:03
yeah it's series every time
00:41:08
what got these three days they one day today three same people coming and so yes it is a series overlap
00:41:19
anyone they even in any one moment together or even one topic group
00:41:25
somebody will introduce something and then it sparks of memories or
00:41:30
key points docking points in the other three you know that mention straighten gotten some years as a yeah yeah i've been there
00:41:36
and then it's all they had this problem the corner kind of thing that people that hiding information the cumulative way
00:41:43
i think he still can't action is the signal we give them this data or you i i'm i'm
00:41:53
i'm a novice in these fields but i can use tools like s. b. m.'s markov models et cetera
00:42:00
i mean you can use it was
00:42:05
that's defined data that's part of the of the that 'cause it's a time aligned annotation
00:42:10
the ground me yeah
00:42:17
come back next week or beginning to believe no facts
00:42:30
oh
00:42:32
yeah
00:42:39
the stream of the student or the comes in from a camera exactly like a doughnut
00:42:43
because the the three hundred and sixty degree thing is very hard to reserve the
00:42:46
first thing we do is is virtually straight nights and then we choose reason of interest
00:42:52
i'm from their warheads appointing up relatively speaking i detection we've got within go
00:42:58
down below the had two point five times the width and that gives us about
00:43:03
and we simply do um
00:43:06
i forget the term for it but there's
00:43:09
yeah optical flow of um within the box around the box within this box around this box so there's no i guess
00:43:17
that there is movement left the right x. coordinate y. coordinate and today is it corner
00:43:21
because we have and um zoom if you like people come close room for the word
00:43:27
that's there and the same for the body so it's x. y. and said for the two parts times number of people
00:43:38
yeah
00:43:58
uh_huh
00:44:02
yeah
00:44:07
uh_huh
00:44:12
uh_huh
00:44:16
that also it sure is a gross exaggeration to say that we adopt all the time that the crazy but
00:44:22
if if you can measure that then you can measure that the amount of discord
00:44:26
in the amount of delay feedback et cetera and that again becomes rich information source
00:44:33
um
00:44:36
yeah
00:44:42
yeah of course
00:44:47
oh i would like to make that generalisation but i don't at the moment i can't say that
00:44:54
it's it's very complex
00:44:56
yeah i'm noise cancellation is a magic word in this context you know the headphones if you
00:45:01
were on the aeroplane it takes the background noise and subtract seven it leaves the the significant noise
00:45:07
we need to we how are you going to say we need to develop the technology like noise cancellation where we can look at
00:45:14
the the round the head scratching and the the nonspeech related movement
00:45:19
i'm that the speech related movements emerge because of the synchrony across them
00:45:24
so we can use the the multi tracks to do this noise cancellation
00:45:29
talking of which i like we'll do that okay
00:45:36
yeah
00:45:40
yes
00:45:42
uh_huh he
00:45:52
this is also the final yeah of course are differences
00:45:55
there are cultural differences there a language related differences there is socially related differences
00:46:01
bus drivers do it but they do it differently from office workers office
00:46:04
workers do differently from university purposes but they'll do the same kind of thing
00:46:09
i don't think this technology would not work for any particular language of
00:46:12
cultures just you'd you you couldn't generalised models trained in one region too
00:46:18
yeah but it's going back to what i said earlier speech is such a wonderfully efficient mechanism that it
00:46:24
contains docking points it contains repetitions to allow you to former based and to make a a local compare
00:46:31
even if you've never met me before you don't know my voice you don't know my speaking characteristics by comparing
00:46:37
small changes in the the frequent simple actually relevant
00:46:41
uh sections then you can do that kind of processing
00:46:47
so i think we can get a robust technology
00:46:57
yes this means right
00:47:01
with that
00:47:03
his mode of ways you can do that i just don't see saying why
00:47:09
oh
00:47:12
i don't think it does work unless that one's male and female and they have other things and then my wife
00:47:19
is japanese and when we first went out my japanese was very very pool but we managed to communicate very well um
00:47:27
so we
00:47:35
yeah

Share this talk: 


Conference Program

Tracking 'the 2nd channel' of information in speech
Nick Campbell, Trinity College Dublin
Sept. 13, 2009 · 2:30 p.m.
Tracking 'the 2nd channel' of information in speech [slightly higher video quality]
Nick Campbell, Trinity College Dublin
Sept. 13, 2009 · 2:35 p.m.

Recommended talks

Speech Graphics Presentation
Gregor Hofer
Feb. 24, 2015 · 10:10 a.m.
138 views