Tracking 'the 2nd channel' of information in speech

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:00

it's a privilege to be here i'm i'm i'm sure that many of you have scenes with the slides before

00:00:07

but that's not an apology 'cause you want to see them in the same order the same combination i know that nobody has seen them all

00:00:13

if you think of slides like spice and you think of a talk like a meal that i hope you could enjoy my impromptu cooking

00:00:21

um and yeah if we're going to talk about

00:00:25

the signals like in fact this is the seventy fifth anniversary

00:00:31

f. quantitative analysis of the interaction of individuals this is the copy of

00:00:36

the front page of the general downfall anthropology january thirteenth nineteen thirty nine

00:00:43

which happened was um he was in the the quantitative analysis of synchrony and

00:00:51

they're active listening and i think it started with his work and those co workers

00:00:55

he adapt to the manual typewriter which is the thing you used to used to typing letters off to put an electric motor

00:01:02

i don't spill of paper so that when something happened on observer could make a mark on the paper

00:01:09

and he actually used to a small shaft so the

00:01:13

operator could measure changes and subject subject activity over time

00:01:19

uh he was able to observe the discourse sections of two individuals i'm to tennis shoes of generations of their actions

00:01:26

to understand the way in which sequence of directions organised nice green

00:01:31

is too small is automatically going into some very grand way presentation that

00:01:35

shows you what's coming next but not what's yeah now um this

00:01:39

was anyway this is the first recorded a sequence analysis if human behaviour

00:01:45

he wasn't i thought just i don can them is also nestle just apologise people typically

00:01:51

study animal behaviour and we are animals so they study about eighty from that point of view

00:01:56

i don't condone has written that the first task of a human it'll just like that of your father just to sets out

00:02:02

the study uh but or official monkey must be systematic description you

00:02:07

must set out to see what behaviours structures the human being has

00:02:12

in doing this with people he says it would seem best to begin with

00:02:15

those aspects of behaviour which are most likely to be shared with other animals

00:02:21

i i it kind of just misses languages as well detailed analysis of language must eventually find place

00:02:27

in human lethal g. these do not seem to be the best aspects of human behaviour to start with

00:02:33

and i took lisa is you know we have very advanced speech recognition slide recognition we can process

00:02:40

text but most of the time the devices the process the text don't know what to do with it

00:02:46

the text is almost irrelevant to the interaction so what i want to focus on is not what was said but what was done

00:02:54

okay

00:02:56

00:02:57

conversation analysts discourse analysts they have also perform similar

00:03:02

studies but many people claim that the work comp generalised

00:03:06

and my name is to produce the technology i machine or device or a a module which will observe

00:03:15

a systematic socially organised procedures and the line the ways in which social actors move into

00:03:21

a mutually ratified participation in an encounter those the technical terms can then calls that frame achievement

00:03:29

in a in a simpler to um it's just speech processing but it's not speech based text based speech processing

00:03:37

okay the goal is to produce a technology for tracking discourse most in conversational speech

00:03:42

by focusing on the behaviour of participants

00:03:44

to make inferences about the discourse participation status

00:03:49

not the text in other words if i talk to you first are you there

00:03:54

can you hear me can you see me i you listing a do you understand me do you agree with me

00:04:01

did you agree with me before i said it or do you agree with

00:04:04

me freshly now are you expressing surprise that that agreement many many levels of participation

00:04:11

this is a highly formal one where you don't have many rights to express feedback

00:04:16

everybody's doing it you're all nodding and i can tell from the timing of units produce smiling how i can face my talk

00:04:23

uh anyway if what you're doing is active listening

00:04:29

and i think that's a technology which is missing we have acted speaking and we have transcribing

00:04:36

machines we do not have devices politically you machine

00:04:39

if into interaction where the machine monitors the hippopotamus

00:04:46

traditional pose a approaches to spoken dialogue interface design which is what i have tended

00:04:51

to work tend to assume a ping pong will push to talk style of speech interaction

00:04:56

so the system talks you answered it responds you reply and so forth there is one older throwing

00:05:02

from side to side and my talk is gonna show the our data do not support this view

00:05:07

and this is what i spend all my time looking at i love it okay i'll try and talk you through it you'll see lots of pictures like this today

00:05:14

it's a telephone conversation it's speech activity the blue

00:05:19

is j. uh am i able from elle in

00:05:21

this case j. f. a. is a female speaker and it's there's six conversation in a series of ten

00:05:28

we pay people to come into a room pick up the telephone and talk and

00:05:32

we said we'll give you this much money for thirty minutes each and that's always it

00:05:37

and it was very frightening because they had no idea he was gonna be only into that

00:05:41

awful but over a series of ten conversations ago actually got to know each other they became friends

00:05:48

they spoke about things which are interest um because we bought some for male female uh et cetera

00:05:53

et cetera we have a very interesting corpus and this is my preferred way of looking at that speech

00:06:01

looking at activity patterns

00:06:05

two party telephone or no constraints on the content little page talk thirty

00:06:10

minutes each week for three months ten compositions when the conversations were manually transcribed

00:06:16

cost a lot of money talk a lot of time but i think you unless you have that human know how

00:06:21

in your initial data and a lot i think it's better to do that

00:06:28

the resulting text is almost impossible to read it's very very fragmented very very broken

00:06:35

overlapping speech you may have noticed it accounts for as much

00:06:39

as half of the individual solar talking time in these conversations

00:06:44

there are very few grammatical sentences but much interactive turn taking with the listener

00:06:49

often completing the utterances of the courts speaker they they dance around the conversation

00:06:57

i i'm i'm like the human machine interface rely spoken dialogue as few constraints aunt and take

00:07:05

uh both buttons typically interrupt each other very often for this mutual construction my mother said don't talk when you're

00:07:12

being spoken to she said you know if somebody sticks

00:07:14

you listen people liked this data does not show polite activity

00:07:22

the first minute one bubble

00:07:26

i mean you do get

00:07:29

this is probably best referred to as a ton in dialogue

00:07:33

and you get very much backchannel or feedback

00:07:38

but the more you look at this the concept of the time um

00:07:43

is a very difficult one to explain how on the concept even of an utterance

00:07:46

is a very difficult one to explain but we have data here which we can interpret

00:07:52

meaningfully to represent stages in the discourse more

00:07:56

important relationship between people social relationships between discourse interacting

00:08:02

just very briefly we'll look at some of the speech activity second backup my words with data with numbers

00:08:08

we calculate is that sort of activity silence overlap so an essay sounds be

00:08:14

so they so we'll be and talk they will talk b. is the numbers and i'll show you a chart next

00:08:20

which shows median maximum minimum just look at the mediums we recorded

00:08:25

we pay them for thirty minutes they gave us ten percent extra

00:08:30

so they actually spoke they held the telephone itself the thirty three minutes but they was

00:08:35

silent the three minutes so they spoke exactly for thirty minutes is what we paid for

00:08:40

he was silent for half the time he was signed of half the time overlapping speech off that

00:08:49

total coke eighteen minutes eighteen minutes more than half the time

00:08:54

uh_huh and this is the solo talk the minister minutes and the overlap

00:08:59

in most cases is more than half of the solar talking time so something very interesting is happening

00:09:05

now you can begin to look at these pictures with more understanding you can begin to interpret this

00:09:11

you can see flow you can see parts where one person is dominating on the other person is

00:09:16

actively participating but receiving and then when they shift the balance and it goes across the other way

00:09:23

machines can do this as well because they don't have to do

00:09:25

any speech analysis any speech recognition it's basically noise detection on off uh_huh

00:09:34

hair now we can start looking at some details j. f. a. j. m. a. first conversation

00:09:39

j. f. a. japanese female j. m. age actually smell the female

00:09:45

again is pink but that's not a general rule it's go back one

00:09:49

yeah in this case the mail is pink and the female is um

00:09:52

that's a japanese male six conversation talk in english female female is blue here

00:09:59

that's a typically bonds composition if we go back to this lady with a

00:10:04

partner just a bit older than he is she's thirty years old he's twenty five

00:10:10

she intimidates him in the first conversation he can't get a word

00:10:15

in edgewise he's politely respond and we don't see the transcription yeah

00:10:23

first composition of strangers so composition zero one that they're absolutely terrified that that they don't know what they're getting into

00:10:31

and i think she's trying to calm him or well that's not draw inferences yep times at a time for discussion later

00:10:38

this is a little bit you can you can play it if if if you click on any of these you can see the text

00:10:42

for those of you read japanese it's good if you doubleclick you can hear the speech so you can understand as well

00:10:48

i'm jeff ha my second composition same people second

00:10:52

composition so it may or may not be human characteristic

00:10:57

she does tend to dominate quite if you if you count the time pinks talking which is her against blue

00:11:04

and another one has this j. may effect ah the same japanese guy

00:11:09

twenty five and this time it's an english speaking female she's also twenty five

00:11:14

um for those of you like dirty pictures this is them having

00:11:20

enough with each other he's chatting her up and she's really corporate

00:11:23

there's a lot of laughter a lot of very clip very brief i'm flirting behaviour

00:11:29

basically they couldn't see each other uh they got to know each other quite well

00:11:34

okay so then we can think of a concept of measuring flow just from

00:11:39

this plus minus noise um the ratio of speech to nonspeech activity in any line

00:11:47

and in fact if you skated by the links the current utterance that's a

00:11:50

useful measure um if it's high it says that speaker is dominating the discourse

00:11:56

and if it's a logo it says that the speaker may be listening maybe thinking that we know not dominating

00:12:04

um if the conversation follows a ping pong button then you'd expect these figures to give you a negative correlation uh_huh

00:12:12

on the other hand if you get a positive correlation i think it's quite interesting because it says

00:12:17

that both participants tend to speak at the same time and to be quite at the same time

00:12:24

uh_huh

00:12:26

okay that's the formula basically you some very uh if you think of

00:12:29

three speech utterances that gives you for speech silences so you some these

00:12:33

i think the average is some these take the average you do ratio you scale it

00:12:36

by the uh like the standard utterance so basically you comparing ratios the speech to nonspeech activity

00:12:44

hand um if this is an extreme case the first

00:12:47

thirty minutes of the last conversation between j. f. b.

00:12:51

and a japanese female partner j. f. c. uh here the flow will be very high for blue every love back

00:12:59

but we definitely dominates that's who is another one uh speech activity the first thirteen

00:13:05

is the last conversation between japanese you know am a partner g. m. a. beautifully balanced

00:13:11

they've found things that they can talk about they could be equal status with each other et cetera et cetera if we look at the overlap lips

00:13:18

don't if uh if you look at the overall average we find that j. f. b. and am see this woman and the man

00:13:25

who did not headed off to well together have a very high negative correlation when she talks he listens when he talked she listens

00:13:33

but these two guys damn c. g. m. b. two young guys talk about baseball

00:13:37

whatever when he said yeah yeah they won the other guy says yeah yeah they wipe

00:13:42

and then they go ah you know it's the dance again

00:13:49

uh_huh so talk is talk but what we do is

00:13:52

talk isn't much more complex much more interesting social ritual

00:13:58

okay let's just from preferences sound nine i think it's like

00:14:01

digression here to just some japanese um some nonverbal speech signals

00:14:07

this is how off my corpus a one hundred utterances those one hundred

00:14:13

utterances account for more than half of the number of utterances in the corpus

00:14:19

one thousand uh no ten thousand times somebody said oh

00:14:23

and in that means yes uh_huh um eight thousand six hundred times high high

00:14:30

means yes and then there's lots of laughter three thousand five hundred times um

00:14:36

and means yes and then a a means yes and hi i'm is yes known him is yes

00:14:46

means now maybe later but i would think about this and

00:14:50

then um i know ah ah i i know huh et cetera

00:14:55

if you get a pattern matching or or bearing in mind that one of these dashes means elongation elongation of all right

00:15:04

see that is a very simple utterances they're very highly um repetitive very complex morphology they

00:15:11

can be very long ah yeah yeah yeah yeah yeah yeah yeah yeah yeah yeah thirteen times

00:15:18

uh_huh now whether yeah and yeah yeah and yeah yeah yeah yeah they may

00:15:22

mean different things but the point is that they are very very very frequent

00:15:28

they're very simple they're very common and they carry very complex information

00:15:34

so let's look at this is way off track but it's interesting numbers

00:15:39

the japanese female talking to these various people in c. is chinese

00:15:42

female chinese mail english emailing the smell japanese you not you know

00:15:46

okay they talk to japanese people more often than they talked foreigners because i was

00:15:49

more interested in that kind of data but you look at once is is ah

00:15:55

ah quite a lot with chinese people quite a lot with

00:15:58

japanese people but not so often with which people the story there

00:16:05

demo demo means but it's it logically complex operator no no no no yes yes

00:16:12

when japanese talk to japanese these more complex structures it's obvious

00:16:17

what about this high yes very formal yes i do not think that yes

00:16:24

chinese chinese english english what

00:16:33

let's go on and take this example of yes i go onto

00:16:37

looking no down through the conversations j. f. a. g. s. b.

00:16:40

you know these two people now first conversation second but switch the

00:16:43

tent conversation first time it comes at twenty six times uses high

00:16:47

second conversation thirteen spot the button seven four three one and that's a beautiful curve if you plotted

00:16:58

when they don't know each other even japanese japanese there on this kind of very tentative

00:17:03

social yes yes and as they get to know each other it yeah yeah yeah yeah yeah

00:17:10

and that comes out of the data what we see here is a bonding uh_huh anyway

00:17:18

um this is another form if the sound works i'm gonna play some samples now of

00:17:26

one of these sounds the japanese were home now on my is a local

00:17:31

dialect of the area where i live and it means really and it can be

00:17:35

used as a modify like really hot like this room is really hot stream is

00:17:39

really stuffy or it can be an interjection or really i didn't things so that's

00:17:47

okay and i oh oh oh oh oh oh

00:17:56

the same lady

00:17:59

same word or oh

00:18:02

oh oh oh oh oh

00:18:12

oh oh this is different

00:18:18

all of that one word from that one person have three thousand five hundred tokens if

00:18:23

anybody wants to play with that data it's on the web you were locked into it

00:18:26

maybe fifteen about fifteen different classes and they

00:18:31

vary along that dimension which i can't explain

00:18:34

so it includes emotion she's laughing sometimes she sounds very set the series other times but

00:18:40

emotion per se is not the best terminology to explain that dimension that's wanna say here

00:18:45

when you summarise that part of the talk by saying that common events facilitate simple comparisons

00:18:52

this noise is is very very frequent even if you've never met her before you talk have a five minutes and you've had kind of them

00:19:00

you're the first two or three and they get a baseline and then you can

00:19:05

you can make a comparison this one is louder longer soft harder than the previous one

00:19:11

so the very frequent simple sounds rather listed uh could we estimate the effective states of the speaker

00:19:18

they're simple another to suck areas of voice quality and prosodic information and that's all they can carry and the

00:19:24

interest birth very readily throughout the speech so i claim that many people say that speech spontaneous speech is you'll forms

00:19:31

what we do is a kind of noisy representation of these beautiful abstract language jump

00:19:37

i don't think so i think it's informed i think those noises

00:19:41

hesitations quote phyllis uh they carry

00:19:45

very very useful prosodic social information

00:19:50

discuss controlled all that much visitor but what i wanna talk about here is multimodal data

00:19:57

processing large numbers of multimodal data because you know we need to look as well as listen

00:20:03

previous work with speech speech only i'm proud to say and that's the truth so this is

00:20:09

already on the s. a. s. p. net portal um so you download it uh it's there

00:20:16

this is the machine this is part of the machines we used to capture we capture a lot of real

00:20:21

interactive speech we put this little bag on the table in in fact contains a three hundred and sixty degree camera

00:20:27

which captures everything around it thirty degrees down six degrees up so

00:20:31

in this angle around itself um a little very old now a stereo

00:20:37

wave recorder very high quality mikes and then in the background somebody's it's

00:20:42

with this like the guy at the back and just checks that you know

00:20:46

people haven't put a biker crisp separate the for example in which case you can't do anything you

00:20:50

can't say move the bag you live with half the data missing you make a note of it somewhere

00:20:56

i'm in is new to form it looks like this and i stuck an array of microphones on it at one time

00:21:03

um in fact i don't use any my took the lens out it's a tiny tiny thing you can screw it

00:21:09

the lenses it's beautiful the microphones are not necessary because it's

00:21:13

actually show new noise no noise for speech minus speech is sufficient

00:21:18

so the microphone and this machine any cheap tent overnight is enough i think for this type of work and this

00:21:24

puts out an analogue signal and it's better to have a digital signal so if you buy spend a lot of money

00:21:30

on the point briefly or something of that quality something affect how about this is a

00:21:33

digital file lack camera this is the same thing i look at the size of it

00:21:38

and these uh two lenses that you can this somewhat just plug into dust covered

00:21:42

table and it it just like a decoration but it watches everything that goes on around

00:21:50

uh okay

00:21:52

00:21:54

me with this lady

00:22:01

she signed a contract with this

00:22:07

ooh

00:22:08

damien is the e. e. o. c. damien is a graphics guy he's is done amazing face tracking for me

00:22:19

uh christine as the pragmatic linguists discourse analysts

00:22:24

um this lady's her friends you cannot one for a long free we have three days of filming contracted

00:22:31

one uh she's very fluent in english belgian finish

00:22:35

australian if expert brit japanese were all speaking in english

00:22:40

and we did ninety minutes each day recorded agnes directories to so

00:22:43

that's a candidate i'm working with you can see here one two

00:22:47

three of those cameras couple microphones and stuff we also have a

00:22:51

microphone hanging from the ceiling for for very high court is on

00:22:55

um all that stuff is on the web you can see it the a.

00:22:59

t. r. faced is the fees that died the lab i was working and crashed

00:23:06

um but this page is not be reconstructed with michelle's help on the s.

00:23:10

s. p. server and you can see they one day today three in various formats

00:23:15

from the the three hundred and sixty degree camera et cetera with flat cameras from

00:23:19

here from their et cetera so there's a lot of video data is also label it

00:23:25

we did a topic list the the the themes of

00:23:28

the conversation uh the emotion list the heat of the conversation

00:23:34

and we had two three labels listening to this stuff has annotated not we didn't use

00:23:38

your choice but it was similar to filter is a high activation like division high positive

00:23:43

and the various charts it search if you got into this link here you get the rule forms of data

00:23:50

that you can download angle one angle to angle three three sixty degree et cetera with you or do you

00:23:56

um some bad sound but you have to live with that if you have enough capture devices you typically can recover a little bit

00:24:03

okay annotations this is what the transcription looks like but that's

00:24:07

a terrible way to access data you cannot reach discourse from text

00:24:14

uh this is better this is topic that time align to time data

00:24:18

what happened to the change of topic has mainly talking and listening and

00:24:22

reading and the mood is it heated all quiet interested very funny bit quiet

00:24:27

uh damien was the topic too gentle understanding kind i think this is like the script of the movie

00:24:32

but it's done in retrospect we had people look at it and i basically said to me a scene analysis

00:24:38

and describe each scene for me and that's that's that we also have this software which is great so

00:24:44

um the four people are colour coded and we have automatic head tracking and automatic speech from the transcription

00:24:52

so uh as you scroll for this is the flash interface a flash movie as

00:24:57

you scroll through it you can see the here for colours because paul people speaking

00:25:02

you see the discourse interactions you can see the activity you can also

00:25:06

see the output of each of the colour coded had trackers body trackers

00:25:12

if you can find heads in a video you know typically that unless

00:25:17

you're in space or simple or something is gonna be a body underneath it

00:25:21

figurehead tracking you get about a tractor free you can measure the movement in this area and you can measure the movement

00:25:26

in this area separately and just by looking at movement i think you can for a lot i will go on to that

00:25:33

00:25:34

okay if you really could be your last what active listening is it

00:25:38

says something i personally don't really agree with but maybe that's a standard definition

00:25:42

it says that activist thing is a structured way of listening and responding to others

00:25:47

which focuses attention on the speaker suspending ones on frame of reference

00:25:51

and suspending judgement are important in order to fully attend to the speaker

00:25:55

i think that's just eccentric what i mean by uh active

00:25:58

listening is participation and i'll show you that maybe this is true

00:26:04

having the ability to interpret a person's body language allows the listed

00:26:08

to develop a more accurate understand is because words i would support that

00:26:13

i tend to think though that it's actions rather than words

00:26:17

participants actively engage in disgust in overlapping and complementary manna

00:26:21

i now focuses on the contradictory on participatory discourse actions rather than only cognitive attention

00:26:28

states of the listener these these actions a physical observables and they can easily be measured

00:26:35

00:26:36

now we have a different view of interaction model

00:26:40

i'm not processing discussed focus content but i'm much more interested in the dance

00:26:46

this is a socially evolving event it's multifaceted multidimensional and it's integrated

00:26:52

synchronised it's loosely based around the frame locus synchrony which i didn't understand

00:26:58

i know the the temple dynamics are essential

00:27:02

i can show you how closely aligned they are but i don't have a model of

00:27:08

anyway what is that often come up in this context our engagement in

00:27:13

train meant mutual cooperation we had a model the session yesterday in the music

00:27:18

which i was able to over here a little bit with these words also

00:27:22

came up musicians do it officially we're all musicians when it comes to social interaction

00:27:30

okay so this is what my one aspect of my data looks like here you can see

00:27:39

unit one grey is dominating green is taking part in

00:27:44

another conversation the others may or may not be present

00:27:48

except he a bach you get an explosion green comes in and then again another

00:27:54

explosion and here a very definite explosion and hear anything bigger one and he'll while

00:28:00

something's happened between these four people where they gradually come together

00:28:08

something is being talked about on the others come in they join in and they take part in it you can see waves

00:28:14

or you can think of where it's going to that going around that table if you like these are things i want to quantify

00:28:22

00:28:24

use a joint it's very difficult to look at it in this format

00:28:27

but you can see that talking about some particular text as this scroll through

00:28:33

i didn't want to do a live demonstration here because it's things can go wrong they will go wrong but if you're interested i have all this on my machine

00:28:40

if you get to the s. s. b. that you can download and play with this you click

00:28:43

anywhere on that previous screen and you'll come up on the slot that it's like that thing could sit

00:28:50

and then we get the head tracking we actually had a dummy head on day one to measure the drift in the head track

00:28:57

we know this thing doesn't move but if you get and you do get movement on that

00:29:01

and you can re calibrate all cameras also get knocked and it's nice to be able to

00:29:09

anyway i'm looking at this you see the the the speech movement the

00:29:13

explosion you can explode the body movement had movement at the same time

00:29:18

it's common sense when people laugh they move their hands ha ha ha

00:29:24

when people talk they not

00:29:28

so not surprisingly when i told my it was my body most own correlation is really high

00:29:36

any person's head correlates about not point it with the body

00:29:39

throughout that data that's common sense but the interesting thing here

00:29:44

is that my body and head synchronise with your speech to

00:29:48

significantly high no point fold open formal point five or higher numbers

00:29:54

and these are having correlations between uh what does it show shows which is the people present it shows

00:30:02

that they're attentive shows that the sharing it shows that they are forming it could be that you like

00:30:08

and more importantly they can be measured him because we have these data strikes

00:30:14

and they find that this remarkably okay he agreed is talking green is talking

00:30:19

and there's a lot of activity the um no activity on red and yellow and grey they could be dead

00:30:25

they might be sleeping but here we have a peek at the peak to peak and the kind of

00:30:30

big which massively coincide the fact that those picks coincide confirms to me that they will not get there

00:30:39

they went sleeping they were listening i can go back now and say from zero dated zero move i guess yeah yeah like i was listening

00:30:46

because i get such a big here

00:30:49

and then you look at this and i if if you look at the video it's almost like they're they're puppets being pulled by the

00:30:55

same piece of string one person goes for the other person goes back

00:30:58

or that was go forward together there's a tremendous synchrony in them movement

00:31:05

yeah okay so we can look at these places the speech activity and we can look at the traces of

00:31:10

a movement data you know if you low pass filter them but ten frames per second so it's like this movie

00:31:18

we can see that they align really precisely i was expecting a cascade of movements as participants reacts so

00:31:25

jane says something and john laughs and then fred last and then mary

00:31:29

last you know a like a like a disease going around the room

00:31:33

it turns out that if you just low pass filtering then i went to

00:31:37

one frame is sufficient to capture many of those synchronous movements this activity picks

00:31:46

activity picks in the movement indicate bursts of high interaction composition

00:31:51

and they're very clear sequences of propose lies units propositional content

00:31:55

one person is speaking at length and the other is a very static a language stuff and then bursts of

00:32:02

high activity the transitions i engagement on those i think a key points in the interaction you couldn't make inference

00:32:09

if everybody listens attentively in the laughs and then somebody else but starts talking you have a topic change

00:32:17

oh so you can do topic detection using this data or we haven't even started thinking about transcription yeah

00:32:23

this is just sensory input from a camera and if you want to century input from a single microphone

00:32:30

those automatic measurements typically call it rates above about not point eight uh and i claim that they

00:32:37

probably render a manual transcription relevant that you know that was we can capture much much more data

00:32:43

we can do very simple analysis of of movement traces noise stresses and then we

00:32:47

can go in and focus on manual labour on the points of high engagement high activity

00:32:54

he is the head traces for the same for people body movement traces for the simple for people and

00:33:00

this is just the some of the whole lot so this is the group activity and this is individual activities

00:33:06

it's remarkable i claim here absolute alignment here to

00:33:11

slight lead for the yellow but absolute for everybody

00:33:14

else there are so many positions with those pigs coworker that you can definitely get a a technology out

00:33:22

okay it's remarkable synchrony

00:33:27

as i said i expected a sliding window will be necessary but that's not the case and low

00:33:31

pass filtering is moving we find very very close activity picks that reveal special moments in the discourse

00:33:38

so in summary i hope picking up your computer was assigned i have five minutes left or something um

00:33:45

okay okay five to get the slider on the web or they will be an awful lot of time um

00:33:53

with respect to sequence of moves in the social conversation interaction this is a quote from adam can

00:33:59

the personal communication each contributes to the emergence of

00:34:03

a joint lisa same system of coordinated action patterns

00:34:06

and the emergent common understanding is well maybe the cognitive consequences

00:34:13

yeah he's gonna make claims about cognitive consequences i'm happy to produce the technology

00:34:18

but we have to share our knowledge and that's why i'm very excited

00:34:21

to be here anyway this particular talk a key point made in this paper

00:34:25

is that whereas previous work required lengthy and expensive manual transcription of this audio visual multimodal data

00:34:31

the proposed automatic procedures to derived from very simple easily downloadable

00:34:36

free image processing that show a very high correlation with transcribed speech

00:34:43

basically if you don't know the c. v. you have this technology they needed some wonderful hacks to make it very robust

00:34:49

and in a constraint situation where heads a vertical and they tend to

00:34:54

look at the camera once in a while we can do better than ninety five percent hit detection

00:35:01

range of like you've done outside than inside this light is fine

00:35:06

so the technology is there and i'm very all that to say that this multi

00:35:11

little conversation that it's not really available on the s. s. p. what side um

00:35:16

they suggested i give it a name and free talk seem to be a a an adequate description um

00:35:22

special thanks to michelle who actually up with all the stuff catherine who initiated this she asked for um

00:35:32

use it a lot of people around the world are playing with this there is no duty except that

00:35:38

if you derive anything from it but it back on the systems other people can benefit from that um okay

00:35:46

sure wrap up um active listening is something that really

00:35:52

excites me at the moment typical traditional technology says that

00:35:56

speech synthesis talks i just the box it goes on that doesn't do any

00:36:00

sensing it doesn't even look around to see if there are any people in there

00:36:05

better still if you could find the people and detect their participation

00:36:10

then it could rephrase it could restructures off it could repeat simplify jump ahead it's such

00:36:16

so we could have some very intelligent speech uh interaction lips what um

00:36:23

people talk interactively they overlap very often an overlap is not a performance or

00:36:30

i think overlap is a social act uh it's not just ping pong turn taking

00:36:36

with respect to discuss synchrony um

00:36:40

people interact together and the best

00:36:43

it's like a time go it's not partner a is dominant partner before those if you watch people

00:36:51

dancing the tango i had the great pleasure to walk through a park in dublin added time go first

00:36:57

was about ten couples with an argentine violinist guitars or and they

00:37:02

were dancing and it wasn't that the man led and the woman followed

00:37:07

she knew exactly where he was going and she was just there when it happened

00:37:12

i think that's what happens in speech it's not that you talk i process and then react

00:37:17

i pro actively participate in our conversation

00:37:23

and i finish your sentences because i know what you're gonna say is not interrupting you it's complimenting you

00:37:29

with me are um tempo dynamics are something i did not yet understand but i think

00:37:35

there's a very magical element there we could understand the temple dynamics of this type of interaction

00:37:42

oh okay uh parts of this talk were also presented at the

00:37:47

into speech special session of the same name discourse some green active listing

00:37:51

on the ninth day of the ninth month of the night you of the third millennium

00:37:57

and it was no terrorist activity unless this could be cool but um so okay um

00:38:07

a acknowledgements i want one says going through the and the the first part is that knowledge

00:38:13

is the n. i. c. t. in eighty are in japan who used to phone me um

00:38:18

due to coach in dublin i i've give me money to carry on this work for the next five

00:38:23

years and i still have to use left the japanese government funded so i i'm looking at both sides

00:38:28

i would think about time i'm i'm headed to here because she

00:38:32

doesn't have the confidence to believe that she is a very strong program

00:38:36

but uh programming is available if you have data which is a similar format

00:38:42

we have time aligned information of any form you can also your your data

00:38:48

and finally we are recruiting so if you wanna control haven't tasted gonna see the very welcome to come see

00:38:54

okay actually stopped there

00:39:04

uh_huh

00:39:11

yeah

00:39:18

yes of course

00:39:20

the

00:39:23

exactly sensing technology

00:39:32

if you

00:39:41

exactly yes you cadence circus the cadence posing and and page topic and so forth if you look at the ends

00:39:51

of utterances nigel ward just on my technology the people katie age inset and and got a locking a list of some

00:39:58

i've made a very nice machine where they actually model much more complexity for that to take what direction to provide feedback

00:40:06

but i think it's not just physical cues i think there's some

00:40:10

things well i i can i i i need musical help in this

00:40:16

music rhythm is very clearly formalised and there's a framework and the

00:40:21

skill in music is deviating from the framework in a a controlled away

00:40:28

that's what makes performance as opposed to rendition i i think

00:40:31

we are masters of that performance in this kind of discourse

00:40:36

i mean here is different but if you good assessment people eating all good probably because

00:40:41

when there are no external 'cause i just noticed this conversation

00:40:44

that's when the social aspects can emerge was strong i think and

00:40:50

yeah okay interrupt me gothic

00:40:55

actually

00:40:58

oh no i haven't out i got that that ugliness

00:41:03

yeah it's series every time

00:41:08

what got these three days they one day today three same people coming and so yes it is a series overlap

00:41:19

anyone they even in any one moment together or even one topic group

00:41:25

somebody will introduce something and then it sparks of memories or

00:41:30

key points docking points in the other three you know that mention straighten gotten some years as a yeah yeah i've been there

00:41:36

and then it's all they had this problem the corner kind of thing that people that hiding information the cumulative way

00:41:43

i think he still can't action is the signal we give them this data or you i i'm i'm

00:41:53

i'm a novice in these fields but i can use tools like s. b. m.'s markov models et cetera

00:42:00

i mean you can use it was

00:42:05

that's defined data that's part of the of the that 'cause it's a time aligned annotation

00:42:10

the ground me yeah

00:42:17

come back next week or beginning to believe no facts

00:42:30

00:42:32

yeah

00:42:39

the stream of the student or the comes in from a camera exactly like a doughnut

00:42:43

because the the three hundred and sixty degree thing is very hard to reserve the

00:42:46

first thing we do is is virtually straight nights and then we choose reason of interest

00:42:52

i'm from their warheads appointing up relatively speaking i detection we've got within go

00:42:58

down below the had two point five times the width and that gives us about

00:43:03

and we simply do um

00:43:06

i forget the term for it but there's

00:43:09

yeah optical flow of um within the box around the box within this box around this box so there's no i guess

00:43:17

that there is movement left the right x. coordinate y. coordinate and today is it corner

00:43:21

because we have and um zoom if you like people come close room for the word

00:43:27

that's there and the same for the body so it's x. y. and said for the two parts times number of people

00:43:38

yeah

00:43:58

uh_huh

00:44:02

yeah

00:44:07

uh_huh

00:44:12

uh_huh

00:44:16

that also it sure is a gross exaggeration to say that we adopt all the time that the crazy but

00:44:22

if if you can measure that then you can measure that the amount of discord

00:44:26

in the amount of delay feedback et cetera and that again becomes rich information source

00:44:33

00:44:36

yeah

00:44:42

yeah of course

00:44:47

oh i would like to make that generalisation but i don't at the moment i can't say that

00:44:54

it's it's very complex

00:44:56

yeah i'm noise cancellation is a magic word in this context you know the headphones if you

00:45:01

were on the aeroplane it takes the background noise and subtract seven it leaves the the significant noise

00:45:07

we need to we how are you going to say we need to develop the technology like noise cancellation where we can look at

00:45:14

the the round the head scratching and the the nonspeech related movement

00:45:19

i'm that the speech related movements emerge because of the synchrony across them

00:45:24

so we can use the the multi tracks to do this noise cancellation

00:45:29

talking of which i like we'll do that okay

00:45:36

yeah

00:45:40

yes

00:45:42

uh_huh he

00:45:52

this is also the final yeah of course are differences

00:45:55

there are cultural differences there a language related differences there is socially related differences

00:46:01

bus drivers do it but they do it differently from office workers office

00:46:04

workers do differently from university purposes but they'll do the same kind of thing

00:46:09

i don't think this technology would not work for any particular language of

00:46:12

cultures just you'd you you couldn't generalised models trained in one region too

00:46:18

yeah but it's going back to what i said earlier speech is such a wonderfully efficient mechanism that it

00:46:24

contains docking points it contains repetitions to allow you to former based and to make a a local compare

00:46:31

even if you've never met me before you don't know my voice you don't know my speaking characteristics by comparing

00:46:37

small changes in the the frequent simple actually relevant

00:46:41

uh sections then you can do that kind of processing

00:46:47

so i think we can get a robust technology

00:46:57

yes this means right

00:47:01

with that

00:47:03

his mode of ways you can do that i just don't see saying why

00:47:09

00:47:12

i don't think it does work unless that one's male and female and they have other things and then my wife

00:47:19

is japanese and when we first went out my japanese was very very pool but we managed to communicate very well um

00:47:27

so we

00:47:35

yeah

Share this talk:

Conference Program

47:46

Tracking 'the 2nd channel' of information in speech
Nick Campbell, Trinity College Dublin
Sept. 13, 2009 · 2:30 p.m.

47:46

Tracking 'the 2nd channel' of information in speech [slightly higher video quality]
Nick Campbell, Trinity College Dublin
Sept. 13, 2009 · 2:35 p.m.

Recommended talks

06:24

Speech Graphics Presentation
Gregor Hofer
Feb. 24, 2015 · 10:10 a.m.

138 views

Tracking 'the 2nd channel' of information in speech
Nick Campbell, Trinity College Dublin

Embed

Transcriptions

Conference Program

Tracking 'the 2nd channel' of information in speech
Nick Campbell, Trinity College Dublin
Sept. 13, 2009 · 2:30 p.m.

Tracking 'the 2nd channel' of information in speech [slightly higher video quality]
Nick Campbell, Trinity College Dublin
Sept. 13, 2009 · 2:35 p.m.

Recommended talks

Speech Graphics Presentation
Gregor Hofer
Feb. 24, 2015 · 10:10 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Tracking 'the 2nd channel' of information in speech Nick Campbell, Trinity College Dublin

Embed

Transcriptions

Conference Program

Tracking 'the 2nd channel' of information in speech Nick Campbell, Trinity College Dublin Sept. 13, 2009 · 2:30 p.m.

Tracking 'the 2nd channel' of information in speech [slightly higher video quality] Nick Campbell, Trinity College Dublin Sept. 13, 2009 · 2:35 p.m.

Recommended talks

Speech Graphics Presentation Gregor Hofer Feb. 24, 2015 · 10:10 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Tracking 'the 2nd channel' of information in speech
Nick Campbell, Trinity College Dublin

Tracking 'the 2nd channel' of information in speech
Nick Campbell, Trinity College Dublin
Sept. 13, 2009 · 2:30 p.m.

Tracking 'the 2nd channel' of information in speech [slightly higher video quality]
Nick Campbell, Trinity College Dublin
Sept. 13, 2009 · 2:35 p.m.

Speech Graphics Presentation
Gregor Hofer
Feb. 24, 2015 · 10:10 a.m.