Case study: Weakly-labeled Sound Event Detection

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:00

okay let's start again

00:00:03

yeah

00:00:06

place okay let's try to make a things um or a pragmatic uh um

00:00:17

and i think uh talking about a um

00:00:22

sound even detection uh easy interesting because see it's not speech first

00:00:28

so it's something different so that then uh what you sold the whole week

00:00:33

and i think it's a also you port you need you to see how

00:00:37

you would use a a are nance uh for something else than speech

00:00:44

uh and also because i worked on these these last month so i i know well the the topic

00:00:51

so uh and more specifically i will talk about weekly labelled a sun even detection

00:00:58

uh the idea of sound even detection is uh the following

00:01:02

uh you haven't inputs so we could be

00:01:06

ten seconds ten seconds uh recordings

00:01:11

and you extracts a um time frequency representation such

00:01:17

as a a a filter bank with visions

00:01:21

to fade um one or two networks one for classification

00:01:27

uh uh to detect the audio tracks in the whole ten seconds so

00:01:33

you want to and answer the question is their speech in these

00:01:38

ten seconds is there dot backing in this then circle

00:01:42

so this would be done by a classifier and then you also wants to localise where the events are

00:01:50

and you you would train a a model to try to localise a v.

00:01:56

events that you found that you identified by a classic with the classifier

00:02:02

uh it could be the same and modern uh as we we see uh then you select

00:02:10

the predictions made by the localise or the localise that makes predictions at each time frame

00:02:16

and you can select which events you want based on the would you tax you just infrared and you

00:02:22

you and we've these a probability curves for instance

00:02:26

folder backing in speech in that sense it

00:02:30

and then you do some post processing like threshold uh to say okay

00:02:36

uh above a zero point three i would say there is speech

00:02:41

uh and above zero point seven i would say there is no backing

00:02:48

so it's a multi label classification problem where you can

00:02:52

have speech and dog barking at the same time

00:02:55

uh which is a situation different from speech uh recognition well

00:03:00

uh most of the time people don't speech uh simultaneously

00:03:07

uh and so for the audio tracks it's a many to one hour and then that you need

00:03:14

uh many because you have a sequence of uh of

00:03:19

uh how does it vectors a sequence of vectors

00:03:23

in your um time frequency presentation and you want

00:03:27

just to output one a label which is

00:03:31

either zero or one for each of the c. classes that you're interested in

00:03:37

so for instance if you have ten classes top backing speech frying whatever

00:03:42

uh you would have a vector or of uh then a zero all zeroes all once

00:03:50

and then for localisation uh we we call them strong labels

00:03:56

because you have one label for each time frame

00:03:59

uh in uh on the contrary with your bags are sometimes called week labels

00:04:05

uh and in the case of strong labelling you want so many too many model

00:04:11

because you have a sequence of inputs and you wants you want a sequence of of predictions

00:04:20

uh i'm so in more detail is uh here you have your

00:04:24

inputs you would have let's say a convolution convolutional neural network

00:04:30

to predict the tax and uh here uh you have you why

00:04:36

for the whole ten seconds and uh you have the first class

00:04:41

a second class and the last one which are positive

00:04:46

then you have a a a second network for localisation uh and you

00:04:51

train it's that's on the time on the frame by frame level

00:04:58

you obtain zone a prediction curves like this and you select

00:05:03

the one on the second class and the slats class

00:05:08

uh in order to make your predictions and to predict in the end you want

00:05:12

segments so from zero to two seconds in forty five have speech except for

00:05:23

sorry

00:05:29

uh um

00:05:31

this is after thresholding so you have also threshold thresholds here you need to set exactly

00:05:40

exactly so since it's me to label yes yeah

00:05:53

ah for yeah

00:06:00

exactly yeah exactly i i mean if uh if you say if this if this one say is there is park

00:06:08

then uh you retain the the whole

00:06:13

a dog barking prediction here and then you use threshold to say okay i

00:06:18

found the segments uh but you don't know how many of us

00:06:24

oh restating it just um you you take this curve then you you put

00:06:29

them between zero and one so it's like a mean maxima re scaling

00:06:37

okay uh so we we deal with weekly labelled a sound even detection

00:06:46

um we wants to perform both what you tagging and localisation

00:06:52

so we want we for both weak and strong labels so

00:06:56

let's say it's a small why and the big why

00:07:01

and you have a sequence of a spectrogram uh here as inputs

00:07:06

and uh what is called weekly labelled means that you

00:07:10

only half week labor's to train your mother

00:07:14

uh so it means that you only have audio tarps and you don't have

00:07:19

strong annotations uh in your recordings for your recordings uh

00:07:25

and you want to predict strong flavours okay

00:07:29

so this is a kind of likely supervised um machine learning problem

00:07:36

uh any speech you can think of um you record some

00:07:40

some one no i mean uh let's say french

00:07:44

uh as a second language and uh instead of asking him a linguists to annotate

00:07:51

where he's uh and a potential pronunciation are

00:07:57

you the linguist would say okay in the sentence uh there is one dispensation

00:08:03

and this is the kind of problem uh uh you could

00:08:07

uh also tackle uh where you want to localise

00:08:11

uh uh uh the pronunciation uh uh the miss pronunciation

00:08:16

uh that was a annotated by the thing with

00:08:22

uh so uh i will very briefly uh speak about three options

00:08:28

when you when you deal with these kinds of problem

00:08:31

of problems uh so the first approaches uh the brute force approach when you say

00:08:38

the forced its call force strong labelling um and it

00:08:43

means that you consider that the week labels

00:08:46

are the same as that strong labels so you would say

00:08:50

uh um the annotator say is there is bark in

00:08:54

these ten seconds and you consider that all the frames of

00:08:58

your input sequence are a positive uh regarding back

00:09:03

uh so this is very approximate eva approach and of course it's a a completely suboptimal

00:09:10

the second approach uh people uh try these are models with attention

00:09:18

uh and uh i will talk about katie now units which are pretty much used uh

00:09:24

uh these last two years and ah and a third option is which interest me

00:09:30

more is a multiple instance running a kind of a setting stuff so

00:09:38

i just uh to say a word about the moderns uh um

00:09:44

just to have it to give you an idea of what kind of models we do

00:09:50

um so it's a competition or recurrent neural network

00:09:55

um and basically you have you in your input

00:09:58

here so uh mel log mel coefficients typically

00:10:02

so you have a uh like um it's a metrics uh so you have

00:10:07

a four hundred thirty one frames for instance i thank frames

00:10:11

and sixty four uh coefficients for each time frame

00:10:16

uh you fill this to a convolution drops

00:10:22

so if a convolution you do batch normalisation you

00:10:27

do rectify only now units you do set

00:10:30

some playing with max coding for instance and then you do some dropout except role

00:10:36

and you repeat these broke several times and then uh these block is often called

00:10:42

the feature extraction block where you end up with another him h. here

00:10:48

a smaller than the the original one and you

00:10:52

um uh flats you you flatten heat it

00:10:57

uh to feed a recurrent layer so you define

00:11:02

uh recurrent player for instance here we have a group uh the bidirectional

00:11:09

room because you have for who and back about a blue

00:11:14

and then uh this is the decision uh making parts of the network

00:11:19

uh after the recurrent layer it outputs the sequence of uh scores

00:11:25

uh you may use some time distribute it fully connected the early years

00:11:32

uh and the last one is has then and the runs ten units

00:11:38

because you want to pretty pen classes for instance if uh

00:11:44

so uh then you use the sigmoid the uh activation function

00:11:49

and not the soft max the soft max function

00:11:53

because it's made to label classification so you want to be able to predict both

00:11:58

back in speech at the same time and not just back because of a source max

00:12:05

so the output of the sigmoid layer would be a

00:12:08

a metrics of size the number of time frames

00:12:13

times the number of classes i want to predict and this is the same as the input a

00:12:18

dimension uh regarding time because we we don't set

00:12:23

sample pine uh in the network issue

00:12:27

um if you use a um putting

00:12:33

on the time dimension venue you'll you lose per um precision so you

00:12:38

don't you just uh use putting on the frequency uh axes

00:12:44

and then you obtain this uh tyne uh predictions

00:12:49

and if you want starks would you tax then you want already production

00:12:54

of for the ten classes for the whole file and then

00:12:57

what people do is you just average the uh the time productions from coming from

00:13:03

these uh a year uh in order to get just ten uh productions

00:13:09

oh okay i'm so full of forced strong labelling uh

00:13:15

as as as i said we consider that the

00:13:18

a frame level labels are the same as as the audio times

00:13:24

and the last function that's people use these the binary cross entropy

00:13:29

uh between the time frame predictions and the uh and the audio track

00:13:36

so this is the formula of the binary cross entropy but i guess you already uh so it uh

00:13:44

so these approaches is rude brute force approach uh achieves poor results for short duration

00:13:50

events so if you can see their adopt backing i it's very short

00:13:56

uh and your saying to the model okay does all our tents unknowns are

00:14:02

adopt backing and it's not of course so it would be quite for on dog backing for instance

00:14:09

uh whereas for um let's say a running water

00:14:13

it will be you will work fine because usually

00:14:16

running water um last the whole ten seconds

00:14:22

oh so uh option two is attention mechanisms

00:14:28

and there is one which is um use which was proposed for um an l. p.

00:14:35

uh but is now used for would you processing it's called the a gated in the r. units with

00:14:42

any it so well and which oh so not nowadays

00:14:47

in language modelling they use this a lot

00:14:49

uh with recurrent well let's let's uh uh and

00:14:55

i'm sorry if a language models uh are based on our enhance

00:15:01

and get a linear units are an alternative to enhance their not

00:15:06

recurrence so i will explain what what it is um

00:15:11

if so this is a figure from the paper from a d.

00:15:15

t. d. i. units uh you have as input somewhere else

00:15:20

you have a look up table where you look at uh and

00:15:24

buildings whether in billings that's let's say you already have them

00:15:28

and uh a gated uh lee now units will be uh these parts

00:15:34

where you come you defines a two convolution layers that are in parallel

00:15:41

and one uh uh one uh will be just a a normal standard

00:15:48

a convolution layer and the other one you use a sigmoid function

00:15:54

uh at the output and then you element twice multiply them together

00:15:59

to get an output and basically the sigmoid the uh function what

00:16:04

it does it it puts all the elements of the metrics

00:16:08

between zero and one so it will tell the model where to

00:16:12

look and in the input uh to take a decision

00:16:18

uh and this uh controls the flow of information from the uh the network

00:16:26

so yeah

00:16:28

and uh so in would you uh processing um some

00:16:34

people tried it and and it was a successful

00:16:38

so maybe it's a easier to see an example like this you have your spectrogram as inputs

00:16:45

and you defined to to convolution layers in parallel one has

00:16:50

no activation function so you can say it's really are

00:16:54

and one is sigmoid and then you multiply them together and

00:16:59

uh this uh layer with the sigmoid is called the attention

00:17:04

a layer because it will tell the other one

00:17:08

uh keep this part of the data or uh don't

00:17:12

keep it because uh the sigmoid gives you

00:17:16

numbers between zero and one so if the sigmoid outputs something close to zero it saves don't

00:17:22

keep this part of the information if it's close to one it keeps the information

00:17:27

and you do this several times for all the convolution they are senior network

00:17:33

and in the end uh you you can do the same we've

00:17:37

uh recurrent layoffs you can define two parallel a recurrent layoffs

00:17:43

and you have some sort of a localised nation fruit these attention mechanism

00:17:50

so oh the these guys uh won a competition

00:17:54

and the decays competition in in twenty seventeen

00:18:00

uh huh so in um this is another picture of of the same thing uh

00:18:07

so i would keep it and i've uh there's also a way to average uh the predictions

00:18:14

uh to get the final uh would you target predicts true prediction here

00:18:19

it's not a important and finally the option three

00:18:23

instead of uh introducing uh attention uh mechanisms

00:18:28

you can think of a this problem as a multiple instance learning problem

00:18:34

uh where you consider that your sequence is a bag or a set of uh

00:18:42

of instances so each your spectrogram is a bag of vectors and then you say

00:18:49

okay uh i suppose there is no dependency in or ordering among each hectare in

00:18:56

the input and you want the single binary label why in the end

00:19:01

uh uh you have unknown individual labels the strong neighbours are unknown

00:19:07

so that male assumptions ah okay uh the um final a week label

00:19:14

will be zero if all the unknown uh labels are zero

00:19:19

otherwise it will be one so it means if you have a a single frame in your

00:19:24

a sequence that is positive you will consider that the final uh label is positive

00:19:32

so they you can uh um a compact this formulation

00:19:38

using a max and this is what we

00:19:41

will use to train in the model so um what you do is your last function

00:19:50

is the binary person tripping as always but between though

00:19:54

weak label the audio track and the maximum

00:19:58

of your predictions um in relation with time

00:20:03

so basically you have a um a curve uh production corrupt and you take

00:20:09

the max according to time and you say this should be one

00:20:15

if my to label is one this should be zero if my to label the zero

00:20:23

so if you do this you obtain a a prediction predictions like this so uh up

00:20:30

in blue you are the production uh for speech any red for the backing

00:20:37

and as you can see you obtain very shop uh peaks

00:20:41

uh uh so uh that it works okay you if you if you listen to this fine

00:20:48

uh there is indeed speech here speech here speech here and the backing here here here

00:20:55

it's a bit too picky you you want something a a less speaking so

00:21:00

you you can smooth feats if you want to to think better results

00:21:06

uh briefly if you do these uh the meal a thing um

00:21:14

sometimes the model won't be able to distinguish between classes that will occur

00:21:19

very often in the same files uh in the training set

00:21:23

let's say you have a dog barking and uh cats uh any

00:21:27

owing a a lot of times in the same fives

00:21:32

we've the meal approach you optimise um around the maximum

00:21:38

and the frame but hasn't the maximum score so

00:21:41

if you train with a fires that have the same classes

00:21:45

exactly the model won't be able to distinguish between them

00:21:50

uh and this is an illustration here where you have a two

00:21:54

classes to predictions the green and a rat which are uh

00:22:00

one hundred the person to correlated so it means it means

00:22:04

the model that can not distinguish between the green and

00:22:07

the red classes so what can we do to solve

00:22:11

these big um so what we did was to

00:22:18

uh we will consistently pages through that we had a penalty uh in

00:22:23

the loss function a a which is a penalty on the similarity

00:22:30

uh between the the predictions of the different classes of the positive glasses

00:22:38

and if you do this uh it works better uh here

00:22:44

you have a uh the same file as before

00:22:47

um and i'm in transparency you have the ground truth

00:22:53

so here you see the the green and the red curves are no more the same

00:22:59

and you begin to detect correctly the shorts a duration events

00:23:05

uh which is the she's here for your information

00:23:10

um hum okay so i have a small them all with um

00:23:20

i'm not so here you will see a the productions here

00:23:25

and the spectrogram here if if okay another exam

00:23:53

that's if okay i

00:24:00

o. u. i. e.

00:24:11

yeah it's okay uh so we we so um three options you you can try

00:24:20

um i mean two options uh one with an attention mechanism

00:24:25

one with um a different uh machine on the framework called multiple instance running

00:24:31

uh we can go back to speech recognition and and um see

00:24:37

uh what people uh do of these uh last years

00:24:42

so what's the motivation for trying to design and two and a recognition system

00:24:48

uh how do people do uh to build models from sounds to

00:24:53

correct ass so the so called the city seem all that

00:24:57

uh what attention mechanisms people use

00:25:01

so the the sun attendance belt a method what what uh

00:25:08

okay so the um as you saw the all week that conventional

00:25:13

uh recognition pipeline you recognise it so you have several

00:25:17

more years you have the acoustic models you have the language file that you have the pronunciation lexicon model

00:25:25

so the question is can we do a single mother instead

00:25:30

of all these some others which are complicated to optimise

00:25:36

um hum so this is the purpose of and to when the uh moderns

00:25:43

and the definition uh would be uh an intern and and twenty speech

00:25:47

recognition system is a system which directly max the sequence of

00:25:53

acoustic features of or raw signal into a sequence of ref teams

00:25:59

or even worlds uh and

00:26:04

since about uh twenty fourteen the first had attempts uh

00:26:10

has invaded the city c. based um all that

00:26:15

and more a more recently uh the attention uh models

00:26:23

uh so let's talk about city see if a

00:26:26

city c. stands for connectionist temporal classification if

00:26:31

and it's uh it it has been proposed by alex

00:26:35

uh grapes and colleagues see in twenty four six

00:26:40

um hum and basically what it what is it it's a

00:26:45

way to train acoustic models without frame level at alignments

00:26:49

so you don't need a to align the speech signals to uh

00:26:55

the transcriptions so uh that now i think the um

00:27:01

the next uh thirty slides are from the charge

00:27:06

because uh has an awesome a course on city see uh on you too

00:27:12

uh you can attend um i think it's a base this explanation and i could

00:27:17

hardly what doom better than a teacher uh to explain a c. d. c.

00:27:23

so if uh uh this is a um an example uh of input sequence

00:27:30

and here you have all the productions uh

00:27:34

the probabilities of each potential a simple

00:27:38

a character i hear it's funny that but you could be correct that

00:27:44

uh so the problem is you have your sequence and you or are in and all that

00:27:51

you do a vector or predictions at each time step

00:27:55

and you want to predict the final of best sequence of a correct as of phonemes

00:28:04

uh what you can do as a first option is to simply select

00:28:09

the most likely a symbol at each time frame so for instance you would be g. g.

00:28:17

f. f. f. et cetera you don't care about the context you just choose the the most likely one

00:28:24

uh this is called a i'm greedy uh decoding uh of

00:28:30

course uh as you can imagine is not optimal

00:28:35

uh well what you would do so if you predict j. uh twice here

00:28:42

you just uh skip uh you just keep the last one less production

00:28:47

and all the uh the preceding once you you don't care you just care about the the last one

00:28:53

uh for your final output so you want to i'll put some correct yes

00:28:58

so you don't care about shape happen twice a you carried it happen

00:29:03

once in you you put it in your uh final transcription

00:29:11

uh so the option to uh is to try to make

00:29:17

things better and you can impose some uh constraints

00:29:21

uh so these are external conch constraints so for instance you

00:29:26

can say i want a sequences that corresponds to

00:29:32

but we're supposed to words that are in the dictionary because if you

00:29:37

don't it's but this constraints with the first option you may obtain

00:29:41

uh sequences of characters that on though uh not valid

00:29:48

so uh uh how would you train such a modern uh as i said you discard

00:29:55

the um uh the frames uh that are the same as your last production here

00:30:02

and you just keep the last ones so it would be a simple too simple for single six

00:30:09

and then you compute your last function on these uh frame only on this frame and and you don't

00:30:16

uh compute the last in the in the preceding frames

00:30:21

so finally the the last function would be the some between uh the different events

00:30:28

so here you have four four cactus afford them see in your last function

00:30:34

uh something else that you can do

00:30:38

is uh here you take care of the frames uh

00:30:45

so for instance if you outputs white to hear you also consider

00:30:49

that it's why to here and you compute the loss

00:30:52

uh uh uh for all the frames so it gives us some over time

00:30:58

a lot of the cross entropy between the true label and the symbol the output simple

00:31:08

um the problem here is that's you are not

00:31:12

provided with time information uh huh so you

00:31:18

uh you have no idea you have to align for instance uh these sequence beefy

00:31:24

but you have no idea where to start to map to stop the b.

00:31:30

uh so you have only the sequence of output symbols

00:31:36

uh you know that you have to find beefy so how do you compute the giver diversions

00:31:43

um to train your modern

00:31:46

so one way to do this you know you have to find

00:31:49

if you so you can restrict your outputs to be

00:31:54

uh the the ones that corresponds to b. e. e. and

00:31:59

f. and you remove all the uh the other possibilities

00:32:03

00:32:06

okay so that's an idea so you make your predictions as usual and then

00:32:11

you copy into another metrics only the ones that are of interest

00:32:17

so here for instance you want to the being you want the a. e. you

00:32:22

want to have so you copy the lights uh to a new metrics

00:32:30

uh and then

00:32:34

you do your decoding only on the reduced uh

00:32:37

metrics so maybe you can find this alignment

00:32:43

um and now you are sure that you have the appropriate symbols

00:32:49

that you are adds a expected to fine the problem is that

00:32:53

uh here we can see you find beef and then you go back to be

00:32:59

it's not satisfactory do you want a you want to give and uh

00:33:05

a sequence which is b. e. f. e. and you don't want to go back to be and if you don't constrain

00:33:12

uh this metrics you will find non non valley the sequences so you have to constrain it's

00:33:20

uh and if you want a two line beefy you are obliged

00:33:26

to a copy once again twice the same line for you

00:33:32

uh so in the end you have you this reduced metrics

00:33:36

uh we if the for light and baseline is duplicated

00:33:42

okay uh and then uh people uh um constrain the the coding by saying

00:33:50

uh by forcing the model to start at the first one on the top left

00:33:56

and then it's monotonic uh you have to go a right and down

00:34:02

until you reach the last uh the last uh states the last uh simple you have to

00:34:08

find and this way you can not go back to a a previous a simple

00:34:16

okay so this is nothing new uh until now uh so this is how it

00:34:23

looks like when you do these constraints so the read the uh arose

00:34:29

uh to tell you where to go where you can go and when you cannot go uh there is no uh a row

00:34:38

uh huh so okay

00:34:41

um so to train these uh you can uh

00:34:45

deaf on the decoding on all the uh the sent

00:34:49

the utterances and then you updates your weights

00:34:53

and you which right this is like a batch mode the training option two is to

00:34:59

um do read uh the transparent trance you align it under trance you

00:35:06

are the the weights any this case it's a online uh learning

00:35:13

both uh works with uh the problem is that

00:35:18

you highly depends on the initial alignments

00:35:22

uh uh you are prone to pull local uh uh conversions up a local optima

00:35:30

uh so the ultimate solution is to try to not

00:35:34

commits to any alignments this is the idea

00:35:38

uh of um forward backward algorithm uh so

00:35:47

instead of selecting the most likely alignments as you would

00:35:50

do we the that's the the viterbi algorithm

00:35:54

uh it here use on over your you take the expectation over the uh

00:36:00

all potential that will possible the alignments you can do with your data

00:36:07

and then you are not anymore constraints to a single uh alignments

00:36:14

uh okay so you have to compute uh the the path

00:36:22

uh and to do this i um you use the forward back part algorithm

00:36:28

uh so if you have so you still have a problem which is the following

00:36:34

uh let's say you have to decode the uh are already uh like route

00:36:41

but these these alignments that you find is it's for road or rude

00:36:49

uh you don't know because you you don't have any ending symbol here

00:36:57

so um this problem does not always appear

00:37:01

uh if you use to symbols for each symbol initials in but

00:37:06

for instance you if oh you say it's always all one

00:37:12

uh a number of times and then or to the number of times then if you go

00:37:17

back to one use you know it's too uh symbols yeah but you have to output

00:37:25

so um

00:37:27

another possibility is to use a as an additional uh correct there

00:37:32

which is the call the the blank a blank here

00:37:36

which is uh like the this one so if you have a these alignments

00:37:42

uh you know it's a road because you remove all the blacks

00:37:48

uh if you have a blank between a repetition of symbols

00:37:54

uh then you know uh uh you should output the two symbols

00:38:00

so in this case you have to go to was here

00:38:04

um hum

00:38:07

okay so this blank a symbol has no meaning has no acoustic uh a sense

00:38:13

uh but it must be trained together with the other units

00:38:18

uh so people the uh uh you uh addition these uh um but a

00:38:26

new line in your score metrics which corresponds to these blank a single

00:38:32

so here e. in this example with the red the boxes we

00:38:35

don't you the the the bank was not found okay

00:38:40

uh huh so

00:38:44

uh in this case uh there is no blank in there is a

00:38:51

so we we have the the correct uh outputs but in this case uh you have a blank

00:39:00

uh between high the some high and f. so

00:39:05

uh the correct the alignments these uh

00:39:08

the the alignment is still correct even if you found the blank blank here

00:39:14

uh but in this case it's not anymore correct because here you have a blank between and

00:39:21

f. and another half so you would have to have here and it's not correct anymore

00:39:29

so uh uh to simplify the

00:39:34

uh the um the computations uh people add a blank line

00:39:40

between every symbol and then you still can use just you can use your constraints

00:39:46

on starting on the uh top left a sin bought

00:39:52

to the bottom uh write a symbolic and you can have blanks everywhere where you want okay

00:40:00

uh huh here the red rose shows you show you uh the possible uh buff

00:40:08

okay

00:40:10

um hum

00:40:12

okay

00:40:15

yes sometimes uh you you can also allows keeps because a

00:40:19

blank e. is a bit a week uh to more

00:40:23

than because there is no acoustic uh meaning so uh

00:40:28

you can also allows keeps uh on on blacks

00:40:33

over the next uh them

00:40:37

a mathematically um the cities the probability of having a sequence of

00:40:42

characters why knowing the input a sequence x. is the song

00:40:48

uh over the possible the alignments which are valid between x. and y.

00:40:55

and as we saw in the the are in an introduction

00:41:00

the probability of a sequence is the product of the

00:41:03

of uh the probabilities at each time step

00:41:10

so yeah so uh and then so you have a this probability forgiven you turns and then

00:41:19

what we want uh ease a maximising the

00:41:22

probabilities of the true labor's why uh

00:41:28

this one from uh so this is the stand out the

00:41:34

a cross entropy uh was here but with the the

00:41:39

ground truth uh and uh correct yeah sequence

00:41:44

okay uh some results about city see uh so how do we the chord with the city see

00:41:51

you can still do we do such also called max the coding but it's a suboptimal

00:41:57

and uh what people do is prefix being such a it was also proposing grapes um

00:42:04

and it uses the fall backward agree them as we so uh the previous slide

00:42:10

so so on a numbers uh h. m. m. the standard h. m. m.

00:42:18

uh he's about twenty or thirty nine percent a letter error rate

00:42:24

uh uh in race paper on timit and we've city see uh

00:42:30

even with greedy search a c. d. c. model is better

00:42:35

uh so in general as the c. d. c. models are

00:42:38

better at character level but not at one level

00:42:43

00:42:45

okay uh yeah and uh in another paper uh they do and best we scoring

00:42:52

we've language by that because so far we don't have any language mine

00:42:57

uh and if if if you re score the n. best

00:43:01

a character sequences you of course uh obtain some gains

00:43:06

uh here you you can see so it's a drastic uh

00:43:13

you don't really using a language model yale helps a lot in this case

00:43:21

um so

00:43:24

uh okay

00:43:28

uh and

00:43:31

yeah so uh um if the um after this

00:43:37

the um the the web by graves

00:43:39

uh people uh incorporated the language model we've in the first pass decoding and not

00:43:46

as a re scoring uh as a post processing as a a step too

00:43:53

and there were also s. n. which is an open source toolkit uh can be based uh which

00:44:00

a proposed uh the prefix being such a to get a correct there are sequences

00:44:07

plus a transducers uh just like a stand out the systems and in uh

00:44:15

two thousand fifteen e. d. they show that it's better to use

00:44:20

to still use the strangest transducers a four to decode

00:44:27

uh huh

00:44:30

00:44:36

um so one uh uh comments on c. d. c. outputs

00:44:41

um if you look at the active nations

00:44:46

uh you see that they are very uh picky and ends pass

00:44:53

contrary to the g. m. m. h. m. m. uh systems uh

00:44:58

you will obtain uh for instance a single frame we've oops

00:45:03

uh and very few frames for vowels and sounds it sounds in general

00:45:08

so i wonder if it's a good tool for a speech analyst

00:45:13

so uh maybe it is idle but um if you want

00:45:19

um and alignments that has some meaning a good acoustically speaking i'm

00:45:24

not sure city sees the a good way to go

00:45:28

because of the is passing picky uh active nations maybe it's

00:45:35

uh so this these are the conclusion since it is it uh it's an alignment agnostic

00:45:40

uh uh and when that a tool to train models uh you make the assumption

00:45:48

that's uh the the outputs a different frames are independent which is not the case

00:45:54

um they need an external language model

00:45:59

greedy decoding does not work

00:46:02

uh and it requires several tens of hours of data to work well

00:46:08

uh i've dealt train to uh tried to build the model on me to read speech

00:46:15

and you didn't converge i mean it converts but it the results one that's that's satisfactory

00:46:22

uh and an open open question is is it is yeah that

00:46:26

it too small scale speech tasks into speech analysis studies

00:46:32

okay that's all for city see if uh okay have to worry harry uh uh

00:46:39

now let's take a look at uh the on coder decoder um ah that's

00:46:45

so d. r. and then importer encode decode era models were

00:46:49

propose formation machine translation first by by chewing policing

00:46:54

uh in a twenty fourteen uh basically you have yours

00:46:59

would you input here all your sequence here and

00:47:03

here you have an hour and then i and you can see the the last heathen state uh

00:47:10

of these late of the same colour and then you take these hidden a a state

00:47:17

as an input to a decoder which is also a an hour and then

00:47:22

uh huh which will generate a outputs at each time frame

00:47:31

uh huh and i mean at each time for and all eyes not correct

00:47:36

uh each generates a number of uh of uh outputs

00:47:40

uh that's uh you want it the the length of the x. sequences

00:47:47

not supposed to be the same as the way sequence uh it

00:47:51

if the decoder predict end of sentence it stops so

00:47:56

uh there's no relation between the two length

00:48:00

and uh recalling the introduction the hidden states that thing t. is a function

00:48:06

of the hidden state of the previous uh times that uh and the input i thank you

00:48:13

here we they are an encoded because our it becomes a function of the same thing

00:48:19

but also on the previews output so there is a a narrow from here

00:48:25

and from here to here from here to here uh and also uh

00:48:32

you use the context as a representation of your inputs so

00:48:37

this is i'm kind of more than which work well

00:48:41

uh we on short sentences if you have long sentences the context hectare is in

00:48:48

it's very difficult to model the whole uh input sequence by a single vector

00:48:56

okay then people came up with the attention mechanisms to

00:49:01

uh and to help these contacts back talk to be uh

00:49:07

but what better so the idea is the same you have the the in colour

00:49:12

and then you will multiplied by an attention factor uh the each of the in

00:49:19

hidden vectors and these attention vector are and this attention values

00:49:24

uh are computed estimated that each time it's that they are different at each time step

00:49:32

so uh and the the on the question is uh where the mother should

00:49:39

focus its attention given the sequence so we put so um yep

00:49:46

okay you this was also for machine translation initially um

00:49:52

so how do we compute um the context vector are

00:49:56

the context vector or not depends on time

00:50:00

so it's the some of the um uh attention values times

00:50:05

the hidden backed out how do we compute the alphas

00:50:10

so the affairs out the amount of attention the

00:50:13

outputs whitey should pay to the hidden hectare

00:50:17

of a chip uh of the chasing bob how do we compute the out for us

00:50:23

uh the alphas are a sigmoid on some

00:50:28

variable each how do we conclude e. it's never ending probably uh that's so the really

00:50:38

he is in fact the output of the open the alignments kind of alignment model a

00:50:44

which takes as input s. so what what is s.

00:50:48

s. is the hidden state of the decoder at time t. minus one

00:50:54

and uh it here it's the hidden states of a single j. but

00:51:00

most from the encode or so it's quite natural to think

00:51:04

we want to see how similar our ah our the hidden state uh of the

00:51:12

decoder i uh ends the hidden state of the encode or uh huh okay

00:51:21

so i interact is how uh what is the a. f.

00:51:27

a a is a small no network is it's a fully connected the

00:51:32

uh no on a layer why why do we use a fully connected

00:51:37

the neural network because we don't know the function that match

00:51:42

uh our fat to as an age so by using by using a

00:51:48

network you let the network a lot on a function that works

00:51:55

so this this is the idea of how you include uh the

00:52:00

attention uh values in the final or context like so

00:52:05

okay in would you processing in speech processing so this is a um a

00:52:10

nice a simplified view of this approach have any quarter which outputs

00:52:16

a vector was hidden vectors you have attention and then you have your decoder uh with

00:52:23

the recurrent a connection here in your yeah new outputs uh something at each time

00:52:31

uh so the um quickly the lease an attendant and spend

00:52:36

more than a proposed in a two thousand and fifteen

00:52:41

uh so i'm here you will first user decoder and

00:52:49

then colour sorry to generates a hidden backed up

00:52:52

and then you have a second part call at end and spell

00:52:56

which takes as inputs the hidden vector of the in colour

00:53:00

and the outputs the previous output so last as we call it listen and then it's

00:53:06

that is the correct their level are and then encode or decode there with attention

00:53:13

so how does it look like uh the listen patty so with a always the same it's an hour and then

00:53:20

which generates h. so it's this part called the the listener

00:53:27

you have your hidden vectors here so it's a metrics

00:53:32

here you have the attention a part where

00:53:37

you computes the context vector c.

00:53:41

of by uh by a function call attention context so uh so

00:53:48

then you you have these context that is fed

00:53:52

into the decoder paths which is the speller

00:53:55

here which is also an hour and then so you could be a a jealous

00:53:59

the and to i don't know why i i wrote there and then here

00:54:04

uh that will output the hidden states of the decoder and then

00:54:09

the the final a character transcription so basically the last more than last more that

00:54:15

he's exactly the same as the machine translation one i just uh explain

00:54:23

uh some results um so ah on a clean speech uh

00:54:31

and so you have a as a baseline here

00:54:35

uh you see that the the moderate reference worse than the stand out the modern

00:54:43

in terms of whether or rates uh also on noisy speech uh so here

00:54:49

they use the two thousand hours of a of a training data

00:54:53

uh and it begins to be quite large a uh

00:54:59

huh but the thing is even if um

00:55:03

uh it's worse here you don't use any any information about pronunciation

00:55:09

there nor a language modelling so it it's quite impressive

00:55:14

uh if you use a re scoring with the language model then yeah you go down a lot

00:55:22

uh but still uh you may think okay why bother with the these kind of approach is

00:55:27

because it's it's worse than the the the

00:55:32

energy main uh approach uh but um

00:55:37

right before before answering this question uh in the paper of last

00:55:43

the show some alignments like this it's really nice to see

00:55:47

so you have uh the spectrogram here and here the transcription

00:55:51

of the first sentence how much would would check check

00:55:56

and and hear what is plotted the are the alphas

00:55:59

uh and you see you see uh he of the correct their transcription

00:56:06

and where in the audio uh input the model uh focused or so

00:56:13

and you can see here that you have twice check check

00:56:18

and and you see the some attention here uh for the

00:56:22

first check a part uh which is repeated so yeah

00:56:30

and uh i'm a paper that just uh came out for i cast this year

00:56:36

is a bike wheel of course if uh and and they show

00:56:40

finally that's this kind of models uh the last models

00:56:45

out there from a strong baseline uh so

00:56:50

here in terms of uh well they're rate for a voice voice search or dictation

00:56:56

uh they have better results than a commercial of unconventional strong base like uh

00:57:02

and but if you have a look at the number of hours they they used it's like twelve thousand

00:57:10

hours of training uh speech which correspond to fifty millions a sentence

00:57:18

okay some conclusions um so the new state of

00:57:22

the art is a sequence to sequence models

00:57:27

um hum more precisely attention based n. coder decoder

00:57:31

architectures such as the sun attendance that um

00:57:37

the uh really exciting thing about this is that the single neural network

00:57:42

uh we places uh acoustic pronunciation and language model components

00:57:49

uh but the down sides are the that they require may save the data sets

00:57:55

and missive compute resources to uh to be able to train them

00:58:00

so another question is are they over any interest in our case

00:58:05

you know we're low resource case so uh for now we saying no

00:58:11

uh uh uh there are very few uh uh tends to um to do

00:58:18

transfer learning on these models i found i found this a paper

00:58:24

uh where they train 'em multilingual the last mile

00:58:29

and they try to adapt it to uh to another language we only a small data

00:58:36

and we what's interesting is that the fine that's a um

00:58:42

you need to fine tune the encode there and the

00:58:46

decoder if you just fine tune the decoder

00:58:49

the last part of the ma that uh at the performance drops

00:58:55

so um yeah but still the the results ah pretty a large

00:59:04

uh well they were it's uh in in this paper

Share this talk:

Conference Program

58:56

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute
Feb. 14, 2019 · 9:12 a.m.

367 views

24:13

About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 10:14 a.m.

59:17

Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 11:05 a.m.

113 views

19:21

Introduction to Pytorch 1
Feb. 14, 2019 · 12:06 p.m.

114 views

22:31

Introduction to Pytorch 2
Feb. 14, 2019 · 12:26 p.m.

134 views

Recommended talks

22:38

An Intrinsic Geometry of Manifold Learning Theory & Related Algorithms
Dr. Ke Sun, Uni Geneva
Oct. 17, 2013 · 11:21 a.m.

103 views

Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France

Embed

Transcriptions

Conference Program

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute
Feb. 14, 2019 · 9:12 a.m.

About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 10:14 a.m.

Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 11:05 a.m.

Introduction to Pytorch 1
Feb. 14, 2019 · 12:06 p.m.

Introduction to Pytorch 2
Feb. 14, 2019 · 12:26 p.m.

Recommended talks

An Intrinsic Geometry of Manifold Learning Theory & Related Algorithms
Dr. Ke Sun, Uni Geneva
Oct. 17, 2013 · 11:21 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Case study: Weakly-labeled Sound Event Detection Thomas Pellegrini, IRIT, France

Embed

Transcriptions

Conference Program

Raw Waveform-based Acoustic Modeling and its analysis Mathew Magimai Doss, Idiap Research Institute Feb. 14, 2019 · 9:12 a.m.

About Sequence Classification for Sound Event Detection and end-to-end ASR Thomas Pellegrini, IRIT, France Feb. 14, 2019 · 10:14 a.m.

Case study: Weakly-labeled Sound Event Detection Thomas Pellegrini, IRIT, France Feb. 14, 2019 · 11:05 a.m.

Introduction to Pytorch 1 Feb. 14, 2019 · 12:06 p.m.

Introduction to Pytorch 2 Feb. 14, 2019 · 12:26 p.m.

Recommended talks

An Intrinsic Geometry of Manifold Learning Theory & Related Algorithms Dr. Ke Sun, Uni Geneva Oct. 17, 2013 · 11:21 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute
Feb. 14, 2019 · 9:12 a.m.

About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 10:14 a.m.

Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 11:05 a.m.

Introduction to Pytorch 1
Feb. 14, 2019 · 12:06 p.m.

Introduction to Pytorch 2
Feb. 14, 2019 · 12:26 p.m.

An Intrinsic Geometry of Manifold Learning Theory & Related Algorithms
Dr. Ke Sun, Uni Geneva
Oct. 17, 2013 · 11:21 a.m.