Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
okay let's start again
00:00:03
yeah
00:00:06
place okay let's try to make a things um or a pragmatic uh um
00:00:17
and i think uh talking about a um
00:00:22
sound even detection uh easy interesting because see it's not speech first
00:00:28
so it's something different so that then uh what you sold the whole week
00:00:33
and i think it's a also you port you need you to see how
00:00:37
you would use a a are nance uh for something else than speech
00:00:44
uh and also because i worked on these these last month so i i know well the the topic
00:00:51
so uh and more specifically i will talk about weekly labelled a sun even detection
00:00:58
uh the idea of sound even detection is uh the following
00:01:02
uh you haven't inputs so we could be
00:01:06
ten seconds ten seconds uh recordings
00:01:11
and you extracts a um time frequency representation such
00:01:17
as a a a filter bank with visions
00:01:21
to fade um one or two networks one for classification
00:01:27
uh uh to detect the audio tracks in the whole ten seconds so
00:01:33
you want to and answer the question is their speech in these
00:01:38
ten seconds is there dot backing in this then circle
00:01:42
so this would be done by a classifier and then you also wants to localise where the events are
00:01:50
and you you would train a a model to try to localise a v.
00:01:56
events that you found that you identified by a classic with the classifier
00:02:02
uh it could be the same and modern uh as we we see uh then you select
00:02:10
the predictions made by the localise or the localise that makes predictions at each time frame
00:02:16
and you can select which events you want based on the would you tax you just infrared and you
00:02:22
you and we've these a probability curves for instance
00:02:26
folder backing in speech in that sense it
00:02:30
and then you do some post processing like threshold uh to say okay
00:02:36
uh above a zero point three i would say there is speech
00:02:41
uh and above zero point seven i would say there is no backing
00:02:48
so it's a multi label classification problem where you can
00:02:52
have speech and dog barking at the same time
00:02:55
uh which is a situation different from speech uh recognition well
00:03:00
uh most of the time people don't speech uh simultaneously
00:03:07
uh and so for the audio tracks it's a many to one hour and then that you need
00:03:14
uh many because you have a sequence of uh of
00:03:19
uh how does it vectors a sequence of vectors
00:03:23
in your um time frequency presentation and you want
00:03:27
just to output one a label which is
00:03:31
either zero or one for each of the c. classes that you're interested in
00:03:37
so for instance if you have ten classes top backing speech frying whatever
00:03:42
uh you would have a vector or of uh then a zero all zeroes all once
00:03:50
and then for localisation uh we we call them strong labels
00:03:56
because you have one label for each time frame
00:03:59
uh in uh on the contrary with your bags are sometimes called week labels
00:04:05
uh and in the case of strong labelling you want so many too many model
00:04:11
because you have a sequence of inputs and you wants you want a sequence of of predictions
00:04:20
uh i'm so in more detail is uh here you have your
00:04:24
inputs you would have let's say a convolution convolutional neural network
00:04:30
to predict the tax and uh here uh you have you why
00:04:36
for the whole ten seconds and uh you have the first class
00:04:41
a second class and the last one which are positive
00:04:46
then you have a a a second network for localisation uh and you
00:04:51
train it's that's on the time on the frame by frame level
00:04:58
you obtain zone a prediction curves like this and you select
00:05:03
the one on the second class and the slats class
00:05:08
uh in order to make your predictions and to predict in the end you want
00:05:12
segments so from zero to two seconds in forty five have speech except for
00:05:23
sorry
00:05:29
uh um
00:05:31
this is after thresholding so you have also threshold thresholds here you need to set exactly
00:05:40
exactly so since it's me to label yes yeah
00:05:53
ah for yeah
00:06:00
exactly yeah exactly i i mean if uh if you say if this if this one say is there is park
00:06:08
then uh you retain the the whole
00:06:13
a dog barking prediction here and then you use threshold to say okay i
00:06:18
found the segments uh but you don't know how many of us
00:06:24
oh restating it just um you you take this curve then you you put
00:06:29
them between zero and one so it's like a mean maxima re scaling
00:06:37
okay uh so we we deal with weekly labelled a sound even detection
00:06:46
um we wants to perform both what you tagging and localisation
00:06:52
so we want we for both weak and strong labels so
00:06:56
let's say it's a small why and the big why
00:07:01
and you have a sequence of a spectrogram uh here as inputs
00:07:06
and uh what is called weekly labelled means that you
00:07:10
only half week labor's to train your mother
00:07:14
uh so it means that you only have audio tarps and you don't have
00:07:19
strong annotations uh in your recordings for your recordings uh
00:07:25
and you want to predict strong flavours okay
00:07:29
so this is a kind of likely supervised um machine learning problem
00:07:36
uh any speech you can think of um you record some
00:07:40
some one no i mean uh let's say french
00:07:44
uh as a second language and uh instead of asking him a linguists to annotate
00:07:51
where he's uh and a potential pronunciation are
00:07:57
you the linguist would say okay in the sentence uh there is one dispensation
00:08:03
and this is the kind of problem uh uh you could
00:08:07
uh also tackle uh where you want to localise
00:08:11
uh uh uh the pronunciation uh uh the miss pronunciation
00:08:16
uh that was a annotated by the thing with
00:08:22
uh so uh i will very briefly uh speak about three options
00:08:28
when you when you deal with these kinds of problem
00:08:31
of problems uh so the first approaches uh the brute force approach when you say
00:08:38
the forced its call force strong labelling um and it
00:08:43
means that you consider that the week labels
00:08:46
are the same as that strong labels so you would say
00:08:50
uh um the annotator say is there is bark in
00:08:54
these ten seconds and you consider that all the frames of
00:08:58
your input sequence are a positive uh regarding back
00:09:03
uh so this is very approximate eva approach and of course it's a a completely suboptimal
00:09:10
the second approach uh people uh try these are models with attention
00:09:18
uh and uh i will talk about katie now units which are pretty much used uh
00:09:24
uh these last two years and ah and a third option is which interest me
00:09:30
more is a multiple instance running a kind of a setting stuff so
00:09:38
i just uh to say a word about the moderns uh um
00:09:44
just to have it to give you an idea of what kind of models we do
00:09:50
um so it's a competition or recurrent neural network
00:09:55
um and basically you have you in your input
00:09:58
here so uh mel log mel coefficients typically
00:10:02
so you have a uh like um it's a metrics uh so you have
00:10:07
a four hundred thirty one frames for instance i thank frames
00:10:11
and sixty four uh coefficients for each time frame
00:10:16
uh you fill this to a convolution drops
00:10:22
so if a convolution you do batch normalisation you
00:10:27
do rectify only now units you do set
00:10:30
some playing with max coding for instance and then you do some dropout except role
00:10:36
and you repeat these broke several times and then uh these block is often called
00:10:42
the feature extraction block where you end up with another him h. here
00:10:48
a smaller than the the original one and you
00:10:52
um uh flats you you flatten heat it
00:10:57
uh to feed a recurrent layer so you define
00:11:02
uh recurrent player for instance here we have a group uh the bidirectional
00:11:09
room because you have for who and back about a blue
00:11:14
and then uh this is the decision uh making parts of the network
00:11:19
uh after the recurrent layer it outputs the sequence of uh scores
00:11:25
uh you may use some time distribute it fully connected the early years
00:11:32
uh and the last one is has then and the runs ten units
00:11:38
because you want to pretty pen classes for instance if uh
00:11:44
so uh then you use the sigmoid the uh activation function
00:11:49
and not the soft max the soft max function
00:11:53
because it's made to label classification so you want to be able to predict both
00:11:58
back in speech at the same time and not just back because of a source max
00:12:05
so the output of the sigmoid layer would be a
00:12:08
a metrics of size the number of time frames
00:12:13
times the number of classes i want to predict and this is the same as the input a
00:12:18
dimension uh regarding time because we we don't set
00:12:23
sample pine uh in the network issue
00:12:27
um if you use a um putting
00:12:33
on the time dimension venue you'll you lose per um precision so you
00:12:38
don't you just uh use putting on the frequency uh axes
00:12:44
and then you obtain this uh tyne uh predictions
00:12:49
and if you want starks would you tax then you want already production
00:12:54
of for the ten classes for the whole file and then
00:12:57
what people do is you just average the uh the time productions from coming from
00:13:03
these uh a year uh in order to get just ten uh productions
00:13:09
oh okay i'm so full of forced strong labelling uh
00:13:15
as as as i said we consider that the
00:13:18
a frame level labels are the same as as the audio times
00:13:24
and the last function that's people use these the binary cross entropy
00:13:29
uh between the time frame predictions and the uh and the audio track
00:13:36
so this is the formula of the binary cross entropy but i guess you already uh so it uh
00:13:44
so these approaches is rude brute force approach uh achieves poor results for short duration
00:13:50
events so if you can see their adopt backing i it's very short
00:13:56
uh and your saying to the model okay does all our tents unknowns are
00:14:02
adopt backing and it's not of course so it would be quite for on dog backing for instance
00:14:09
uh whereas for um let's say a running water
00:14:13
it will be you will work fine because usually
00:14:16
running water um last the whole ten seconds
00:14:22
oh so uh option two is attention mechanisms
00:14:28
and there is one which is um use which was proposed for um an l. p.
00:14:35
uh but is now used for would you processing it's called the a gated in the r. units with
00:14:42
any it so well and which oh so not nowadays
00:14:47
in language modelling they use this a lot
00:14:49
uh with recurrent well let's let's uh uh and
00:14:55
i'm sorry if a language models uh are based on our enhance
00:15:01
and get a linear units are an alternative to enhance their not
00:15:06
recurrence so i will explain what what it is um
00:15:11
if so this is a figure from the paper from a d.
00:15:15
t. d. i. units uh you have as input somewhere else
00:15:20
you have a look up table where you look at uh and
00:15:24
buildings whether in billings that's let's say you already have them
00:15:28
and uh a gated uh lee now units will be uh these parts
00:15:34
where you come you defines a two convolution layers that are in parallel
00:15:41
and one uh uh one uh will be just a a normal standard
00:15:48
a convolution layer and the other one you use a sigmoid function
00:15:54
uh at the output and then you element twice multiply them together
00:15:59
to get an output and basically the sigmoid the uh function what
00:16:04
it does it it puts all the elements of the metrics
00:16:08
between zero and one so it will tell the model where to
00:16:12
look and in the input uh to take a decision
00:16:18
uh and this uh controls the flow of information from the uh the network
00:16:26
so yeah
00:16:28
and uh so in would you uh processing um some
00:16:34
people tried it and and it was a successful
00:16:38
so maybe it's a easier to see an example like this you have your spectrogram as inputs
00:16:45
and you defined to to convolution layers in parallel one has
00:16:50
no activation function so you can say it's really are
00:16:54
and one is sigmoid and then you multiply them together and
00:16:59
uh this uh layer with the sigmoid is called the attention
00:17:04
a layer because it will tell the other one
00:17:08
uh keep this part of the data or uh don't
00:17:12
keep it because uh the sigmoid gives you
00:17:16
numbers between zero and one so if the sigmoid outputs something close to zero it saves don't
00:17:22
keep this part of the information if it's close to one it keeps the information
00:17:27
and you do this several times for all the convolution they are senior network
00:17:33
and in the end uh you you can do the same we've
00:17:37
uh recurrent layoffs you can define two parallel a recurrent layoffs
00:17:43
and you have some sort of a localised nation fruit these attention mechanism
00:17:50
so oh the these guys uh won a competition
00:17:54
and the decays competition in in twenty seventeen
00:18:00
uh huh so in um this is another picture of of the same thing uh
00:18:07
so i would keep it and i've uh there's also a way to average uh the predictions
00:18:14
uh to get the final uh would you target predicts true prediction here
00:18:19
it's not a important and finally the option three
00:18:23
instead of uh introducing uh attention uh mechanisms
00:18:28
you can think of a this problem as a multiple instance learning problem
00:18:34
uh where you consider that your sequence is a bag or a set of uh
00:18:42
of instances so each your spectrogram is a bag of vectors and then you say
00:18:49
okay uh i suppose there is no dependency in or ordering among each hectare in
00:18:56
the input and you want the single binary label why in the end
00:19:01
uh uh you have unknown individual labels the strong neighbours are unknown
00:19:07
so that male assumptions ah okay uh the um final a week label
00:19:14
will be zero if all the unknown uh labels are zero
00:19:19
otherwise it will be one so it means if you have a a single frame in your
00:19:24
a sequence that is positive you will consider that the final uh label is positive
00:19:32
so they you can uh um a compact this formulation
00:19:38
using a max and this is what we
00:19:41
will use to train in the model so um what you do is your last function
00:19:50
is the binary person tripping as always but between though
00:19:54
weak label the audio track and the maximum
00:19:58
of your predictions um in relation with time
00:20:03
so basically you have a um a curve uh production corrupt and you take
00:20:09
the max according to time and you say this should be one
00:20:15
if my to label is one this should be zero if my to label the zero
00:20:23
so if you do this you obtain a a prediction predictions like this so uh up
00:20:30
in blue you are the production uh for speech any red for the backing
00:20:37
and as you can see you obtain very shop uh peaks
00:20:41
uh uh so uh that it works okay you if you if you listen to this fine
00:20:48
uh there is indeed speech here speech here speech here and the backing here here here
00:20:55
it's a bit too picky you you want something a a less speaking so
00:21:00
you you can smooth feats if you want to to think better results
00:21:06
uh briefly if you do these uh the meal a thing um
00:21:14
sometimes the model won't be able to distinguish between classes that will occur
00:21:19
very often in the same files uh in the training set
00:21:23
let's say you have a dog barking and uh cats uh any
00:21:27
owing a a lot of times in the same fives
00:21:32
we've the meal approach you optimise um around the maximum
00:21:38
and the frame but hasn't the maximum score so
00:21:41
if you train with a fires that have the same classes
00:21:45
exactly the model won't be able to distinguish between them
00:21:50
uh and this is an illustration here where you have a two
00:21:54
classes to predictions the green and a rat which are uh
00:22:00
one hundred the person to correlated so it means it means
00:22:04
the model that can not distinguish between the green and
00:22:07
the red classes so what can we do to solve
00:22:11
these big um so what we did was to
00:22:18
uh we will consistently pages through that we had a penalty uh in
00:22:23
the loss function a a which is a penalty on the similarity
00:22:30
uh between the the predictions of the different classes of the positive glasses
00:22:38
and if you do this uh it works better uh here
00:22:44
you have a uh the same file as before
00:22:47
um and i'm in transparency you have the ground truth
00:22:53
so here you see the the green and the red curves are no more the same
00:22:59
and you begin to detect correctly the shorts a duration events
00:23:05
uh which is the she's here for your information
00:23:10
um hum okay so i have a small them all with um
00:23:20
i'm not so here you will see a the productions here
00:23:25
and the spectrogram here if if okay another exam
00:23:53
that's if okay i
00:24:00
o. u. i. e.
00:24:11
yeah it's okay uh so we we so um three options you you can try
00:24:20
um i mean two options uh one with an attention mechanism
00:24:25
one with um a different uh machine on the framework called multiple instance running
00:24:31
uh we can go back to speech recognition and and um see
00:24:37
uh what people uh do of these uh last years
00:24:42
so what's the motivation for trying to design and two and a recognition system
00:24:48
uh how do people do uh to build models from sounds to
00:24:53
correct ass so the so called the city seem all that
00:24:57
uh what attention mechanisms people use
00:25:01
so the the sun attendance belt a method what what uh
00:25:08
okay so the um as you saw the all week that conventional
00:25:13
uh recognition pipeline you recognise it so you have several
00:25:17
more years you have the acoustic models you have the language file that you have the pronunciation lexicon model
00:25:25
so the question is can we do a single mother instead
00:25:30
of all these some others which are complicated to optimise
00:25:36
um hum so this is the purpose of and to when the uh moderns
00:25:43
and the definition uh would be uh an intern and and twenty speech
00:25:47
recognition system is a system which directly max the sequence of
00:25:53
acoustic features of or raw signal into a sequence of ref teams
00:25:59
or even worlds uh and
00:26:04
since about uh twenty fourteen the first had attempts uh
00:26:10
has invaded the city c. based um all that
00:26:15
and more a more recently uh the attention uh models
00:26:23
uh so let's talk about city see if a
00:26:26
city c. stands for connectionist temporal classification if
00:26:31
and it's uh it it has been proposed by alex
00:26:35
uh grapes and colleagues see in twenty four six
00:26:40
um hum and basically what it what is it it's a
00:26:45
way to train acoustic models without frame level at alignments
00:26:49
so you don't need a to align the speech signals to uh
00:26:55
the transcriptions so uh that now i think the um
00:27:01
the next uh thirty slides are from the charge
00:27:06
because uh has an awesome a course on city see uh on you too
00:27:12
uh you can attend um i think it's a base this explanation and i could
00:27:17
hardly what doom better than a teacher uh to explain a c. d. c.
00:27:23
so if uh uh this is a um an example uh of input sequence
00:27:30
and here you have all the productions uh
00:27:34
the probabilities of each potential a simple
00:27:38
a character i hear it's funny that but you could be correct that
00:27:44
uh so the problem is you have your sequence and you or are in and all that
00:27:51
you do a vector or predictions at each time step
00:27:55
and you want to predict the final of best sequence of a correct as of phonemes
00:28:04
uh what you can do as a first option is to simply select
00:28:09
the most likely a symbol at each time frame so for instance you would be g. g.
00:28:17
f. f. f. et cetera you don't care about the context you just choose the the most likely one
00:28:24
uh this is called a i'm greedy uh decoding uh of
00:28:30
course uh as you can imagine is not optimal
00:28:35
uh well what you would do so if you predict j. uh twice here
00:28:42
you just uh skip uh you just keep the last one less production
00:28:47
and all the uh the preceding once you you don't care you just care about the the last one
00:28:53
uh for your final output so you want to i'll put some correct yes
00:28:58
so you don't care about shape happen twice a you carried it happen
00:29:03
once in you you put it in your uh final transcription
00:29:11
uh so the option to uh is to try to make
00:29:17
things better and you can impose some uh constraints
00:29:21
uh so these are external conch constraints so for instance you
00:29:26
can say i want a sequences that corresponds to
00:29:32
but we're supposed to words that are in the dictionary because if you
00:29:37
don't it's but this constraints with the first option you may obtain
00:29:41
uh sequences of characters that on though uh not valid
00:29:48
so uh uh how would you train such a modern uh as i said you discard
00:29:55
the um uh the frames uh that are the same as your last production here
00:30:02
and you just keep the last ones so it would be a simple too simple for single six
00:30:09
and then you compute your last function on these uh frame only on this frame and and you don't
00:30:16
uh compute the last in the in the preceding frames
00:30:21
so finally the the last function would be the some between uh the different events
00:30:28
so here you have four four cactus afford them see in your last function
00:30:34
uh something else that you can do
00:30:38
is uh here you take care of the frames uh
00:30:45
so for instance if you outputs white to hear you also consider
00:30:49
that it's why to here and you compute the loss
00:30:52
uh uh uh for all the frames so it gives us some over time
00:30:58
a lot of the cross entropy between the true label and the symbol the output simple
00:31:08
um the problem here is that's you are not
00:31:12
provided with time information uh huh so you
00:31:18
uh you have no idea you have to align for instance uh these sequence beefy
00:31:24
but you have no idea where to start to map to stop the b.
00:31:30
uh so you have only the sequence of output symbols
00:31:36
uh you know that you have to find beefy so how do you compute the giver diversions
00:31:43
um to train your modern
00:31:46
so one way to do this you know you have to find
00:31:49
if you so you can restrict your outputs to be
00:31:54
uh the the ones that corresponds to b. e. e. and
00:31:59
f. and you remove all the uh the other possibilities
00:32:03
uh
00:32:06
okay so that's an idea so you make your predictions as usual and then
00:32:11
you copy into another metrics only the ones that are of interest
00:32:17
so here for instance you want to the being you want the a. e. you
00:32:22
want to have so you copy the lights uh to a new metrics
00:32:30
uh and then
00:32:34
you do your decoding only on the reduced uh
00:32:37
metrics so maybe you can find this alignment
00:32:43
um and now you are sure that you have the appropriate symbols
00:32:49
that you are adds a expected to fine the problem is that
00:32:53
uh here we can see you find beef and then you go back to be
00:32:59
it's not satisfactory do you want a you want to give and uh
00:33:05
a sequence which is b. e. f. e. and you don't want to go back to be and if you don't constrain
00:33:12
uh this metrics you will find non non valley the sequences so you have to constrain it's
00:33:20
uh and if you want a two line beefy you are obliged
00:33:26
to a copy once again twice the same line for you
00:33:32
uh so in the end you have you this reduced metrics
00:33:36
uh we if the for light and baseline is duplicated
00:33:42
okay uh and then uh people uh um constrain the the coding by saying
00:33:50
uh by forcing the model to start at the first one on the top left
00:33:56
and then it's monotonic uh you have to go a right and down
00:34:02
until you reach the last uh the last uh states the last uh simple you have to
00:34:08
find and this way you can not go back to a a previous a simple
00:34:16
okay so this is nothing new uh until now uh so this is how it
00:34:23
looks like when you do these constraints so the read the uh arose
00:34:29
uh to tell you where to go where you can go and when you cannot go uh there is no uh a row
00:34:38
uh huh so okay
00:34:41
um so to train these uh you can uh
00:34:45
deaf on the decoding on all the uh the sent
00:34:49
the utterances and then you updates your weights
00:34:53
and you which right this is like a batch mode the training option two is to
00:34:59
um do read uh the transparent trance you align it under trance you
00:35:06
are the the weights any this case it's a online uh learning
00:35:13
both uh works with uh the problem is that
00:35:18
you highly depends on the initial alignments
00:35:22
uh uh you are prone to pull local uh uh conversions up a local optima
00:35:30
uh so the ultimate solution is to try to not
00:35:34
commits to any alignments this is the idea
00:35:38
uh of um forward backward algorithm uh so
00:35:47
instead of selecting the most likely alignments as you would
00:35:50
do we the that's the the viterbi algorithm
00:35:54
uh it here use on over your you take the expectation over the uh
00:36:00
all potential that will possible the alignments you can do with your data
00:36:07
and then you are not anymore constraints to a single uh alignments
00:36:14
uh okay so you have to compute uh the the path
00:36:22
uh and to do this i um you use the forward back part algorithm
00:36:28
uh so if you have so you still have a problem which is the following
00:36:34
uh let's say you have to decode the uh are already uh like route
00:36:41
but these these alignments that you find is it's for road or rude
00:36:49
uh you don't know because you you don't have any ending symbol here
00:36:57
so um this problem does not always appear
00:37:01
uh if you use to symbols for each symbol initials in but
00:37:06
for instance you if oh you say it's always all one
00:37:12
uh a number of times and then or to the number of times then if you go
00:37:17
back to one use you know it's too uh symbols yeah but you have to output
00:37:25
so um
00:37:27
another possibility is to use a as an additional uh correct there
00:37:32
which is the call the the blank a blank here
00:37:36
which is uh like the this one so if you have a these alignments
00:37:42
uh you know it's a road because you remove all the blacks
00:37:48
uh if you have a blank between a repetition of symbols
00:37:54
uh then you know uh uh you should output the two symbols
00:38:00
so in this case you have to go to was here
00:38:04
um hum
00:38:07
okay so this blank a symbol has no meaning has no acoustic uh a sense
00:38:13
uh but it must be trained together with the other units
00:38:18
uh so people the uh uh you uh addition these uh um but a
00:38:26
new line in your score metrics which corresponds to these blank a single
00:38:32
so here e. in this example with the red the boxes we
00:38:35
don't you the the the bank was not found okay
00:38:40
uh huh so
00:38:44
uh in this case uh there is no blank in there is a
00:38:51
so we we have the the correct uh outputs but in this case uh you have a blank
00:39:00
uh between high the some high and f. so
00:39:05
uh the correct the alignments these uh
00:39:08
the the alignment is still correct even if you found the blank blank here
00:39:14
uh but in this case it's not anymore correct because here you have a blank between and
00:39:21
f. and another half so you would have to have here and it's not correct anymore
00:39:29
so uh uh to simplify the
00:39:34
uh the um the computations uh people add a blank line
00:39:40
between every symbol and then you still can use just you can use your constraints
00:39:46
on starting on the uh top left a sin bought
00:39:52
to the bottom uh write a symbolic and you can have blanks everywhere where you want okay
00:40:00
uh huh here the red rose shows you show you uh the possible uh buff
00:40:08
okay
00:40:10
um hum
00:40:12
okay
00:40:15
yes sometimes uh you you can also allows keeps because a
00:40:19
blank e. is a bit a week uh to more
00:40:23
than because there is no acoustic uh meaning so uh
00:40:28
you can also allows keeps uh on on blacks
00:40:33
over the next uh them
00:40:37
a mathematically um the cities the probability of having a sequence of
00:40:42
characters why knowing the input a sequence x. is the song
00:40:48
uh over the possible the alignments which are valid between x. and y.
00:40:55
and as we saw in the the are in an introduction
00:41:00
the probability of a sequence is the product of the
00:41:03
of uh the probabilities at each time step
00:41:10
so yeah so uh and then so you have a this probability forgiven you turns and then
00:41:19
what we want uh ease a maximising the
00:41:22
probabilities of the true labor's why uh
00:41:28
this one from uh so this is the stand out the
00:41:34
a cross entropy uh was here but with the the
00:41:39
ground truth uh and uh correct yeah sequence
00:41:44
okay uh some results about city see uh so how do we the chord with the city see
00:41:51
you can still do we do such also called max the coding but it's a suboptimal
00:41:57
and uh what people do is prefix being such a it was also proposing grapes um
00:42:04
and it uses the fall backward agree them as we so uh the previous slide
00:42:10
so so on a numbers uh h. m. m. the standard h. m. m.
00:42:18
uh he's about twenty or thirty nine percent a letter error rate
00:42:24
uh uh in race paper on timit and we've city see uh
00:42:30
even with greedy search a c. d. c. model is better
00:42:35
uh so in general as the c. d. c. models are
00:42:38
better at character level but not at one level
00:42:43
um
00:42:45
okay uh yeah and uh in another paper uh they do and best we scoring
00:42:52
we've language by that because so far we don't have any language mine
00:42:57
uh and if if if you re score the n. best
00:43:01
a character sequences you of course uh obtain some gains
00:43:06
uh here you you can see so it's a drastic uh
00:43:13
you don't really using a language model yale helps a lot in this case
00:43:21
um so
00:43:24
uh okay
00:43:28
uh and
00:43:31
yeah so uh um if the um after this
00:43:37
the um the the web by graves
00:43:39
uh people uh incorporated the language model we've in the first pass decoding and not
00:43:46
as a re scoring uh as a post processing as a a step too
00:43:53
and there were also s. n. which is an open source toolkit uh can be based uh which
00:44:00
a proposed uh the prefix being such a to get a correct there are sequences
00:44:07
plus a transducers uh just like a stand out the systems and in uh
00:44:15
two thousand fifteen e. d. they show that it's better to use
00:44:20
to still use the strangest transducers a four to decode
00:44:27
uh huh
00:44:30
so
00:44:36
um so one uh uh comments on c. d. c. outputs
00:44:41
um if you look at the active nations
00:44:46
uh you see that they are very uh picky and ends pass
00:44:53
contrary to the g. m. m. h. m. m. uh systems uh
00:44:58
you will obtain uh for instance a single frame we've oops
00:45:03
uh and very few frames for vowels and sounds it sounds in general
00:45:08
so i wonder if it's a good tool for a speech analyst
00:45:13
so uh maybe it is idle but um if you want
00:45:19
um and alignments that has some meaning a good acoustically speaking i'm
00:45:24
not sure city sees the a good way to go
00:45:28
because of the is passing picky uh active nations maybe it's
00:45:35
uh so this these are the conclusion since it is it uh it's an alignment agnostic
00:45:40
uh uh and when that a tool to train models uh you make the assumption
00:45:48
that's uh the the outputs a different frames are independent which is not the case
00:45:54
um they need an external language model
00:45:59
greedy decoding does not work
00:46:02
uh and it requires several tens of hours of data to work well
00:46:08
uh i've dealt train to uh tried to build the model on me to read speech
00:46:15
and you didn't converge i mean it converts but it the results one that's that's satisfactory
00:46:22
uh and an open open question is is it is yeah that
00:46:26
it too small scale speech tasks into speech analysis studies
00:46:32
okay that's all for city see if uh okay have to worry harry uh uh
00:46:39
now let's take a look at uh the on coder decoder um ah that's
00:46:45
so d. r. and then importer encode decode era models were
00:46:49
propose formation machine translation first by by chewing policing
00:46:54
uh in a twenty fourteen uh basically you have yours
00:46:59
would you input here all your sequence here and
00:47:03
here you have an hour and then i and you can see the the last heathen state uh
00:47:10
of these late of the same colour and then you take these hidden a a state
00:47:17
as an input to a decoder which is also a an hour and then
00:47:22
uh huh which will generate a outputs at each time frame
00:47:31
uh huh and i mean at each time for and all eyes not correct
00:47:36
uh each generates a number of uh of uh outputs
00:47:40
uh that's uh you want it the the length of the x. sequences
00:47:47
not supposed to be the same as the way sequence uh it
00:47:51
if the decoder predict end of sentence it stops so
00:47:56
uh there's no relation between the two length
00:48:00
and uh recalling the introduction the hidden states that thing t. is a function
00:48:06
of the hidden state of the previous uh times that uh and the input i thank you
00:48:13
here we they are an encoded because our it becomes a function of the same thing
00:48:19
but also on the previews output so there is a a narrow from here
00:48:25
and from here to here from here to here uh and also uh
00:48:32
you use the context as a representation of your inputs so
00:48:37
this is i'm kind of more than which work well
00:48:41
uh we on short sentences if you have long sentences the context hectare is in
00:48:48
it's very difficult to model the whole uh input sequence by a single vector
00:48:56
okay then people came up with the attention mechanisms to
00:49:01
uh and to help these contacts back talk to be uh
00:49:07
but what better so the idea is the same you have the the in colour
00:49:12
and then you will multiplied by an attention factor uh the each of the in
00:49:19
hidden vectors and these attention vector are and this attention values
00:49:24
uh are computed estimated that each time it's that they are different at each time step
00:49:32
so uh and the the on the question is uh where the mother should
00:49:39
focus its attention given the sequence so we put so um yep
00:49:46
okay you this was also for machine translation initially um
00:49:52
so how do we compute um the context vector are
00:49:56
the context vector or not depends on time
00:50:00
so it's the some of the um uh attention values times
00:50:05
the hidden backed out how do we compute the alphas
00:50:10
so the affairs out the amount of attention the
00:50:13
outputs whitey should pay to the hidden hectare
00:50:17
of a chip uh of the chasing bob how do we compute the out for us
00:50:23
uh the alphas are a sigmoid on some
00:50:28
variable each how do we conclude e. it's never ending probably uh that's so the really
00:50:38
he is in fact the output of the open the alignments kind of alignment model a
00:50:44
which takes as input s. so what what is s.
00:50:48
s. is the hidden state of the decoder at time t. minus one
00:50:54
and uh it here it's the hidden states of a single j. but
00:51:00
most from the encode or so it's quite natural to think
00:51:04
we want to see how similar our ah our the hidden state uh of the
00:51:12
decoder i uh ends the hidden state of the encode or uh huh okay
00:51:21
so i interact is how uh what is the a. f.
00:51:27
a a is a small no network is it's a fully connected the
00:51:32
uh no on a layer why why do we use a fully connected
00:51:37
the neural network because we don't know the function that match
00:51:42
uh our fat to as an age so by using by using a
00:51:48
network you let the network a lot on a function that works
00:51:55
so this this is the idea of how you include uh the
00:52:00
attention uh values in the final or context like so
00:52:05
okay in would you processing in speech processing so this is a um a
00:52:10
nice a simplified view of this approach have any quarter which outputs
00:52:16
a vector was hidden vectors you have attention and then you have your decoder uh with
00:52:23
the recurrent a connection here in your yeah new outputs uh something at each time
00:52:31
uh so the um quickly the lease an attendant and spend
00:52:36
more than a proposed in a two thousand and fifteen
00:52:41
uh so i'm here you will first user decoder and
00:52:49
then colour sorry to generates a hidden backed up
00:52:52
and then you have a second part call at end and spell
00:52:56
which takes as inputs the hidden vector of the in colour
00:53:00
and the outputs the previous output so last as we call it listen and then it's
00:53:06
that is the correct their level are and then encode or decode there with attention
00:53:13
so how does it look like uh the listen patty so with a always the same it's an hour and then
00:53:20
which generates h. so it's this part called the the listener
00:53:27
you have your hidden vectors here so it's a metrics
00:53:32
here you have the attention a part where
00:53:37
you computes the context vector c.
00:53:41
of by uh by a function call attention context so uh so
00:53:48
then you you have these context that is fed
00:53:52
into the decoder paths which is the speller
00:53:55
here which is also an hour and then so you could be a a jealous
00:53:59
the and to i don't know why i i wrote there and then here
00:54:04
uh that will output the hidden states of the decoder and then
00:54:09
the the final a character transcription so basically the last more than last more that
00:54:15
he's exactly the same as the machine translation one i just uh explain
00:54:23
uh some results um so ah on a clean speech uh
00:54:31
and so you have a as a baseline here
00:54:35
uh you see that the the moderate reference worse than the stand out the modern
00:54:43
in terms of whether or rates uh also on noisy speech uh so here
00:54:49
they use the two thousand hours of a of a training data
00:54:53
uh and it begins to be quite large a uh
00:54:59
huh but the thing is even if um
00:55:03
uh it's worse here you don't use any any information about pronunciation
00:55:09
there nor a language modelling so it it's quite impressive
00:55:14
uh if you use a re scoring with the language model then yeah you go down a lot
00:55:22
uh but still uh you may think okay why bother with the these kind of approach is
00:55:27
because it's it's worse than the the the
00:55:32
energy main uh approach uh but um
00:55:37
right before before answering this question uh in the paper of last
00:55:43
the show some alignments like this it's really nice to see
00:55:47
so you have uh the spectrogram here and here the transcription
00:55:51
of the first sentence how much would would check check
00:55:56
and and hear what is plotted the are the alphas
00:55:59
uh and you see you see uh he of the correct their transcription
00:56:06
and where in the audio uh input the model uh focused or so
00:56:13
and you can see here that you have twice check check
00:56:18
and and you see the some attention here uh for the
00:56:22
first check a part uh which is repeated so yeah
00:56:30
and uh i'm a paper that just uh came out for i cast this year
00:56:36
is a bike wheel of course if uh and and they show
00:56:40
finally that's this kind of models uh the last models
00:56:45
out there from a strong baseline uh so
00:56:50
here in terms of uh well they're rate for a voice voice search or dictation
00:56:56
uh they have better results than a commercial of unconventional strong base like uh
00:57:02
and but if you have a look at the number of hours they they used it's like twelve thousand
00:57:10
hours of training uh speech which correspond to fifty millions a sentence
00:57:18
okay some conclusions um so the new state of
00:57:22
the art is a sequence to sequence models
00:57:27
um hum more precisely attention based n. coder decoder
00:57:31
architectures such as the sun attendance that um
00:57:37
the uh really exciting thing about this is that the single neural network
00:57:42
uh we places uh acoustic pronunciation and language model components
00:57:49
uh but the down sides are the that they require may save the data sets
00:57:55
and missive compute resources to uh to be able to train them
00:58:00
so another question is are they over any interest in our case
00:58:05
you know we're low resource case so uh for now we saying no
00:58:11
uh uh uh there are very few uh uh tends to um to do
00:58:18
transfer learning on these models i found i found this a paper
00:58:24
uh where they train 'em multilingual the last mile
00:58:29
and they try to adapt it to uh to another language we only a small data
00:58:36
and we what's interesting is that the fine that's a um
00:58:42
you need to fine tune the encode there and the
00:58:46
decoder if you just fine tune the decoder
00:58:49
the last part of the ma that uh at the performance drops
00:58:55
so um yeah but still the the results ah pretty a large
00:59:04
uh well they were it's uh in in this paper

Share this talk: 


Conference Program

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute
Feb. 14, 2019 · 9:12 a.m.
367 views
About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 10:14 a.m.
Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 11:05 a.m.
113 views
Introduction to Pytorch 1
Feb. 14, 2019 · 12:06 p.m.
114 views
Introduction to Pytorch 2
Feb. 14, 2019 · 12:26 p.m.
134 views

Recommended talks

An Intrinsic Geometry of Manifold Learning Theory & Related Algorithms
Dr. Ke Sun, Uni Geneva
Oct. 17, 2013 · 11:21 a.m.
103 views