Player is loading...

Embed

Embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
okay so my term uh
00:00:04
um so we have a more less fifteen fifteen minutes before the break
00:00:11
so we'll see uh where we go in fifteen minutes
00:00:18
oh that's so if i would like to thank uh i'll baffle and the usually for inviting
00:00:24
me uh to to talk uh today i will try to make a things clear
00:00:33
uh about a sequence classification
00:00:37
and that will take examples in some events detection
00:00:41
and and to went yes are so um these last
00:00:46
month i worked more with sand even detection
00:00:51
so uh contrary to matthew for uh the uh the second pass the
00:00:56
end to end parts i'm not so familiar with so i will
00:01:00
i will uh talk about uh works by other
00:01:05
teams that so maybe i might say some
00:01:10
uh i'm wrong things about into and a recognition that you
00:01:15
i i hope you will uh forgive me but
00:01:19
um hum okay so the the outline
00:01:24
um hum we even we need to have
00:01:28
some basic knowledge on recurrent neural networks
00:01:32
uh not to um to talk about uh and two and the speech recognition
00:01:39
so i will try my best to to explain uh how are and thence work
00:01:47
uh then i will uh talk a bit about uh sound even detection
00:01:52
um and then um maybe the most interesting part uh for you
00:01:58
uh no will be the second part on a into and
00:02:01
a speech recognition i will try to motivate white
00:02:05
people uh to do research uh on this
00:02:11
i will presents a the c. t. c. approach
00:02:16
and then i will talk about them most recent
00:02:19
uh works on a attention uh models
00:02:25
uh and the so called leeson the tent spell a model
00:02:32
and i will say very few words on transfer learning uh i think i have one slide
00:02:39
uh about transfer learning in the framework of end to end uh yes ah
00:02:46
okay so uh why why do we care about this recurrent neural networks are and that's
00:02:54
because we deal we've a sequential a
00:02:58
beep sequence a sequence that that
00:03:02
so here in this slide you can see a um a series of examples
00:03:07
uh dealing with sequences of course speech recognition so in speech recognition
00:03:13
and you have a sequence of uh would you frames
00:03:17
and you want to predict a sequence of words so if it would be a sequence to sequence model that
00:03:24
you need to do this in the case of music generation you would you would start with nothing
00:03:31
uh so and then you try to generate a melody
00:03:36
uh so in this case e. it would be a zero
00:03:40
too many a model that to you wouldn't need
00:03:46
uh and so on so on a four sentiments classification it would be a many to one
00:03:52
a model because you have sentences and you want to predict um
00:03:59
a five star uh on the five star scale of um
00:04:03
what would be a the sentiments about it's a movie
00:04:09
okay so if um this is also the case for video activity recognition where
00:04:15
you have a sequence of uh you may choose a a video
00:04:20
and you want to predict an activity uh so you could be one
00:04:25
word but it could be several word describing an image if okay
00:04:32
uh_huh uh_huh maybe at i cut the wife
00:04:40
okay
00:04:43
so let's start with a birds eye view one recurrent a neural networks
00:04:48
a few free to an interrupt me you
00:04:52
however questions if um so um
00:04:58
on the uh on the generic point of view um an hour and then
00:05:04
is the neural network where you have a um any put sorry
00:05:11
and you put a x. uh which is the time sequence
00:05:16
and you feed it to a layer and the slayer has uh an output uh
00:05:24
which we will call the age that depends on the time uh t.
00:05:29
and this out which would be um feedback to the input
00:05:35
of the of the layer of the record player
00:05:38
so you can or rock and roll uh we call it an role and network uh
00:05:45
uh this way so at time t. zero uh you feed x. zero to the layer and
00:05:53
you obtain a heathen state or an output uh which would which will be h. zero
00:06:00
and then at the time see well one uh you feed x.
00:06:06
one two than a year uh at but you also feed
00:06:11
the h. zero to uh the lawyer and so on
00:06:17
so on until the end of the sequence x.
00:06:22
and then you i'll put your last a result your last production
00:06:27
that takes into account fixed c. but also all the history
00:06:33
uh based on what's the layer so before uh so on the past
00:06:39
so more formally you can define generate klee speaking or
00:06:44
an hour and and we've hidden state h.
00:06:48
and an optional output why uh here it's called h. h. but it's a good could be one
00:06:56
which operates on a variable length sequence x. so
00:07:00
of land for let's say a capital t.
00:07:04
and uh at each time step see uh you update
00:07:10
the hidden state uh we if some function f.
00:07:15
which can be a sigmoid function or whatever uh uh it's a usually a nonlinear are
00:07:22
a function like a signal it but it could be something much more complex
00:07:27
like an l. s. t. m. or uh uh uh who uh uh but we will uh we
00:07:33
will say well the about this uh um so i don't expect him right yeah visible
00:07:42
uh but if you have a layer here i um and it
00:07:48
takes as input extinct so it outputs some heathen it's the
00:07:53
it's the h. g. and then if you want an output why
00:07:58
on this you can take these as an inputs to
00:08:02
um another layer a feed forward like yeah so
00:08:09
it's that uh to like this and uh uh so you will have uh some weights here
00:08:18
and it would be the weights uh related to the output white but you want to uh pretty
00:08:24
and then what you have from this uh lay your is your why a prediction okay
00:08:33
uh and usually you have um i have here a an
00:08:41
activation function at the output of the the slayer
00:08:46
sometimes uh why will be equal to h. sometimes not
00:08:51
units one additional year two gets your final output
00:08:59
so uh let's take a good example of the most
00:09:03
simple of the simplest uh are and then
00:09:07
uh which uh looks like this so sometimes it's a
00:09:13
more difficult to understand the picture than the equations
00:09:17
uh but here uh you have the input x. t.
00:09:23
and uh you multiply it's by some metrics the value
00:09:29
as usual as you would do in a standard uh fully connected
00:09:33
the neural network and you obtain your hidden state h.
00:09:39
uh then you some it's we've um the preceding uh uh output
00:09:47
multiplied by another metrics which is called the recurrent can
00:09:51
and and the note the noted you here
00:09:56
so basically you have h. the hidden states which is uh
00:10:03
dependent on x. multiply by some metrics and then the final output y.
00:10:09
t. v.s psalm normally are function like a tan tan at age
00:10:15
uh off the sound of the hidden states plus the preceding
00:10:20
uh outputs multiply by the reckon turned out okay
00:10:25
so it's it's very simple model uh for instance if you
00:10:30
define your network we've then uh units ten cents
00:10:37
you you have your injury your input x. or a of that mentioned d. let's say
00:10:44
each each um each input at time t. is often mentioned d.
00:10:50
then you would have w. which is called the carnal or
00:10:54
size a t. g. times ten the number of steps
00:11:00
uh the regular and colonel itself is of them mention the number of cells
00:11:05
times the number of sense if so it would be done by fan
00:11:11
you also have a biases but let's forget this for for the moment
00:11:16
and then uh you obtain vectors as outputs
00:11:21
so the hidden states h. and the output why would
00:11:26
be of size the number of sense so then
00:11:32
so i'm here you see that's a matrix you the
00:11:37
recurrent cannon uh it is uh a square matrix
00:11:41
and it means that's but then says are not independent
00:11:47
uh because when you do these um metrics product
00:11:52
uh if you have a dependency between the ten steps the
00:11:56
ten cents i'm not independence uh between each other
00:12:03
okay if
00:12:04
if uh so you can have a much more complex uh boxes
00:12:11
if uh and the maybe the most complex one
00:12:16
is the l. s. t. m. a box
00:12:21
uh i i want a safe too much uh here about the the l. s. l. s. t.
00:12:27
m. but so it's it means long short term memory because basically you have the input here
00:12:35
but you also defined gates and the gates i hear to control how much information
00:12:41
ah passed to the next uh to the next uh part of the set
00:12:48
to decide whether a so you have an input gates and the input gates will tell you uh
00:12:55
uh how much of the input information should i keep to take a decision later on
00:13:02
the same for uh these two gates there is a forget gates
00:13:06
and an output kate so the four get the forget cake
00:13:10
uh we'll concerned the hidden states of the set and will say how much of the
00:13:15
hidden states we need to take a final decision the same for the output gate
00:13:21
how much from the output i want to keep for the next iterations
00:13:27
so all these gates are like um small neural networks small layers
00:13:34
uh and uh with the sigmoid function that's a
00:13:39
maps uh are everything to zero one
00:13:44
um okay so the this is a pretty all the um uh
00:13:50
proposal because it was proposed in a ninety seven
00:13:55
uh but it's very much used uh today uh
00:13:58
in speech recognition and sequence a justification
00:14:03
uh people came up with um simpler models and uh one which
00:14:10
is famous now these the rule so for gated recurrent units
00:14:16
and basically it's a very similar to analyse and
00:14:21
the difference is that instead of three gates
00:14:24
you have to gates so it's a bit simpler if a uh huh
00:14:30
and and the this was shown to have better performance on smaller data sets
00:14:39
uh huh so if you want more information about this this was proposed in a twenty fourteen
00:14:46
by you to show uh and this was for um a machine a translation
00:14:54
okay of
00:14:58
uh so let's go back to the generic idea of our intense
00:15:04
um hum so what do we do with them uh we want to model sequences
00:15:10
so uh we want to train our our in an
00:15:15
to predicts the next symbol in the sequence
00:15:19
so in that case the outputs at each thing that thanks that see
00:15:24
is the conditional probability of the uh the next symbol x. t.
00:15:32
uh knowing over the past uh symbols okay so we it's they
00:15:37
be of sixty knowing all the past symbols we want to
00:15:44
uh estimate this probability for all the symbols all the possible symbols
00:15:50
so what we could do for for instance for language modelling
00:15:54
it would be to uh estimate this probability for anywhere out
00:16:00
that would follow a a sequence of previous wells
00:16:05
and this could be done by uh using uh a soft
00:16:09
max a function which is the exponential of song
00:16:12
score uh which is a dot products between the weights
00:16:18
and the hidden state of the uh the sense
00:16:24
so this is basically white area i drew here you have another layer here
00:16:30
two gets um the dot product metrics project to get
00:16:34
the final a score which would be why here
00:16:39
and this is just a normalisation so that when you sum over all
00:16:43
the possible uh what types you will get one uh here
00:16:50
so this is basically what people do for language modelling
00:16:54
uh and then once you're done with these how do you
00:16:58
have a estimate the probability of the whole sequence
00:17:03
uh by simply uh multiplying than a together so
00:17:08
this would be the probability of x. one
00:17:12
times the probability of x. to knowing that there were
00:17:16
there was x. one uh before et cetera
00:17:20
uh until uh the a full length of the sequence so here you have all you need
00:17:28
to estimate um how likely is your sequence
00:17:36
okay um so uh here how we do a forward
00:17:42
pass and how do we train them all that
00:17:45
uh basically it's a um it's with the same algorithm
00:17:51
that than a a fully connected the neural network
00:17:55
or um a convolutional a neural network
00:18:00
so basically it's the back propagation algorithm if uh the the
00:18:05
difference here is that you would need to uh um
00:18:10
and the role uh your network and to back propagate the gradients uh
00:18:17
fruit time so it it's a nice uh i agree the
00:18:20
name back propagation from that um i wanted to go
00:18:27
uh some thing it's yeah um hum
00:18:32
basically um if you have an x. zero here and your says your said here
00:18:42
uh you get uh some hidden activities so it takes one h. one here
00:18:49
have as input age zero so it's initialised to zero normally than at time
00:18:57
to you get x. to a an exceptional
00:19:03
until h. c.
00:19:08
that we've text
00:19:10
and you get some uh y. one y. two cetera
00:19:18
whitey and then what you do is you can compute
00:19:24
a v. there are you make at each time step so here you will compute
00:19:31
a loss loss function at time one here last at bank to exact role
00:19:39
and the last i'd bang fish okay and then what you do is you you compute the sum of
00:19:46
all these laws all all these losses so you get the final last which is this song
00:19:53
and one was it two years uh uh if it's you
00:20:00
and then you you the rebate each term to have
00:20:05
your regions need each to a updates the weights
00:20:11
so what you would do is you take the is you the rebate it so uh you will flow like this
00:20:21
and then you there even does the um any will flow like this
00:20:27
except rap and uh here the last one so it goes like this
00:20:36
and you see that um in the end
00:20:40
e. v. your rose are going backwards this is why
00:20:44
it's called the a back propagation through time
00:20:50
we you reverse uh the time if okay
00:20:55
uh and before uh having to break a a summary of all the
00:21:01
types of our in nantes uh you can uh imagine so
00:21:07
well you can have a one to one of which is not really
00:21:12
a recurrent one it would be an a standard a neural network
00:21:17
you can have a one to many so one too many
00:21:21
would be the case where you would generate a sentences
00:21:25
uh you start from a first well out like a start of sentence
00:21:31
uh and then you predicts the next wear out and then you feed this word to
00:21:38
as input to uh the next time thank stuff and you predict
00:21:45
the new word and that up you feed your fits to the layer and you produce a new word
00:21:51
and you go like this and you prove you you have a
00:21:54
generative model you can generate sentences janet middle this like this
00:21:59
you have that many to one so this would be the case where you
00:22:03
have a um a speech you terrance and you want to know
00:22:09
if it's positive or negative uh like sentimental is is so you would output
00:22:16
uh just a single label at the end of all the sequence
00:22:21
uh and in fact this models use always outputs some something
00:22:28
at every time step just to discard them don't you remove then you
00:22:33
you don't care about the intermediate outputs but you have them
00:22:38
uh huh you have the many too many this one and this one yeah different this one
00:22:46
you have a potentially a different length t. o.
00:22:51
backs for x. and your wife awhile
00:22:54
uh but basically each at each time things that you see an input and you output something
00:23:02
uh in this version it's completely different you first uh see
00:23:07
the whole input sequence and then you begin to predict
00:23:11
so i'm in this kind of uh architecture you will you
00:23:16
will correlates and then coder decoder uh architecture uh
00:23:21
basically here you will obtain the last he done uh uh states
00:23:26
of the and go there this is the end coder so you you have the last uh
00:23:31
hidden a state of the input their here that is fed to the the decoder and
00:23:39
these hidden states is like a summary of all the
00:23:44
input sequence so you start by generating some output
00:23:49
from a a representation of your input sequence that is a summary of
00:23:55
this a sequence uh which is the the hidden state here
00:24:02
okay and i think we stop here uh for the wreck thank you

Share this talk: 


Conference Program

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute
14 Feb. 2019 · 9:12 a.m.
About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France
14 Feb. 2019 · 10:14 a.m.
Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France
14 Feb. 2019 · 11:05 a.m.
Introduction to Pytorch 1
14 Feb. 2019 · 12:06 p.m.
Introduction to Pytorch 2
14 Feb. 2019 · 12:26 p.m.