About Sequence Classification for Sound Event Detection and end-to-end ASR

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:00

okay so my term uh

00:00:04

um so we have a more less fifteen fifteen minutes before the break

00:00:11

so we'll see uh where we go in fifteen minutes

00:00:18

oh that's so if i would like to thank uh i'll baffle and the usually for inviting

00:00:24

me uh to to talk uh today i will try to make a things clear

00:00:33

uh about a sequence classification

00:00:37

and that will take examples in some events detection

00:00:41

and and to went yes are so um these last

00:00:46

month i worked more with sand even detection

00:00:51

so uh contrary to matthew for uh the uh the second pass the

00:00:56

end to end parts i'm not so familiar with so i will

00:01:00

i will uh talk about uh works by other

00:01:05

teams that so maybe i might say some

00:01:10

uh i'm wrong things about into and a recognition that you

00:01:15

i i hope you will uh forgive me but

00:01:19

um hum okay so the the outline

00:01:24

um hum we even we need to have

00:01:28

some basic knowledge on recurrent neural networks

00:01:32

uh not to um to talk about uh and two and the speech recognition

00:01:39

so i will try my best to to explain uh how are and thence work

00:01:47

uh then i will uh talk a bit about uh sound even detection

00:01:52

um and then um maybe the most interesting part uh for you

00:01:58

uh no will be the second part on a into and

00:02:01

a speech recognition i will try to motivate white

00:02:05

people uh to do research uh on this

00:02:11

i will presents a the c. t. c. approach

00:02:16

and then i will talk about them most recent

00:02:19

uh works on a attention uh models

00:02:25

uh and the so called leeson the tent spell a model

00:02:32

and i will say very few words on transfer learning uh i think i have one slide

00:02:39

uh about transfer learning in the framework of end to end uh yes ah

00:02:46

okay so uh why why do we care about this recurrent neural networks are and that's

00:02:54

because we deal we've a sequential a

00:02:58

beep sequence a sequence that that

00:03:02

so here in this slide you can see a um a series of examples

00:03:07

uh dealing with sequences of course speech recognition so in speech recognition

00:03:13

and you have a sequence of uh would you frames

00:03:17

and you want to predict a sequence of words so if it would be a sequence to sequence model that

00:03:24

you need to do this in the case of music generation you would you would start with nothing

00:03:31

uh so and then you try to generate a melody

00:03:36

uh so in this case e. it would be a zero

00:03:40

too many a model that to you wouldn't need

00:03:46

uh and so on so on a four sentiments classification it would be a many to one

00:03:52

a model because you have sentences and you want to predict um

00:03:59

a five star uh on the five star scale of um

00:04:03

what would be a the sentiments about it's a movie

00:04:09

okay so if um this is also the case for video activity recognition where

00:04:15

you have a sequence of uh you may choose a a video

00:04:20

and you want to predict an activity uh so you could be one

00:04:25

word but it could be several word describing an image if okay

00:04:32

uh_huh uh_huh maybe at i cut the wife

00:04:40

okay

00:04:43

so let's start with a birds eye view one recurrent a neural networks

00:04:48

a few free to an interrupt me you

00:04:52

however questions if um so um

00:04:58

on the uh on the generic point of view um an hour and then

00:05:04

is the neural network where you have a um any put sorry

00:05:11

and you put a x. uh which is the time sequence

00:05:16

and you feed it to a layer and the slayer has uh an output uh

00:05:24

which we will call the age that depends on the time uh t.

00:05:29

and this out which would be um feedback to the input

00:05:35

of the of the layer of the record player

00:05:38

so you can or rock and roll uh we call it an role and network uh

00:05:45

uh this way so at time t. zero uh you feed x. zero to the layer and

00:05:53

you obtain a heathen state or an output uh which would which will be h. zero

00:06:00

and then at the time see well one uh you feed x.

00:06:06

one two than a year uh at but you also feed

00:06:11

the h. zero to uh the lawyer and so on

00:06:17

so on until the end of the sequence x.

00:06:22

and then you i'll put your last a result your last production

00:06:27

that takes into account fixed c. but also all the history

00:06:33

uh based on what's the layer so before uh so on the past

00:06:39

so more formally you can define generate klee speaking or

00:06:44

an hour and and we've hidden state h.

00:06:48

and an optional output why uh here it's called h. h. but it's a good could be one

00:06:56

which operates on a variable length sequence x. so

00:07:00

of land for let's say a capital t.

00:07:04

and uh at each time step see uh you update

00:07:10

the hidden state uh we if some function f.

00:07:15

which can be a sigmoid function or whatever uh uh it's a usually a nonlinear are

00:07:22

a function like a signal it but it could be something much more complex

00:07:27

like an l. s. t. m. or uh uh uh who uh uh but we will uh we

00:07:33

will say well the about this uh um so i don't expect him right yeah visible

00:07:42

uh but if you have a layer here i um and it

00:07:48

takes as input extinct so it outputs some heathen it's the

00:07:53

it's the h. g. and then if you want an output why

00:07:58

on this you can take these as an inputs to

00:08:02

um another layer a feed forward like yeah so

00:08:09

it's that uh to like this and uh uh so you will have uh some weights here

00:08:18

and it would be the weights uh related to the output white but you want to uh pretty

00:08:24

and then what you have from this uh lay your is your why a prediction okay

00:08:33

uh and usually you have um i have here a an

00:08:41

activation function at the output of the the slayer

00:08:46

sometimes uh why will be equal to h. sometimes not

00:08:51

units one additional year two gets your final output

00:08:59

so uh let's take a good example of the most

00:09:03

simple of the simplest uh are and then

00:09:07

uh which uh looks like this so sometimes it's a

00:09:13

more difficult to understand the picture than the equations

00:09:17

uh but here uh you have the input x. t.

00:09:23

and uh you multiply it's by some metrics the value

00:09:29

as usual as you would do in a standard uh fully connected

00:09:33

the neural network and you obtain your hidden state h.

00:09:39

uh then you some it's we've um the preceding uh uh output

00:09:47

multiplied by another metrics which is called the recurrent can

00:09:51

and and the note the noted you here

00:09:56

so basically you have h. the hidden states which is uh

00:10:03

dependent on x. multiply by some metrics and then the final output y.

00:10:09

t. v.s psalm normally are function like a tan tan at age

00:10:15

uh off the sound of the hidden states plus the preceding

00:10:20

uh outputs multiply by the reckon turned out okay

00:10:25

so it's it's very simple model uh for instance if you

00:10:30

define your network we've then uh units ten cents

00:10:37

you you have your injury your input x. or a of that mentioned d. let's say

00:10:44

each each um each input at time t. is often mentioned d.

00:10:50

then you would have w. which is called the carnal or

00:10:54

size a t. g. times ten the number of steps

00:11:00

uh the regular and colonel itself is of them mention the number of cells

00:11:05

times the number of sense if so it would be done by fan

00:11:11

you also have a biases but let's forget this for for the moment

00:11:16

and then uh you obtain vectors as outputs

00:11:21

so the hidden states h. and the output why would

00:11:26

be of size the number of sense so then

00:11:32

so i'm here you see that's a matrix you the

00:11:37

recurrent cannon uh it is uh a square matrix

00:11:41

and it means that's but then says are not independent

00:11:47

uh because when you do these um metrics product

00:11:52

uh if you have a dependency between the ten steps the

00:11:56

ten cents i'm not independence uh between each other

00:12:03

okay if

00:12:04

if uh so you can have a much more complex uh boxes

00:12:11

if uh and the maybe the most complex one

00:12:16

is the l. s. t. m. a box

00:12:21

uh i i want a safe too much uh here about the the l. s. l. s. t.

00:12:27

m. but so it's it means long short term memory because basically you have the input here

00:12:35

but you also defined gates and the gates i hear to control how much information

00:12:41

ah passed to the next uh to the next uh part of the set

00:12:48

to decide whether a so you have an input gates and the input gates will tell you uh

00:12:55

uh how much of the input information should i keep to take a decision later on

00:13:02

the same for uh these two gates there is a forget gates

00:13:06

and an output kate so the four get the forget cake

00:13:10

uh we'll concerned the hidden states of the set and will say how much of the

00:13:15

hidden states we need to take a final decision the same for the output gate

00:13:21

how much from the output i want to keep for the next iterations

00:13:27

so all these gates are like um small neural networks small layers

00:13:34

uh and uh with the sigmoid function that's a

00:13:39

maps uh are everything to zero one

00:13:44

um okay so the this is a pretty all the um uh

00:13:50

proposal because it was proposed in a ninety seven

00:13:55

uh but it's very much used uh today uh

00:13:58

in speech recognition and sequence a justification

00:14:03

uh people came up with um simpler models and uh one which

00:14:10

is famous now these the rule so for gated recurrent units

00:14:16

and basically it's a very similar to analyse and

00:14:21

the difference is that instead of three gates

00:14:24

you have to gates so it's a bit simpler if a uh huh

00:14:30

and and the this was shown to have better performance on smaller data sets

00:14:39

uh huh so if you want more information about this this was proposed in a twenty fourteen

00:14:46

by you to show uh and this was for um a machine a translation

00:14:54

okay of

00:14:58

uh so let's go back to the generic idea of our intense

00:15:04

um hum so what do we do with them uh we want to model sequences

00:15:10

so uh we want to train our our in an

00:15:15

to predicts the next symbol in the sequence

00:15:19

so in that case the outputs at each thing that thanks that see

00:15:24

is the conditional probability of the uh the next symbol x. t.

00:15:32

uh knowing over the past uh symbols okay so we it's they

00:15:37

be of sixty knowing all the past symbols we want to

00:15:44

uh estimate this probability for all the symbols all the possible symbols

00:15:50

so what we could do for for instance for language modelling

00:15:54

it would be to uh estimate this probability for anywhere out

00:16:00

that would follow a a sequence of previous wells

00:16:05

and this could be done by uh using uh a soft

00:16:09

max a function which is the exponential of song

00:16:12

score uh which is a dot products between the weights

00:16:18

and the hidden state of the uh the sense

00:16:24

so this is basically white area i drew here you have another layer here

00:16:30

two gets um the dot product metrics project to get

00:16:34

the final a score which would be why here

00:16:39

and this is just a normalisation so that when you sum over all

00:16:43

the possible uh what types you will get one uh here

00:16:50

so this is basically what people do for language modelling

00:16:54

uh and then once you're done with these how do you

00:16:58

have a estimate the probability of the whole sequence

00:17:03

uh by simply uh multiplying than a together so

00:17:08

this would be the probability of x. one

00:17:12

times the probability of x. to knowing that there were

00:17:16

there was x. one uh before et cetera

00:17:20

uh until uh the a full length of the sequence so here you have all you need

00:17:28

to estimate um how likely is your sequence

00:17:36

okay um so uh here how we do a forward

00:17:42

pass and how do we train them all that

00:17:45

uh basically it's a um it's with the same algorithm

00:17:51

that than a a fully connected the neural network

00:17:55

or um a convolutional a neural network

00:18:00

so basically it's the back propagation algorithm if uh the the

00:18:05

difference here is that you would need to uh um

00:18:10

and the role uh your network and to back propagate the gradients uh

00:18:17

fruit time so it it's a nice uh i agree the

00:18:20

name back propagation from that um i wanted to go

00:18:27

uh some thing it's yeah um hum

00:18:32

basically um if you have an x. zero here and your says your said here

00:18:42

uh you get uh some hidden activities so it takes one h. one here

00:18:49

have as input age zero so it's initialised to zero normally than at time

00:18:57

to you get x. to a an exceptional

00:19:03

until h. c.

00:19:08

that we've text

00:19:10

and you get some uh y. one y. two cetera

00:19:18

whitey and then what you do is you can compute

00:19:24

a v. there are you make at each time step so here you will compute

00:19:31

a loss loss function at time one here last at bank to exact role

00:19:39

and the last i'd bang fish okay and then what you do is you you compute the sum of

00:19:46

all these laws all all these losses so you get the final last which is this song

00:19:53

and one was it two years uh uh if it's you

00:20:00

and then you you the rebate each term to have

00:20:05

your regions need each to a updates the weights

00:20:11

so what you would do is you take the is you the rebate it so uh you will flow like this

00:20:21

and then you there even does the um any will flow like this

00:20:27

except rap and uh here the last one so it goes like this

00:20:36

and you see that um in the end

00:20:40

e. v. your rose are going backwards this is why

00:20:44

it's called the a back propagation through time

00:20:50

we you reverse uh the time if okay

00:20:55

uh and before uh having to break a a summary of all the

00:21:01

types of our in nantes uh you can uh imagine so

00:21:07

well you can have a one to one of which is not really

00:21:12

a recurrent one it would be an a standard a neural network

00:21:17

you can have a one to many so one too many

00:21:21

would be the case where you would generate a sentences

00:21:25

uh you start from a first well out like a start of sentence

00:21:31

uh and then you predicts the next wear out and then you feed this word to

00:21:38

as input to uh the next time thank stuff and you predict

00:21:45

the new word and that up you feed your fits to the layer and you produce a new word

00:21:51

and you go like this and you prove you you have a

00:21:54

generative model you can generate sentences janet middle this like this

00:21:59

you have that many to one so this would be the case where you

00:22:03

have a um a speech you terrance and you want to know

00:22:09

if it's positive or negative uh like sentimental is is so you would output

00:22:16

uh just a single label at the end of all the sequence

00:22:21

uh and in fact this models use always outputs some something

00:22:28

at every time step just to discard them don't you remove then you

00:22:33

you don't care about the intermediate outputs but you have them

00:22:38

uh huh you have the many too many this one and this one yeah different this one

00:22:46

you have a potentially a different length t. o.

00:22:51

backs for x. and your wife awhile

00:22:54

uh but basically each at each time things that you see an input and you output something

00:23:02

uh in this version it's completely different you first uh see

00:23:07

the whole input sequence and then you begin to predict

00:23:11

so i'm in this kind of uh architecture you will you

00:23:16

will correlates and then coder decoder uh architecture uh

00:23:21

basically here you will obtain the last he done uh uh states

00:23:26

of the and go there this is the end coder so you you have the last uh

00:23:31

hidden a state of the input their here that is fed to the the decoder and

00:23:39

these hidden states is like a summary of all the

00:23:44

input sequence so you start by generating some output

00:23:49

from a a representation of your input sequence that is a summary of

00:23:55

this a sequence uh which is the the hidden state here

00:24:02

okay and i think we stop here uh for the wreck thank you

Share this talk:

Conference Program

58:56

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute
Feb. 14, 2019 · 9:12 a.m.

367 views

24:13

About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 10:14 a.m.

59:17

Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 11:05 a.m.

113 views

19:21

Introduction to Pytorch 1
Feb. 14, 2019 · 12:06 p.m.

114 views

22:31

Introduction to Pytorch 2
Feb. 14, 2019 · 12:26 p.m.

134 views

Recommended talks

01:14:51

Human action recognition: Recent progress, open questions and future challenges
Ivan Laptev, WILLOW, INRIA/ENS/CNRS, Paris
Feb. 14, 2014 · 11:05 a.m.

157 views

About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France

Embed

Transcriptions

Conference Program

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute
Feb. 14, 2019 · 9:12 a.m.

About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 10:14 a.m.

Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 11:05 a.m.

Introduction to Pytorch 1
Feb. 14, 2019 · 12:06 p.m.

Introduction to Pytorch 2
Feb. 14, 2019 · 12:26 p.m.

Recommended talks

Human action recognition: Recent progress, open questions and future challenges
Ivan Laptev, WILLOW, INRIA/ENS/CNRS, Paris
Feb. 14, 2014 · 11:05 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

About Sequence Classification for Sound Event Detection and end-to-end ASR Thomas Pellegrini, IRIT, France

Embed

Transcriptions

Conference Program

Raw Waveform-based Acoustic Modeling and its analysis Mathew Magimai Doss, Idiap Research Institute Feb. 14, 2019 · 9:12 a.m.

About Sequence Classification for Sound Event Detection and end-to-end ASR Thomas Pellegrini, IRIT, France Feb. 14, 2019 · 10:14 a.m.

Case study: Weakly-labeled Sound Event Detection Thomas Pellegrini, IRIT, France Feb. 14, 2019 · 11:05 a.m.

Introduction to Pytorch 1 Feb. 14, 2019 · 12:06 p.m.

Introduction to Pytorch 2 Feb. 14, 2019 · 12:26 p.m.

Recommended talks

Human action recognition: Recent progress, open questions and future challenges Ivan Laptev, WILLOW, INRIA/ENS/CNRS, Paris Feb. 14, 2014 · 11:05 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute
Feb. 14, 2019 · 9:12 a.m.

About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 10:14 a.m.

Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 11:05 a.m.

Introduction to Pytorch 1
Feb. 14, 2019 · 12:06 p.m.

Introduction to Pytorch 2
Feb. 14, 2019 · 12:26 p.m.

Human action recognition: Recent progress, open questions and future challenges
Ivan Laptev, WILLOW, INRIA/ENS/CNRS, Paris
Feb. 14, 2014 · 11:05 a.m.