Raw Waveform-based Acoustic Modeling and its analysis

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:01

i'm sorry for the delay

00:00:05

of course i'm going to talk about is an illusion your work but i tried to the to slightly differently if as the

00:00:13

well we've found is acoustic modelling and analysis by the time of cotton morning

00:00:20

just don't get cancelled that it does look nice for speech recognition

00:00:24

i i mean to say if a customer ling is pretty much wherever you do

00:00:30

speech processing on your bill for whichever classification problem you want to be declassified

00:00:38

as you'll see in the park that bush that's very little to talk to about architecture of the

00:00:46

training and things that's because pretty much that has

00:00:52

not changed maybe old all pretty us

00:00:56

so i'll briefly give some examples of a class okay how it the phones are with

00:01:04

some uh it's like more deviate why it's it's something interesting to look at it

00:01:11

then i'm going to focus on something for a more on

00:01:15

what is called how you penalise this kind of networks

00:01:21

how you can really underlies discover that works to get an understanding

00:01:26

about what kind of information that not like some learning

00:01:31

and of course that's all the things we always want to now

00:01:36

about what we don't know each and really it's nice to see if we can

00:01:41

find something like that but um most of the thing you see that

00:01:47

remote things we know how to process the but some interesting aspects given different uh

00:01:58

i don't

00:02:00

oh okay so so the motivation for this up so before i go into this i have to tell something more

00:02:09

so this well uh is mainly based arms about which we have been doing active yep

00:02:15

so what then we started this work in two thousand eleven

00:02:20

uh the with the with the first p. h. d. student who started with me on or not all of it

00:02:26

on on the name of the project you move visible the median age to uh to do

00:02:32

then people um then he want and speech recognition and then

00:02:39

a student hannah and she said that with me on how to use this method for speaker recognition

00:02:47

and then uh there was a post doc walking along the way like work with hannah

00:02:54

then that s. and then she came she wanted to do a semester project and i said

00:03:00

take this network and try to study for gender recognition and simple really

00:03:06

and then more recently the pollen while he's one of the p.

00:03:11

d. student who is using it for speech assessment problems

00:03:17

i would let the byzantine focusing on the work we have done not much on that

00:03:23

it it it is because at least i have understood what we have been

00:03:28

so i can generate the i of course you can go up i can point whether indicators

00:03:34

all the papers with their existing coming on i'll probably from

00:03:38

this morning but here i'll try to collect more into

00:03:43

how we can be related to traditional speech processing well uh_huh

00:03:48

so i'll the basic moderation is that okay commercial feet as i'm

00:03:53

going to assume some kind of why the stationarity windowing

00:03:57

and there will be issues of time and reconsideration often what we do is we take a a short term window

00:04:06

um signals and then what we're going to do is get going to split them into souls and

00:04:11

system is so that in the first day of the painting event of a survey

00:04:16

also gave me make like to talk about it the same aspects and

00:04:21

then what they do is really adamant price most often what we do is

00:04:26

we are we are going to panama try this information vocal tract filter

00:04:31

but before you want about i'm not i the also incorporate some form

00:04:35

of speech perception knowledge like we have something called critical bands

00:04:41

um there's a dominant frequency the critical band other of frequencies and not acceptable

00:04:49

and there's a nonlinear relationship between the intensity and loudness

00:04:55

so all this kind of information we incorporate

00:04:59

uh the speech perception knowledge and then adam replace that this is not understood we are doing

00:05:05

after you after you fat i'm not like that what you do is

00:05:09

uh that you can actually computed there are time they're idiots and

00:05:16

then you dana left if i am yeah putting director classifier

00:05:20

so this is not traditionally we started it and then people started going back and said

00:05:26

well do i have to need to work week after them i'm gonna go back

00:05:31

here and people said oh can we go back to have a pick

00:05:36

to that full so what we tried to start was okay

00:05:42

why don't why do you have to do all those things would take this speech signal

00:05:47

and then we have a convolutional that well and that's a that's a

00:05:53

and then layer and there's like a um multi multi level

00:05:57

so far layered with colours classify and this is a feature of

00:06:00

the and then why don't retain everything jointly and say this

00:06:06

so why why can't they do that such a if you so that isn't me started initially was that

00:06:13

most of the feature extraction process you can write down as the filtering operations you will

00:06:18

see that we can even at the f. d. a. like does the filtering operation

00:06:23

so we started with that point

00:06:27

00:06:28

and so what it can do is it can have sent those overcoming

00:06:33

limitation of an reference lock them up with a speech processing

00:06:38

it can also has i had was better understanding what the speech signal characteristics

00:06:44

ah so this tool aspects was so main motivation

00:06:49

to get on this um the this kind of uh architecture and studied the don't it

00:06:58

so i i illustrate it in this universe basic in this uh convolutional the this the next building

00:07:05

that's a a nonlinearity i put an unhappily with it could use a little and it is

00:07:11

not not going to be a for the loo will improve your performance a bit

00:07:19

and what we built when you started building the fact detector rebuilt on

00:07:24

little prior knowledge what but i know that we wanted to use was

00:07:29

the we need a shot them processing because the signal is nonstationary

00:07:35

that means the properties of the signal is statistical properties like that mean and autocorrelation

00:07:41

keep changing over time if and that's why we also do sure probably

00:07:47

so but what we said we we don't want to tell sometime what should be the shortcoming going what size

00:07:53

should be looking at him then as i said feature extraction can was in a separate thing operation

00:08:01

and the tart information was relevant information can be spread across

00:08:05

by for example if you take speech recogniser today

00:08:09

you will see that input to the neural network is not just one frame

00:08:13

that the ten milliseconds beat you are going to get to fifty millisecond three hundred millisecond speech into the neural net

00:08:23

the whole aspect what we initially started was do everything in it it everyone members and then later

00:08:30

with little bit understood and listed annoying right to stay in this network and focus on

00:08:37

all this the this a plane trying to using back propagation with the cost function based on cross entropy so

00:08:45

as i said this is not changing how we are going to the neural

00:08:48

network so what you do is you feel if you did input signal

00:08:53

get um uh output production then compared to the label and

00:08:58

rapid but probably get the error to adjustment but

00:09:02

simple and the other functions what to use yet is cross entropy because

00:09:06

it's a classification problem so the use of let's entropy other function

00:09:12

so this has not changed if for a long time

00:09:17

so what is still look let's look at the input layer so input lady something interesting

00:09:24

so here is what they say a signal with the context

00:09:28

okay and then the condemnation your advertisement operate on a small portion

00:09:34

of the signal like we do those shot them processing

00:09:38

so this is the window size and this is when the

00:09:41

shape and you you get accommodation millard bracket has some

00:09:47

i like that are filters like uh and the filters

00:09:52

and each window time employees on the word to collect is

00:09:58

exact is is your number of filters you gave to the huge window you get an output

00:10:05

and you keep getting so now the question is what should be the window size

00:10:11

the window size or what we we we're going to talk about is

00:10:17

what is what i what i i call it like some segment and then segment though that isn't is um what we do when shot

00:10:25

them processing is that uh as i have a was explaining that

00:10:29

didn't painting human so first day of that they knew what

00:10:33

that you need to have certain number up it cycles i too

00:10:38

good enough to decompose that source handoff system component

00:10:44

if you don't have enough it cycles you may only be able to see the source information not the system information

00:10:52

now this is this basically we're ready to write this whole logic

00:10:58

read it that i i'm not in the in the into the

00:11:00

speech processing it it came basically from um speech coding studies

00:11:06

where they said all we can split the signal into so ten system and we can fathom applies

00:11:12

this information and descended and we can say bits and we can because like the speech

00:11:17

basically that's that's where the more one of the this whole analysis synthesis model unmatched

00:11:24

inside that are so i'll be said okay this is what i call is

00:11:30

like one to the big is what generally what people do the shot

00:11:33

them posting in the conventional segments that's and mentally that it can be less than one p. p. did i don't know what it is

00:11:40

but often what we do we'll see that it just won't be below a two millisecond

00:11:46

below two millisecond appeared is lovely and compute to two millisecond if if suppose

00:11:52

there's a fundamental period of two millisecond it'll be five hundred hertz

00:11:57

which is normal adult speakers we might have come to that level

00:12:03

okay so how so what happens in this kind of known it but is that the first layer is going to take this input

00:12:12

okay so yeah that giving the the kernel with with this in samples

00:12:17

and she'd been samples yet i i'm giving that window size which is getting used

00:12:23

in time that thirty millisecond is one point it it then you

00:12:28

get the price for the output you'd let max pulling

00:12:32

then basically that the this interpretation is that when you do the max

00:12:36

pulling your operating here in the first convolution the the operating

00:12:40

on a video say that one point eight minute max pulling you're kind of getting this window size to two point five

00:12:50

and then you add one more conversion there then met willing and so on

00:12:56

you can do you ever shifting with the the convolution layers and

00:13:00

then a flattened output and give it to a indian and

00:13:04

and some people will say no we will put their lives to um and then a day and then

00:13:09

it's a different opinion but this is it twice but it that's why devoted does and then

00:13:15

so you can put any kind of a day and then into it and the product that last that's what you're doing

00:13:22

so so you can see that this this enough people under fifty millisecond and the shipping

00:13:28

what you do is you take a two hundred fifty millisecond and if you're

00:13:31

putting in ten millisecond friendship like what we're been doing that additional processing

00:13:39

now what we did is initially in speech recognition essay was talking on the second day

00:13:45

reduce feature extraction then we'll do our a and then so what we're doing here is

00:13:50

we're going to make this as one flock to get the phone likelihood estimate

00:13:56

yeah no kidding here and this is something uh uh thomas they would accept that it where

00:14:03

we also try to go all the way up pretty uh we can go up here

00:14:08

but i'm a i'm not going to talk about that bit of what

00:14:12

what i'm talking here the this popcorn and you replace by the

00:14:16

the the mythology we have it so here is the study on to the speech recognition

00:14:23

so we started it on the p. after the that's it

00:14:29

so we play the all the case where the children speeded life augmented without a speech data also

00:14:37

so we built system with the mobile phones anti forms yeah

00:14:43

and the pain different the systems like uh the

00:14:46

m. um and demands on on carly

00:14:51

uh then they built like a traditional uh um a hybrid hybrid

00:14:57

and then system where we uh we use multi lipper supplements of one hidden layer and the the layout then

00:15:05

we use we uh every pain systems like c. and then with the the layer four sorry three convolution

00:15:10

therefore conventional is by conditioned it but one the one he really we don't have more than that

00:15:17

so here you have one he deliberately their layer after you expected m. f. c. c. and that

00:15:23

appeared because the coefficients disorder doing yeah you get the

00:15:26

last bit signal the correlation there for everyone and

00:15:31

the lottery and the and the m. m. system is a train uh it's a multi pass training so

00:15:37

you do speaker adaptation uh all those things are doing inside that and this is the multicast training

00:15:43

and uh what it be and and then the uh l. c. and then there's those cells based on single

00:15:48

questioning that so that we don't train like we don't have that uh speakers and all the stuff

00:15:54

so what we have thought was that the c. n. and tries to perform pretty well

00:16:00

it also is able to we see that the adult or

00:16:04

speed if you added under the performance is improving

00:16:08

what this is that like the the and then they system is not able to be that human

00:16:12

based system uh without any kind of adaptation it's not able to be this entertainment system

00:16:19

but the c. n. and we're not doing any adaptation of everything the learning feature directly

00:16:23

is that we'll do you perform comparatively uh compared to the t. pass system

00:16:30

so this is just one steady to just to tell how the

00:16:34

those look like and how uh things well in this

00:16:40

that's at plus thirty we were looking at something that was

00:16:45

okay we're learning i'm real learning this the layer by

00:16:50

layer of a layer by layer of here

00:16:53

and it should lead to each layer should lead to an abstraction for this

00:16:58

so we started asking me what is the whole all this this yeah this internally

00:17:05

if you do layer after layer start looking at is so we did some by experiment

00:17:11

so it's it by that's why it's not so yeah it's what they uh it's you take uh my uh um

00:17:18

uh m. f. c. c. on timit data we're doing this and retain what is called a single episode from

00:17:26

thing uh let us up learned that there's no hidden there you just classify selena classified that's not the performance yep

00:17:36

then we take the law speech we have one convolution layer

00:17:40

and they're a and it almost equal to the same m. f. c. c. uh the um uh

00:17:49

parameters that laughter parameters to reply to kind of get the same number of parameters in the spectrum and

00:17:55

so this is what you get before that it so lower the better for you

00:18:00

so you get compared action to what you get with the signals that that's a problem

00:18:06

then you increase the convolution layers and basically your parameters

00:18:13

in the class if they'll ever going to go down into the parameters on the uh feature layer and we see that

00:18:19

this kind of the the see that the more affectionate have the performances and of improving for you huh

00:18:28

we also did the same experiment uh with including i'm a i'm a hidden uh

00:18:35

uh my philip except one and it's a similar trends what whatever functional

00:18:41

of course in this case you see that you can have a very uh even you can

00:18:44

have fewer parameter than what the uh a regular uh muscular pose open this storm

00:18:50

and we are getting um performance but other than that so that's the

00:18:57

basic idea is that it was telling us that okay this

00:18:59

layers are in some way are doing some form of an abstraction job based on the past and the task dependent manner

00:19:06

in the implied that so we'll see that so i don't know what

00:19:11

what that we react under this work into a speaker verification

00:19:16

so speaker verification is a problem where you give to

00:19:19

the system a speech signal and identity claim

00:19:26

and the system will tell uh will accept or reject that identity

00:19:29

clean it's exactly like your radium you put a random act

00:19:33

but and then you clean your password and it yeah will tell whether you allow lexus or not it's exactly the

00:19:40

same problem yep so we came up to the place

00:19:44

where um we first train the speaker identification system

00:19:51

we then what we do is we than uh take the uh system and

00:19:58

in this uh it take take out this how quickly a lot

00:20:02

and we just replace it by um a binary classify it because it's a speaker identification you've been

00:20:08

able independently this to people are not not related to the basketball you train it on that

00:20:15

and then you do an enrolment so basically you try to add that does that work on the small amount of speaker data

00:20:22

and you will this but maybe that's the firewood couldn't be with good well if i just get a

00:20:29

well we tried it on yeltsin watch for data set so it's a bit late really clean data sets here

00:20:37

so we started it like the traditional methods like we uh like

00:20:42

uh the m. m. based um you know some background model

00:20:46

um i rector bassist i'm done fact analysis and so this can

00:20:51

assistance we all the system revert comparing it to the case

00:20:55

with o. c. and then so in the in the input to the c. and then the

00:20:58

first uh on addition there had three hundred samples as input all that examples of input

00:21:04

so this is the case which we call the segment though this is the case with the

00:21:07

subject matter is and then you see that this is a pretty much performing well

00:21:13

it's a platform for that's it's about to discriminate the speakers and it is

00:21:19

available to lance because this commission that that's what the phone the directive

00:21:25

then this might tend to get also be combined in in other ways around that we add some form of speech processing

00:21:32

and then you get a signal and then you'd you'd take the output of the signal and try to

00:21:38

train a network so yeah that you can incorporate the nodded at the processing that what i so

00:21:45

i guess today uh when make was talking about they make was presenting about

00:21:50

um what this call that we have about call white quality features

00:21:55

the source speakers here and the source we decide depending upon how

00:21:59

the vocal folds are closing um uh the the pitch

00:22:03

a fundamental p. three feet really changing all the excitation send this the ending you want to make sure that and

00:22:11

i mean this kind of information can be useful for depression detection because

00:22:16

on one side a lot account for the fact of um the idea that um when

00:22:21

uh the whole progression happens we also somewhat rules control of our vocal folds

00:22:27

that is what we differentiate in terms of deter then should know and it yesterday nick was talking to on the frequency

00:22:33

domain perspective you said how we can go to to consider mine and you can measure the detention but aspects

00:22:41

what we need is okay the problem left it doesn't should not uh estimation is that that

00:22:48

we want to mobile the souls and your signal is sent any also the system information

00:22:54

and all the moving the system information this i believe it is a difficult problem

00:22:59

been a prediction is it linear prediction analysis you get a residual signal the simple signal

00:23:06

doesn't mean that you got rid of the whole system inside the visible signal

00:23:10

you still may have some system information or the to do signal

00:23:14

so the time domain and uh the the the moving the system how that that was the um

00:23:20

the see the um the speech signal information is a challenging problem

00:23:26

so what we did is okay let's do what happens in in the system information most of the time

00:23:32

in the spectrum your formants are that announces how going to lie in more high frequency range

00:23:39

that they're not going to be really like with no uh the most firemen to fires from under confined is thought he

00:23:46

it is around three hundred hertz on other than a of sound they're going to have

00:23:53

a two two fifty the under the then they're going to have more than that

00:23:56

so what we said okay let's filter the signal with a low pass filter and then ten this phone illusion or

00:24:02

no network for depression detection value you have did you say this depressed upon proof we also tried it

00:24:10

okay the little limit but basically this production analysis get the residual and then think this system

00:24:17

we can also do what is called captain analysis again this was what i was talking about and this is what

00:24:23

and i was presenting in them in the first day and captain analysis so

00:24:27

you can build you can split this source and system component and

00:24:31

v. v. the basically the high have to hike after what fish another blow

00:24:36

up with local option the bobcat logo vision that it to the

00:24:39

to the system and the high is related to the soles so we give the high

00:24:44

after um a proficient signal that you can invite that

00:24:49

and then then then another and i met and interesting analysis is

00:24:53

called little frequency analysis so the others because the analysis that

00:24:57

i gave the following that then this in the vineyard accepting

00:25:00

the system the excitation is more like an impulse signal

00:25:05

so it's i think it's going to spread that was all the frequencies

00:25:11

now what you can do is you can try to put the through

00:25:14

the other on the low frequency and tend to filter the signal

00:25:19

now what happens is in this case the system information is more or less we can expect that it will it will get to to

00:25:26

doubt and what really what will remain is more more information related

00:25:31

to the more men maybe like and what they would say

00:25:34

a keynote of waking the presentation of the signal related to the source we can get

00:25:42

it isn't but i yeah still there's there's not been doing of anything it's

00:25:46

a it's a simple i have is there and you train the colour

00:25:49

condition a neural network with that so basic id that you could

00:25:53

combine um a speech processing that's with robin for modelling

00:25:59

so if we did that to me is that it it on that back corpus

00:26:03

so this is the regular thing where v. uh extract features from um

00:26:08

um like l. d. features and trained axiom system

00:26:14

um we uh l. desist uh elderly features with l. s. d. and

00:26:18

then there is a spectral features even to see and then

00:26:22

the spectral features are nothing but a lot of a critical band energies of the um uh

00:26:28

and we also painted that just for our own self and i'm i'm a sissy based system so we just this this

00:26:36

to system i would say like the spectral and them at city they have mainly focusing on the system information

00:26:42

because when you do all the m. f. c. c. um a a warlock filter bank and it is you're

00:26:48

pretty much of mainly focusing on the system you're getting rid of the of the harmonic information that

00:26:54

and i'm a city so so so this is turns when we come back to that

00:27:01

when we do the um even when they do this like about speech we see that oh

00:27:07

we are not able to detect like the depression very well with the last bit

00:27:12

but there are many more when we do some segment the morning that is around thirty samples

00:27:17

a at the uh the c. n. n. facilities and processing for thirty samples for

00:27:24

it is it to to an example is it's tries to model a valuable the depression and control

00:27:31

if you do low a low pass filtering there is gain in something mental analysis

00:27:36

as well as you you have again with the segment analysis and this is see that

00:27:42

and then only if you apply that knowledge and filter out the signal you're getting an improve performance

00:27:49

and what we see that the the view the frequency filters signal

00:27:52

we get a better overall score with this average of their

00:27:55

detection under control here and we see that

00:28:00

the sub segment of a segment of work are kind of performing pretty good

00:28:06

systems so i so i have listed here oh but i'm i'm going to talk about many applications yeah

00:28:17

that that's i'm tied to list some of the things so what we observed is in speech recognition

00:28:23

the window size is like two fifty to three hundred ten millisecond

00:28:28

you can use it to it is more you decide segment awarding the first layer

00:28:33

i'm usually is gonna be three to five years can be one two three below competitive system

00:28:40

and speaker recognition what we found is the input is about finding

00:28:44

millisecond but we have tried it even two seconds it works

00:28:49

it can it can model the first layer can as a model so

00:28:53

segment aside segment bill here have to fix it it's a it

00:28:56

between two to five and one hidden layer we need it so the connotation lizard yeah the number of it and they say yeah

00:29:05

then that is that a presentation that back what it means presentation

00:29:09

at bag is that k. events both a speaker verification system

00:29:13

by recording may weiss and claiming i'm to the system i met two and the system has to

00:29:19

somehow detect that it's that wolf and in that can assist the views analysis we sound that

00:29:24

a a a a a it's like mental analysis quite good enough of course have segment and it also good

00:29:30

with that is also not in the paper but uh we need a number of politicians around too

00:29:36

and we can do without even hidden layer this kind of detection and then there's a gender recognition and

00:29:42

battling mystic uh which i present the the did the depression detection so what does all that

00:29:50

all this kind of problems yet able to uh saw a a real competitive systems

00:29:56

to regularly the idea that you focus of the features and then classify him

00:30:02

so what do you asking is okay what is exactly the neural networks are doing

00:30:08

it for that so i'm going to present in the demand another talk

00:30:15

what does this in such systems line and i'll show a first layer

00:30:19

one english and fast convolution it analysis at the filter level

00:30:24

then i i'll try to all the tightest what do with the idea that

00:30:29

oh how all this whole system as a whole is that

00:30:34

learning what kind of information it is focusing on

00:30:38

if so up one thing you can ask is the first nail

00:30:43

polish and little you have filters you can ask is

00:30:47

what is accumulated response of the filter it's you put all the filters that's

00:30:51

part to get that big hefty and you you may take accumulated response

00:30:57

then you can ask another question is how we this bill does this one into an input signal

00:31:04

so far that we get up with that interpretation that

00:31:09

the condemnation layers at the first one illusion that that we're doing it in operation so that we can

00:31:16

relate to the spectral interpretations easily we we we cannot found in a

00:31:21

wooden that it is kind of a learning aspect of dictionary

00:31:25

it learn some form of mast filters with us once to particular class s. death

00:31:32

this mass filters we can we can look at the best ones of the of

00:31:36

if this bill this aspect of this one that is i mean model by this filters

00:31:41

for the input as follows so you have the production this is the filtered output

00:31:46

and it's a the this could put it as from of this uh of

00:31:50

the filter and this will give you like the magnet magnitude response

00:31:55

oh you can get the so the the the the spectral response and you can take the magnitude and analyse it

00:32:01

yeah now i now what is the interpretation of this is that if this filters where science

00:32:10

and was hanged and your land of the signal is equal to the number of filters

00:32:18

or more i would say it then x. i can be seen like at the this what we're seeing here is of

00:32:27

uh for the purpose of the of the spectrum that's what it real estate that point

00:32:33

so you are you have the signal if that i'll sign ten percent basis you have a production

00:32:39

then you this filter it's to consign function

00:32:44

a sign function will have a different as long as they do that then basically a good to multiply the productions

00:32:52

and i and this is your b. f. t. what exactly he what we're doing here is that

00:33:03

to get what is this this bill does a morning because we know that this because i'm not signs and percent basis

00:33:11

what is done the more so i'll we need some analysis so this is

00:33:16

the case then the scene understand on wall street journal's op was

00:33:21

this is accumulated response what you see is that the filters

00:33:26

are not more link all the frequency in it equally

00:33:30

typically then if people will go and implement an m. f. c. c. or p. l. p.

00:33:35

one thing they will try to impose on the thing is that the film the there's a lot of caution data

00:33:43

so you you that's why the filters will have hyped will reduce with their with at the with was

00:33:49

usually will see that there will be an idea that they have to have a portion you that all frequencies are kind

00:33:55

of represented the you don't try to be emphasising the chickens it's nothing like that happens what we see that

00:34:03

it is focusing on something like telephone bandwidth it is focusing so people and decide to the

00:34:10

telephone bandwidth at some point as four kilohertz and this that will sample at eight kilohertz

00:34:15

they did all these experiments with the humans baptist belittling i'm listening

00:34:19

handed we're doing and that decided this bandwidth for the telephone

00:34:23

telephone bandwidth at full uh uh of of uh affordable her don't

00:34:28

like people infected herds telephone bandwidth so that defined as the

00:34:32

the it and we think that did not i thought also that information is that they are focusing on that side

00:34:42

so i'll

00:34:43

what we need is so this is what the cumulative that's part of that is that now we can ask what if you

00:34:50

give a signal input to it and we ask what is this

00:34:54

the spectrum just once of this filter side coming out

00:34:58

to do list testament what we did is we do the voice you don't know which one illusion neural network

00:35:04

and what it is this fad from this american english novel data

00:35:09

set that are like the the there is a beta

00:35:14

where they have um at this level have he'd

00:35:18

who'd had they had the same context of how and that an in between your changing the levels

00:35:26

and we don't that that men can load it does that they have that the full analysis giving

00:35:30

you what are the formants information what is the fundamental frequency information all this information is

00:35:35

available so we did what we what i was present thing is like the analogy that's it

00:35:40

they have to another do so you take a speech signal and do that analysis

00:35:45

and have the the spectrum so work um uh like twenty to thirty minutes second this is the average

00:35:52

of that so if you see that if either this is the main of roman and i'm a boy and a girl

00:36:01

we see that this information the title covers want to the one the bit

00:36:06

didn't analysis uh by uh before and they found that the foreman did

00:36:11

this dallas with what the formant information they they got in the in the american well with it that's it

00:36:17

so what does indicate is in the speech recognition case

00:36:23

the first politician they are in some way is focusing on the system information

00:36:30

the foreman kind of information and that this is we have um i will come back with

00:36:37

another analysis to show uh the later but this is what this thing for speech recognition

00:36:44

does it means that it will do the same for speaker recognition you can ask yourself but if you see the conventional it later

00:36:52

you go you use m. s. easy for speech recognition you're going to show you the

00:36:56

same m. f. c. c. maybe with a few more confusion for speaker recognition

00:37:01

you don't think the processing in between eugene you you probably the how

00:37:06

much what kind of because in the spectrum represent you changed

00:37:10

it by cleaning the order also them of c. c. uh the the the the definition of them as as a feature

00:37:18

well in this gets all speaker recognition

00:37:22

what we found was if you take like three hundred samples uh add a fast convolution layer that

00:37:29

we do say that three hundred and pulls then it just focus on very low frequency didn't

00:37:37

when we take like thirty samples it focuses on low frequency as well as it would be tough on the high frequency

00:37:45

this just once is entirely different than what i we saw for speech uh speech recognition

00:37:50

kate entertaining for phone classifies retaining this is you're trying to basically may speak yes

00:37:56

so what it would do so what we first analysis it is okay

00:38:02

this means it could remotely what is well like fundamental frequency information

00:38:08

so full of something that what we did is we study the same spectral that's ponce

00:38:14

kind of thing so here is a windowless signal which is voiced and unvoiced

00:38:20

and you do the spectral analysis as i was uh showing before for the window frame and you see that

00:38:26

um you see that there is a distinct be for the voiced with this we we we use another software

00:38:33

to get also the f. not information on the hand calculation all those things and we saw that

00:38:39

it's coming close to what the fundamental frequency at that same was and mandates and

00:38:45

lives you don't see any set distinct shall be the enough energy in that

00:38:51

we said okay this is for one frame is it really doing

00:38:54

the job a a fundamental frequency informations really capturing it

00:38:59

so we took what is called the keel database gilded every is is a ready they have recorded

00:39:05

this b. l. for a ten males and females uh so if i'm a rough life email

00:39:12

they also they go to the editor wrote about so basically the recorder grab

00:39:15

the broader closer is happening and opening you can get all this information

00:39:20

and that can be a standard for us because there's not an ambiguity from me what is

00:39:26

what is the fundamental pleaded for that so we hope that he'll get up is handy

00:39:32

we passed this kills the signal well to the award for the the annex b. with painting

00:39:40

and uh and that's what we got for female and male

00:39:46

so you see that that references the f. not within the bad here

00:39:52

this is coming from the keel speech database reference and this is the f. not without our one small detector

00:39:59

based on the computer respect and detect that point and the small that's not and that's it unplugged it

00:40:08

and we see that it's that it's nice to try to match of on the

00:40:12

even though kill database was not the values in our training nothing we can see that it it it we

00:40:19

can get that not onto itself not contour it's some lot it not so good for male and female

00:40:26

oh so female it's nice mail is it's not so good one based on what would be that we

00:40:32

use the hundred millisecond with this about point b. uh close to you let them twenty millisecond

00:40:37

and to to model the fundamental period well so this may be enough for

00:40:43

females from males it may not be it may not be so that's why you can see that

00:40:48

a lot of deviations mail have lower fundamental frequencies with the pitch cycle can be long

00:40:55

so this is what we found in the case of a speaker when within a speaker recognition

00:41:00

system with like three hundred samples the a politician in window input the as input

00:41:07

when we do the same with cut examples that segment analysis we see that so

00:41:14

here is the l. p. spectrum of a flame and here is what

00:41:17

the spectral that's one that we computed for the same claims we see that

00:41:21

it's kind of capturing the foreman information so it's more focusing on that

00:41:26

what is call the system related information speaker for discrimination

00:41:30

so when attending the sub segment than it was for it with kind of neglecting

00:41:35

uh it it it it is kind of ignoring all their system related information

00:41:41

but is more focusing on this source level and when you do the sound segment the morning

00:41:47

it actually it's morning um um the um the that's the the system information also

00:41:53

so and that explains liven issues the assistant we we're also getting improvements

00:41:58

because we had to be living up to different speaker disseminating information

00:42:03

so so then we asked the question of thing we're looking at one point allusion they are the

00:42:09

first convolutional yeah but recognise what it is all network as a whole is trying to model

00:42:16

so i'll read some inspiration from computer vision

00:42:22

incumbent reason they have some visualisation but that's where i you good but

00:42:28

all the input and look at the output how it changes

00:42:32

you know i use some reconstruction that that all something like they did at that

00:42:37

so that they did not i did the following likely given input image

00:42:41

and you take the output unit cost wanting to the class before the sock my slayer

00:42:46

and you that probably get this information back to that all the way to the input

00:42:53

what it cannot mention is is that how much a small variation of each pixel value will impact the production school

00:43:01

basically that's the basic idea and it is what they call it the levels map of the same

00:43:05

size at the input so you given input image and you get that elements matt said that

00:43:13

so here is the case like i get like this is undoubtedly mange

00:43:18

the the the the the tonight that like that in c. d. compilation guided by propagation

00:43:23

and some people are have come back and said no it is not good and uh we should do

00:43:28

a back propagation and they're likely and not guided by propagation the so but the idea is that

00:43:34

you're trying to find what the population that has train and you basically here it is it's not so

00:43:41

clear um but for example yet in guided propagation you can see the cat and all those thing

00:43:48

this lecture they we don't have in speech or it's too bad propagate these kind of things and

00:43:55

do it in the speech you would do with the speech signal you get the signal

00:44:01

you basically you go don't either though i look at this is nice this is

00:44:05

for this ah this is easy i can i say anything like this

00:44:09

but what we did is okay let's look at the autocorrelation aspect that is power spectrum you can let it

00:44:17

so we did the articulation of the signal and articulation of this eleven

00:44:21

signal and we see that the fundamental information is the dane

00:44:25

and it has the characteristics we we can so we said let's

00:44:30

do what is called a regular short term analysis and see

00:44:33

what is this information so different across different problems

00:44:41

so the case for them to meet um uh where we this is

00:44:47

the way the spectrum here that fit of the linear prediction

00:44:53

and here that eleven signal spectrum you the fit for that similarly this is easy and so we

00:45:01

can see that the school are kind of having the same uh kind of formant information

00:45:08

so what we did is we did we went back the then back to the

00:45:13

american over clock was and we tried lana like this the level signal

00:45:18

so we basically got the f. not estimate we got the f.

00:45:21

one estimate everything by applying linear prediction analysis on them

00:45:26

on the level signal should and painted compared to what the results have been

00:45:32

uh um the in that may come about with it as a people

00:45:36

of course we see that the first phone allusion layer when i was analysing initially were

00:45:40

not thing any them uh any kind of have not information in that but

00:45:44

i the lemon signal is still a heading definite information that means is that

00:45:50

not information is somewhere in the later layers it is getting captured

00:45:56

for farm and one informant to male and female we see that we are not that bad when this this information

00:46:02

is then that that was the eleven signal is a containing the neural network is kind of focusing on this

00:46:09

information this also tallies with our analysis when we did the first layout

00:46:13

analysis yes yeah this this is acting with that this one

00:46:20

then we look at look at the speak okay is that the jewels so segment two more

00:46:24

link or something mental more link with the that the fundamental frequency information is the again

00:46:29

yeah it is also a the regular speech signal with the again whether you even

00:46:34

if you do um i even if you do something with them or link

00:46:38

and here is the case um when it happens level at the spectrum when you do the segment that morning

00:46:45

and the subsequent the morning so what it says in the segment and more link

00:46:50

is hardly focusing on the high frequency did it mainly focusing in very low

00:46:55

because see because again um the air going back to the initial analysis of the filters it agnes with that

00:47:03

and then sell segment on modelling what we say is not segment involving what

00:47:07

we see that that's the low frequency the presentation with is related

00:47:10

to the bit fundamental frequency and then it's mainly focused on the higher

00:47:14

level formants information uh not much on the low level information yeah

00:47:21

so a low level uh formant because every not every some some kind of idea it there that

00:47:27

uh the low level formant like the f. one and f. to them

00:47:30

more sound specific and speaker specific this commission so coming native

00:47:36

for you and higher formant and that kind of uh uh it's a fact we can see that this is

00:47:42

this is that you can see that this is more focusing on i have to consider agents

00:47:49

then also compared everything control and they made

00:47:54

the the bit that training and all so if you train for

00:47:58

the phone classification this is that eleven signal spectrogram we get

00:48:04

yeah yeah we can keep up the for the formant information and uh the harmonic structure

00:48:13

then then you should play need for uh so this is all in a spectrogram and this is the

00:48:18

form one want like that and so forth spectrogram try to keep more this this farmer spectral information

00:48:25

yeah when you take the speaker uh uh on a level that

00:48:30

with the same stuff segment that what they're doing something mental analysis and we say that

00:48:35

basically then is pretty much not so well model but that higher pick and this didn't some more and more what it's indicated

00:48:42

even though you will keep the same shot them well put it put the sample input to

00:48:47

do one big ignition or speaker recognition it it's it is known in different information

00:48:52

is one is a little speaker one is the the speech so it's changing is normal in the same

00:48:57

spectral information for both of them

00:49:02

so let's come back to the point so i thought about all this thing in spectral analysis like

00:49:07

so now we'll see a little bit example of kind of mine how we can look at the time domain analysis

00:49:13

so here i'll come back to this case that we were filled in

00:49:16

the the the frequency signal and painting this network for depression detection

00:49:22

but had you been sequences signal is look for as follows so it's a signal that you

00:49:29

can say it's for this nice sustain well well so it makes looks really good if

00:49:34

no no this this signal's intersecting from positive to negative

00:49:40

that's a that's a point off kind of the source expectation points for you if that's where the excitation stuff happening

00:49:48

so there's a full paper on this to show that how well it you can estimate the

00:49:52

a source of buttons uh a poser instance expectations at this point because it's happening

00:50:01

now what we did for depression detection because fate we will just

00:50:06

feeding the signal into the classifier and then asking so

00:50:09

is it really focusing on such fundamental frequency information we got a really good result

00:50:14

and nice results on that but it is really model incentives that information

00:50:20

to go that way than this um same elements analysis

00:50:25

you get this relevant signal so yeah them plodding for uh the the the the

00:50:29

the well enough signal than that that is the depth of the lemon signal

00:50:34

and then you see the the linear prediction this it allows them prodding for

00:50:37

just to properly properly everything i mean the lemon signal we see that

00:50:44

at this point of intersection is where you get the maximum relevance point this is where you're getting it inside that

00:50:52

okay we said okay this is nice can you know really see this the morley fundamental frequency

00:50:58

all this kind of information so we did like a autocorrelation if this is the

00:51:04

uh i'm autocorrelation of that of of signal in green and then

00:51:07

the blue with a minute visit production that's a dual and

00:51:11

not the correlation of the the the frequency signal we see that you want to clearly see at this point that the

00:51:18

the the mac thing because there is no also that not much shortened a system information so you're not see

00:51:24

like any others aspects and this uh that that's enough tape

00:51:30

so we thought that it's more link uh that the fundamental pleaded information it

00:51:34

it it so there's that new at regulating the fundamental period information

00:51:39

and it is thank after information related to that that's what

00:51:43

that's what we understood from the study and depression detection

00:51:47

so what i have probably until now he's

00:51:51

a to show that i did a show that well the v. although

00:51:55

we have sophisticated spectral based analysis methods we have double up

00:52:00

we can use neural networks to actually automatically ll model directly that

00:52:05

ought to be a signal input in a class specific manner

00:52:10

you can learn it up past the thirty minute i'm interested and what i try to show you that in this case the neural network

00:52:19

is learning information in a task independent manner to simply put the

00:52:25

cases if you take the current speaker would look great literature

00:52:29

almost all the systems are based on bit best systems are based on what this fall

00:52:35

m. f. c. c. the first thing you do is an emergency extraction which is more of the system information

00:52:41

what i try to show you that this network is about even focus on the source information for the speakers discrimination

00:52:48

we have tested it even on a that tell the task like what's eleven all it

00:52:52

it it it locks we're still preparing the paper for that but it lets that

00:52:58

so it yeah so what it means that by h. t. m. changing

00:53:04

want how we change the input at the at the neural network

00:53:09

that we understood some speech recognition study we did everything cross validation and speech recognition

00:53:14

study we found all for speech recognition we were finding the sub segment though

00:53:19

then we did the speaker then we solo you can model the segment of information

00:53:24

so we saw that this be know how the signal needs to be operated

00:53:29

yeah and to capture the information this can be embedded in this whole learning process

00:53:34

we don't need to basically decide why put just a millisecond

00:53:39

thirty milliseconds even the shape they can ignore some problems we don't

00:53:44

need to worry about the sequence like speaker and uh

00:53:47

uh um depression detection if we're not focusing on the sound

00:53:51

you can ignore for uh also the shit information

00:53:54

you can shifted by not ten millisecond twenty millisecond thirty minutes and it is going to work for that

00:54:02

what it can have in it what do you think the that i i think i

00:54:05

think a that it can help us to understand better the speech signal characteristics

00:54:12

um that eleven signal what i showed is to let debatable point about the gradient what we're getting

00:54:18

is it really doing the job it it needs some more uh uh investigation to do that

00:54:25

if we are able to do that nicely then we may be able to gain really good inside how art

00:54:33

than it really is begun poses the it's a speech signal and say that we can probably understand

00:54:39

uh uh it it will help us in improving our own understanding how to process the speech signal thank you

00:54:53

any questions

00:54:57

yeah

00:55:13

yes that's off

00:55:22

no but the what type what dan ellis is like on that to me to know

00:55:25

they were showing on that the lemon signal they did and then shot them analysis

00:55:30

it's also item analysis and then comparing the formant end of information

00:55:34

inside that so that the basic thing so what we

00:55:37

did i see if you do the maybe giving it to fifty milliseconds or even two second meaning is second input

00:55:44

so we show that we can see this relevance in for signal also in the spectrum the typically we can show

00:55:51

does that back the the guy the back propagation of back propagation we control you can look at the spectrum

00:55:57

but if you look at the two fifty millisecond to fifty millisecond as

00:56:00

a signal spectrum it's not going to be localised information for you

00:56:05

so what we what we found it well in since yeah but to look at this didn't again

00:56:11

we can show that we can and i like that spectrum and say that is at

00:56:15

all then in that case why don't we go back to the talisman shot them and that

00:56:19

it's it's and then and i like this eleven signal in the short term basis

00:56:24

and then see what information and it's getting so it in we can localise that's what i mean to say that

00:56:35

00:56:41

if it's if it's yeah uh_huh yeah

00:56:55

uh_huh

00:56:59

yeah so that when i'm in speech we we did find it to go to suppose segment the

00:57:06

level but for example in the case of um um speaker recognition speaker when you're doing

00:57:16

we needed the same so uh we get the same so start segment segment analysis

00:57:20

at the input but the context can go to pull second three seconds

00:57:26

then what happened the features to to come at the before the classifier

00:57:30

layer are kind of the supply segment of information for me especially

00:57:35

when i'm using i know that the uh i fair use like around two

00:57:38

hundred millisecond convolution input i know that it's focusing on the multi county

00:57:42

fundamental frequency information pete information so what i can expect it is

00:57:49

if i'm giving the two second speech and modern thing it it should be kept getting information that is

00:57:54

more suppose segment of at that level so we don't uh what they they like did we

00:57:59

don't need to increase this window size that now up at the lower level late las when they're

00:58:06

actually making they can do the job we don't need to chain that yeah yeah yes

00:58:23

i i that's in that's we're going back now of the point too fast ascertained that it indeed focusing on them

00:58:29

the like going back to feel p. database and then ally whether

00:58:33

it's an x. second of on where a locating properly

00:58:36

as so we're doing some systematic analysis and of course we we we were we

00:58:41

gonna uh i'm expecting to say something like that but if it's open

Share this talk:

Conference Program

58:56

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute
Feb. 14, 2019 · 9:12 a.m.

367 views

24:13

About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 10:14 a.m.

59:17

Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 11:05 a.m.

113 views

19:21

Introduction to Pytorch 1
Feb. 14, 2019 · 12:06 p.m.

114 views

22:31

Introduction to Pytorch 2
Feb. 14, 2019 · 12:26 p.m.

134 views

Recommended talks

31:03

Voice source analysis
Prof Juan Rafael Orozco - Arroyave, Colombia
Feb. 11, 2019 · 10:10 a.m.

110 views

08:13

End-to-end approach for recognizing speakers from audio
Subhadeep Dey, Idiap Research Institute
April 19, 2018 · 11:09 a.m.

232 views

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute

Embed

Transcriptions

Conference Program

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute
Feb. 14, 2019 · 9:12 a.m.

About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 10:14 a.m.

Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 11:05 a.m.

Introduction to Pytorch 1
Feb. 14, 2019 · 12:06 p.m.

Introduction to Pytorch 2
Feb. 14, 2019 · 12:26 p.m.

Recommended talks

Voice source analysis
Prof Juan Rafael Orozco - Arroyave, Colombia
Feb. 11, 2019 · 10:10 a.m.

End-to-end approach for recognizing speakers from audio
Subhadeep Dey, Idiap Research Institute
April 19, 2018 · 11:09 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Raw Waveform-based Acoustic Modeling and its analysis Mathew Magimai Doss, Idiap Research Institute

Embed

Transcriptions

Conference Program

Raw Waveform-based Acoustic Modeling and its analysis Mathew Magimai Doss, Idiap Research Institute Feb. 14, 2019 · 9:12 a.m.

About Sequence Classification for Sound Event Detection and end-to-end ASR Thomas Pellegrini, IRIT, France Feb. 14, 2019 · 10:14 a.m.

Case study: Weakly-labeled Sound Event Detection Thomas Pellegrini, IRIT, France Feb. 14, 2019 · 11:05 a.m.

Introduction to Pytorch 1 Feb. 14, 2019 · 12:06 p.m.

Introduction to Pytorch 2 Feb. 14, 2019 · 12:26 p.m.

Recommended talks

Voice source analysis Prof Juan Rafael Orozco - Arroyave, Colombia Feb. 11, 2019 · 10:10 a.m.

End-to-end approach for recognizing speakers from audio Subhadeep Dey, Idiap Research Institute April 19, 2018 · 11:09 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute
Feb. 14, 2019 · 9:12 a.m.

About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 10:14 a.m.

Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 11:05 a.m.

Introduction to Pytorch 1
Feb. 14, 2019 · 12:06 p.m.

Introduction to Pytorch 2
Feb. 14, 2019 · 12:26 p.m.

Voice source analysis
Prof Juan Rafael Orozco - Arroyave, Colombia
Feb. 11, 2019 · 10:10 a.m.

End-to-end approach for recognizing speakers from audio
Subhadeep Dey, Idiap Research Institute
April 19, 2018 · 11:09 a.m.