Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:01
i'm sorry for the delay
00:00:05
of course i'm going to talk about is an illusion your work but i tried to the to slightly differently if as the
00:00:13
well we've found is acoustic modelling and analysis by the time of cotton morning
00:00:20
just don't get cancelled that it does look nice for speech recognition
00:00:24
i i mean to say if a customer ling is pretty much wherever you do
00:00:30
speech processing on your bill for whichever classification problem you want to be declassified
00:00:38
as you'll see in the park that bush that's very little to talk to about architecture of the
00:00:46
training and things that's because pretty much that has
00:00:52
not changed maybe old all pretty us
00:00:56
so i'll briefly give some examples of a class okay how it the phones are with
00:01:04
some uh it's like more deviate why it's it's something interesting to look at it
00:01:11
then i'm going to focus on something for a more on
00:01:15
what is called how you penalise this kind of networks
00:01:21
how you can really underlies discover that works to get an understanding
00:01:26
about what kind of information that not like some learning
00:01:31
and of course that's all the things we always want to now
00:01:36
about what we don't know each and really it's nice to see if we can
00:01:41
find something like that but um most of the thing you see that
00:01:47
remote things we know how to process the but some interesting aspects given different uh
00:01:58
i don't
00:02:00
oh okay so so the motivation for this up so before i go into this i have to tell something more
00:02:09
so this well uh is mainly based arms about which we have been doing active yep
00:02:15
so what then we started this work in two thousand eleven
00:02:20
uh the with the with the first p. h. d. student who started with me on or not all of it
00:02:26
on on the name of the project you move visible the median age to uh to do
00:02:32
then people um then he want and speech recognition and then
00:02:39
a student hannah and she said that with me on how to use this method for speaker recognition
00:02:47
and then uh there was a post doc walking along the way like work with hannah
00:02:54
then that s. and then she came she wanted to do a semester project and i said
00:03:00
take this network and try to study for gender recognition and simple really
00:03:06
and then more recently the pollen while he's one of the p.
00:03:11
d. student who is using it for speech assessment problems
00:03:17
i would let the byzantine focusing on the work we have done not much on that
00:03:23
it it it is because at least i have understood what we have been
00:03:28
so i can generate the i of course you can go up i can point whether indicators
00:03:34
all the papers with their existing coming on i'll probably from
00:03:38
this morning but here i'll try to collect more into
00:03:43
how we can be related to traditional speech processing well uh_huh
00:03:48
so i'll the basic moderation is that okay commercial feet as i'm
00:03:53
going to assume some kind of why the stationarity windowing
00:03:57
and there will be issues of time and reconsideration often what we do is we take a a short term window
00:04:06
um signals and then what we're going to do is get going to split them into souls and
00:04:11
system is so that in the first day of the painting event of a survey
00:04:16
also gave me make like to talk about it the same aspects and
00:04:21
then what they do is really adamant price most often what we do is
00:04:26
we are we are going to panama try this information vocal tract filter
00:04:31
but before you want about i'm not i the also incorporate some form
00:04:35
of speech perception knowledge like we have something called critical bands
00:04:41
um there's a dominant frequency the critical band other of frequencies and not acceptable
00:04:49
and there's a nonlinear relationship between the intensity and loudness
00:04:55
so all this kind of information we incorporate
00:04:59
uh the speech perception knowledge and then adam replace that this is not understood we are doing
00:05:05
after you after you fat i'm not like that what you do is
00:05:09
uh that you can actually computed there are time they're idiots and
00:05:16
then you dana left if i am yeah putting director classifier
00:05:20
so this is not traditionally we started it and then people started going back and said
00:05:26
well do i have to need to work week after them i'm gonna go back
00:05:31
here and people said oh can we go back to have a pick
00:05:36
to that full so what we tried to start was okay
00:05:42
why don't why do you have to do all those things would take this speech signal
00:05:47
and then we have a convolutional that well and that's a that's a
00:05:53
and then layer and there's like a um multi multi level
00:05:57
so far layered with colours classify and this is a feature of
00:06:00
the and then why don't retain everything jointly and say this
00:06:06
so why why can't they do that such a if you so that isn't me started initially was that
00:06:13
most of the feature extraction process you can write down as the filtering operations you will
00:06:18
see that we can even at the f. d. a. like does the filtering operation
00:06:23
so we started with that point
00:06:27
so
00:06:28
and so what it can do is it can have sent those overcoming
00:06:33
limitation of an reference lock them up with a speech processing
00:06:38
it can also has i had was better understanding what the speech signal characteristics
00:06:44
ah so this tool aspects was so main motivation
00:06:49
to get on this um the this kind of uh architecture and studied the don't it
00:06:58
so i i illustrate it in this universe basic in this uh convolutional the this the next building
00:07:05
that's a a nonlinearity i put an unhappily with it could use a little and it is
00:07:11
not not going to be a for the loo will improve your performance a bit
00:07:19
and what we built when you started building the fact detector rebuilt on
00:07:24
little prior knowledge what but i know that we wanted to use was
00:07:29
the we need a shot them processing because the signal is nonstationary
00:07:35
that means the properties of the signal is statistical properties like that mean and autocorrelation
00:07:41
keep changing over time if and that's why we also do sure probably
00:07:47
so but what we said we we don't want to tell sometime what should be the shortcoming going what size
00:07:53
should be looking at him then as i said feature extraction can was in a separate thing operation
00:08:01
and the tart information was relevant information can be spread across
00:08:05
by for example if you take speech recogniser today
00:08:09
you will see that input to the neural network is not just one frame
00:08:13
that the ten milliseconds beat you are going to get to fifty millisecond three hundred millisecond speech into the neural net
00:08:23
the whole aspect what we initially started was do everything in it it everyone members and then later
00:08:30
with little bit understood and listed annoying right to stay in this network and focus on
00:08:37
all this the this a plane trying to using back propagation with the cost function based on cross entropy so
00:08:45
as i said this is not changing how we are going to the neural
00:08:48
network so what you do is you feel if you did input signal
00:08:53
get um uh output production then compared to the label and
00:08:58
rapid but probably get the error to adjustment but
00:09:02
simple and the other functions what to use yet is cross entropy because
00:09:06
it's a classification problem so the use of let's entropy other function
00:09:12
so this has not changed if for a long time
00:09:17
so what is still look let's look at the input layer so input lady something interesting
00:09:24
so here is what they say a signal with the context
00:09:28
okay and then the condemnation your advertisement operate on a small portion
00:09:34
of the signal like we do those shot them processing
00:09:38
so this is the window size and this is when the
00:09:41
shape and you you get accommodation millard bracket has some
00:09:47
i like that are filters like uh and the filters
00:09:52
and each window time employees on the word to collect is
00:09:58
exact is is your number of filters you gave to the huge window you get an output
00:10:05
and you keep getting so now the question is what should be the window size
00:10:11
the window size or what we we we're going to talk about is
00:10:17
what is what i what i i call it like some segment and then segment though that isn't is um what we do when shot
00:10:25
them processing is that uh as i have a was explaining that
00:10:29
didn't painting human so first day of that they knew what
00:10:33
that you need to have certain number up it cycles i too
00:10:38
good enough to decompose that source handoff system component
00:10:44
if you don't have enough it cycles you may only be able to see the source information not the system information
00:10:52
now this is this basically we're ready to write this whole logic
00:10:58
read it that i i'm not in the in the into the
00:11:00
speech processing it it came basically from um speech coding studies
00:11:06
where they said all we can split the signal into so ten system and we can fathom applies
00:11:12
this information and descended and we can say bits and we can because like the speech
00:11:17
basically that's that's where the more one of the this whole analysis synthesis model unmatched
00:11:24
inside that are so i'll be said okay this is what i call is
00:11:30
like one to the big is what generally what people do the shot
00:11:33
them posting in the conventional segments that's and mentally that it can be less than one p. p. did i don't know what it is
00:11:40
but often what we do we'll see that it just won't be below a two millisecond
00:11:46
below two millisecond appeared is lovely and compute to two millisecond if if suppose
00:11:52
there's a fundamental period of two millisecond it'll be five hundred hertz
00:11:57
which is normal adult speakers we might have come to that level
00:12:03
okay so how so what happens in this kind of known it but is that the first layer is going to take this input
00:12:12
okay so yeah that giving the the kernel with with this in samples
00:12:17
and she'd been samples yet i i'm giving that window size which is getting used
00:12:23
in time that thirty millisecond is one point it it then you
00:12:28
get the price for the output you'd let max pulling
00:12:32
then basically that the this interpretation is that when you do the max
00:12:36
pulling your operating here in the first convolution the the operating
00:12:40
on a video say that one point eight minute max pulling you're kind of getting this window size to two point five
00:12:50
and then you add one more conversion there then met willing and so on
00:12:56
you can do you ever shifting with the the convolution layers and
00:13:00
then a flattened output and give it to a indian and
00:13:04
and some people will say no we will put their lives to um and then a day and then
00:13:09
it's a different opinion but this is it twice but it that's why devoted does and then
00:13:15
so you can put any kind of a day and then into it and the product that last that's what you're doing
00:13:22
so so you can see that this this enough people under fifty millisecond and the shipping
00:13:28
what you do is you take a two hundred fifty millisecond and if you're
00:13:31
putting in ten millisecond friendship like what we're been doing that additional processing
00:13:39
now what we did is initially in speech recognition essay was talking on the second day
00:13:45
reduce feature extraction then we'll do our a and then so what we're doing here is
00:13:50
we're going to make this as one flock to get the phone likelihood estimate
00:13:56
yeah no kidding here and this is something uh uh thomas they would accept that it where
00:14:03
we also try to go all the way up pretty uh we can go up here
00:14:08
but i'm a i'm not going to talk about that bit of what
00:14:12
what i'm talking here the this popcorn and you replace by the
00:14:16
the the mythology we have it so here is the study on to the speech recognition
00:14:23
so we started it on the p. after the that's it
00:14:29
so we play the all the case where the children speeded life augmented without a speech data also
00:14:37
so we built system with the mobile phones anti forms yeah
00:14:43
and the pain different the systems like uh the
00:14:46
m. um and demands on on carly
00:14:51
uh then they built like a traditional uh um a hybrid hybrid
00:14:57
and then system where we uh we use multi lipper supplements of one hidden layer and the the layout then
00:15:05
we use we uh every pain systems like c. and then with the the layer four sorry three convolution
00:15:10
therefore conventional is by conditioned it but one the one he really we don't have more than that
00:15:17
so here you have one he deliberately their layer after you expected m. f. c. c. and that
00:15:23
appeared because the coefficients disorder doing yeah you get the
00:15:26
last bit signal the correlation there for everyone and
00:15:31
the lottery and the and the m. m. system is a train uh it's a multi pass training so
00:15:37
you do speaker adaptation uh all those things are doing inside that and this is the multicast training
00:15:43
and uh what it be and and then the uh l. c. and then there's those cells based on single
00:15:48
questioning that so that we don't train like we don't have that uh speakers and all the stuff
00:15:54
so what we have thought was that the c. n. and tries to perform pretty well
00:16:00
it also is able to we see that the adult or
00:16:04
speed if you added under the performance is improving
00:16:08
what this is that like the the and then they system is not able to be that human
00:16:12
based system uh without any kind of adaptation it's not able to be this entertainment system
00:16:19
but the c. n. and we're not doing any adaptation of everything the learning feature directly
00:16:23
is that we'll do you perform comparatively uh compared to the t. pass system
00:16:30
so this is just one steady to just to tell how the
00:16:34
those look like and how uh things well in this
00:16:40
that's at plus thirty we were looking at something that was
00:16:45
okay we're learning i'm real learning this the layer by
00:16:50
layer of a layer by layer of here
00:16:53
and it should lead to each layer should lead to an abstraction for this
00:16:58
so we started asking me what is the whole all this this yeah this internally
00:17:05
if you do layer after layer start looking at is so we did some by experiment
00:17:11
so it's it by that's why it's not so yeah it's what they uh it's you take uh my uh um
00:17:18
uh m. f. c. c. on timit data we're doing this and retain what is called a single episode from
00:17:26
thing uh let us up learned that there's no hidden there you just classify selena classified that's not the performance yep
00:17:36
then we take the law speech we have one convolution layer
00:17:40
and they're a and it almost equal to the same m. f. c. c. uh the um uh
00:17:49
parameters that laughter parameters to reply to kind of get the same number of parameters in the spectrum and
00:17:55
so this is what you get before that it so lower the better for you
00:18:00
so you get compared action to what you get with the signals that that's a problem
00:18:06
then you increase the convolution layers and basically your parameters
00:18:13
in the class if they'll ever going to go down into the parameters on the uh feature layer and we see that
00:18:19
this kind of the the see that the more affectionate have the performances and of improving for you huh
00:18:28
we also did the same experiment uh with including i'm a i'm a hidden uh
00:18:35
uh my philip except one and it's a similar trends what whatever functional
00:18:41
of course in this case you see that you can have a very uh even you can
00:18:44
have fewer parameter than what the uh a regular uh muscular pose open this storm
00:18:50
and we are getting um performance but other than that so that's the
00:18:57
basic idea is that it was telling us that okay this
00:18:59
layers are in some way are doing some form of an abstraction job based on the past and the task dependent manner
00:19:06
in the implied that so we'll see that so i don't know what
00:19:11
what that we react under this work into a speaker verification
00:19:16
so speaker verification is a problem where you give to
00:19:19
the system a speech signal and identity claim
00:19:26
and the system will tell uh will accept or reject that identity
00:19:29
clean it's exactly like your radium you put a random act
00:19:33
but and then you clean your password and it yeah will tell whether you allow lexus or not it's exactly the
00:19:40
same problem yep so we came up to the place
00:19:44
where um we first train the speaker identification system
00:19:51
we then what we do is we than uh take the uh system and
00:19:58
in this uh it take take out this how quickly a lot
00:20:02
and we just replace it by um a binary classify it because it's a speaker identification you've been
00:20:08
able independently this to people are not not related to the basketball you train it on that
00:20:15
and then you do an enrolment so basically you try to add that does that work on the small amount of speaker data
00:20:22
and you will this but maybe that's the firewood couldn't be with good well if i just get a
00:20:29
well we tried it on yeltsin watch for data set so it's a bit late really clean data sets here
00:20:37
so we started it like the traditional methods like we uh like
00:20:42
uh the m. m. based um you know some background model
00:20:46
um i rector bassist i'm done fact analysis and so this can
00:20:51
assistance we all the system revert comparing it to the case
00:20:55
with o. c. and then so in the in the input to the c. and then the
00:20:58
first uh on addition there had three hundred samples as input all that examples of input
00:21:04
so this is the case which we call the segment though this is the case with the
00:21:07
subject matter is and then you see that this is a pretty much performing well
00:21:13
it's a platform for that's it's about to discriminate the speakers and it is
00:21:19
available to lance because this commission that that's what the phone the directive
00:21:25
then this might tend to get also be combined in in other ways around that we add some form of speech processing
00:21:32
and then you get a signal and then you'd you'd take the output of the signal and try to
00:21:38
train a network so yeah that you can incorporate the nodded at the processing that what i so
00:21:45
i guess today uh when make was talking about they make was presenting about
00:21:50
um what this call that we have about call white quality features
00:21:55
the source speakers here and the source we decide depending upon how
00:21:59
the vocal folds are closing um uh the the pitch
00:22:03
a fundamental p. three feet really changing all the excitation send this the ending you want to make sure that and
00:22:11
i mean this kind of information can be useful for depression detection because
00:22:16
on one side a lot account for the fact of um the idea that um when
00:22:21
uh the whole progression happens we also somewhat rules control of our vocal folds
00:22:27
that is what we differentiate in terms of deter then should know and it yesterday nick was talking to on the frequency
00:22:33
domain perspective you said how we can go to to consider mine and you can measure the detention but aspects
00:22:41
what we need is okay the problem left it doesn't should not uh estimation is that that
00:22:48
we want to mobile the souls and your signal is sent any also the system information
00:22:54
and all the moving the system information this i believe it is a difficult problem
00:22:59
been a prediction is it linear prediction analysis you get a residual signal the simple signal
00:23:06
doesn't mean that you got rid of the whole system inside the visible signal
00:23:10
you still may have some system information or the to do signal
00:23:14
so the time domain and uh the the the moving the system how that that was the um
00:23:20
the see the um the speech signal information is a challenging problem
00:23:26
so what we did is okay let's do what happens in in the system information most of the time
00:23:32
in the spectrum your formants are that announces how going to lie in more high frequency range
00:23:39
that they're not going to be really like with no uh the most firemen to fires from under confined is thought he
00:23:46
it is around three hundred hertz on other than a of sound they're going to have
00:23:53
a two two fifty the under the then they're going to have more than that
00:23:56
so what we said okay let's filter the signal with a low pass filter and then ten this phone illusion or
00:24:02
no network for depression detection value you have did you say this depressed upon proof we also tried it
00:24:10
okay the little limit but basically this production analysis get the residual and then think this system
00:24:17
we can also do what is called captain analysis again this was what i was talking about and this is what
00:24:23
and i was presenting in them in the first day and captain analysis so
00:24:27
you can build you can split this source and system component and
00:24:31
v. v. the basically the high have to hike after what fish another blow
00:24:36
up with local option the bobcat logo vision that it to the
00:24:39
to the system and the high is related to the soles so we give the high
00:24:44
after um a proficient signal that you can invite that
00:24:49
and then then then another and i met and interesting analysis is
00:24:53
called little frequency analysis so the others because the analysis that
00:24:57
i gave the following that then this in the vineyard accepting
00:25:00
the system the excitation is more like an impulse signal
00:25:05
so it's i think it's going to spread that was all the frequencies
00:25:11
now what you can do is you can try to put the through
00:25:14
the other on the low frequency and tend to filter the signal
00:25:19
now what happens is in this case the system information is more or less we can expect that it will it will get to to
00:25:26
doubt and what really what will remain is more more information related
00:25:31
to the more men maybe like and what they would say
00:25:34
a keynote of waking the presentation of the signal related to the source we can get
00:25:42
it isn't but i yeah still there's there's not been doing of anything it's
00:25:46
a it's a simple i have is there and you train the colour
00:25:49
condition a neural network with that so basic id that you could
00:25:53
combine um a speech processing that's with robin for modelling
00:25:59
so if we did that to me is that it it on that back corpus
00:26:03
so this is the regular thing where v. uh extract features from um
00:26:08
um like l. d. features and trained axiom system
00:26:14
um we uh l. desist uh elderly features with l. s. d. and
00:26:18
then there is a spectral features even to see and then
00:26:22
the spectral features are nothing but a lot of a critical band energies of the um uh
00:26:28
and we also painted that just for our own self and i'm i'm a sissy based system so we just this this
00:26:36
to system i would say like the spectral and them at city they have mainly focusing on the system information
00:26:42
because when you do all the m. f. c. c. um a a warlock filter bank and it is you're
00:26:48
pretty much of mainly focusing on the system you're getting rid of the of the harmonic information that
00:26:54
and i'm a city so so so this is turns when we come back to that
00:27:01
when we do the um even when they do this like about speech we see that oh
00:27:07
we are not able to detect like the depression very well with the last bit
00:27:12
but there are many more when we do some segment the morning that is around thirty samples
00:27:17
a at the uh the c. n. n. facilities and processing for thirty samples for
00:27:24
it is it to to an example is it's tries to model a valuable the depression and control
00:27:31
if you do low a low pass filtering there is gain in something mental analysis
00:27:36
as well as you you have again with the segment analysis and this is see that
00:27:42
and then only if you apply that knowledge and filter out the signal you're getting an improve performance
00:27:49
and what we see that the the view the frequency filters signal
00:27:52
we get a better overall score with this average of their
00:27:55
detection under control here and we see that
00:28:00
the sub segment of a segment of work are kind of performing pretty good
00:28:06
systems so i so i have listed here oh but i'm i'm going to talk about many applications yeah
00:28:17
that that's i'm tied to list some of the things so what we observed is in speech recognition
00:28:23
the window size is like two fifty to three hundred ten millisecond
00:28:28
you can use it to it is more you decide segment awarding the first layer
00:28:33
i'm usually is gonna be three to five years can be one two three below competitive system
00:28:40
and speaker recognition what we found is the input is about finding
00:28:44
millisecond but we have tried it even two seconds it works
00:28:49
it can it can model the first layer can as a model so
00:28:53
segment aside segment bill here have to fix it it's a it
00:28:56
between two to five and one hidden layer we need it so the connotation lizard yeah the number of it and they say yeah
00:29:05
then that is that a presentation that back what it means presentation
00:29:09
at bag is that k. events both a speaker verification system
00:29:13
by recording may weiss and claiming i'm to the system i met two and the system has to
00:29:19
somehow detect that it's that wolf and in that can assist the views analysis we sound that
00:29:24
a a a a a it's like mental analysis quite good enough of course have segment and it also good
00:29:30
with that is also not in the paper but uh we need a number of politicians around too
00:29:36
and we can do without even hidden layer this kind of detection and then there's a gender recognition and
00:29:42
battling mystic uh which i present the the did the depression detection so what does all that
00:29:50
all this kind of problems yet able to uh saw a a real competitive systems
00:29:56
to regularly the idea that you focus of the features and then classify him
00:30:02
so what do you asking is okay what is exactly the neural networks are doing
00:30:08
it for that so i'm going to present in the demand another talk
00:30:15
what does this in such systems line and i'll show a first layer
00:30:19
one english and fast convolution it analysis at the filter level
00:30:24
then i i'll try to all the tightest what do with the idea that
00:30:29
oh how all this whole system as a whole is that
00:30:34
learning what kind of information it is focusing on
00:30:38
if so up one thing you can ask is the first nail
00:30:43
polish and little you have filters you can ask is
00:30:47
what is accumulated response of the filter it's you put all the filters that's
00:30:51
part to get that big hefty and you you may take accumulated response
00:30:57
then you can ask another question is how we this bill does this one into an input signal
00:31:04
so far that we get up with that interpretation that
00:31:09
the condemnation layers at the first one illusion that that we're doing it in operation so that we can
00:31:16
relate to the spectral interpretations easily we we we cannot found in a
00:31:21
wooden that it is kind of a learning aspect of dictionary
00:31:25
it learn some form of mast filters with us once to particular class s. death
00:31:32
this mass filters we can we can look at the best ones of the of
00:31:36
if this bill this aspect of this one that is i mean model by this filters
00:31:41
for the input as follows so you have the production this is the filtered output
00:31:46
and it's a the this could put it as from of this uh of
00:31:50
the filter and this will give you like the magnet magnitude response
00:31:55
oh you can get the so the the the the spectral response and you can take the magnitude and analyse it
00:32:01
yeah now i now what is the interpretation of this is that if this filters where science
00:32:10
and was hanged and your land of the signal is equal to the number of filters
00:32:18
or more i would say it then x. i can be seen like at the this what we're seeing here is of
00:32:27
uh for the purpose of the of the spectrum that's what it real estate that point
00:32:33
so you are you have the signal if that i'll sign ten percent basis you have a production
00:32:39
then you this filter it's to consign function
00:32:44
a sign function will have a different as long as they do that then basically a good to multiply the productions
00:32:52
and i and this is your b. f. t. what exactly he what we're doing here is that
00:33:03
to get what is this this bill does a morning because we know that this because i'm not signs and percent basis
00:33:11
what is done the more so i'll we need some analysis so this is
00:33:16
the case then the scene understand on wall street journal's op was
00:33:21
this is accumulated response what you see is that the filters
00:33:26
are not more link all the frequency in it equally
00:33:30
typically then if people will go and implement an m. f. c. c. or p. l. p.
00:33:35
one thing they will try to impose on the thing is that the film the there's a lot of caution data
00:33:43
so you you that's why the filters will have hyped will reduce with their with at the with was
00:33:49
usually will see that there will be an idea that they have to have a portion you that all frequencies are kind
00:33:55
of represented the you don't try to be emphasising the chickens it's nothing like that happens what we see that
00:34:03
it is focusing on something like telephone bandwidth it is focusing so people and decide to the
00:34:10
telephone bandwidth at some point as four kilohertz and this that will sample at eight kilohertz
00:34:15
they did all these experiments with the humans baptist belittling i'm listening
00:34:19
handed we're doing and that decided this bandwidth for the telephone
00:34:23
telephone bandwidth at full uh uh of of uh affordable her don't
00:34:28
like people infected herds telephone bandwidth so that defined as the
00:34:32
the it and we think that did not i thought also that information is that they are focusing on that side
00:34:42
so i'll
00:34:43
what we need is so this is what the cumulative that's part of that is that now we can ask what if you
00:34:50
give a signal input to it and we ask what is this
00:34:54
the spectrum just once of this filter side coming out
00:34:58
to do list testament what we did is we do the voice you don't know which one illusion neural network
00:35:04
and what it is this fad from this american english novel data
00:35:09
set that are like the the there is a beta
00:35:14
where they have um at this level have he'd
00:35:18
who'd had they had the same context of how and that an in between your changing the levels
00:35:26
and we don't that that men can load it does that they have that the full analysis giving
00:35:30
you what are the formants information what is the fundamental frequency information all this information is
00:35:35
available so we did what we what i was present thing is like the analogy that's it
00:35:40
they have to another do so you take a speech signal and do that analysis
00:35:45
and have the the spectrum so work um uh like twenty to thirty minutes second this is the average
00:35:52
of that so if you see that if either this is the main of roman and i'm a boy and a girl
00:36:01
we see that this information the title covers want to the one the bit
00:36:06
didn't analysis uh by uh before and they found that the foreman did
00:36:11
this dallas with what the formant information they they got in the in the american well with it that's it
00:36:17
so what does indicate is in the speech recognition case
00:36:23
the first politician they are in some way is focusing on the system information
00:36:30
the foreman kind of information and that this is we have um i will come back with
00:36:37
another analysis to show uh the later but this is what this thing for speech recognition
00:36:44
does it means that it will do the same for speaker recognition you can ask yourself but if you see the conventional it later
00:36:52
you go you use m. s. easy for speech recognition you're going to show you the
00:36:56
same m. f. c. c. maybe with a few more confusion for speaker recognition
00:37:01
you don't think the processing in between eugene you you probably the how
00:37:06
much what kind of because in the spectrum represent you changed
00:37:10
it by cleaning the order also them of c. c. uh the the the the definition of them as as a feature
00:37:18
well in this gets all speaker recognition
00:37:22
what we found was if you take like three hundred samples uh add a fast convolution layer that
00:37:29
we do say that three hundred and pulls then it just focus on very low frequency didn't
00:37:37
when we take like thirty samples it focuses on low frequency as well as it would be tough on the high frequency
00:37:45
this just once is entirely different than what i we saw for speech uh speech recognition
00:37:50
kate entertaining for phone classifies retaining this is you're trying to basically may speak yes
00:37:56
so what it would do so what we first analysis it is okay
00:38:02
this means it could remotely what is well like fundamental frequency information
00:38:08
so full of something that what we did is we study the same spectral that's ponce
00:38:14
kind of thing so here is a windowless signal which is voiced and unvoiced
00:38:20
and you do the spectral analysis as i was uh showing before for the window frame and you see that
00:38:26
um you see that there is a distinct be for the voiced with this we we we use another software
00:38:33
to get also the f. not information on the hand calculation all those things and we saw that
00:38:39
it's coming close to what the fundamental frequency at that same was and mandates and
00:38:45
lives you don't see any set distinct shall be the enough energy in that
00:38:51
we said okay this is for one frame is it really doing
00:38:54
the job a a fundamental frequency informations really capturing it
00:38:59
so we took what is called the keel database gilded every is is a ready they have recorded
00:39:05
this b. l. for a ten males and females uh so if i'm a rough life email
00:39:12
they also they go to the editor wrote about so basically the recorder grab
00:39:15
the broader closer is happening and opening you can get all this information
00:39:20
and that can be a standard for us because there's not an ambiguity from me what is
00:39:26
what is the fundamental pleaded for that so we hope that he'll get up is handy
00:39:32
we passed this kills the signal well to the award for the the annex b. with painting
00:39:40
and uh and that's what we got for female and male
00:39:46
so you see that that references the f. not within the bad here
00:39:52
this is coming from the keel speech database reference and this is the f. not without our one small detector
00:39:59
based on the computer respect and detect that point and the small that's not and that's it unplugged it
00:40:08
and we see that it's that it's nice to try to match of on the
00:40:12
even though kill database was not the values in our training nothing we can see that it it it we
00:40:19
can get that not onto itself not contour it's some lot it not so good for male and female
00:40:26
oh so female it's nice mail is it's not so good one based on what would be that we
00:40:32
use the hundred millisecond with this about point b. uh close to you let them twenty millisecond
00:40:37
and to to model the fundamental period well so this may be enough for
00:40:43
females from males it may not be it may not be so that's why you can see that
00:40:48
a lot of deviations mail have lower fundamental frequencies with the pitch cycle can be long
00:40:55
so this is what we found in the case of a speaker when within a speaker recognition
00:41:00
system with like three hundred samples the a politician in window input the as input
00:41:07
when we do the same with cut examples that segment analysis we see that so
00:41:14
here is the l. p. spectrum of a flame and here is what
00:41:17
the spectral that's one that we computed for the same claims we see that
00:41:21
it's kind of capturing the foreman information so it's more focusing on that
00:41:26
what is call the system related information speaker for discrimination
00:41:30
so when attending the sub segment than it was for it with kind of neglecting
00:41:35
uh it it it it is kind of ignoring all their system related information
00:41:41
but is more focusing on this source level and when you do the sound segment the morning
00:41:47
it actually it's morning um um the um the that's the the system information also
00:41:53
so and that explains liven issues the assistant we we're also getting improvements
00:41:58
because we had to be living up to different speaker disseminating information
00:42:03
so so then we asked the question of thing we're looking at one point allusion they are the
00:42:09
first convolutional yeah but recognise what it is all network as a whole is trying to model
00:42:16
so i'll read some inspiration from computer vision
00:42:22
incumbent reason they have some visualisation but that's where i you good but
00:42:28
all the input and look at the output how it changes
00:42:32
you know i use some reconstruction that that all something like they did at that
00:42:37
so that they did not i did the following likely given input image
00:42:41
and you take the output unit cost wanting to the class before the sock my slayer
00:42:46
and you that probably get this information back to that all the way to the input
00:42:53
what it cannot mention is is that how much a small variation of each pixel value will impact the production school
00:43:01
basically that's the basic idea and it is what they call it the levels map of the same
00:43:05
size at the input so you given input image and you get that elements matt said that
00:43:13
so here is the case like i get like this is undoubtedly mange
00:43:18
the the the the the tonight that like that in c. d. compilation guided by propagation
00:43:23
and some people are have come back and said no it is not good and uh we should do
00:43:28
a back propagation and they're likely and not guided by propagation the so but the idea is that
00:43:34
you're trying to find what the population that has train and you basically here it is it's not so
00:43:41
clear um but for example yet in guided propagation you can see the cat and all those thing
00:43:48
this lecture they we don't have in speech or it's too bad propagate these kind of things and
00:43:55
do it in the speech you would do with the speech signal you get the signal
00:44:01
you basically you go don't either though i look at this is nice this is
00:44:05
for this ah this is easy i can i say anything like this
00:44:09
but what we did is okay let's look at the autocorrelation aspect that is power spectrum you can let it
00:44:17
so we did the articulation of the signal and articulation of this eleven
00:44:21
signal and we see that the fundamental information is the dane
00:44:25
and it has the characteristics we we can so we said let's
00:44:30
do what is called a regular short term analysis and see
00:44:33
what is this information so different across different problems
00:44:41
so the case for them to meet um uh where we this is
00:44:47
the way the spectrum here that fit of the linear prediction
00:44:53
and here that eleven signal spectrum you the fit for that similarly this is easy and so we
00:45:01
can see that the school are kind of having the same uh kind of formant information
00:45:08
so what we did is we did we went back the then back to the
00:45:13
american over clock was and we tried lana like this the level signal
00:45:18
so we basically got the f. not estimate we got the f.
00:45:21
one estimate everything by applying linear prediction analysis on them
00:45:26
on the level signal should and painted compared to what the results have been
00:45:32
uh um the in that may come about with it as a people
00:45:36
of course we see that the first phone allusion layer when i was analysing initially were
00:45:40
not thing any them uh any kind of have not information in that but
00:45:44
i the lemon signal is still a heading definite information that means is that
00:45:50
not information is somewhere in the later layers it is getting captured
00:45:56
for farm and one informant to male and female we see that we are not that bad when this this information
00:46:02
is then that that was the eleven signal is a containing the neural network is kind of focusing on this
00:46:09
information this also tallies with our analysis when we did the first layout
00:46:13
analysis yes yeah this this is acting with that this one
00:46:20
then we look at look at the speak okay is that the jewels so segment two more
00:46:24
link or something mental more link with the that the fundamental frequency information is the again
00:46:29
yeah it is also a the regular speech signal with the again whether you even
00:46:34
if you do um i even if you do something with them or link
00:46:38
and here is the case um when it happens level at the spectrum when you do the segment that morning
00:46:45
and the subsequent the morning so what it says in the segment and more link
00:46:50
is hardly focusing on the high frequency did it mainly focusing in very low
00:46:55
because see because again um the air going back to the initial analysis of the filters it agnes with that
00:47:03
and then sell segment on modelling what we say is not segment involving what
00:47:07
we see that that's the low frequency the presentation with is related
00:47:10
to the bit fundamental frequency and then it's mainly focused on the higher
00:47:14
level formants information uh not much on the low level information yeah
00:47:21
so a low level uh formant because every not every some some kind of idea it there that
00:47:27
uh the low level formant like the f. one and f. to them
00:47:30
more sound specific and speaker specific this commission so coming native
00:47:36
for you and higher formant and that kind of uh uh it's a fact we can see that this is
00:47:42
this is that you can see that this is more focusing on i have to consider agents
00:47:49
then also compared everything control and they made
00:47:54
the the bit that training and all so if you train for
00:47:58
the phone classification this is that eleven signal spectrogram we get
00:48:04
yeah yeah we can keep up the for the formant information and uh the harmonic structure
00:48:13
then then you should play need for uh so this is all in a spectrogram and this is the
00:48:18
form one want like that and so forth spectrogram try to keep more this this farmer spectral information
00:48:25
yeah when you take the speaker uh uh on a level that
00:48:30
with the same stuff segment that what they're doing something mental analysis and we say that
00:48:35
basically then is pretty much not so well model but that higher pick and this didn't some more and more what it's indicated
00:48:42
even though you will keep the same shot them well put it put the sample input to
00:48:47
do one big ignition or speaker recognition it it's it is known in different information
00:48:52
is one is a little speaker one is the the speech so it's changing is normal in the same
00:48:57
spectral information for both of them
00:49:02
so let's come back to the point so i thought about all this thing in spectral analysis like
00:49:07
so now we'll see a little bit example of kind of mine how we can look at the time domain analysis
00:49:13
so here i'll come back to this case that we were filled in
00:49:16
the the the frequency signal and painting this network for depression detection
00:49:22
but had you been sequences signal is look for as follows so it's a signal that you
00:49:29
can say it's for this nice sustain well well so it makes looks really good if
00:49:34
no no this this signal's intersecting from positive to negative
00:49:40
that's a that's a point off kind of the source expectation points for you if that's where the excitation stuff happening
00:49:48
so there's a full paper on this to show that how well it you can estimate the
00:49:52
a source of buttons uh a poser instance expectations at this point because it's happening
00:50:01
now what we did for depression detection because fate we will just
00:50:06
feeding the signal into the classifier and then asking so
00:50:09
is it really focusing on such fundamental frequency information we got a really good result
00:50:14
and nice results on that but it is really model incentives that information
00:50:20
to go that way than this um same elements analysis
00:50:25
you get this relevant signal so yeah them plodding for uh the the the the
00:50:29
the well enough signal than that that is the depth of the lemon signal
00:50:34
and then you see the the linear prediction this it allows them prodding for
00:50:37
just to properly properly everything i mean the lemon signal we see that
00:50:44
at this point of intersection is where you get the maximum relevance point this is where you're getting it inside that
00:50:52
okay we said okay this is nice can you know really see this the morley fundamental frequency
00:50:58
all this kind of information so we did like a autocorrelation if this is the
00:51:04
uh i'm autocorrelation of that of of signal in green and then
00:51:07
the blue with a minute visit production that's a dual and
00:51:11
not the correlation of the the the frequency signal we see that you want to clearly see at this point that the
00:51:18
the the mac thing because there is no also that not much shortened a system information so you're not see
00:51:24
like any others aspects and this uh that that's enough tape
00:51:30
so we thought that it's more link uh that the fundamental pleaded information it
00:51:34
it it so there's that new at regulating the fundamental period information
00:51:39
and it is thank after information related to that that's what
00:51:43
that's what we understood from the study and depression detection
00:51:47
so what i have probably until now he's
00:51:51
a to show that i did a show that well the v. although
00:51:55
we have sophisticated spectral based analysis methods we have double up
00:52:00
we can use neural networks to actually automatically ll model directly that
00:52:05
ought to be a signal input in a class specific manner
00:52:10
you can learn it up past the thirty minute i'm interested and what i try to show you that in this case the neural network
00:52:19
is learning information in a task independent manner to simply put the
00:52:25
cases if you take the current speaker would look great literature
00:52:29
almost all the systems are based on bit best systems are based on what this fall
00:52:35
m. f. c. c. the first thing you do is an emergency extraction which is more of the system information
00:52:41
what i try to show you that this network is about even focus on the source information for the speakers discrimination
00:52:48
we have tested it even on a that tell the task like what's eleven all it
00:52:52
it it it locks we're still preparing the paper for that but it lets that
00:52:58
so it yeah so what it means that by h. t. m. changing
00:53:04
want how we change the input at the at the neural network
00:53:09
that we understood some speech recognition study we did everything cross validation and speech recognition
00:53:14
study we found all for speech recognition we were finding the sub segment though
00:53:19
then we did the speaker then we solo you can model the segment of information
00:53:24
so we saw that this be know how the signal needs to be operated
00:53:29
yeah and to capture the information this can be embedded in this whole learning process
00:53:34
we don't need to basically decide why put just a millisecond
00:53:39
thirty milliseconds even the shape they can ignore some problems we don't
00:53:44
need to worry about the sequence like speaker and uh
00:53:47
uh um depression detection if we're not focusing on the sound
00:53:51
you can ignore for uh also the shit information
00:53:54
you can shifted by not ten millisecond twenty millisecond thirty minutes and it is going to work for that
00:54:02
what it can have in it what do you think the that i i think i
00:54:05
think a that it can help us to understand better the speech signal characteristics
00:54:12
um that eleven signal what i showed is to let debatable point about the gradient what we're getting
00:54:18
is it really doing the job it it needs some more uh uh investigation to do that
00:54:25
if we are able to do that nicely then we may be able to gain really good inside how art
00:54:33
than it really is begun poses the it's a speech signal and say that we can probably understand
00:54:39
uh uh it it will help us in improving our own understanding how to process the speech signal thank you
00:54:53
any questions
00:54:57
yeah
00:55:13
yes that's off
00:55:22
no but the what type what dan ellis is like on that to me to know
00:55:25
they were showing on that the lemon signal they did and then shot them analysis
00:55:30
it's also item analysis and then comparing the formant end of information
00:55:34
inside that so that the basic thing so what we
00:55:37
did i see if you do the maybe giving it to fifty milliseconds or even two second meaning is second input
00:55:44
so we show that we can see this relevance in for signal also in the spectrum the typically we can show
00:55:51
does that back the the guy the back propagation of back propagation we control you can look at the spectrum
00:55:57
but if you look at the two fifty millisecond to fifty millisecond as
00:56:00
a signal spectrum it's not going to be localised information for you
00:56:05
so what we what we found it well in since yeah but to look at this didn't again
00:56:11
we can show that we can and i like that spectrum and say that is at
00:56:15
all then in that case why don't we go back to the talisman shot them and that
00:56:19
it's it's and then and i like this eleven signal in the short term basis
00:56:24
and then see what information and it's getting so it in we can localise that's what i mean to say that
00:56:35
i
00:56:41
if it's if it's yeah uh_huh yeah
00:56:55
uh_huh
00:56:59
yeah so that when i'm in speech we we did find it to go to suppose segment the
00:57:06
level but for example in the case of um um speaker recognition speaker when you're doing
00:57:16
we needed the same so uh we get the same so start segment segment analysis
00:57:20
at the input but the context can go to pull second three seconds
00:57:26
then what happened the features to to come at the before the classifier
00:57:30
layer are kind of the supply segment of information for me especially
00:57:35
when i'm using i know that the uh i fair use like around two
00:57:38
hundred millisecond convolution input i know that it's focusing on the multi county
00:57:42
fundamental frequency information pete information so what i can expect it is
00:57:49
if i'm giving the two second speech and modern thing it it should be kept getting information that is
00:57:54
more suppose segment of at that level so we don't uh what they they like did we
00:57:59
don't need to increase this window size that now up at the lower level late las when they're
00:58:06
actually making they can do the job we don't need to chain that yeah yeah yes
00:58:23
i i that's in that's we're going back now of the point too fast ascertained that it indeed focusing on them
00:58:29
the like going back to feel p. database and then ally whether
00:58:33
it's an x. second of on where a locating properly
00:58:36
as so we're doing some systematic analysis and of course we we we were we
00:58:41
gonna uh i'm expecting to say something like that but if it's open

Share this talk: 


Conference Program

Raw Waveform-based Acoustic Modeling and its analysis
Mathew Magimai Doss, Idiap Research Institute
Feb. 14, 2019 · 9:12 a.m.
367 views
About Sequence Classification for Sound Event Detection and end-to-end ASR
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 10:14 a.m.
Case study: Weakly-labeled Sound Event Detection
Thomas Pellegrini, IRIT, France
Feb. 14, 2019 · 11:05 a.m.
113 views
Introduction to Pytorch 1
Feb. 14, 2019 · 12:06 p.m.
114 views
Introduction to Pytorch 2
Feb. 14, 2019 · 12:26 p.m.
134 views

Recommended talks

Voice source analysis
Prof Juan Rafael Orozco - Arroyave, Colombia
Feb. 11, 2019 · 10:10 a.m.
110 views
End-to-end approach for recognizing speakers from audio
Subhadeep Dey, Idiap Research Institute
April 19, 2018 · 11:09 a.m.
232 views