Player is loading...

Embed

Embed code

Transcriptions

Note: this content has been automatically generated.
00:00:01
okay if we i don't have so that i can maybe finish it and then
00:00:05
we'll take a a lap was directly then what's in the best approach
00:00:11
so here i will do that
00:00:17
no so i so we got back to the string matching problem
00:00:25
so we had this uh two things and we saw that you can do a dynamic
00:00:30
programming and you can match the stress things that that's what we thought that
00:00:36
now you can we can expos this we can
00:00:41
exploit this was thing matting buyer probabilistic interpretation
00:00:48
the interpretation of the following
00:00:54
is that the singles what we see here there is no ambiguity from me
00:01:00
so if you know the setup symbols same english have a but you
00:01:04
have twenty six plus some other symbols you want to add
00:01:11
uh you had so there's no ambiguity so the posterior probability for the symbol
00:01:18
and be like a direct function there's no and no ambiguity
00:01:22
means and puppies doodle or all the probably distributions
00:01:27
and then you go back to er dynamic programming you keep the same
00:01:33
um uh strains except and then you go and change your local
00:01:37
part you you you compared to this to do two distributions
00:01:44
so it can combat many different ways one but that is to to back
00:01:47
level divergence between two or you can compute but at that distance
00:01:52
we can have maybe it's been a really possible there's a paper we
00:01:56
just talked about many different ways to compared to probably distributions
00:02:02
in this distribution in general than the local score with zero as it's one
00:02:11
so this problem the way he if you formulate here carefully
00:02:16
doing my thing in this big is a sequence in hypothesis testing problem
00:02:20
we can show that what you do here is an hypothesis testing
00:02:25
it's just it's a happens is interesting problems you're comparing two distributions and you're
00:02:29
saying whether they're same i'm not this is what you're making a decision
00:02:34
if it is same using below if you think one except that my name's not making that this kind of appreciation
00:02:42
here it's an early decision for this is fine as long as you're absolutely sure about the same but there's no problem
00:02:54
uh_huh
00:02:58
now we can also be fine
00:03:02
this whole process often acting of word hypotheses and a speech signal
00:03:11
exactly as combination of two speakers so probably distributions
00:03:17
we saw that in the development already i tried to show that this is a categorical distribution
00:03:23
are you the mapping alarm of according to put in for two
00:03:30
this product opponent for to a a part into a plastic pointed depended
00:03:36
a this is a categorical distribution we were trying to say that
00:03:42
what we do here instead of estimating the likelihood estimate the posterior probability
00:03:49
so you you you don't compute anymore likelihood
00:03:53
you estimate posterior probabilities and there are many estimate is again come across again we can come up with gaussian
00:04:00
mixture models infused button with the models you'll have to apply bayes rule we get the posterior probabilities
00:04:09
or you can use and then we can directly estimate the positive probably this for you
00:04:16
oh you can find any other stay with it then fine
00:04:20
you get the sequence up with it all you have to do is tell me we
00:04:24
did the best posted estimated i don't care with but that you're telling me
00:04:31
you can get it
00:04:33
then once they get this one
00:04:36
this problem formulates into what is called matting and sit and symbol sequences
00:04:42
here we have tennis certain here but yeah we're uncertain with the symbols
00:04:49
and we're going to match this uncertain symbols
00:04:53
again you can do the exactly the same dynamic programming this no
00:04:58
no differences so that the local school or that's a constraint
00:05:05
like the same item number specifies what were we were talking about
00:05:10
and then what we have is the local school is
00:05:13
a clearly within but at that distance you can go again
00:05:16
at the said any kind of distribution comparison but that
00:05:22
while it click if you want to choose is that if you are doing happens is is testing you want
00:05:29
to always the those that use the added off your detection have it you want to use it
00:05:36
based on that you can always decide which mesh or i should use yeah
00:05:42
for example in tae the communication systems if you go if you have padded communication
00:05:47
system it can we show it they have shown that but a distant is
00:05:52
has a better um what is called lower bound then
00:05:57
he'll divergence on the headers the your hypothesis testing as
00:06:04
so it's us this still thinks can it's so this wouldn't make your choice what to what try to want to
00:06:10
make here is based on that data so it's very simple so we had we didn't do any anything
00:06:17
compared to likelihood based system with just reversed this problem from likelihood of
00:06:23
the us and then you're going to do it and if
00:06:31
okay so what ten this was to the base like you put a bit and low likelihood based
00:06:37
is question first question fan in the us than select and simple set like a
00:06:42
car like let's just wanted opponents all all those kind of things they
00:06:47
remain pretty much the same you can borrow it from the the like to
00:06:51
the top of two that's no big differences of was uh uh the
00:06:59
then
00:07:04
then the relationship between this um what is called a contact opponent fall and that lasted you need
00:07:12
it can be deterministic what i should hear uh what i was trying to show sorry
00:07:21
whatever thank you should hear it a deterministic map
00:07:25
or it's going to be a probabilistic matt you complain it
00:07:31
this one i i know i'll talk about it and it was like a about
00:07:35
that later our containing then your local score compared to posterior probably distribution
00:07:42
in this so that that basically you didn't even think what it does change is
00:07:48
you reach if we just went on likelihood estimation for frame to post tape over to so
00:07:53
you're classifying forms and estimating the posterior probabilities of all the phones at every frame
00:08:00
at a given time frame just have tested with that's effective that that's what we're doing yeah
00:08:08
oh
00:08:09
so you just estimating posterior probabilities up all the phones left a letter printed with work and that's what you're doing here
00:08:18
and i'll
00:08:21
this meant that doesn't differentiate if i want to the place to model my method by
00:08:28
by say i'm a instance based approach
00:08:32
i don't entertain anything i just copy this board here
00:08:38
i have liked i i basically big them until here it's no more word sequences
00:08:44
it is like flames a feature extracted the web for each frame here
00:08:50
and then given that you're going to estimate posterior probabilities so there
00:08:55
are two possible would base so this also begins here
00:08:59
and so then this also becomes uncertain and certain i mean is there and probably is now
00:09:04
going to redo all of all the distribution it's you take everything then when until
00:09:11
and you're going to do the same dynamic programming basically whether i represent my word hypotheses
00:09:18
by a textual mode or i represent what i put this is
00:09:22
by an instance the metal would be pretty much when saying
00:09:27
when we dealing with likelihood and there were a spectral based approaches and the
00:09:33
thing you see that for instance based approach we went to some
00:09:36
like that have to be speed to a curve feature extraction and the
00:09:40
d. w. dynamic time warping what we were calling that and and
00:09:44
uh uh that and what we do with platinum slightly different in the like of
00:09:50
yeah it doesn't think this for you
00:09:53
the foundation and the main same except that you may have
00:09:57
the local constant maybe slightly different reason is because
00:10:03
um this can be we're we're dealing this in number of frames
00:10:08
so you cannot you cannot apply like an atom unconstrained local i think
00:10:12
if you apply like and have them on a local constraint
00:10:16
like this it imposes a gas thing that this land yet and
00:10:21
have to be less than or equal to this land am
00:10:24
that's it i think inside that if such kind of best it cannot
00:10:28
be applied because that's big five because it's low so basically
00:10:34
you have to go back your to the instant based approaches and take one of those uh local constraints
00:10:42
so the methodology doesn't in this with that i played unify support model bit an instant based approaches
00:10:49
so some of the interesting properties the that it retains some properties of
00:10:55
thing matching uh a nice properties of what sting matting provides
00:11:00
i think i think the it's a local this committee that in at every point every cell
00:11:05
you're trying to say this to simple so same are not used to locally discriminating
00:11:12
two symbols and exactly what you're doing here is except that here the symbols on
00:11:18
certain the probably distribution and you're going to do something like but they're distant
00:11:23
or cripple brake lever the within the all can be shown to be performing hypothesis testing
00:11:28
to tell you whether they have the same symbol or not i don't care
00:11:32
there's ah ah well well ah this this same or not the second
00:11:39
this is what the locals what we're doing
00:11:41
i think i think it's boldly this committee the send that at the end of the string matching
00:11:47
if you can always test i put this either the putting smacked on a
00:11:51
lot you can think that's on all i can accept and reject
00:11:55
this
00:11:57
for example then you're doing text search at the moment it finds a deal a sting netting
00:12:05
in it pulls that document to you and pages so it suppose you typed it wrong
00:12:11
what to do most of and it's going to find the most one interacting
00:12:15
and if they do start giving it additions that's what pretty much people
00:12:21
thing called you're wrong maybe you were talking about this now
00:12:25
that you start uh this this war not that work
00:12:28
so i think i think why to the hype but this is uh you can test it so it's a globally discriminative
00:12:37
so here is what the next moment what we did so what we did is we went back to this matter
00:12:45
um
00:12:47
yeah
00:12:50
so you take off
00:12:53
uh that uh this it doubly k. and that's what the speech
00:12:55
signal i gave exactly the same word from different speakers
00:13:02
okay so i have same thing same what i gave you and i get the school
00:13:09
yeah of course this parkland can be different so you have to know more lights
00:13:15
yeah
00:13:16
and i take okay so that these two are diff different words
00:13:22
but on the different speakers all kind of thing so you can take a list of things
00:13:27
and you can you can split them into put but matting and not matching thinking
00:13:35
and then we can put the the score and this is what the the distribution of the schools
00:13:40
so they get some acting it to going to go to close the deal for you
00:13:46
but the minimum and then the cat is not my thing it was for the survey
00:13:52
and you can see that that's hardly any kind of
00:13:57
oh yeah that's a very little overlap for you
00:14:01
that means what we have is not so what what is thing matching generally provides you
00:14:07
it is preserved in this kind of method you're preserving that capability to discriminate
00:14:13
that automatically so i i'm not going to go for the neural networks are any
00:14:18
thinks i'm simply if if you take any estimated it can start doing this
00:14:25
that's it
00:14:27
so maybe this ah what is that identification we were trying to do and
00:14:33
then we could simply stay late the methodology to do keyword spotting him
00:14:40
it was it was a bit as they say there's an example of an incentive document you're looking for looking about him
00:14:46
and i can just directly expand this meant that as a simple string matching
00:14:51
but there to get a keyword sport topic symbol what's no problems that
00:14:59
now i know what i talk about here is the relationship
00:15:06
can be deterministic are probably speak so that is
00:15:11
that's what we can do yeah
00:15:14
so we have a sequence of acoustic features feed it to the role at but i'm putting a neural
00:15:21
network here again i said any can opposed to probably estimated it can put it in front plate
00:15:28
then you get a sequence of probably be wrecked the yeah
00:15:35
then you take them as observations for hidden markov model
00:15:40
and this hidden markov models are adamant live at the same time mention categorical distribution
00:15:49
and then basically you can do like what is called reader be an an based
00:15:53
on the cost function like a burglar but dividend you can estimate these parameters
00:16:00
so first you paint a posted estimators here first you need estimator i want
00:16:05
to have it if you go all and basically train this categorical distributions
00:16:13
so it's normal going to be for you and then copy of zero it will have some entropy
00:16:21
so you're the distribution here really pretty the distributions here are not going to be chronicled or
00:16:29
distributions they're going to have it not going to be interrupted euro anymore for it
00:16:37
and it can with this there's no problems it has some nice so i i gave you the sting letting
00:16:44
perspective so some things i'm going to talk about is what they can do with this kind of models
00:16:53
so in speech recognition when you are painting like an acoustic model
00:16:58
if you need a pastel speech data uh linguistic expertise
00:17:04
figure brunch alexi um and actually put resources so i waited about this but
00:17:14
many languages i may not have a good path that speech
00:17:20
this problem yeah but also facing pathological speech often use the i have a really is that tough to face the problem
00:17:29
you also lack linguistic aspect expertise
00:17:33
four languages the languages like gaelic the language that maybe like a scully
00:17:38
all all these languages they have them on the linguistic expert based
00:17:45
for that i can save a pathological speak we also don't have any linguistic expertise
00:17:53
if you have all the linguistic expert is we have is coming out well you went spokane speaking l.
00:18:02
the enormous we left by itself was trying to show that the firemen so when ups and downs
00:18:08
so where is your phone which one you're going to talk about so we like linguistics but this thing
00:18:18
so i'll how we can well i that's an age to challenge this at the same time so
00:18:23
for example i don't have phonetic dictionary i i have a really that that's i've speech
00:18:30
so the problem can be done in a simple in a way that you trained this neural network
00:18:37
on a multilingual data
00:18:40
you take
00:18:42
languages you can pull whatever languages you have
00:18:48
you pull the phone sets retainer multilingual your laptop it's language independent model
00:18:58
what do you need to train is this parameter which is language dependent
00:19:04
if you don't have a lexical that the the states like the phonetic lexical
00:19:08
so you don't know what should be this here the phones and all
00:19:12
what we can use these you can go back to return for letting based lexicons
00:19:18
you can go to that means that is like here like v.
00:19:22
crappy minutes the g. r. a. p. i. d. and me
00:19:25
is make nothing but so this is my dictionary i go back and that happens in the states make that means
00:19:33
and i take the neural network uh i i pay my
00:19:37
my my my this language dependent parameters on the settlements
00:19:44
so here is an example
00:19:47
what did it was so we don't like languages like english swiss francs
00:19:53
three german and italian and spanish and retrain this model on that
00:20:03
then we talk a greek language at a target language
00:20:08
and what we did is with that language yet painting we're building a speech recognition system of week yeah
00:20:18
so we are going to learn these parameters and make him
00:20:23
so what maybe we've adding the amount of training data we have
00:20:28
we had the amount of training data so five minutes line is eighteen
00:20:32
when we tried some when it's like this we keep adding it
00:20:36
and we tried many kind of method some any kind of systems step
00:20:42
so uh like lot of adaptation but the bit of based on that
00:20:46
means that's without falling you know we were comparing all the systems
00:20:51
and what is a is like most of the they limit that it's like that means with the
00:20:58
standard like little bit upload ecstatic that uh they they're not going to go so well
00:21:04
they need the standard approaches they need a lot of data like up to one fifty
00:21:11
um when it's basically stacked catching up with the models what we have yeah
00:21:18
third so
00:21:22
what we in agreement even one step further and detect even show that
00:21:28
this
00:21:31
a base
00:21:33
we we we went one step further and we tape the show that even if
00:21:37
you do have deal minutes of data you can build a speech recogniser
00:21:45
all you have to do it if you do need a little knowledge like a map
00:21:50
between crap into the four years if you have that little knowledge you can
00:21:56
what we can do is you can put the distribution as and you stick
00:22:03
you can use to clean define if you know the clapping the phoneme lip so how this
00:22:08
multilingual phones are related to the graph names of that language if you know that
00:22:13
without looking at any acoustic data any parameter trainee you may be able to do some speech recognition
00:22:21
we went but it's in the idea of this was recorded near the other this was because i still assume
00:22:25
that i have this knowledge that the clapping madly things i think of when you have this knowledge
00:22:32
so people and proper would give us so they say i don't even have that knowledge and it'll
00:22:40
so i'll so this is one of the problem now let me
00:22:46
start doing some kind of modelling like that inside that
00:22:52
what happens is this parameters here they had one this is crappy and
00:23:00
this be damaged is phonemes so it's going to learn i did nothing to
00:23:04
phonemes relation it for you if you train the model like that
00:23:11
if you have nothing to the states and output of the neural don't uh phones you start learning about
00:23:18
into phoneme relations so here uh there's a lot of the first was have and we were doing
00:23:24
so we tried to be like a a product
00:23:27
uh independent wrapping model laying to product dependent
00:23:32
so for example if you do context independent morally and that the act among states for that
00:23:38
we see that the first date it kept doing so uh the
00:23:42
second it is corrupting charter and this paid these capping photo
00:23:49
i if you give us a letter in english
00:23:52
and i asked what would be the possible sounds
00:23:56
you need a lot of contact to basically figure out what betsy corresponds to which sound and put it
00:24:01
that is a that those all that kind of the sound we're trying to get into the state
00:24:07
if in case the context i i say oh it's a start the c. is
00:24:13
at the start of the word and followed by a i'll that a letter
00:24:19
then the state one two three they have high probability focal cocoa that
00:24:24
is an advantage to call and go and those that are if
00:24:29
if it's a high probability no i'm i'm taking those but i'm it doesn't
00:24:32
and keeping only the popular probably the same thing in those distributions
00:24:38
and i seem like i say see followed by a but the beginning it
00:24:43
was too so so so the the answers to military go for
00:24:50
see that's hatch it was too tall joe than church a jail
00:24:58
what's happening of course it's interestingly what that's what exactly you're glad you've only been what at best
00:25:05
it basically going to model the context and say i want phoneme i
00:25:08
have to map it to and that's exactly you're trying to learn
00:25:12
in it didn't even matter giving the from the data to go directly moaning this this one
00:25:20
once you have this kind of parameters this kind of thing you can turn
00:25:25
this thing a maximum input then it there's nothing to phoning from what
00:25:32
so you have autocratic so you do it collectively make nothing um the you you pass them
00:25:41
then you take the pain wrapping based to give it a more models and then you generate
00:25:46
so that the that the states like yeah we have cat like state one state
00:25:53
to state the difference it would it be to twenty two to three like this and
00:25:58
then they sequence of positive probably these then you run through a fourteen decoders
00:26:05
fully connected and then if it of four tickets
00:26:11
so what we have done mary that we had saying that okay we can do
00:26:18
we can we can we can we can learn got into phoneme relations and that can basically kind of this
00:26:25
my lexical and acoustically so's france thing but the same thing we can also gonna start doing what it's called
00:26:32
what we call across to get everyone's laughing too from which you can do
00:26:38
that uh
00:26:42
now the other problems like objective assessment of accented
00:26:45
mess so here i haven't made a speech
00:26:51
and i haven't not maybe speech yeah i extract features
00:26:59
and then i parker neural network with just a non native speech i get the sequence of
00:27:05
two sequence of forty the vectors the middle dynamic programming and i can get an accessible
00:27:14
we can delay this planet problems even to the listening problems because as i said at each
00:27:20
point you're comparing probably distributions on the plot you can since you can do hypothesis testing
00:27:25
you can say somewhere with some forms are right or wrong the kind of decisions to make
00:27:32
so we get this uh on that data set which was coming from me mine
00:27:39
so there um native english speakers and then the the kind
00:27:44
of nonnative speakers one was coming from a friend
00:27:49
phil and people were from and that in a speaker
00:27:53
and then germans because what interesting yeah though is
00:27:59
this space of latency but what i call the the a. v.
00:28:03
yeah so in the castle madly in what we saw that
00:28:08
be um we really don't need like um this this
00:28:14
this this place could be just didn't wanna phones
00:28:20
you can able to get this kind of good correlation between the
00:28:26
uh the the the subjective accent the schools and the automatic school we're getting a bit
00:28:36
if you take finish we can also get it papa johns what we have the hope was for germans we found
00:28:43
that that the acoustics units that that spans of latency both
00:28:48
have to be defined based on context dependent units
00:28:52
if you use more phones product independence is not going to come up with this nice correlation
00:28:59
so that means when this because uh chaney i i may have to also add that
00:29:05
this the unit space that this this is what i mean to say this uh
00:29:12
oh so here is what the it is what the plot i'm going to show yeah
00:29:18
so we had forty four ounce as that my latent symbol space
00:29:25
for that like demanding it was fly you include
00:29:29
number of units doesn't matter the eighties the
00:29:33
simple to the complex to conduct opponent tape doing giving it it doesn't matter much
00:29:39
but the finish there is some form of improvement
00:29:45
but then then germans you see that basically it i really relies on the convicted in space for
00:29:54
now this i present that they thought accent than the scroll off but
00:29:58
of course you can do seem a kind of test for intelligibility
00:30:01
also there's also work we have published and into the be dozens but
00:30:05
also the dick speed even codec assessment you can do it
00:30:10
you can do exactly the score there is can correlate with your uh what you're trying to assess it said that
00:30:23
we also know lapel like the objective in dealing with the production evaluation so you have a synthetic speech
00:30:31
and you have the text that rendered to the f. p. d. s. system to give the speech
00:30:38
from that you can get a gay letterman with generates the categorical distributions
00:30:45
from speech signal we get as sick as opposed to probably case
00:30:51
and then we decided we define this problem as what we call problem how
00:30:56
many words we can recall from the synthetic speech we can do
00:31:01
we defined it to this method is can be more generally then a single human often so you don't need a human opponents they are
00:31:08
it can also okay letterman here can more of pronunciation variability
00:31:12
and we found that the the no need for the matting human up that's according
00:31:16
so you can you don't eat so you have a text you get it
00:31:20
and your match this this to sequence up over this
00:31:25
and you try to see how many words you can be called the more words to recall that in those
00:31:31
to the to the speed is more intelligible for the the less you have a call is let's interlude
00:31:40
so that's what is thought that uh this was over the call
00:31:45
thing and this is where the subjective word accuracy and we could see a nice
00:31:51
correlation between what we call it so what we call
00:31:55
objectively and work subject really people were trying to
00:32:00
recognise and speak and that was the way we can we can do this if it
00:32:10
any if it if you have any questions i can answer later or
00:32:18
you want to break or you know the big have a big

Share this talk: 


Conference program

Sequence modelling for speech processing - Part 1
Mathew Magimai Doss, Idiap Research Institute
12 Feb. 2019 · 9:05 a.m.
Sequence modelling for speech processing - Part 2
Mathew Magimai Doss, Idiap Research Institute
12 Feb. 2019 · 9:59 a.m.
Sequence modelling for speech processing - Part 3
Mathew Magimai Doss, Idiap Research Institute
12 Feb. 2019 · 11:10 a.m.