Sequence modelling for speech processing - Part 2 | Mathew Magimai Doss, Idiap Research Institute | 12.02.2019 at 09:59 | Part of TAPAS Training Event 2: Speech Processing and Machine Learning Workshop - Day 2

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:01

okay if we i don't have so that i can maybe finish it and then

00:00:05

we'll take a a lap was directly then what's in the best approach

00:00:11

so here i will do that

00:00:17

no so i so we got back to the string matching problem

00:00:25

so we had this uh two things and we saw that you can do a dynamic

00:00:30

programming and you can match the stress things that that's what we thought that

00:00:36

now you can we can expos this we can

00:00:41

exploit this was thing matting buyer probabilistic interpretation

00:00:48

the interpretation of the following

00:00:54

is that the singles what we see here there is no ambiguity from me

00:01:00

so if you know the setup symbols same english have a but you

00:01:04

have twenty six plus some other symbols you want to add

00:01:11

uh you had so there's no ambiguity so the posterior probability for the symbol

00:01:18

and be like a direct function there's no and no ambiguity

00:01:22

means and puppies doodle or all the probably distributions

00:01:27

and then you go back to er dynamic programming you keep the same

00:01:33

um uh strains except and then you go and change your local

00:01:37

part you you you compared to this to do two distributions

00:01:44

so it can combat many different ways one but that is to to back

00:01:47

level divergence between two or you can compute but at that distance

00:01:52

we can have maybe it's been a really possible there's a paper we

00:01:56

just talked about many different ways to compared to probably distributions

00:02:02

in this distribution in general than the local score with zero as it's one

00:02:11

so this problem the way he if you formulate here carefully

00:02:16

doing my thing in this big is a sequence in hypothesis testing problem

00:02:20

we can show that what you do here is an hypothesis testing

00:02:25

it's just it's a happens is interesting problems you're comparing two distributions and you're

00:02:29

saying whether they're same i'm not this is what you're making a decision

00:02:34

if it is same using below if you think one except that my name's not making that this kind of appreciation

00:02:42

here it's an early decision for this is fine as long as you're absolutely sure about the same but there's no problem

00:02:54

uh_huh

00:02:58

now we can also be fine

00:03:02

this whole process often acting of word hypotheses and a speech signal

00:03:11

exactly as combination of two speakers so probably distributions

00:03:17

we saw that in the development already i tried to show that this is a categorical distribution

00:03:23

are you the mapping alarm of according to put in for two

00:03:30

this product opponent for to a a part into a plastic pointed depended

00:03:36

a this is a categorical distribution we were trying to say that

00:03:42

what we do here instead of estimating the likelihood estimate the posterior probability

00:03:49

so you you you don't compute anymore likelihood

00:03:53

you estimate posterior probabilities and there are many estimate is again come across again we can come up with gaussian

00:04:00

mixture models infused button with the models you'll have to apply bayes rule we get the posterior probabilities

00:04:09

or you can use and then we can directly estimate the positive probably this for you

00:04:16

oh you can find any other stay with it then fine

00:04:20

you get the sequence up with it all you have to do is tell me we

00:04:24

did the best posted estimated i don't care with but that you're telling me

00:04:31

you can get it

00:04:33

then once they get this one

00:04:36

this problem formulates into what is called matting and sit and symbol sequences

00:04:42

here we have tennis certain here but yeah we're uncertain with the symbols

00:04:49

and we're going to match this uncertain symbols

00:04:53

again you can do the exactly the same dynamic programming this no

00:04:58

no differences so that the local school or that's a constraint

00:05:05

like the same item number specifies what were we were talking about

00:05:10

and then what we have is the local school is

00:05:13

a clearly within but at that distance you can go again

00:05:16

at the said any kind of distribution comparison but that

00:05:22

while it click if you want to choose is that if you are doing happens is is testing you want

00:05:29

to always the those that use the added off your detection have it you want to use it

00:05:36

based on that you can always decide which mesh or i should use yeah

00:05:42

for example in tae the communication systems if you go if you have padded communication

00:05:47

system it can we show it they have shown that but a distant is

00:05:52

has a better um what is called lower bound then

00:05:57

he'll divergence on the headers the your hypothesis testing as

00:06:04

so it's us this still thinks can it's so this wouldn't make your choice what to what try to want to

00:06:10

make here is based on that data so it's very simple so we had we didn't do any anything

00:06:17

compared to likelihood based system with just reversed this problem from likelihood of

00:06:23

the us and then you're going to do it and if

00:06:31

okay so what ten this was to the base like you put a bit and low likelihood based

00:06:37

is question first question fan in the us than select and simple set like a

00:06:42

car like let's just wanted opponents all all those kind of things they

00:06:47

remain pretty much the same you can borrow it from the the like to

00:06:51

the top of two that's no big differences of was uh uh the

00:06:59

then

00:07:04

then the relationship between this um what is called a contact opponent fall and that lasted you need

00:07:12

it can be deterministic what i should hear uh what i was trying to show sorry

00:07:21

whatever thank you should hear it a deterministic map

00:07:25

or it's going to be a probabilistic matt you complain it

00:07:31

this one i i know i'll talk about it and it was like a about

00:07:35

that later our containing then your local score compared to posterior probably distribution

00:07:42

in this so that that basically you didn't even think what it does change is

00:07:48

you reach if we just went on likelihood estimation for frame to post tape over to so

00:07:53

you're classifying forms and estimating the posterior probabilities of all the phones at every frame

00:08:00

at a given time frame just have tested with that's effective that that's what we're doing yeah

00:08:08

00:08:09

so you just estimating posterior probabilities up all the phones left a letter printed with work and that's what you're doing here

00:08:18

and i'll

00:08:21

this meant that doesn't differentiate if i want to the place to model my method by

00:08:28

by say i'm a instance based approach

00:08:32

i don't entertain anything i just copy this board here

00:08:38

i have liked i i basically big them until here it's no more word sequences

00:08:44

it is like flames a feature extracted the web for each frame here

00:08:50

and then given that you're going to estimate posterior probabilities so there

00:08:55

are two possible would base so this also begins here

00:08:59

and so then this also becomes uncertain and certain i mean is there and probably is now

00:09:04

going to redo all of all the distribution it's you take everything then when until

00:09:11

and you're going to do the same dynamic programming basically whether i represent my word hypotheses

00:09:18

by a textual mode or i represent what i put this is

00:09:22

by an instance the metal would be pretty much when saying

00:09:27

when we dealing with likelihood and there were a spectral based approaches and the

00:09:33

thing you see that for instance based approach we went to some

00:09:36

like that have to be speed to a curve feature extraction and the

00:09:40

d. w. dynamic time warping what we were calling that and and

00:09:44

uh uh that and what we do with platinum slightly different in the like of

00:09:50

yeah it doesn't think this for you

00:09:53

the foundation and the main same except that you may have

00:09:57

the local constant maybe slightly different reason is because

00:10:03

um this can be we're we're dealing this in number of frames

00:10:08

so you cannot you cannot apply like an atom unconstrained local i think

00:10:12

if you apply like and have them on a local constraint

00:10:16

like this it imposes a gas thing that this land yet and

00:10:21

have to be less than or equal to this land am

00:10:24

that's it i think inside that if such kind of best it cannot

00:10:28

be applied because that's big five because it's low so basically

00:10:34

you have to go back your to the instant based approaches and take one of those uh local constraints

00:10:42

so the methodology doesn't in this with that i played unify support model bit an instant based approaches

00:10:49

so some of the interesting properties the that it retains some properties of

00:10:55

thing matching uh a nice properties of what sting matting provides

00:11:00

i think i think the it's a local this committee that in at every point every cell

00:11:05

you're trying to say this to simple so same are not used to locally discriminating

00:11:12

two symbols and exactly what you're doing here is except that here the symbols on

00:11:18

certain the probably distribution and you're going to do something like but they're distant

00:11:23

or cripple brake lever the within the all can be shown to be performing hypothesis testing

00:11:28

to tell you whether they have the same symbol or not i don't care

00:11:32

there's ah ah well well ah this this same or not the second

00:11:39

this is what the locals what we're doing

00:11:41

i think i think it's boldly this committee the send that at the end of the string matching

00:11:47

if you can always test i put this either the putting smacked on a

00:11:51

lot you can think that's on all i can accept and reject

00:11:55

this

00:11:57

for example then you're doing text search at the moment it finds a deal a sting netting

00:12:05

in it pulls that document to you and pages so it suppose you typed it wrong

00:12:11

what to do most of and it's going to find the most one interacting

00:12:15

and if they do start giving it additions that's what pretty much people

00:12:21

thing called you're wrong maybe you were talking about this now

00:12:25

that you start uh this this war not that work

00:12:28

so i think i think why to the hype but this is uh you can test it so it's a globally discriminative

00:12:37

so here is what the next moment what we did so what we did is we went back to this matter

00:12:45

00:12:47

yeah

00:12:50

so you take off

00:12:53

uh that uh this it doubly k. and that's what the speech

00:12:55

signal i gave exactly the same word from different speakers

00:13:02

okay so i have same thing same what i gave you and i get the school

00:13:09

yeah of course this parkland can be different so you have to know more lights

00:13:15

yeah

00:13:16

and i take okay so that these two are diff different words

00:13:22

but on the different speakers all kind of thing so you can take a list of things

00:13:27

and you can you can split them into put but matting and not matching thinking

00:13:35

and then we can put the the score and this is what the the distribution of the schools

00:13:40

so they get some acting it to going to go to close the deal for you

00:13:46

but the minimum and then the cat is not my thing it was for the survey

00:13:52

and you can see that that's hardly any kind of

00:13:57

oh yeah that's a very little overlap for you

00:14:01

that means what we have is not so what what is thing matching generally provides you

00:14:07

it is preserved in this kind of method you're preserving that capability to discriminate

00:14:13

that automatically so i i'm not going to go for the neural networks are any

00:14:18

thinks i'm simply if if you take any estimated it can start doing this

00:14:25

that's it

00:14:27

so maybe this ah what is that identification we were trying to do and

00:14:33

then we could simply stay late the methodology to do keyword spotting him

00:14:40

it was it was a bit as they say there's an example of an incentive document you're looking for looking about him

00:14:46

and i can just directly expand this meant that as a simple string matching

00:14:51

but there to get a keyword sport topic symbol what's no problems that

00:14:59

now i know what i talk about here is the relationship

00:15:06

can be deterministic are probably speak so that is

00:15:11

that's what we can do yeah

00:15:14

so we have a sequence of acoustic features feed it to the role at but i'm putting a neural

00:15:21

network here again i said any can opposed to probably estimated it can put it in front plate

00:15:28

then you get a sequence of probably be wrecked the yeah

00:15:35

then you take them as observations for hidden markov model

00:15:40

and this hidden markov models are adamant live at the same time mention categorical distribution

00:15:49

and then basically you can do like what is called reader be an an based

00:15:53

on the cost function like a burglar but dividend you can estimate these parameters

00:16:00

so first you paint a posted estimators here first you need estimator i want

00:16:05

to have it if you go all and basically train this categorical distributions

00:16:13

so it's normal going to be for you and then copy of zero it will have some entropy

00:16:21

so you're the distribution here really pretty the distributions here are not going to be chronicled or

00:16:29

distributions they're going to have it not going to be interrupted euro anymore for it

00:16:37

and it can with this there's no problems it has some nice so i i gave you the sting letting

00:16:44

perspective so some things i'm going to talk about is what they can do with this kind of models

00:16:53

so in speech recognition when you are painting like an acoustic model

00:16:58

if you need a pastel speech data uh linguistic expertise

00:17:04

figure brunch alexi um and actually put resources so i waited about this but

00:17:14

many languages i may not have a good path that speech

00:17:20

this problem yeah but also facing pathological speech often use the i have a really is that tough to face the problem

00:17:29

you also lack linguistic aspect expertise

00:17:33

four languages the languages like gaelic the language that maybe like a scully

00:17:38

all all these languages they have them on the linguistic expert based

00:17:45

for that i can save a pathological speak we also don't have any linguistic expertise

00:17:53

if you have all the linguistic expert is we have is coming out well you went spokane speaking l.

00:18:02

the enormous we left by itself was trying to show that the firemen so when ups and downs

00:18:08

so where is your phone which one you're going to talk about so we like linguistics but this thing

00:18:18

so i'll how we can well i that's an age to challenge this at the same time so

00:18:23

for example i don't have phonetic dictionary i i have a really that that's i've speech

00:18:30

so the problem can be done in a simple in a way that you trained this neural network

00:18:37

on a multilingual data

00:18:40

you take

00:18:42

languages you can pull whatever languages you have

00:18:48

you pull the phone sets retainer multilingual your laptop it's language independent model

00:18:58

what do you need to train is this parameter which is language dependent

00:19:04

if you don't have a lexical that the the states like the phonetic lexical

00:19:08

so you don't know what should be this here the phones and all

00:19:12

what we can use these you can go back to return for letting based lexicons

00:19:18

you can go to that means that is like here like v.

00:19:22

crappy minutes the g. r. a. p. i. d. and me

00:19:25

is make nothing but so this is my dictionary i go back and that happens in the states make that means

00:19:33

and i take the neural network uh i i pay my

00:19:37

my my my this language dependent parameters on the settlements

00:19:44

so here is an example

00:19:47

what did it was so we don't like languages like english swiss francs

00:19:53

three german and italian and spanish and retrain this model on that

00:20:03

then we talk a greek language at a target language

00:20:08

and what we did is with that language yet painting we're building a speech recognition system of week yeah

00:20:18

so we are going to learn these parameters and make him

00:20:23

so what maybe we've adding the amount of training data we have

00:20:28

we had the amount of training data so five minutes line is eighteen

00:20:32

when we tried some when it's like this we keep adding it

00:20:36

and we tried many kind of method some any kind of systems step

00:20:42

so uh like lot of adaptation but the bit of based on that

00:20:46

means that's without falling you know we were comparing all the systems

00:20:51

and what is a is like most of the they limit that it's like that means with the

00:20:58

standard like little bit upload ecstatic that uh they they're not going to go so well

00:21:04

they need the standard approaches they need a lot of data like up to one fifty

00:21:11

um when it's basically stacked catching up with the models what we have yeah

00:21:18

third so

00:21:22

what we in agreement even one step further and detect even show that

00:21:28

this

00:21:31

a base

00:21:33

we we we went one step further and we tape the show that even if

00:21:37

you do have deal minutes of data you can build a speech recogniser

00:21:45

all you have to do it if you do need a little knowledge like a map

00:21:50

between crap into the four years if you have that little knowledge you can

00:21:56

what we can do is you can put the distribution as and you stick

00:22:03

you can use to clean define if you know the clapping the phoneme lip so how this

00:22:08

multilingual phones are related to the graph names of that language if you know that

00:22:13

without looking at any acoustic data any parameter trainee you may be able to do some speech recognition

00:22:21

we went but it's in the idea of this was recorded near the other this was because i still assume

00:22:25

that i have this knowledge that the clapping madly things i think of when you have this knowledge

00:22:32

so people and proper would give us so they say i don't even have that knowledge and it'll

00:22:40

so i'll so this is one of the problem now let me

00:22:46

start doing some kind of modelling like that inside that

00:22:52

what happens is this parameters here they had one this is crappy and

00:23:00

this be damaged is phonemes so it's going to learn i did nothing to

00:23:04

phonemes relation it for you if you train the model like that

00:23:11

if you have nothing to the states and output of the neural don't uh phones you start learning about

00:23:18

into phoneme relations so here uh there's a lot of the first was have and we were doing

00:23:24

so we tried to be like a a product

00:23:27

uh independent wrapping model laying to product dependent

00:23:32

so for example if you do context independent morally and that the act among states for that

00:23:38

we see that the first date it kept doing so uh the

00:23:42

second it is corrupting charter and this paid these capping photo

00:23:49

i if you give us a letter in english

00:23:52

and i asked what would be the possible sounds

00:23:56

you need a lot of contact to basically figure out what betsy corresponds to which sound and put it

00:24:01

that is a that those all that kind of the sound we're trying to get into the state

00:24:07

if in case the context i i say oh it's a start the c. is

00:24:13

at the start of the word and followed by a i'll that a letter

00:24:19

then the state one two three they have high probability focal cocoa that

00:24:24

is an advantage to call and go and those that are if

00:24:29

if it's a high probability no i'm i'm taking those but i'm it doesn't

00:24:32

and keeping only the popular probably the same thing in those distributions

00:24:38

and i seem like i say see followed by a but the beginning it

00:24:43

was too so so so the the answers to military go for

00:24:50

see that's hatch it was too tall joe than church a jail

00:24:58

what's happening of course it's interestingly what that's what exactly you're glad you've only been what at best

00:25:05

it basically going to model the context and say i want phoneme i

00:25:08

have to map it to and that's exactly you're trying to learn

00:25:12

in it didn't even matter giving the from the data to go directly moaning this this one

00:25:20

once you have this kind of parameters this kind of thing you can turn

00:25:25

this thing a maximum input then it there's nothing to phoning from what

00:25:32

so you have autocratic so you do it collectively make nothing um the you you pass them

00:25:41

then you take the pain wrapping based to give it a more models and then you generate

00:25:46

so that the that the states like yeah we have cat like state one state

00:25:53

to state the difference it would it be to twenty two to three like this and

00:25:58

then they sequence of positive probably these then you run through a fourteen decoders

00:26:05

fully connected and then if it of four tickets

00:26:11

so what we have done mary that we had saying that okay we can do

00:26:18

we can we can we can we can learn got into phoneme relations and that can basically kind of this

00:26:25

my lexical and acoustically so's france thing but the same thing we can also gonna start doing what it's called

00:26:32

what we call across to get everyone's laughing too from which you can do

00:26:38

that uh

00:26:42

now the other problems like objective assessment of accented

00:26:45

mess so here i haven't made a speech

00:26:51

and i haven't not maybe speech yeah i extract features

00:26:59

and then i parker neural network with just a non native speech i get the sequence of

00:27:05

two sequence of forty the vectors the middle dynamic programming and i can get an accessible

00:27:14

we can delay this planet problems even to the listening problems because as i said at each

00:27:20

point you're comparing probably distributions on the plot you can since you can do hypothesis testing

00:27:25

you can say somewhere with some forms are right or wrong the kind of decisions to make

00:27:32

so we get this uh on that data set which was coming from me mine

00:27:39

so there um native english speakers and then the the kind

00:27:44

of nonnative speakers one was coming from a friend

00:27:49

phil and people were from and that in a speaker

00:27:53

and then germans because what interesting yeah though is

00:27:59

this space of latency but what i call the the a. v.

00:28:03

yeah so in the castle madly in what we saw that

00:28:08

be um we really don't need like um this this

00:28:14

this this place could be just didn't wanna phones

00:28:20

you can able to get this kind of good correlation between the

00:28:26

uh the the the subjective accent the schools and the automatic school we're getting a bit

00:28:36

if you take finish we can also get it papa johns what we have the hope was for germans we found

00:28:43

that that the acoustics units that that spans of latency both

00:28:48

have to be defined based on context dependent units

00:28:52

if you use more phones product independence is not going to come up with this nice correlation

00:28:59

so that means when this because uh chaney i i may have to also add that

00:29:05

this the unit space that this this is what i mean to say this uh

00:29:12

oh so here is what the it is what the plot i'm going to show yeah

00:29:18

so we had forty four ounce as that my latent symbol space

00:29:25

for that like demanding it was fly you include

00:29:29

number of units doesn't matter the eighties the

00:29:33

simple to the complex to conduct opponent tape doing giving it it doesn't matter much

00:29:39

but the finish there is some form of improvement

00:29:45

but then then germans you see that basically it i really relies on the convicted in space for

00:29:54

now this i present that they thought accent than the scroll off but

00:29:58

of course you can do seem a kind of test for intelligibility

00:30:01

also there's also work we have published and into the be dozens but

00:30:05

also the dick speed even codec assessment you can do it

00:30:10

you can do exactly the score there is can correlate with your uh what you're trying to assess it said that

00:30:23

we also know lapel like the objective in dealing with the production evaluation so you have a synthetic speech

00:30:31

and you have the text that rendered to the f. p. d. s. system to give the speech

00:30:38

from that you can get a gay letterman with generates the categorical distributions

00:30:45

from speech signal we get as sick as opposed to probably case

00:30:51

and then we decided we define this problem as what we call problem how

00:30:56

many words we can recall from the synthetic speech we can do

00:31:01

we defined it to this method is can be more generally then a single human often so you don't need a human opponents they are

00:31:08

it can also okay letterman here can more of pronunciation variability

00:31:12

and we found that the the no need for the matting human up that's according

00:31:16

so you can you don't eat so you have a text you get it

00:31:20

and your match this this to sequence up over this

00:31:25

and you try to see how many words you can be called the more words to recall that in those

00:31:31

to the to the speed is more intelligible for the the less you have a call is let's interlude

00:31:40

so that's what is thought that uh this was over the call

00:31:45

thing and this is where the subjective word accuracy and we could see a nice

00:31:51

correlation between what we call it so what we call

00:31:55

objectively and work subject really people were trying to

00:32:00

recognise and speak and that was the way we can we can do this if it

00:32:10

any if it if you have any questions i can answer later or

00:32:18

you want to break or you know the big have a big

Share this talk:

Conference Program

53:35

Sequence modelling for speech processing - Part 1
Mathew Magimai Doss, Idiap Research Institute
Feb. 12, 2019 · 9:05 a.m.

334 views

32:34

Sequence modelling for speech processing - Part 2
Mathew Magimai Doss, Idiap Research Institute
Feb. 12, 2019 · 9:59 a.m.

30:44

Sequence modelling for speech processing - Part 3
Mathew Magimai Doss, Idiap Research Institute
Feb. 12, 2019 · 11:10 a.m.

Recommended talks

01:02:35

Deep Learning for Speech Processing: An NST Perspective
Mark Gales, Cambridge University
Sept. 27, 2016 · 11 a.m.

401 views

32:20

Sequence modelling for speech processing - Part 2
Mathew Magimai Doss, Idiap Research Institute

Embed

Transcriptions

Conference Program

Sequence modelling for speech processing - Part 1
Mathew Magimai Doss, Idiap Research Institute
Feb. 12, 2019 · 9:05 a.m.

Sequence modelling for speech processing - Part 2
Mathew Magimai Doss, Idiap Research Institute
Feb. 12, 2019 · 9:59 a.m.

Sequence modelling for speech processing - Part 3
Mathew Magimai Doss, Idiap Research Institute
Feb. 12, 2019 · 11:10 a.m.

Recommended talks

Deep Learning for Speech Processing: An NST Perspective
Mark Gales, Cambridge University
Sept. 27, 2016 · 11 a.m.

Quiz
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 11:59 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Sequence modelling for speech processing - Part 2 Mathew Magimai Doss, Idiap Research Institute

Embed

Transcriptions

Conference Program

Sequence modelling for speech processing - Part 1 Mathew Magimai Doss, Idiap Research Institute Feb. 12, 2019 · 9:05 a.m.

Sequence modelling for speech processing - Part 2 Mathew Magimai Doss, Idiap Research Institute Feb. 12, 2019 · 9:59 a.m.

Sequence modelling for speech processing - Part 3 Mathew Magimai Doss, Idiap Research Institute Feb. 12, 2019 · 11:10 a.m.

Recommended talks

Deep Learning for Speech Processing: An NST Perspective Mark Gales, Cambridge University Sept. 27, 2016 · 11 a.m.

Quiz Nick Cummins, Universität Augsburg Feb. 13, 2019 · 11:59 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Sequence modelling for speech processing - Part 2
Mathew Magimai Doss, Idiap Research Institute

Sequence modelling for speech processing - Part 1
Mathew Magimai Doss, Idiap Research Institute
Feb. 12, 2019 · 9:05 a.m.

Sequence modelling for speech processing - Part 2
Mathew Magimai Doss, Idiap Research Institute
Feb. 12, 2019 · 9:59 a.m.

Sequence modelling for speech processing - Part 3
Mathew Magimai Doss, Idiap Research Institute
Feb. 12, 2019 · 11:10 a.m.

Deep Learning for Speech Processing: An NST Perspective
Mark Gales, Cambridge University
Sept. 27, 2016 · 11 a.m.

Quiz
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 11:59 a.m.