Player is loading...

Embed

Embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
anyway i hope everybody thin i guess okay so is it like a okay my okay so
00:00:10
it's the pretty that and wondered about three hours so you have to
00:00:14
be um before the war was but uh that's tied to
00:00:19
uh make it more engaging and interesting it's so
00:00:23
that's why it took me some time to prepare the slides for this uh presentation
00:00:29
so i'm gonna talk about sequence marketing for speech processing
00:00:35
okay it how many i i'm i'm i'm so the basic thing is uh
00:00:45
i would do it it would be way why we are going to do that
00:00:50
of course you may not already some people but still i would like to present it why do you want to do that i
00:01:00
that i'm going to talk a lot about how
00:01:05
and the problem is
00:01:08
there are many things in the literature that like this
00:01:13
so what i tried to do a few years back about four five years back i started biting my simple document
00:01:21
asking myself why what is the main problem in
00:01:26
as a as a set of sequent modelling and as in speech processing what do the main main
00:01:32
main key key problem and i was trying to explain myself how others have at the state
00:01:39
how things are getting the wall
00:01:42
and build on that and i want to talk a lot about light seedling
00:01:48
to differ from much stuff your text will even journal papers because
00:01:57
what i'm gonna do also is that i'm not going to be a not go deep into it tended to get aspects
00:02:04
it's not that i don't like but that's not enough time to talk about
00:02:09
it inside it so what i have tried to do is have provided
00:02:14
almost for every topic i have provided a suggestive leading edge
00:02:22
then again till it to survive for the three of us
00:02:26
then you will have to be a years and all for two hours so two and a half hours
00:02:32
who will be going on college session matlab exercise than had amounts
00:02:38
and that's an hand on exercise and enemy programming which i asked about a book been paid
00:02:44
which you can do it by your hands you don't have to do it today i take
00:02:48
it won't try it yourself and see if you could understand what i've talked about today
00:02:55
if you don't the matlab exercise it's a it's a good
00:03:01
exercise we made for mainly from master students and
00:03:05
block lupus just wouldn't would really don't want to know every detail somehow
00:03:11
be implement them my uh hidden markov model a markov models
00:03:16
but you can visualise who can understand the basic brow knit basic things so that actually you
00:03:23
can see that some of the tended to get last meeting will also get into that
00:03:29
so when for what so
00:03:34
so i asked myself
00:03:37
split it's of speech is uh is is the most common moral
00:03:42
oh is the most common mode of communication is so that
00:03:47
even if you are using a laptop does this or that people will really do it
00:03:53
the the one movie to speak up to get the children
00:03:59
who are going to happen at some point the reader riding capabilities and
00:04:03
all those thing until then they're going to speak to you
00:04:08
and then um what happens is if the speech signal has multiple information so we have a message
00:04:18
the speaker identity was that's the personality then bills as stated as accent accent or the
00:04:25
many teens who can ask which when you look when we had this meeting
00:04:31
and problems and speech processing al bundy well focus on one
00:04:37
other problem up to all these problems for you
00:04:41
but what's most important in speech communication is
00:04:46
the message
00:04:48
if we are message that comes about our ability which humans have
00:04:56
got is that language how to express our self out parts
00:05:04
if you are able if yeah if you lose that ability to pass on the message
00:05:09
i believe that the utility of speech us as a come for publication is pretty much that
00:05:16
so if you take the message component
00:05:20
usually a hundred people have seen there's an application process
00:05:24
the production pauses the speech signal and the message
00:05:30
what we understood is it can be you can hear sickle cell phones or syllables
00:05:37
depending on the language you can hear sequence of words phrases and slot
00:05:46
and it can go on building on this so we are going to stock yeah i'm not
00:05:51
going to the comedy place where the your whole language from good or so so
00:05:58
so
00:06:00
speech technologies for example if it a speech recognition systems will control detection and
00:06:07
you have having speak in the spoken mode and you want to work a little more of a presentation
00:06:14
similarly vectors piece in this case you have a little more
00:06:18
accommodation you want to go to one mode of communication
00:06:24
yeah
00:06:25
then that's a million things on assessment you can ask and with the
00:06:31
speech into those big button logic alone the speak even out
00:06:35
with the speech is intelligible matter the petition seal to learn
00:06:39
uh and so on the um many problems in assessment
00:06:43
well that's
00:06:45
kind of things you didn't ask is all the letter to the message what is being deli but you can ask discussions
00:06:56
so i asked myself on on on this what is the most important question we face
00:07:06
so the first to me the one question with canvas important if
00:07:12
how are we going to matter not don't signal it i did
00:07:17
not hear that ass but that one had the disease
00:07:22
w. k. the many possible hypotheses but i want to match
00:07:26
a signal with the world had with this this
00:07:30
one hypothesis so yeah then illustrating what happens in speech recognition system
00:07:37
he has bitten off
00:07:40
what you do is you try to match too many possible type of things
00:07:45
i am the other one just select the best letting i put this
00:07:48
is has no put that's exactly what speech recognition with to do
00:07:53
the first question i asked is how we are going to match him
00:07:58
if you are going to ask the question of intelligibility assessment proficiency assessment again this question with banks
00:08:05
if the other one locking the speaker recognition if you bought a
00:08:08
pack dependent speaker verification again this question whether it's fun
00:08:14
yeah
00:08:15
so
00:08:18
so i went so i tried to look at all the literature
00:08:25
and say okay what is exactly people have that until now
00:08:30
so id like that
00:08:34
it may not it melts so well is that you have a signal and you have about hypotheses
00:08:42
you matt this to us shared laden simple space
00:08:48
of those shared latin so we'll put we'll see that sometimes is easy to
00:08:52
define sometimes it's not easy it is still an open to such problem
00:08:58
we'll see that and then you might the resulting too late and symbol sequences a and b.
00:09:06
simple easy right
00:09:10
how many out they have dancing matching
00:09:16
string acting
00:09:19
about people not daunting matching exactly does show the police
00:09:23
simple lights out the problem is it gets complicated
00:09:27
the moment you start asking many other questions or so that i just for some questions for us
00:09:36
what is the charlatans symbols that first question
00:09:41
than how to map the signal to the latency will sequence be i hear the dated
00:09:50
how come at the words i put this is it too late it was it was a
00:09:56
and how are we going to match the too late and we'll see and a and b.
00:10:03
now any net anything i was looking at many many many papers
00:10:08
it's a matter that's and then it later
00:10:11
the maybe the for on how we haven't either this false helpful questions
00:10:18
the moment the many different on those lines
00:10:21
that's like formation them insane you just that you're going to change that they your link to other these four questions
00:10:29
um so i'm gonna i'll keep keep methods that have come across this uh this
00:10:36
so you see that i you this colour coding more often uh places where
00:10:42
i want to which part we're talking about the uh that's why
00:10:46
i put in green red blue and so on um
00:10:52
so the first question is string matching so thought about speech so
00:10:59
yeah and so we have a sting a. c. c. d. and then maybe c. d.
00:11:05
and we had to match to states and we want to fine how we can plan work this thing into this thing
00:11:14
our go from its its e. d. u. a. b. c. one point eight
00:11:21
and if you have implement that uh once a thing it's a simple dynamic programming
00:11:29
so it's a simple dynamic programming
00:11:33
which is built on the idea that bellman put forward optimal policies composed of suboptimal this it's
00:11:40
so you saw our local problems adoptee mice and to that you can optimise the global problem
00:11:50
so what do i ask is for h. t. cat actor you see here i hear you ask the mobile
00:11:58
distance at the same or not if it's in a put yellow if it's different you put one
00:12:09
and then you apply some constraints local constraints and then you can get is solution
00:12:19
out of it so the local past pains for example if you you
00:12:24
can go a week any point here you can come from
00:12:28
the here or a diagonally forward eagerly origin
00:12:33
and you apply this constraint off and you do that you will get
00:12:39
the sting matching and a few backtracked the plot how new
00:12:44
this whole process event that last you get what to do whether we are the substitute
00:12:49
into inserted once thing or mock this all this kind of things even though
00:12:56
so one of the first apple juice was pretty much built on this point tough
00:13:02
so they use phones as the linguistic knowledge so the the the nation audible that phones at the symbols
00:13:11
and they said okay we have a word hypotheses me can
00:13:14
apply linguistic nodded and convert into some sequence up
00:13:18
ones then we have speech so this little can be understood it might almost everything about the speech
00:13:25
we'll segment the speed and label it tools
00:13:32
not training
00:13:35
and then we'll max this stinks to phone sequences
00:13:41
that's that's one of the very first approach that came and that's the you can be about it in
00:13:48
this kind this paper here but what are the limitation it uh the line overly on knowledge
00:13:55
and later famously mon
00:13:58
uh then the next month and say that the moment we fired
00:14:02
a linguist a performance of a speech recognition system improved
00:14:07
so the only the lights on all its you make an early decision
00:14:13
and you cannot recover from like yeah that's like segmentation and
00:14:16
leaving others you can apical from that you made a mistake you made a mistake there's no chance of getting back to that
00:14:24
so this was the first approach
00:14:28
so after that
00:14:31
people started looking at data driven
00:14:35
methods or machine learning based not that's an
00:14:42
in that one case i'm going to distinguish bookcases how your
00:14:47
word habit is is is going to be represented
00:14:50
it can be a simple recording like i can be got all the words in the vocabulary myself
00:14:58
that is what happened to say it's there that upwards unit are
00:15:04
i'm wonder there that button them in the text mode
00:15:08
they that that two ways of doing that stuff now in that we'll
00:15:16
see that what classic and that that is the likelihood based method
00:15:23
that's another mattered what to do there's not that though it came probably
00:15:29
even almost thirty years backed it but it didn't like it until now
00:15:35
we couldn't find if we always just get around that problem
00:15:41
now
00:15:43
so the engine based approach so yeah giving couple up papers here uh both uh
00:15:49
but i ran and drew on and that point about this is a famous
00:15:52
i think but if you start ugly you'll go the other people we started working
00:15:57
on this problem in the basic idea is coming from the idea that
00:16:02
speech signal can be decomposed into so and system but i mean
00:16:07
that's source filter model accounts on the basic idea about that
00:16:13
and we thought that if you have like uh this uh resonances in the vocal tract
00:16:20
that would be most depends on the sounds to produce the shape of the vocal tract
00:16:26
so one of the people that started last thing well if you start that i'm advising that
00:16:31
and we compared this this parameters we should be able to
00:16:35
do speech recognition are matched to map that is
00:16:42
so let yesterday were all thing about this kind of model so that many methods room
00:16:49
was that up our talking about in the production l. might about captain analysis
00:16:54
so there are many ways to go time domain analysis which is in the
00:16:57
production if you do because the the main analysis you get ripped off
00:17:05
so in that case the basic idea was speed can be the gun was
00:17:11
so come driving ability that is different from different sounds our that's an option here
00:17:16
but i meant by the vocal tract system information integrating speech perception not is that that uh
00:17:22
what we do with them at least the tips of vision and comparison w.
00:17:27
that was the basic idea so you have a speech signal reference hypotheses
00:17:33
you do short time processing like the rebuilding of the signal from that to extract
00:17:38
of sequence of features of course they're not going to be up statements
00:17:43
and then um what you do is you do a dynamic programming
00:17:50
and this is all the same things where you're going to have
00:17:55
a mobile cost which is going to build on what feature use so for example if you get
00:18:01
you get ten pages you can use euclidean distance a model been one of the distance
00:18:06
linear prediction coefficients if you use it it it up or distance spectral information you will well it
00:18:12
up aside to distance so that the eye the meeting the vial trying to say is that
00:18:17
the local for what you choose has to be built on the feature space you're lying it
00:18:24
and that besides my local school and that then only we can have
00:18:29
an uh this problem will we're solving for you optimal if
00:18:34
if you if you choose the wrong locals for you'll be always suboptimal
00:18:40
so because blindly say oh i love played euclidean distance in this method no it's not true
00:18:47
it's always not the case so you should know which feature
00:18:51
you're doing it second thing the local castings can change
00:18:57
and if you got the psychology was paper the study in many different local castings inside that
00:19:06
so the study so i'm not going to talk much about it you can
00:19:11
go right into okay so if i start asking myself for questions back
00:19:18
in this case the shotguns but those beaches are my latent symbols
00:19:24
the set up liz symbols is for me and define tough
00:19:29
because there's no unique feature vector presentation for speech on you to really beauties
00:19:36
first of all this is unbounded
00:19:39
that means we'll come to that point where that's me via via problem comes
00:19:45
do you do get these fair enough saddam processing which feature extraction
00:19:50
dynamic programming with up to the locals will invoke the stains
00:19:55
so the children actually into these are the biggest problem for this
00:20:00
q. and because it's a deck for example yeah it's a linear prediction
00:20:06
spectrum off a sustainable ah from like a female male and child
00:20:14
and you say that it's considerably the first and i think when the panorama play that and i'm going to compare
00:20:21
the system is not going to work so good it so if it's a male speaker is fine but the
00:20:27
famous i if i i thought i would template from male it may not walk off email for example
00:20:34
even that is within speaker also there's variability
00:20:39
so here's a case where the female produce the same sustained will are for three times and say that
00:20:48
now because of this variability you don't have a unique feature
00:20:52
presentation so you don't have a closed set of symbols
00:20:56
for you so what you want to go to do that you will go have trouble collect many many many examples
00:21:04
there's no limit to that so
00:21:08
limitations boats well first we could depend then clean and control conditions
00:21:13
the late nineties there was a name dialling system on the phone
00:21:17
you could just say and name for three times you if you
00:21:21
have a one template usually the templates for each name
00:21:25
and then you can i have a name dialling it was that
00:21:31
late nineties
00:21:33
then the lies juniper speakers and conditions is highly tanning problem
00:21:38
latin template typically represent word unit every new word needs a new weapon template
00:21:44
not the amount of c. p. u. and memory requirements for example in the us a late ninety there were
00:21:50
not much memory on the cell phone so they were giving you maximum pen names to store like that
00:21:57
so you have to to to to ten names to full of women driving so we're
00:22:02
not that one day of this whole matter this there's really no training needed
00:22:08
there's no training or you just speak process speed to extract features
00:22:14
and compare it to so many people have an unholy grave
00:22:20
find the shortens people's thing does peter presentation that
00:22:23
can linguistic in the unit directed information and
00:22:26
is robust wonder that if you can find it your problem is pretty much sort of
00:22:33
it's it's still many people a long people have tried it still
00:22:38
there's no it's that one feature that can do it
00:22:43
so that leaves to basically what is called the model based approach where
00:22:48
we start saying that how to incorporate um what is called um
00:22:55
um data and knowledge how we can use learning methods inside that
00:23:06
so the statistical formulation formatting
00:23:11
and as a cigarette um a spitting out and about habits is gay
00:23:17
can be formulated as at getting a posterior probability
00:23:23
of the word sequence given the speech signal
00:23:27
of course more most of the time this is going to be difficult problems too small
00:23:32
because it's very difficult to directly estimate it so people go to that site to the base rule
00:23:39
they're going to apply joint likelihood they read it but the prior of the speak
00:23:46
now based on this we can define two approaches so one with just focuses on this quantity
00:23:55
uh the other which focuses on the posse
00:24:01
that to think this statistical quantities and they're doing we'll see that there
00:24:07
again going to do the same effect formation about the puppy
00:24:14
so it
00:24:17
so well before i go further i'll try to little bit chocolate
00:24:22
things so we are as human that that's the speech signal
00:24:25
it was autumn processing when doings for each when do you
00:24:28
expect the teachers and the standard p. l. p. m.
00:24:32
f. c. c.s pubic of top of your filter bank energies even have number of features on on that
00:24:38
so that so and then i want to do is also called
00:24:47
uh what happened to say is
00:24:50
the sequence of words like that that you pass it were dictionary uh prior knowledge you have
00:24:57
it in a day on that you can the sickest cell phones like the cat with
00:25:04
the e. then there's a sharp was moral to distinguish between
00:25:08
what ends word beginnings then you have a uh huh
00:25:13
so we believe that this is processing you there
00:25:17
some this processing given how we go about that's the statistical models
00:25:24
so it's likely would made approach it's place in the book one
00:25:29
problem one is called what is for the acoustic model
00:25:32
like you love the it director given the word hypotheses then that's a a prior
00:25:42
probability of the um of um what i have it it's it's funny
00:25:48
so here is what we are going to use what is called hidden markov models and here that this quantity
00:25:55
is estimated using this to mark a model also we colours language model as well
00:26:04
so just get malcolm models um simple in the sense that they
00:26:09
have a set of states and what you have is observations
00:26:16
that's not distinguish and there's no there's no distinction between what you of though and so if you observe
00:26:24
that's the last thing um observation you also know the state
00:26:29
source on this kind of thing so i can ask it's a isn't fully connected is
00:26:33
the market model that estate at a. b. c. a. f. cloudy rainy sunday
00:26:39
so i can asked okay today's cloudy to maurice um a sunny
00:26:45
what would be the next day i can ask the question
00:26:50
i can ask that all will it be cloudy cloudy cloudy cloudy for many days what
00:26:56
do the fans of having followed by the uh for uh many days like cloudy
00:27:03
like the battery does upload e. followed by for the the the rainy
00:27:09
it can be like that so i can pass this kind of thing to do what you can do is you should the markov model
00:27:17
and and this kid markov model
00:27:21
what do you have uh what you have is in that
00:27:25
in in this game audible that's a set of states in the gets up our processing
00:27:31
the state senate thing where the words from your vocabulary
00:27:36
and each of those um those the words i was showing here
00:27:44
a word hypotheses here they are an element of this fact
00:27:54
and then we can estimate this this quantity as by
00:27:59
applying a simple expansion of um you know um
00:28:04
don't like you would expect pension you can so that it's it's a multiplication
00:28:10
of like word of uh i i is that just a a day
00:28:15
can a uh i did uh the last word with just with his depending upon the
00:28:21
previous around and so on to the first word like this you can keep multiplying
00:28:27
and this this this probably days and then you can estimate this quantity
00:28:33
now the problem is that this history is out with the land for you
00:28:41
so it off i would be let into does a time estimation problem so
00:28:44
you're going to do some kind of assumption that really made the history
00:28:49
they're so you've put a fixed atlantic city and you say
00:28:54
okay the uh i want to do only depend on the previous
00:28:58
a word and then it becomes voted for the vacuum
00:29:02
the presentation i think that if you put two words tuesday to comply exam and so on
00:29:08
so the question is now how i am going to estimate this quantity
00:29:13
for that we need a lot of actually so since it can be both web
00:29:19
you flat whatever material you can find text material of the big it
00:29:25
one method is to do with simple counting you see how many times have you did yeah
00:29:32
and you can normalised it to get the conditional probability
00:29:38
now the the the the chances that use
00:29:43
the chances that you will see a lot of uh of words will not appear impress for you
00:29:48
so you will not know whether it exists in the language that parents are they don't exist
00:29:54
so if you put it do you go to that pretty much uh you can have problems with that
00:29:59
what do you go to the that we take a small probably think
00:30:04
it's all probably detention probably do so you think it's mostly
00:30:08
out of that and we distribute it so it's kind of a discounting
00:30:14
if you don't see them uh the but i haven't seen that but but you
00:30:19
go back back off for example from bigram you go to uni gum
00:30:24
or you interpret it the uh that that the the uh different things like for example that
00:30:30
bigram i don't know the history properly i'll integrate with bigram and unique them too
00:30:37
so there are many different methods and today they're also methods based on a recurrent neural networks and swap
00:30:46
it's and so so we can estimate this point to be
00:30:54
sorry so now so what it discusses the this could not come models and what are the
00:31:01
different the detail markov models so that's simplest example many of course you of gene intact
00:31:08
so that's the point i was saying and we're only the results of each prime but that is heads or tails is revealed to you
00:31:15
so if they had had they'll tell tale had had had had and tell tale
00:31:22
now if you have one twine model this is like a this good markov model with two states
00:31:29
red head is once to a nice one state an all
00:31:33
time just simple sake of two point five three inside
00:31:39
now if you don't on another one and that is i have put wine it's
00:31:44
inside that and you only know had had dale dale tale like this
00:31:50
this becomes hidden markov model with two states with wine one and point
00:31:54
to become the states each state is associated with like emission distribution
00:32:00
like what is the probability that fine one will produce head for the power to the point to produce a pale
00:32:07
line one product and that isn't probably reflect the probability
00:32:11
of choosing one now the possibly biased points inside
00:32:16
so it's a simple illustration from the room packed amounts
00:32:22
so what we do in speech so there are two ways to estimate this
00:32:28
quantity for the for the acoustic likelihood of what's going full likelihood estimation
00:32:35
i'm going to skip last that's here to get to hear but it's simple to do live with the uh in
00:32:41
the id independent added to get distributed that means each vector
00:32:47
either under the very first one the one state
00:32:51
then each day it depending on you know with previous state so this that was engines
00:32:57
and you can look at all possible parts are in the hidden markov models
00:33:04
but if you look in whatever you you you you have it you you
00:33:13
you you take a hidden markov model based on this and you say what are the different possible race
00:33:22
i i can go from one and i can look at it given the observations of course
00:33:31
so you asked this question
00:33:34
or you can solve it with an approximate solution with this article with debbie
00:33:40
vision it's got best but only so you can look at it and you can approximate instead of looking for all possible
00:33:47
but you will look for the part with this with the maximum a maximum likelihood but it has a might like
00:33:57
so now
00:34:01
in this we can do a little bit more expansion a slightly bad this quantity here
00:34:09
what we have here this what is called the emission likelihood and there's a transition probability
00:34:15
the emission likelihood we can further fact fact allies it
00:34:20
recognise it by yeah a latent latent variable
00:34:25
and we can little bit of things if you do an expansion here and i assume that
00:34:32
that the vector x. m. is independent of the state you and if we're in the
00:34:37
late in the that was given then you will result with this this approximation yeah
00:34:43
now it's simple we provide the logarithm of that functions here it will be like the product will become some
00:34:52
then you have a lot lot of this last log up attention probably
00:34:59
now this exact full process you can hide it in
00:35:05
um this fashion yeah but we have a word as i put the thing is we saw
00:35:11
that how we can come to sequence of phones or stay this have misstates sentiments dates
00:35:18
and there is a categorical distribution here for each state
00:35:24
and then we have an input speech where you expect the features and then a check the
00:35:31
likelihood that and based on this likelihood you're going to do it and every one
00:35:37
you can do it you can try it yourself it's there's no magic yeah it's
00:35:45
for example this whole thing what i have done in every ten and just illustrating in the in this part
00:35:52
and what exactly are you the match between the word hypotheses and the speech signal yet it inside
00:36:06
in that man's what happens these
00:36:09
you have you can buy their initial condition women same
00:36:14
your dynamic programming that's a cost that cost is it to have
00:36:18
it done here is the what we call the emission likelihood
00:36:22
that comes here with this um kind of a broad product between the likelihood director
00:36:28
and the cat ever even distribution and the longer lock things and then
00:36:34
you have usually in had time um you have sent transition that with you can
00:36:40
and the next time step you can remain on the state all you could allow you you could uh come from the previous date
00:36:48
so this is the usual model you have and if you ha it's a left to
00:36:53
right at the moment no state keeping this is what your local constraint is
00:36:57
except that in this this this ah is going to
00:37:02
have a cost based on the transition probabilities sorry
00:37:08
that's worked here you see that suddenly this point it is that appealing for you and
00:37:13
that sounds because that's a cost on that part and so having said that
00:37:21
so the for some questions i go back to the for some questions what i'm a little symbols that typically predicted
00:37:29
have the arts speech recognition system that left apartment dependence that word in it at my latent in with that
00:37:37
there are many possible emission like you wrecked estimation methods i'm just listing some of the of
00:37:43
a few of them by gaussian mixture models up the the amount that division of of
00:37:49
you can go for a okay and then you can go for vector quantisation
00:37:54
can fancy another set their christian with this one did it for me
00:38:00
then estimation of the categorical this division it's typically instead of the of speech recognition
00:38:05
system is one to one mapping between the context dependent somebody in it and
00:38:12
the question so what it it's a product that there's the distribution and that is why you see that i was putting
00:38:20
one and the rules like that it's just not a sock distribution
00:38:26
and the local score is based on the dot product between the likelihood back there and the categorical
00:38:33
distribution with respect to each day and and the cast into based on that i'm up apology
00:38:40
so you'd you'd defender different apology you will have different aspects that
00:38:46
this is what exactly will give you but with every likelihood estimation inside that
00:38:53
the essentially really new away from x. except that we had this problem fine
00:38:59
what isn't it it so people don't pay much attention to have fun
00:39:04
so for example if we take the case of yesterday's stock on
00:39:08
a on a laugh huh and you and thomas they
00:39:12
were talking about if you say your mobile space it yesterday
00:39:16
using this space is going to change for me
00:39:22
this is not the same space for at an a a normal
00:39:26
healthy adult speakers is not the same spirit for trying to
00:39:32
so finding the symbol set ace i believe is the most important problem
00:39:40
it
00:39:41
it it cannot be defined it cannot be defined prior knowledge because a pilot will
00:39:46
slip it in a very limited we have a very limited prior knowledge
00:39:51
and i think things are all quite that have love for us we can get on this
00:40:00
so it's
00:40:02
so how i mean how am i going to start training the emission likelihood estimated
00:40:10
basically i'm going to go to an expectation mathematician about it and so in
00:40:14
this given word level transfers be then like they can initialise the models
00:40:21
the expectations that you estimate what call state occupancy poverty that the probably d.
00:40:28
that
00:40:30
state and he's um the probably be in state and
00:40:35
at a stay so they probably uh of being
00:40:41
in in the in this phone here l. k. at
00:40:46
a time and given the whole observation sequence
00:40:50
and this you can compute into ray's someone is what is going forward backward
00:40:55
algorithm in this day this will be a sock probabilities that means
00:41:00
they're all like this will belong to all the states in
00:41:05
with some probability that is it will not be like anti medieval for you
00:41:11
so
00:41:13
without me is the case venue i line for example um
00:41:19
here you will will this ah then you know which vector here belongs to
00:41:26
which state you have that information and given that information you can go
00:41:33
and basically a given this information you can go and be estimate the
00:41:39
parameters of the amount in this case it means and variances
00:41:43
are you the payment your yeah and then and you're going to be this e. and and stuff until convergence
00:41:50
that's it so there is no other of simple plan of the teletext then that would that be
00:41:58
a a painting is that here is my uh acoustic sequence observation yes madam of states
00:42:05
so what the first we do is with what the colour splashed that movie diagonally do this so basically
00:42:13
equal norm we divide this sequence equally and associate reached it
00:42:19
then you basically estimate that if you have a gem um it's you estimate
00:42:26
the parameters of the demons if you have a menu ten amen
00:42:30
given the segmentation you have the label and the state so you can paint they and then
00:42:36
once again they and then you'll be aligned and then you get a new segmentation and want to get a new segmentation
00:42:45
you go back and painting of parameters of the emission distribute
00:42:50
emission distribution estimated and you keep repeating this process
00:42:55
and two candidates one pen wouldn't say that your partisan changes at some point there's very little tin in about
00:43:03
that is uh what is called recall us with dolby yeah i'm i'm about them here
00:43:12
now that's what we used typically and the many people even on
00:43:16
carly and know if emphatic me it uses bitter be um
00:43:21
i've gotten so you will see that actually uh you they usually
00:43:26
people go for this than the full likely would with that
00:43:30
of words uh the that's well uh sometimes people indeed nets the tops saying
00:43:36
that all re uh what we we're doing alignment for the and all
00:43:40
this kind of thing oppose the islands can detain without alignments it basically is
00:43:45
what it here if you do the proper reader b. m. step
00:43:50
or if you go for the full post full uh i'm i'm
00:43:54
farther back what i wanted them you are going to do
00:44:00
you you you are going to do what is called um um um
00:44:05
what is one label us um i the alignment less training of finance
00:44:12
so
00:44:15
so not so going back to one of the applications i was just talking what is
00:44:20
taking to max this hypotheses and we're looking at this point it in here
00:44:26
so in speech recognition you are interested in finding their best matting word hypotheses
00:44:32
but this is the post here then you come to the base will
00:44:37
since the denominator is independent of the word hypotheses we can drop it
00:44:43
and that is your gender formulation so what you have is a speech signal
00:44:49
we haven't already did a front end you know processing you have a spectral features then for likelihood estimate the
00:44:58
is um they're used to the use the amounts of neural i've avoided the
00:45:04
um um thick and whatever it is you get a four likelihood
00:45:08
you get into the decoded along with lex again grammar and it may be given sequence of words
00:45:16
the most difficult part is the decoder
00:45:21
training is easy for item number systems people might want tough to do the
00:45:27
top because the decoder is going to do everything on the fly
00:45:33
just simply in the illustrate the point here is the case so what happens in the big for them
00:45:40
you're not even model is kind of word hypotheses generated you can
00:45:44
don't make any kind of word hypotheses you can think
00:45:50
the language
00:45:53
then you put all the lexical knowledge and everything then you are going to get a a
00:46:00
funny each word hypotheses you're going to get as sick as a pedagogical distributions here
00:46:06
and then yeah with the max them and make initiation written document and the problem
00:46:12
is you can do a full backed search from this in the sense that
00:46:17
the number of hypotheses this um are quite large in the sense that it it's
00:46:24
it's kind of i would say it's kind of counting mean fund act basically
00:46:31
so what you are going to do if if if you do a full bastard practically it's
00:46:36
infeasible to people what do they yeah they have to go for some search heuristics
00:46:41
so be injured just packed decoding so when you're doing the dynamic programming which
00:46:46
is exactly what you're doing with the viterbi yeah uh for example um
00:46:55
with the v. uh the dynamic programming yeah with exactly the same an enemy programming it before there
00:47:02
you apply some search heuristics where you try to for example in beam search you only keep
00:47:11
i teach that age that you try to keep the like paul
00:47:15
hypotheses you don't give all the hypotheses on the f. a.
00:47:21
that that that things with comes in this whole competition for example is that the lessons
00:47:26
in fact as you will have for the emission likelihood and the language model probabilities
00:47:32
no that's because often what happens is emission likelihood at a different dynamic range then
00:47:38
language model which is the probability and the likely would it has a different animal
00:47:43
pain so you're going to have something that's scaling factors associated with that
00:47:48
and also to a wide incision off edge sharp words
00:47:53
you will have some well intentioned but that this now how you can efficiently the
00:47:59
word than any programming and backtrack need um without a a much competition
00:48:06
the math uh one of the papers as additions to the this the use of the one stay
00:48:11
that way programming over the back that far connected word recognition now the what i presented here
00:48:18
i've been sending into finite amount basis out perspective the same
00:48:22
thing you can actually do with instant basis stuff
00:48:25
so basically this certain value them and all those thing the pretty much can i mean same for you
00:48:36
so i don't know based text just if so had them one of the main thing about that time i
00:48:43
mean that we can use them as the recognition model but they can also be identity model for us
00:48:51
so i'm going to give a quick a shark on pointer on this
00:48:55
that intact a speech what happens is you are given a text you process they
00:49:02
then you're going to penn web somewhere that at some crappy into phoneme
00:49:06
conversion and then you're going to get a contact depending unique yeah
00:49:13
this conduct is pretty long compared to regular speech recognition i'll try to show an example here uh
00:49:21
then you are taking this unique and you're going to map to this plastered a
00:49:27
object opinion it so you're going to pain but this lectured us states
00:49:34
then a state sequence is chosen based on the duration models like how many frames to be generated per state
00:49:41
and they've done that you're going to them datum vocoder parameter
00:49:45
what do the vocal tract system parameters vocals was parameters
00:49:49
and this generation can happen with yemen's or a big
00:49:54
neural networks and then the ones inside the speech
00:49:59
so at the moment yeah is it then that the model so you're reversing the whole process of
00:50:04
uh uh uh so when we go speech recognition it's analysis model and
00:50:10
when you go for vector speed it's going to be a synthesis model
00:50:15
similarly like video production is an analysis as the synthesis model
00:50:19
we can analyse and synthesise so it's captain captain processing we
00:50:23
can analyse and synthesise so is that a month yeah
00:50:29
so the rest of your thumb passions in had amounts
00:50:33
so usually if you're doing speech recognition and you will have
00:50:36
a lovely around work after coefficients uttered encrypted provisions
00:50:43
right into the cease you really do need more forty to sixty parameters less the
00:50:49
so speak is usually in speech recognition we don't use it if so speeches
00:50:56
more the link typically we we will really leave up to what's called
00:51:00
by phone or uh you know on a lot by phone
00:51:05
at what that means you're considering the preceding and following fourteen intent is easy when you know that
00:51:12
the what i call a what i what i call s. l. c. and here he's
00:51:20
and k. and is is going to be pretty different just a full partner considering
00:51:26
proceeding to answer to the two fourteen plus all others product of features so
00:51:32
the world in the in the in the chapters what simon king chapter these you can see
00:51:39
yeah
00:51:43
you can see it
00:51:49
so you so you can see preceding and following phone pollution
00:51:53
of segmenting syllable borders of syllable in work with if
00:51:58
you would use context it's it does all this information is what is your view laid there and that
00:52:09
is it and that is
00:52:18
ah and that is it
00:52:22
um and then in direction um morning usually we just have us they sell plantations
00:52:29
yeah you need an explicit among picked model for the state duration
00:52:35
it's baum welch and his bombers are protected training it some
00:52:39
more about it you can find in the paper
00:52:43
big coding we do with the research yeah there's no need for usually you don't maybe of course you can search
00:52:50
if you want them they should not required is the maximum likelihood permitted ambition
00:52:56
so what we have seen in sin disease is in in in
00:53:02
uh in in speech okay in in speech uh processing we have time to matched
00:53:07
two hypotheses and insane this is you're generating the speech signal that is going to match to
00:53:13
the word hypotheses and precisely it comes on the same item information you reverse the problem
00:53:20
instead of so i'm going to stop yeah to having the questions to take questions from

Share this talk: 


Conference program

Sequence modelling for speech processing - Part 1
Mathew Magimai Doss, Idiap Research Institute
12 Feb. 2019 · 9:05 a.m.
Sequence modelling for speech processing - Part 2
Mathew Magimai Doss, Idiap Research Institute
12 Feb. 2019 · 9:59 a.m.
Sequence modelling for speech processing - Part 3
Mathew Magimai Doss, Idiap Research Institute
12 Feb. 2019 · 11:10 a.m.