Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
so um so there's two weeklies um recap from the point
00:00:05
we lead so we're talking that okay that's a
00:00:09
a texas based system that and thanks be given in fact and we can do like uh what they call problem
00:00:16
if you don't have back you can also give a human reference and still that score will match
00:00:23
to the uh what activists this this even if you give a human happens this method would
00:00:31
would would yield this kind of correlation um basically when we were putting this
00:00:38
paper was the improvement over that that that that that the f. f.
00:00:43
so why it's interesting because if you recall a human um way of uh
00:00:50
uh uh in the little bit that's that's meant is that we give you meant and usually text to speech you have like
00:00:58
what is called a semantically unpredictable sentences and you ask you
00:01:02
meant to listen to it and you than the weight
00:01:06
after on the human can also make errors
00:01:09
uh like they give instead of new word which is not there but in the end
00:01:15
usually are looking at the accuracy how many human it was able to recall
00:01:21
so pretty much the same kind of thing you are basically objectively trying to find it
00:01:29
so what we do as a set the uh so even if
00:01:32
you do something like um syllable test like nonsense syllable test
00:01:38
those kind of thing again what you have to do at the end you view if you if you can't really see
00:01:44
you have a syllable uh you you have a syllable test you have it happen sequence of
00:01:50
once you are looking at first and then the human is going
00:01:54
to give some another reference sequence and then what you do
00:01:57
with a string matching and you're going to compute how many of
00:02:00
them they were correctly detected and how many did do
00:02:06
and the whole problem as i say it again comes back whatever you even human when you're doing a
00:02:11
subjective analysis we are actually doing some form of this thing batting with a human in the loop
00:02:18
so set matt that's so that's where the method automatically brings in that kind of capabilities
00:02:27
having said that if i'm going to talk a little more
00:02:32
um talk inside that so i'll what i know that
00:02:38
is that we have a soap production phenomenal and that's the perception from them and then
00:02:45
and then i think this is inundated in this uh perception and the production phenomena
00:02:52
at some point of time um there are people like um jacobson
00:03:01
ah they and plant they wanted to link this what it's what
00:03:06
you perceive here like for once to work you articulate here
00:03:12
and that led to this thirty of distinctive feature to the action
00:03:19
then there are people a little bit more uh we're doing what is called labour man and all
00:03:26
i didn't put the reference here but i will add it in the final paid yep
00:03:31
there are coming up with what was more that's really that's saying that
00:03:36
then you will perceive the speech to the type to them bathe articulated gestures and try to match it
00:03:45
so you know perception their argument was that there is some form of
00:03:50
in haven't articulation in more than some form of production and phenomenon process they go together
00:03:59
and most of the speech processing and didn't know what we have seen it
00:04:04
it's obvious to that basically you have a speech signal have
00:04:07
a feature i've got two forms i go to syllables
00:04:11
basically yeah but you're not looking at the synergy between the what is the production and the perception for that
00:04:20
and often you will see that in pathological thing this is whether the one thing start breaking because
00:04:26
production phenomena steps some problem perception phenomena people don't know
00:04:31
how to solder that a perception a production problem
00:04:34
well they percy and basically that's just that's extract the for creating the value for it
00:04:42
so i'll i'll what they did is they came up with the same posted in this formulation
00:04:50
we came up a one night that where we can link this production no this
00:04:55
perception from them and speech signal in one day in explainable a completely
00:05:03
so the idea was appalling
00:05:06
you have a speech signal
00:05:09
and you can define what disco articulatory features
00:05:15
of course there are military exists for this one is
00:05:19
uh the garment and from positive or negative
00:05:23
each sound can be described very like fish setup features and positive or negative
00:05:28
then there are people like mother for the the no the
00:05:31
push for the idea call multi valued addicted to features
00:05:35
so basically they said there is something like a place of articulation of
00:05:41
our manner of articulation that to the degree of fast action
00:05:45
not valid e. and so on they they had different so
00:05:50
that i didn't realise how about exactly the features
00:05:54
so i was part of a i think the the this a workshop which
00:05:58
was working on multi with valued identity features into the other six
00:06:04
so we came up with nice methods to
00:06:06
model them for pronunciation variation on understanding
00:06:12
trying to put them in that aiming basing that works and i'll see if we can build a speech recogniser with that
00:06:19
um which they could not make it at that point of time
00:06:24
but yeah at the and the whatever doing with them yet uh
00:06:30
one of the first people are just on this problem
00:06:34
so in that case so we have actively features
00:06:40
you you get a speech signal of feature directly to different was the probably
00:06:46
estimate is four plays a part in the same manner of articulation
00:06:51
i'd of articulation models so we would be reading this basis
00:06:57
and then you stack them those features so you create a stacked feature
00:07:03
and then you use them as a feature observation for now
00:07:08
okay let among rather parameters have kept their capital
00:07:13
distribution is off the stack categorical distribution corresponding to each
00:07:18
after i technically multi valued at a good features
00:07:23
again the the scroll to match these two things are cool but clever the ridges base
00:07:30
you can do the same standard with debian methods and you can train this parameters
00:07:38
so this guide model directly kind of uses the link to this whole maybe appealing to what
00:07:45
it at some point this more that they they uh perception was ejected for many people
00:07:51
at some point uh you if you go back and related to have um the from the billing you stayed at some
00:07:58
point they really directing this theory bad i hear what it use the formulation is you have a speech signal
00:08:06
from audience base you go to the production phase from perception base you come
00:08:10
to the production space and you do the math and the production space
00:08:16
you know my think everything the production levels and say that
00:08:22
i hope in making the point is i hope we've got the point so we're the
00:08:28
whole match the whole simple space is lying for me in the production space
00:08:34
and is a multi channel space um this impulse the tool it's a multichannel
00:08:40
if it's all like that and then they and i'm doing that
00:08:44
so bent on that i i mean i'm not the flesh dupe
00:08:49
clearly show that you can do speech recognition like this
00:08:55
until that most of the people reading this features they will do some form of um
00:09:01
question i additions and everything and then feed it to russia for questions we said we can retain all
00:09:08
this information that's in this and we can still be the speech recogniser with that said that
00:09:15
so
00:09:18
maybe like so first so that so what happened it like that
00:09:24
a man a place for the town like ah manner
00:09:28
there were three state they're all synchronise the valve will last a five uh uh in the model
00:09:36
so yeah that the banana lighting so what we did is basically what we did is
00:09:41
we retain would it different for for here and we started analysing those for models
00:09:48
and looking at what in the states capturing product is it making any sense to me
00:09:53
uh are they really making a uh in a sense this parameters what they are going to learn yeah so i'll
00:10:03
so we say well ah that's a man and it's available and places
00:10:07
back similarly the like the wives top place value rollers like this
00:10:15
uh we must reading like mile diva analysing these parameters in some
00:10:20
cases like our our beep uh all or a ball
00:10:26
the like this kind of sound we thought there is some form of us include me in this channels
00:10:33
so this will process it's not like a synchronised was was that the it in the place
00:10:37
of articulation doesn't mean that you obtained also mind up articulation at the same time
00:10:42
so we thought that this kind of model actually can get what's in kindness case
00:10:48
as well as something this case it can more than those what information
00:10:53
although you on your lap but for the estimation is train with the simple in
00:10:57
this case and alison this case if we were we were saying this
00:11:03
then we also didn't experiment that really
00:11:08
uh be said in turn off onto use graph aims here inside
00:11:16
so this is the case for the word below
00:11:21
and this is the gap in case of the of the disciplines of uh of a b. d. l. or w.
00:11:29
and we think that there is a good uh like this low
00:11:34
yeah it's low cost both here you can see that
00:11:38
be it by uh it it covers points the uh and he was here and there was
00:11:43
a slight differences what we were thing to what we saw that well we'll see
00:11:49
so this yeah that's an idea okay i'm so this is a model where we can
00:11:55
the perception space i can even look yeah no date result i themes because
00:12:02
but they then is bigger than black teams in some way is on the embedding the phoneme information other than human can have
00:12:08
we all speak with numbers that space it can be within those
00:12:12
information for that's inside that and i could learn distillation
00:12:18
now um so this kind of understanding far less yeah that's a little
00:12:25
bit further motivation for going for that than just speech problem
00:12:31
the problem is that's following so in speech you have a production phenomena asks
00:12:36
where you hope of that but then people to communicate with science
00:12:41
which will model of communication in sign what happens this
00:12:46
that's some animal aspect like you move your hands you change your hand shape
00:12:52
you can do some non man was stuff like it changes face gestures moping shoulder
00:12:56
balls and this is what is the multichannel when shooting that you're getting it
00:13:02
and then the who already stalking tool is going to perceive that as the sequence
00:13:07
of words or phrases and they're going to interpret that what has been said
00:13:14
so we said okay if my our method of idea of production
00:13:19
and perception for them and all those things are reasonable
00:13:22
we should be able to think this method from speech i and movie tools life
00:13:31
so uh so that's what we did it
00:13:35
all this hand shape and position all this becomes my particular to space
00:13:43
so this visual space perception faces a sequence of atoms
00:13:48
date i really don't know how to define it
00:13:52
this place because in india gifted in speech because we're say
00:13:57
that's the uh we have a dictionary that tells me
00:14:01
this is the the this phone sequence handed on that i can already find
00:14:05
the sequence here there's nothing like that and sign up for that
00:14:11
it went laughing to combine it there's a sequence and if so that
00:14:16
it didn't fall off um of uh there's no standard that
00:14:20
actually just take this big and there's really no standardised way of
00:14:24
writing a as a um i'm signed as sign language
00:14:30
what they are not like he's actually what you produce if for example if i say below
00:14:36
what they're writing any in speech would be like i have to close my lips
00:14:41
then i have to go to you know more my little bit far how light will make file i understand that
00:14:50
exactly so you is that i think that that's an addition and that's exactly what sign people
00:14:56
under dating sign language that's is what call hammers is coming from hamburg notation system
00:15:04
it doesn't mean that exactly that was the perception processes going
00:15:07
here it just you're not getting the articulation process
00:15:13
and if i know that for speech can heal you can you tell me the some
00:15:17
sequence and i think you can say you cannot say anything perception for them not
00:15:22
so what we did in the study is that they say this and this problem would be sold in them in the data they
00:15:29
were man i became of that that that uh this bit we have people in i guess this coming up in this
00:15:39
so you have from the visual signal you can estimate the hand shape what city are so by them handles like
00:15:46
this um handles like this you know and i whether
00:15:50
habits shape was like this at the moment information
00:15:57
and that we came up with the the hand position and so how
00:16:02
then it's from the body how over how you move the hand
00:16:06
all this information we we came up with some nice some unit extraction based on that
00:16:12
like with the mets i took some methods and apply this and that and we came up with this but if
00:16:18
and they're all my uh production space and we did the match
00:16:24
and it's interestingly i ate bought it well in the sense that i'm going to show demo we showed recently
00:16:33
he he i basically we build a system that is sign language no no comes
00:16:39
and the they are displayed how this i don't have to be produced
00:16:43
and or no no expert it doesn't matter if it
00:16:48
um and then the title probably the sign
00:16:53
and based on the sign production we can tell them whether your hand
00:16:57
chip along your moment was wrong we can tell that i've still
00:17:04
oops that the deal if if
00:17:15
absolutely
00:17:19
i
00:17:21
oh
00:17:30
oh
00:17:35
so we that we do not sell it and we did not
00:17:41
yeah i have to use powerpoint for that
00:17:49
yes
00:18:06
i
00:18:13
i. e.
00:18:26
oh
00:18:31
if
00:18:37
if if
00:18:41
oh yeah it's something it's only let me go back
00:19:04
okay
00:19:19
oh it's nice remember just doesn't
00:19:22
the basic idea yes you can you you can use it in two different moulds didn't you do
00:19:29
that's cool
00:19:31
megan
00:19:40
you've used
00:19:42
oh
00:19:43
the
00:19:48
this is too was dues you so full
00:19:54
no
00:19:55
does directly
00:20:12
you want some debugging things like ah some motion you position you
00:20:15
can shape and uh in orientation and it will um
00:20:20
we use a little bit information to present you with the of visual feedback that basically going
00:20:25
to red circle around from uh the low red it is the room or some shit
00:20:30
uh and the red line don't deal which again the more but it is the more impractical shoes
00:20:37
so there's there's two components basically uh so this is a
00:20:40
um uh uh connect sensor and what it does is
00:20:43
it records uh the the colour images record starts images so
00:20:48
all the bases that image but not every pixel
00:20:52
instead of colour it's the distance to to to the nearest object um so
00:20:57
when you're recording a person you basically tell where the portal bodies
00:21:01
and what we can do you can do some some uh processing and get escalated which is the green stuff you see there
00:21:07
um and so we extract the uh colour religion the skeleton
00:21:12
to connect and then we run those things through
00:21:16
a selection of neural networks and a hidden markov models which are
00:21:20
source types of a like that then uh give was um
00:21:27
confidence is on how to read the sign of a particular asked to design was
00:21:47
so it works really well when you produce design tools to what it should be
00:21:51
if one of the limitations is if you do something that was just completely
00:22:06
knowledge
00:22:10
tools
00:22:19
what do you think
00:22:25
you mean
00:22:28
oh
00:22:31
oh yeah so the law decay that the back problem for me
00:22:36
is that to find one of those big means in design
00:22:41
do they may seem similar kind of uh things what we are human speed
00:22:46
like syllables or some kind of structure of these different moments appalling
00:22:51
yeah but it is that it it like a very long complicit line for that
00:22:58
so so the last line is uh that's just some alive
00:23:06
mapping about athletes it it the fundamental problem in many speed application that's what
00:23:12
i tried to saw that as a first if you saw that problem
00:23:16
uh then uh many things for you are sitting on this the point
00:23:23
what i tried to impose on this in the stock a that invariably
00:23:27
in almost all methods it is the latent symbol sequence matching problem
00:23:32
so you take any method in the late data even one big don't into for some problems i was talking about
00:23:39
and you may be able to see that it is you you can see them in into this fall from rooms
00:23:46
we don't know yet what is the idea litton symbol that or
00:23:50
how to find the best slate that's a thimble set
00:23:55
so the currently what we're working on is basically it can
00:23:59
can is broadly new technology and data the remnants of
00:24:02
the moment you dupont independent up laughter condor different than you need it it had it is both linguistic
00:24:08
last did everyone you're gonna be what you're doing and that's the best what
00:24:12
you're doing with the sars and everything that's what you're working turned him
00:24:19
had some numbers approach suitable for both recognition and synthesis to so basically
00:24:24
it's the next model it can analysis it gets into nice
00:24:30
what i like to in the and was talking of a lot lot
00:24:35
going away from what the traditional like little bit opposed to put it best approach
00:24:41
um i didn't go too much into the theory and all those things but we can show
00:24:45
a strong to information daily and detection taping this there is a very old proof
00:24:52
is the model approach the components i try to show that how this this different components you can play with
00:24:59
like whether you wanna play the symbols that whether you want to play with the push the
00:25:03
um mapping the speech signal to the symbols or you want to map the what happens to the table it's
00:25:09
a model of problems and you can solve many of
00:25:13
your talent is thing called working on this problems
00:25:21
what is the idea is that the better the posted uh probably estimated
00:25:27
um yeah but that is the approval a better the pop would would be the basic point i'm trying to drive yeah is
00:25:33
today we had to see a lot of things but neural networks like and it's
00:25:37
i can do that bit today it to model you come and tell me
00:25:43
okay this is the best post to produce tomato i'll take it off
00:25:48
then you're in the theory doesn't depends upon whether i'm using versions had um neural network
00:25:55
i what it needs is a probable to probably were uh uh uh probably the estimator
00:26:03
and the moral if you come up with it it is independent of that
00:26:09
uh independent of what you use as an estimate of
00:26:15
it really advise i'd like to show that it can unify number patterns so you
00:26:19
can do recognition it important that you can do assessment you can do
00:26:23
a um also what does galactic lately production knowledge and perception or let all
00:26:29
those typically neophyte you can integrate in this in this in this matter
00:26:36
so that's my top and thank you so i'm open to questions if you have any
00:26:49
oh you are you said oh that's too fast can you explain me something i read we happen to be in that
00:26:58
um
00:27:06
but not not not questions so in the lab sessions you have um um if you're not
00:27:13
then i mean a markov models or something you can buy to the matlab exercise
00:27:17
as i said there is a hands on x. a sheet which you can take
00:27:21
home and you can do it on your hand what do then make programming
00:27:25
as i said this is a and it's uh it's it's different it's not changing yes
00:27:31
that much it thing across um different problems the it it's it's one duty
00:27:37
and it it the fun part is that that was that i make
00:27:41
programming was a dynamic time warping was one of the first proper
00:27:45
computer program i wrote in my life so i i i i thought i i i i understood better
00:27:54
so it with the weather very effectively with this and i really did in my life so
00:27:59
and then i thought that there's it's the whole thing is connected up for this problem of dynamic programming
00:28:06
except that it in the cost function within the estimate and and that's the thing so
00:28:12
in the in the in the yes not exercise i didn't deal with
00:28:16
some things like oh how that the decision to even look
00:28:19
like so the something this kind of things you can get into
00:28:24
uh the exercise carly excess i i didn't believe it
00:28:27
uh how you can masturbate uh the parameters of g. m. of the means and variances
00:28:34
how many have you know that up estimate the parameters of a g. m. m.
00:28:45
well you would i if you thought uh the that i had given a a meetings edition
00:28:53
and i i don't think i complete it didn't do l. into was um
00:28:58
something i want to say before they finish this is something i did well
00:29:05
into was
00:29:08
this uh
00:29:12
so i didn't do they allowed red october like little bit up would what it's called
00:29:18
a discriminative training out of a hat amounts
00:29:23
and how you do m. l. l. r. r. like gas farms and all those things
00:29:28
i it was a bit too much to go into that that action for me so i prefer
00:29:32
to go but if you want to do i indeed you can go about this book
00:29:37
by gail maggie is the t. v. and um it pretty nicely this right and if you go there
00:29:45
and you look at uh this committee painting like i. b. m. b. my minimum for now there
00:29:52
all those kind of matches you look good they all come back to sting matching
00:29:57
but those those training you will at it you can see that
00:30:02
so i the point was that it was too much to believe this um
00:30:07
some of the extra things i took from this paper what by interpretation i gave in terms
00:30:12
of dynamic programming except i am from these people that do that of course for
00:30:17
i had amounts and the amounts this is the this is the ball and five yeah hybrid system
00:30:25
willow than more than i can i get for nothing has changed except for us
00:30:30
architectures and different things you can do with the the kiddies here you go

Share this talk: 


Conference Program

Sequence modelling for speech processing - Part 1
Mathew Magimai Doss, Idiap Research Institute
Feb. 12, 2019 · 9:05 a.m.
334 views
Sequence modelling for speech processing - Part 2
Mathew Magimai Doss, Idiap Research Institute
Feb. 12, 2019 · 9:59 a.m.
Sequence modelling for speech processing - Part 3
Mathew Magimai Doss, Idiap Research Institute
Feb. 12, 2019 · 11:10 a.m.

Recommended talks

Bridging the Gap between ASR and HSR (AHSR)
Liang Lu, University of Edinburgh, UK
June 21, 2012 · 11:32 a.m.