Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
Since we're going back in the past I as
00:00:03
as have a was saying that I I remember
00:00:06
so and and I eighty to I'm I'm with the
00:00:09
Cambridge and I and I have a fellowship
00:00:12
or the main points of that proposal was
00:00:14
to do speaker adaptation for neural
00:00:16
network acoustic models and although we
00:00:20
kind of did some things I it might be
00:00:23
we never really got it to work properly
00:00:25
were working poppy is getting things to
00:00:27
work essentially some kind of
00:00:28
unsupervised doors or kind of well is
00:00:31
not real kind of labels. Um but we can
00:00:35
do not side that's all talk about what
00:00:38
I think do we can we can do a little
00:00:40
bit anyway. So this work as part of a
00:00:46
lot for jabber such a compliment in the
00:00:47
summer in the in the in the UK called
00:00:49
natural speech technology which is a
00:00:52
the collaboration between Edinburgh
00:00:55
Cambridge and Sheffield invest is I
00:00:58
still nice project actually and as you
00:01:01
as you may or may not know these are
00:01:03
the by some by the three largest groups
00:01:05
in in speech recognition and speech
00:01:07
synthesis and you K up with the first
00:01:09
time we've actually all work together
00:01:10
on the same project so it's it's it's
00:01:14
been a lot of fun fun. And a great
00:01:16
thing about it is that the project was
00:01:20
really a focus on the core technology
00:01:22
so things relating to learning another
00:01:24
station for speech recognition and
00:01:25
speech synthesis with with some example
00:01:29
applications as well. Um and at the
00:01:33
time I say I I we've been kind of
00:01:35
looking back a little bit wall because
00:01:38
this project started in twenty level am
00:01:40
remote compose one twenty ten so a lot
00:01:42
of things have happened in speech in
00:01:44
speech technology since then but we
00:01:46
have to look at my motivations we had
00:01:49
for writing the project. So the models
00:01:52
we have a weekly factored if we want to
00:01:54
kind of learn about some condition or
00:01:56
adaptor condition we just have to sort
00:01:58
of colour data from that and and just
00:02:00
and just but the standard learning
00:02:02
prior approaches do that rather than
00:02:04
teasing apart because the variability
00:02:06
that still a problem. Uh there's this
00:02:08
you've the main fragility that if you
00:02:10
want to have a speech recognition
00:02:12
system for example for new domain then
00:02:14
the first thing you typically do is
00:02:16
collect some data for that domain wow
00:02:18
that's still a problem. Um so that's
00:02:22
the recognition have been lottery
00:02:25
separate feels actually that's a lot
00:02:26
less of a problem. And I'll give one
00:02:29
example here of of how fairly similar.
00:02:33
So the site models all the same
00:02:35
algorithms applied to for different
00:02:36
models come work accustomed system
00:02:38
recognition and you you know that value
00:02:40
well in the work of people like Phil
00:02:41
and so the fact we don't react to the
00:02:45
environment the context very well.
00:02:47
That's still a problem. So I think part
00:02:49
of the point here is that despite what
00:02:51
you might read speech recognition and
00:02:53
speech synthesis isn't isn't yes a
00:02:55
solved problem. Um we're not
00:02:58
particularly good incorporating speech
00:03:00
knowledge the best systems we have tend
00:03:03
to be is knowledge free as possible and
00:03:05
use unused what powerful machine
00:03:07
learning techniques that we have and
00:03:09
it's proven frustratingly difficult to
00:03:14
actually useful incorporate what
00:03:16
metadata you might have about speech
00:03:18
recognition of speech synthesis problem
00:03:20
to really improve systems but but but I
00:03:25
think we it still reasonable reasonable
00:03:28
belief that we can do that somehow. Um
00:03:31
and and that is this we we went into
00:03:34
this project with a somewhat slightly
00:03:37
too aggressive you so we said we were
00:03:41
going to do no transcription in the
00:03:43
project and and of course if you do
00:03:45
speech recognition of speech synthesis
00:03:47
research you know that's that's that's
00:03:49
not gonna work quite like that. So we
00:03:51
could do some transcription
00:03:52
particularly of test set so I can see
00:03:54
how we were doing but we do want to
00:03:56
work further away from purely from
00:03:59
purely supervise things in this working
00:04:01
adaptation is is is in part of the work
00:04:04
for that. So by adaptation primarily
00:04:08
talking about acoustic adaptation so
00:04:10
I'm not talking about language models
00:04:13
this morning so we're looking at
00:04:16
adaptation to different speakers. Um
00:04:20
adaptation and to the acoustic
00:04:22
environment to to the different channel
00:04:26
across tasks although again number to
00:04:29
that this morning and although we don't
00:04:31
quite quite a bit of work in the area I
00:04:33
would talk about it this morning but
00:04:35
across across across multiple languages
00:04:37
and to some degree I think is quite
00:04:39
profitable to think of building
00:04:42
multilingual systems is just another
00:04:43
example of adaptation speech
00:04:46
processing. And what I what I what I
00:04:49
want to do is to I'll talk about two or
00:04:53
three things that we've done a timber a
00:04:55
but I want to give a slightly broader
00:04:56
picture of some of the what we've done
00:04:58
across the honesty project because we
00:04:59
really have have a a strong focus over
00:05:02
that's a three laps on the adaptation.
00:05:04
So why why is it that saying no network
00:05:11
models a challenge so very simply is
00:05:17
that we have these beyond network
00:05:18
models on they can have a lot of
00:05:19
parameters tens of millions of
00:05:21
parameters even hundreds of millions of
00:05:22
promises some of the things people
00:05:24
developing a getting even bigger than
00:05:26
that say potentially you might think
00:05:29
you need a lot of adaptation data for
00:05:31
that in particular if you if these
00:05:34
prompted optically structured so you
00:05:35
have huge white matrices you might have
00:05:38
an output white matrix twenty million
00:05:40
whites or something like that again it
00:05:42
can be rather difficult to know how you
00:05:44
can say high what what what structure
00:05:46
you can take advantage of we also want
00:05:51
to do some kind of unsupervised
00:05:53
adaptation. So speech recognition by
00:05:55
unsupervised adaptation we typically
00:05:57
mean adaptation using what some people
00:06:00
machine learning cost you the labels.
00:06:01
So with a with the labels we're using
00:06:04
all the labels that a first pass
00:06:06
recognition but first pass recogniser
00:06:08
supplies so that may have high with our
00:06:12
right twenty percent thirty percent
00:06:13
whatever right in some cases but you
00:06:15
would still like to be able to adapt
00:06:17
the parameters you want to adapt using
00:06:19
there's your labels because it's
00:06:21
relatively uncommon that you have a
00:06:23
situation where you have supervised
00:06:27
data you you you have transcriptions
00:06:30
for your adaptation task if you if you
00:06:33
buy dictation system of the shell from
00:06:35
the Ascii to read something that's one
00:06:36
example where you do have supervised
00:06:38
adaptation but in most cases you
00:06:39
calling into call centre you it's it's
00:06:43
it's a it's a it's a TV program or
00:06:44
something and there is there is there's
00:06:46
no there's no label data to do the
00:06:49
adaptation would like adaptation to be
00:06:52
compact. Um so you can adapt all the
00:06:58
whites of the neural network but if you
00:07:01
adapt sixty million whites that means
00:07:05
per use that you have sixty million
00:07:07
parameters and even if you Google you
00:07:11
don't want sixty million parameters
00:07:13
four billion users that's a lot of
00:07:15
storage so it's you really what want
00:07:21
more nor to a few thousand parameters
00:07:23
per user plus you might believe from a
00:07:26
learning point of view that's the
00:07:28
that's a that's a good way to do it. Um
00:07:32
and finally there are choices to make
00:07:34
about what you jointly optimise the
00:07:36
core parameters in the adaptation
00:07:37
parameters seven as serving speaker
00:07:39
adaptive training settings and there
00:07:42
are pros and cons of this you might
00:07:43
think it's always a good thing to this
00:07:44
joint optimisation and that's correct
00:07:46
but sometimes you might want to adopt a
00:07:48
system well you haven't already done
00:07:50
some kind of speaker that training
00:07:51
process. So it's nice to have
00:07:54
adaptation approaches that can work in
00:07:56
a test only setting. So focus to model
00:08:00
adaptation as for all these things you
00:08:04
can you can slice and dice these things
00:08:05
in different ways and that's of course
00:08:07
overlaps between between different
00:08:09
approaches but I think it's helpful to
00:08:11
think in terms of three approaches to
00:08:14
to to adapt acoustic models so they're
00:08:17
the feature space space approaches
00:08:20
where you try to transform a normalised
00:08:22
acoustic features for each speaker you
00:08:24
might do this typically do this as a a
00:08:27
duck training set up we might do is
00:08:29
test only and the classic example of
00:08:32
that is feature space I'm now are
00:08:35
already was saying that we have this
00:08:37
kind of linkage between a different
00:08:39
approaches because you can equal
00:08:40
couldn't separate feature space a lot
00:08:42
as a model based approach you're kind
00:08:43
of doing the linear regression on them
00:08:45
on the model parameters of your of your
00:08:48
of your of your TM M.s which is based
00:08:51
on the laws of is a very using using a
00:08:54
single transforms about pop approach to
00:08:55
adaptation. And I'll show you in the
00:08:57
next slide it's it's something that is
00:08:59
always worth considering is the
00:09:00
baseline you can have work using all
00:09:05
generally features. So I batches and
00:09:07
speaker carries examples of this when
00:09:09
you have some additional features I did
00:09:11
to the input which somehow represent
00:09:15
the speaker current because of these a
00:09:17
speaker specific features and and you
00:09:21
and you somehow one transfer these
00:09:22
features for news because and then the
00:09:25
rest model based approaches well you
00:09:27
somehow want to update all the whites
00:09:30
or maybe some of the whites or you want
00:09:33
to define a particular adaptations that
00:09:34
of whites and this this can certainly
00:09:37
be back and adaptor training set up or
00:09:39
a or a test only setup so I'm gonna
00:09:44
pack the the talk with results and I'm
00:09:50
not the paper is a very specific remote
00:09:53
experiments are I'm not always gonna be
00:09:56
a hundred percent specific and all the
00:09:57
details but do ask if if if you if you
00:10:00
want want to know something and I can I
00:10:05
can tell you that so so mostly results
00:10:07
showing on three different only three
00:10:09
different corpora which are the ami
00:10:12
corpus switchboard untied talks and
00:10:17
unfortunately for historical reasons we
00:10:20
tend to use different front ends for
00:10:21
those things. Um so again if you're
00:10:24
interested I don't look at the papers
00:10:26
are asked me but I think the the
00:10:28
general message holds. So this is this
00:10:32
is this is this is why I know all is
00:10:37
this is something you should always
00:10:39
think of as a baseline so these are
00:10:42
adaptation results on on on these three
00:10:44
different corpora the blue balls
00:10:47
indicate on adapted in an acoustic
00:10:51
models and the red ball so when you F
00:10:54
law adapted adapted features and the
00:10:57
fact that stoop as about speech corpus
00:10:59
is just is just some showing two
00:11:01
different evaluation sets for those for
00:11:03
those those corpora in the in the
00:11:05
switchboard case the left hand path is
00:11:09
the switchboard evaluation the right
00:11:11
hand as to call hire evaluation and
00:11:13
it's the it's the left on have
00:11:14
microsoft's I thought a lot but that or
00:11:17
the lifetime case. So what we can see
00:11:21
here. So these results using doing an
00:11:24
acoustic models and nearly always in
00:11:26
this talk will be using the same basic
00:11:28
structure fraud demands which is
00:11:30
basically we kind of impaired you find
00:11:32
it's it works pretty well which is
00:11:33
sixty liars yeah we're using sigmoid
00:11:37
units you can do slightly better if you
00:11:38
use different units like values all mac
00:11:40
sites but but but we using sigmoid
00:11:42
units two thousand units a liar. And we
00:11:44
typically have plus or minus five
00:11:46
frames of input context. So when we do
00:11:49
after a lot of station we trying to
00:11:51
GMMHMM system and we can we try to
00:11:53
single analogue transform well at the
00:11:57
meeting could parents just try to be
00:12:00
the the transfer make it to try to be
00:12:02
the same. So this can correspond to
00:12:03
feature space transform and then we can
00:12:05
use the transformed features a training
00:12:07
and test time for off from your own
00:12:11
that's because so that these results
00:12:14
are all on MFCCOPLP cepstral
00:12:16
coefficients because it's and I think
00:12:22
the adjective tedious at best that to
00:12:26
estimate a from a off transform that
00:12:27
you want to use collected filter bank
00:12:30
features because then you're not using
00:12:31
a diagonal covariance GMM system it
00:12:33
gets a bit it gets a bit complicated so
00:12:36
this is also using MFCC or mainly or PL
00:12:42
P.s in in one case I think and you can
00:12:44
see you get this consistent sort of
00:12:46
five to ten percent relative
00:12:48
improvement using using after our
00:12:52
transform the best improvements that
00:12:55
that that that we see consistently for
00:12:57
adaptation around that had talks and
00:13:00
the reason the reason for this is that
00:13:02
you get quite a lot of speech from one
00:13:04
speaker because you typically get like
00:13:05
a seven or eight minute talk. Um and
00:13:08
the allocation of the speech that
00:13:12
speaker is nearly always correct. So
00:13:15
I'm not showing the results in this
00:13:16
talk but when you come to do something
00:13:18
like broadcast speech like on the NGB
00:13:20
challenge which is which is this multi
00:13:22
genre broadcast time and we do it
00:13:23
adaptation is much worse because you're
00:13:27
having to infer the link between
00:13:30
speakers we you want to adapt to and
00:13:34
you and you get very looking proven
00:13:37
from adaptation largely because you you
00:13:41
you're not reliably adapting to a
00:13:43
single speaker all the time. Um as as
00:13:46
mark and sell you speaker
00:13:49
identification that data is is quite
00:13:52
challenging if you look at the
00:13:53
downright station results in TB
00:13:54
challenge from ASRU if you have a
00:13:57
thought that isolation was a solved
00:13:59
problem. You can stop thinking about.
00:14:01
So that's a baseline after a while
00:14:05
unfortunately most of the thing
00:14:07
fortunately most the things are gonna
00:14:08
talk about are complementary that from
00:14:11
Los to weaken the plan from a law and
00:14:13
we can get improvement after that so
00:14:18
what can I wanna talk a little bit
00:14:20
about auxiliary features in this is
00:14:21
this is this is this is work that that
00:14:24
was primarily done of first came
00:14:25
original second part that Sheffield so
00:14:29
I just I mean I mean you for the kind
00:14:30
of name check this sort of work. So
00:14:32
there's this work on using I that's as
00:14:36
so using I'd like to stress saw. So as
00:14:40
you as you as you know I've acted or
00:14:42
like dimensional speak representation.
00:14:44
Um which have basically over the past
00:14:48
few years define the state of the art
00:14:49
in speaker identification. And I think
00:14:53
most incorrectly happened friends run
00:14:55
out with the first people to use I've
00:14:57
actors for SR when they used in the GMM
00:14:59
system about five years ago. And then
00:15:03
some people IBM short time and some of
00:15:05
the people utility and then system a
00:15:07
couple of years later. And I have to
00:15:11
say it's not always the most
00:15:16
straightforward thing to you to to to
00:15:18
get good performance or to get good
00:15:20
improvements using I act is a lot of
00:15:22
people do use I vectors now because
00:15:24
it's in some standard county recipes so
00:15:27
so older people actually use I better
00:15:29
adaptation in speech recognition
00:15:31
without even really knowing it because
00:15:32
it's just in the standard recipe. Um
00:15:35
but it can work it come up quite well
00:15:37
so one of the things that we're
00:15:39
interested in is whether you can fax
00:15:42
try the adaptations you can you can
00:15:44
adapt about the speaker and to the and
00:15:46
to the acoustic environment and penny
00:15:52
Karen assume at Cambridge and amount
00:15:55
girls looked a lot of doing this in in
00:15:59
a task using most you jen without it
00:16:00
noise and in this case they were
00:16:02
basically looking to extract to satisfy
00:16:05
vectors one of which was gonna be
00:16:06
concerned with speaker information one
00:16:10
which is concerned with information
00:16:11
about the noise about the about the
00:16:13
acoustic environment and then these we
00:16:15
used as whites for close to got to
00:16:18
training GMM system they going on to
00:16:19
use this as as as as whites in and and
00:16:22
then you'll network system. And they
00:16:24
found that during this sort of
00:16:25
factorisation can on on a noisy wall
00:16:28
street journal task and you can improve
00:16:30
things by about five to ten percent
00:16:32
they went on to improve things a little
00:16:35
bit so so they are very interested in
00:16:36
what you could do adaptation on just a
00:16:38
single utterance basically by providing
00:16:42
a guys team prior so it gives you
00:16:44
something but that but that lock it in
00:16:46
the equations it looks very similar to
00:16:48
map what you're basically interpolating
00:16:50
between between the prior previously
00:16:52
the statistics and they find that in
00:16:54
particular for gender dependent stuff
00:16:56
like together they could get a people a
00:16:59
few percent improvement relative to
00:17:04
just using a speaker independent prior
00:17:05
but I think it's more interesting is
00:17:09
somewhat but also also on auxiliary
00:17:12
features is to assume work that what
00:17:15
was the deal actually that is doing at
00:17:17
Asher field so he was using his been
00:17:21
looking at like the wretch like
00:17:22
allocation no L the eyes you may know
00:17:25
is is is the well for fifteen years
00:17:28
it's been a very hard topic in and then
00:17:30
I'll pay with people looking to do
00:17:31
automatic topic modelling based on bag
00:17:34
of words models what what has been
00:17:36
doing is looking at using a pakistan's
00:17:38
model and doing LDA eucharistic level.
00:17:42
So I was first experiments which which
00:17:43
are very proof of concept buddy buddy
00:17:45
buddy plugged into speech last year
00:17:47
here's looking really quite different
00:17:50
quite different acoustic data from
00:17:52
radio and TV conversational telephone
00:17:54
speech meetings and so on. And
00:17:57
automatically automatically defining
00:18:01
domains or learning learning learning
00:18:05
learning acoustic remains for this data
00:18:07
using LDI and the taken the sort of
00:18:09
like a sense with using the network use
00:18:11
basically doing about consolidation and
00:18:13
doing and doing an LDAO the code words.
00:18:17
And you can he was getting a small
00:18:19
improvement bye bye building them a
00:18:23
specific colour building them a
00:18:25
specific acoustic models in this way as
00:18:28
you should because these very different
00:18:30
domains but he that once on to do was
00:18:33
to apply this to the and you be
00:18:34
challenged data which is this multi
00:18:36
genre work of data and in this case we
00:18:42
had a better way of extracting the LDA
00:18:44
domain carried several then using B you
00:18:47
debased rained the GMM overall the
00:18:49
speech and basically basically
00:18:52
basically did indicating to get a
00:18:54
sequence of the to to do that be over
00:18:59
the over the component guys teens. So
00:19:02
you can represent each utterance as a
00:19:03
sequence of the most like the component
00:19:05
guy sins and then you can just read
00:19:07
that as a backup guy in the back of
00:19:09
sayings and once you have that you can
00:19:11
you can perform like direction
00:19:12
allocation on it. So you can do that
00:19:15
and you can you typically working with
00:19:17
something like sixty four with some fat
00:19:18
sixty four domains and you did nothing
00:19:21
fancy for the LDA domain carried is
00:19:22
system one hot representation of the
00:19:24
domains but it ended up to the to the
00:19:26
acoustic features. And when you use
00:19:30
that in addition to to speaker
00:19:32
adaptation using F an on his case I
00:19:35
think we kinda got a five to ten
00:19:37
percent about the reduction on the
00:19:39
entropy charming which is which is
00:19:40
actually quite quite significant it's
00:19:42
two or three percent absolute in the in
00:19:44
the kind of area people working on that
00:19:46
so that's I think a very a very
00:19:49
interesting way to do speaker
00:19:51
adaptation and and this idea of doing
00:19:53
LDA of doing LDA over acoustic symbols
00:19:58
rather than rather than text I think
00:20:01
it's quite a powerful one and is is is
00:20:03
one I think is is interesting to look
00:20:05
at that well I'm gonna talk about is
00:20:11
model based adaptation. So model by
00:20:15
thought station has been around you
00:20:17
know it's it's what you first think of
00:20:19
perhaps when you when you wanted when
00:20:20
you want to adapt neural networks there
00:20:23
was a detailed study done by hank
00:20:25
Landry that he's a Google good looking
00:20:27
adapting different white subsets of a
00:20:29
of a of a large of a large DNN that had
00:20:32
the slightly depressing results that
00:20:36
the best subset to use with all the
00:20:37
whites. So you got something like a
00:20:39
five percent relative decrease in in
00:20:41
what are right will sixty million
00:20:42
weights were adopted but it didn't work
00:20:45
so well when you come to when you come
00:20:47
to do this in a kind of unsupervised
00:20:49
you labels I think that's you know the
00:20:52
local people would be looking to try to
00:20:54
do specific promises subsets looking at
00:20:59
just adapting the biases of the slides
00:21:01
and it's it's been interesting but I
00:21:05
don't think you would you would argue
00:21:06
that it's had a significant that that
00:21:08
gives you a significant inconsistent
00:21:10
improvement. And is also sometimes
00:21:12
teamwork dynamite also the Microsoft
00:21:14
which was looking at changing the
00:21:16
adaptation cost. So they were looking
00:21:19
adaptation approaches why you want to
00:21:23
look at the cool but leave the
00:21:24
divergence between the speaker
00:21:25
independence and speaker adaptive
00:21:27
output distributions and you and you
00:21:29
want to use that as your adaptation
00:21:30
cost and that again got a very small
00:21:35
but I think consistent improvement when
00:21:38
they're when they're looking thing
00:21:39
switchboard. And then people been
00:21:41
looking at right to get these compact
00:21:44
transforms. So you don't want to adopt
00:21:46
all the whites and again people
00:21:48
Microsoft have been quite prominent in
00:21:51
looking at doing things like doing this
00:21:54
three D factorisation the weight matrix
00:21:56
which you can sort of think of is like
00:21:57
putting a kind of bottleneck in the in
00:21:59
the white matrix more menu button
00:22:01
bottleneck and that's only gives you
00:22:04
more compact transform but we didn't
00:22:07
fit in and looking at some other ways
00:22:09
to get compact transforms and and the
00:22:11
thing that we found that works really
00:22:13
well and is actually remarkably simple
00:22:16
is the thing that we call learning
00:22:19
hidden unit contributions all outlook.
00:22:22
And that has a very a very simple idea.
00:22:24
So the idea behind outlook is that we
00:22:28
have an extra parameter each unit. So
00:22:30
we're making we're making it compact by
00:22:33
adapting the units rather than the
00:22:35
whites. So you know in a and sort of
00:22:39
the intense we're looking at we have
00:22:41
six layers of two thousand hidden units
00:22:42
that way we we we have twelve thousand
00:22:44
hidden units that we're that we're
00:22:46
looking at some every unit we we
00:22:49
provide an amplitude is that how Peachy
00:22:53
parameters that we that we adopt we
00:22:57
actually this this kind of a a thing we
00:23:00
don't kind of we don't do have a linear
00:23:05
product from that we tend to put things
00:23:08
in the sigmoid between well between
00:23:11
zero to to work talking about a certain
00:23:13
kind of such rights out a bit and
00:23:16
empirically that that that works
00:23:20
better. So if you're just Reading this
00:23:23
in a test only case then you can we got
00:23:26
this like the speaker independent model
00:23:28
these things are all set to one so so
00:23:31
there's no there's no differentiation
00:23:33
between young Richards was a speaker
00:23:35
dependent model then where learn the MP
00:23:38
choose from the data the speaker. So
00:23:41
these I'm cheap parameters are tied
00:23:43
across speakers. And the idea is is
00:23:45
that we're thinking of speaker
00:23:50
adaptation as emphasising the
00:23:52
contributions of some hidden units and
00:23:54
de emphasising the contributions about
00:23:55
the hidden units. So we're looking at
00:23:58
what sort of if you like what sort of
00:24:00
facts about the previous layer a
00:24:02
particular set of hidden units of
00:24:04
learning and then for some speakers we
00:24:06
want to kind of emphasise their you can
00:24:08
five so there is all those particular
00:24:10
filters and for the hidden units you
00:24:12
want to de emphasise them. And then we
00:24:14
will learn a lot from from the data no
00:24:18
we can do this also in a speaker
00:24:20
adaptive why. So we can we can we can
00:24:24
we can train a hidden units to to learn
00:24:27
both good speaker independent
00:24:29
representations as wanna speaker
00:24:30
specific amplitudes. So that means that
00:24:33
way we want to train speaker specific
00:24:38
transforms at the the the trains phase
00:24:42
but also interested in getting a good
00:24:44
speaker independent transform as well.
00:24:46
Um not least because then we can use
00:24:48
that model to produce a first pass
00:24:50
labels that we that will that will use
00:24:53
the adaptation and we tried a couple of
00:24:56
ways of doing that and they and they
00:24:57
can no and and the main way that we
00:24:59
that we do that is that when we
00:25:01
training we basically for any given
00:25:06
training item there's a choice you can
00:25:11
either update the amplitudes for the
00:25:14
speaker independent one we could update
00:25:16
yeah cheese that speaker and we just
00:25:19
basically toss a coin to decide all we
00:25:21
possibly wait point to decide to decide
00:25:24
what to do. And it it isn't the
00:25:28
strongly dependent on what the what the
00:25:32
wages so this is this is that this is a
00:25:36
slightly complicated graph because it's
00:25:40
a showing a couple of different things.
00:25:42
Um the colours the the green is is
00:25:49
where where basically choosing the
00:25:54
choosing that's also the corn fine
00:25:57
level in blue is the speaker level in
00:26:00
red is that the the segment level. And
00:26:02
then the solid lines are when we just
00:26:05
doing the speaker independent you
00:26:07
carried and then the dotted lines a
00:26:09
when we do speaker adaptation to do
00:26:11
speaker dependent to carry. Um and the
00:26:14
black lines all the all the other
00:26:16
baselines speaker independent system
00:26:19
and the nonspeech productively trained
00:26:21
L system. So what you can see from this
00:26:26
graph a speaker doubt that training is
00:26:28
that firstly you can see that you get
00:26:32
this you get this quite nice
00:26:35
improvements other three percent once
00:26:38
you when when you when you do the
00:26:39
adaptation the X axis is basically the
00:26:44
calling flip parameter between speaker
00:26:47
dependent and speaker independent. So
00:26:49
when it's not point for E which is
00:26:51
which is which is the parameter that we
00:26:53
tend to use that means that thirty
00:26:55
percent of the time we use the data to
00:26:58
learn to speak independent transforming
00:27:00
seventy percent of the time we use it
00:27:01
to learn the speaker dependent
00:27:02
transform but you can see it's not too
00:27:06
it's not too sensitive to that data it
00:27:09
works better if you if you if you do if
00:27:11
you do the selection at the frame level
00:27:16
rather than the speaker all the segment
00:27:19
level on the frame level if you this if
00:27:21
you if actual can the icassp type it
00:27:23
was walking that the five novel but we
00:27:25
thought we'd better try the speaking
00:27:27
segment level to see what difference
00:27:28
right. Um so we can see that the as
00:27:31
long as you use at least sort of about
00:27:33
thirty percent of the data to trying to
00:27:38
speak independent parts of the model
00:27:40
you can you can you can get quite a
00:27:42
performance and the the main the main
00:27:48
difference comes I don't think I put
00:27:51
the the graph in here the main
00:27:52
difference comes actually because when
00:27:55
you when you start to go below fifty
00:27:58
percent being used to trying the
00:27:59
speaker independent L parameters you're
00:28:03
using two little data to trying to
00:28:05
speak and dependent model. So the first
00:28:09
plastic we guess I higher. So the
00:28:11
targets for your adaptation a weaker
00:28:14
and you and you get and you get a poor
00:28:16
performance well we do find is that one
00:28:21
of the relative really important things
00:28:24
for all these adaptation approaches
00:28:26
will we using this you do labels is is
00:28:29
is that they can be a bit sensitive to
00:28:32
the to the to the accuracy of the of
00:28:34
the first pass system. So if we if we
00:28:38
look at the results of of this outlook
00:28:40
approach using different amounts of
00:28:42
adaptation data. So this is this is on
00:28:46
a particular one that ad test sets so
00:28:49
so this is going from ten seconds to to
00:28:54
five minutes and I think all is about
00:28:56
in most cases seven or eight minutes of
00:29:00
of adaptation data and the solid lines
00:29:07
indicate the what I'll call the
00:29:10
unsupervised L talk transform and then
00:29:13
the dotted lines all what I will call
00:29:16
the oracle transform which is where
00:29:18
we're actually using the true the true
00:29:21
alignments to do to do the speaker
00:29:24
adaptation. Um what we can say is that
00:29:27
even just give it a small amounts of of
00:29:29
data just a just a few seconds of
00:29:31
adaptation data you can you can you can
00:29:34
you can get a percent improvement from
00:29:37
from something like one percent of the
00:29:39
of the sergeant something like yeah one
00:29:43
percent improvement concerns like ten
00:29:44
twenty seconds of the of the adaptation
00:29:47
data that's a key when using adaptive
00:29:50
training. And as we go on to to sort of
00:29:53
two three five minutes then we can then
00:29:56
we can get to a three percent
00:29:57
improvement it or or it's a couple
00:30:01
percent improvement from from from from
00:30:04
doing the speaker adaptation approach
00:30:06
again I should say this is this is ten
00:30:07
C get good adaptation results and cat
00:30:09
because because you know you know who
00:30:11
the speaker is and it's and and there's
00:30:13
not any changing channel effects
00:30:14
typically or anything like that. Um and
00:30:18
the interesting thing is is that if you
00:30:20
compare the solid lines to the dotted
00:30:22
lines of the same colour so the up to a
00:30:25
couple of minutes there's not a huge
00:30:27
difference between the article
00:30:31
adaptation and the adaptation using the
00:30:33
labels and first pass to carry doing
00:30:35
what you do find is that for the for
00:30:40
the first pass decoding approach at you
00:30:43
get a couple of minutes the improvement
00:30:45
curve starts to flatten add wires in
00:30:48
the in the oracle supervise part it
00:30:50
continues to it continues to improve
00:30:52
and and you know you're you're kind of
00:30:56
getting red in the speaker adaptive
00:30:58
training opal thing you getting the
00:30:59
fifty percent or thirty percent of the
00:31:02
arrows by the time you using seven
00:31:04
minutes which is which is really a lot.
00:31:07
Um so that's a that's a that's a a nice
00:31:12
a nice nice powerful technique the
00:31:16
other thing that we get is is we is we
00:31:18
wanted to see how well it worked across
00:31:22
all the speakers. So was it giving
00:31:24
consistent improvement for each speaker
00:31:26
or was it a bit variable. And decide
00:31:29
again quite quite quite a nice I come.
00:31:31
So what you can see here is that you've
00:31:34
got the balloon have which so what
00:31:38
we've got along the X access is I think
00:31:41
this is a histogram. And what one got
00:31:44
along the X axis is two hundred is the
00:31:46
word error rates to two hundred
00:31:47
speakers well mixing up tasks yeah well
00:31:50
mixing up ten I mean switchboard but we
00:31:52
just interested in the relative
00:31:53
performance of the of the speakers for
00:31:55
all for adaptation for that and then we
00:31:57
have the word error rate for each
00:31:58
speaker which varies from like this the
00:32:00
right way to seventy and then the blue
00:32:04
curve is the is the about error rights
00:32:07
we get with the speaker independent
00:32:08
system the red curve is what you get
00:32:14
using test and the I'll talk. And then
00:32:17
the doctor green curve which is which
00:32:20
is also given by the but the solid
00:32:22
green is what you get from doing
00:32:25
speaker data retrained outlook the main
00:32:29
and I think most important point is
00:32:32
that we just a few exceptions literally
00:32:35
sort of five percent a few all the
00:32:39
speakers improve when we do this
00:32:41
adaptation. So it really does give a
00:32:44
consistent improvements what we also
00:32:46
find we just a couple of exceptions is
00:32:49
that speaker adaptive training usually
00:32:51
gives an improvement for the for for
00:32:54
speakers all these doesn't make it was
00:32:55
compared to test and the adaptation. Um
00:32:58
so you can see here that for sort of
00:33:00
any given speaker you get this sort of
00:33:03
you know small typically five to ten
00:33:05
percent relative improvement by doing
00:33:10
this L adaptation approach we're not
00:33:15
doing it and a fact rice that's up as
00:33:17
well. Um so we we did these experiments
00:33:22
on aurora for which is a wall street
00:33:25
journal speech mix mix with noise. And
00:33:28
basically I I was kinda look at this
00:33:32
diagram this morning I and and and the
00:33:34
guy it can always be misleading what
00:33:37
what what we're basically doing is
00:33:39
we're computing separate transform the
00:33:41
computing stepper out transforms for
00:33:44
both the speaker and for the
00:33:48
environment and then we'll in into we
00:33:51
linearly interpolated transforms
00:33:53
together combined adaptation. Now you
00:33:55
could just like the joint adaptation
00:33:57
for speaker and environment together
00:33:59
and results will show that the reason I
00:34:02
think that's that's interesting is that
00:34:03
it's relatively unusual that you
00:34:05
actually have adaptation data for the
00:34:07
right speaker in the in the right noise
00:34:10
wise if you can if you can makes a
00:34:12
noise adaptation with speaker
00:34:14
adaptation this is a this is a this is
00:34:16
a this is a much more general approach
00:34:19
and if we do this a or what we can say
00:34:25
so the first just kind of look at the
00:34:27
the blue balls. So this is this is
00:34:30
looking at some adaptation results on
00:34:31
aurora for the blue balls showed that
00:34:34
from a from again and if we do I that
00:34:37
speak adaptation or environment
00:34:39
adaptation we get we got an improvement
00:34:43
of about a percent absolute that if we
00:34:48
do a joint adaptation we get something
00:34:49
like another sense improvements but if
00:34:53
we do the fact of adaptation well we
00:34:55
just taking the relevant speaker
00:34:57
adaptation the relevance environment
00:34:59
adaptation interpolate them we get a
00:35:01
very similar performance I think I
00:35:03
think it's about point three percent
00:35:05
worse than doing the joint adaptation
00:35:07
and that's it and that's a that's a
00:35:09
nice results because it it shows that
00:35:13
you can you can actually combining
00:35:15
different adaptations in a very in a
00:35:16
very simple why I'm sure that better
00:35:18
ways you can do the combination this is
00:35:19
this is just a linear interpolation the
00:35:23
paddle balls on the right hand side is
00:35:25
just that we wanted to see what would
00:35:26
happen if we used a slightly fancier
00:35:30
and don't have a computer model so yeah
00:35:34
there is balls off for max I've seen
00:35:37
and we trained using a new product. Um
00:35:40
and that significantly I mean if you
00:35:44
just compare the DN and which is I
00:35:46
guess about fourteen percent with this
00:35:48
one which is probably eleven percent or
00:35:51
something you get you get you get a
00:35:52
significant role that and then if you
00:35:56
if you go on to do the the joint speak
00:35:59
and balance adaptation I think it's
00:36:01
down to about seven point nine percent
00:36:03
on on aurora aurora for some of these
00:36:06
tasks where you have to be careful when
00:36:08
you compare numbers because so we're
00:36:10
doing speaker adaptation is those
00:36:12
environment adaptation not on numbers
00:36:15
that you see are doing speaker
00:36:16
adaptation so so for all these things
00:36:19
free feet three comparisons with cat.
00:36:24
Um but I think the other the other
00:36:26
point also makes is that doing this L
00:36:28
approach doesn't just work on simple
00:36:33
demands it works in importance in
00:36:35
assessing as well. Uh which is nice so
00:36:45
we also have been looking at some
00:36:50
related approaches which will call
00:36:52
adaptive pulling. So and pulling which
00:36:58
is used to CNN is also using things
00:37:00
like max not hidden units and so on the
00:37:03
basically. Um concerned with combining
00:37:08
a certain set of hidden units interest
00:37:10
single summary statistic. So in a
00:37:13
typical max problems that up you might
00:37:15
take three units and taken take the max
00:37:17
or you could take an average which was
00:37:19
which is what we've done in the CN N.'s
00:37:21
in the nineteen nineties we nine or max
00:37:23
what's usually works quite a lot better
00:37:25
than than an average who that and this
00:37:28
this nice paper there's our focus that
00:37:30
about four years ago then introduce
00:37:33
nation of differentiable pooling
00:37:35
operators. So rather than having a
00:37:37
fixed pulling like a max you can have
00:37:40
an operator but does the pooling so the
00:37:42
sort of quite as we've been looking at
00:37:44
because have been kind of like an LP
00:37:47
normal operator where where where
00:37:49
basically taking the the LP normal so
00:37:53
the default might we didn't help to
00:37:54
know but it can be the the exponent use
00:37:58
can be can be a parameter that you
00:38:00
learn and you can just learned using
00:38:01
gradient descent or also looked at
00:38:05
doing the kind of dancing improving
00:38:06
setup well if you're putting region
00:38:08
save your set of units which is
00:38:10
typically sort of three or four five
00:38:12
units you basically have a gaussian
00:38:14
kernel for those units and use the guys
00:38:17
in kernel together wait for what what
00:38:21
what is a oh how how cute is being
00:38:24
weighted and then the parameters of
00:38:27
that pooling would be the mean
00:38:31
imprecision of the guys in kernel and
00:38:35
potentially also an amplitude for for
00:38:38
each approving units. So again those a
00:38:42
parameters that you can but you can
00:38:44
learn just using just using gradient
00:38:45
descent backpropagation. And the idea
00:38:49
that we have is that these parameters
00:38:52
again might give us a nice compact
00:38:53
representation to to learn speaker
00:38:57
adaptation. So apologies for this big I
00:39:00
I I can't put it at the last minute I
00:39:02
could only find this kind of vertical
00:39:04
compass that's a kind of a funny on the
00:39:06
slide but this is trying to show what
00:39:09
doing and helping norm pulling that was
00:39:13
so the the figure in the figure in the
00:39:19
top left is basically just showing you
00:39:21
the what happens for the unit circle
00:39:26
when you're using different orders from
00:39:28
from L one on with the red dotted line
00:39:31
L two well norm is the green stuck a
00:39:34
lot sell one point five and then the
00:39:36
black square of course is the is the
00:39:38
kind of of infinity type of type and
00:39:43
all so we can say that we sort of have
00:39:46
some data well we have a decision
00:39:49
boundary here is here we happen to
00:39:51
define decision boundary by people like
00:39:54
point one on here. So we have one class
00:39:57
the middle of the second class around
00:39:59
the outside. So you can then you can
00:40:01
learn quite interesting on the new
00:40:03
units using using using this kind of
00:40:06
function and you can see what happens
00:40:08
is you get paid different values if you
00:40:10
if you if you have if you have this is
00:40:12
this is the people's one that and this
00:40:14
is the the P equals infinity example
00:40:17
but if you do adaptation of these units
00:40:21
you can you can you can get some quite
00:40:22
interesting behaviours. So you can
00:40:24
adapt the origin of the bias which is
00:40:26
basically sliding translating from
00:40:29
translating the region across the space
00:40:31
this two dimensional example or you
00:40:34
could expand the contract the region by
00:40:36
adopting without talking using and
00:40:38
using amplitudes. So this is just to
00:40:40
give a sort of you know it's always
00:40:42
dangerous to kind of think of things in
00:40:43
two dimensions weenie dealing with high
00:40:45
dimensional data but is to give a kind
00:40:47
of motivation for why we think this
00:40:49
sort of unit might what well so we use
00:40:52
these putting adaptation idea is again
00:40:54
on the same in the in the in the same
00:40:56
set of context look at at I mean
00:40:59
switchboard a nice a nice six lie
00:41:00
hidden hidden these six inland that
00:41:04
works. So if we look at the what
00:41:07
happens to the parameters yeah so the
00:41:10
red is is the distribution of the them
00:41:13
over the main parameter is for for for
00:41:18
a gaussian kernel for the pooling and
00:41:20
the balloon the distribution of the
00:41:22
precision promises to that and what you
00:41:25
can say that the higher is then that
00:41:27
quite well clustered around there are
00:41:29
but the localised a lot more a lot more
00:41:32
spread out so so the sun is something
00:41:36
the survey something going on but the
00:41:38
lower layers and you can also see this
00:41:40
sort of thing if we if we look at the
00:41:42
the distribution of the P parameter for
00:41:46
for LP norm and again it's it's it's
00:41:50
it's default said with the people's too
00:41:53
so so euclidean which is which is just
00:41:56
the kind of black spike and then we can
00:41:59
see the local I get quite a
00:42:00
distribution over the over the
00:42:02
different values of pay the different
00:42:04
colours here in this case we refer to
00:42:07
different corporate so I believe the
00:42:09
switchboard right that ad and green for
00:42:11
a but again as we kind of go to the
00:42:14
highlight is so you know getting less
00:42:18
but still some adaptive fact and you
00:42:20
can you getting close to to the to the
00:42:22
default which is the which is the black
00:42:24
which is the black spike that so is
00:42:26
quite interesting behaviour happening
00:42:28
when you do this when you do this
00:42:29
adaptation all these all these putting
00:42:32
promises have a look at the results we
00:42:35
got a small improvements doing doing is
00:42:38
different from pooling of the of the of
00:42:41
a of a of a L work. So if we look at
00:42:45
the solid lines the red line is a
00:42:47
baseline using L talk the black line
00:42:51
which is typically slightly wuss least
00:42:54
when you have more them any data many
00:42:55
adaptation data is using these guys
00:42:58
improving and then the blue line which
00:43:00
is typically a bit better is using the
00:43:05
LP pulling but then the dotted lines so
00:43:08
what you get when you combine the
00:43:10
pooling without work and you get and
00:43:12
you get this consistent improving that
00:43:14
so you get quite a reasonable sort of
00:43:16
further improvement of around of the
00:43:19
sense when you do different triple LP
00:43:23
pooling on top of elk. So that's that
00:43:31
was kind of that kind a kind of quick
00:43:34
tool of of some of these sort of hidden
00:43:37
use adaptations that would been trying
00:43:38
to do for acoustic modelling there's
00:43:41
other things happening in an ST which
00:43:44
which at which I which I don't go very
00:43:45
have time to talk about which includes
00:43:48
somewhat them all gales nice tune to
00:43:50
doing on what he called a multi basis
00:43:53
adaptive neural network which is which
00:43:55
is not the summer to clustered up
00:43:56
training. Um we've applied this so
00:44:00
highly Christensen Sheffield as applied
00:44:02
this to this off speech which is highly
00:44:05
talk a dependence because you've ye you
00:44:08
you don't just have the different and
00:44:09
that's amaze you had in in the case of
00:44:11
this I think speech of have very
00:44:12
differing physiology is people as
00:44:14
people have different different speech
00:44:17
affects so again doing doing doing some
00:44:22
form of adaptation and because they
00:44:25
choose limited selecting speaker
00:44:28
selecting pools the speakers groups the
00:44:29
speakers to do the adaptation on can
00:44:33
can give can give significant
00:44:35
improvements in word error right. So
00:44:36
down to about forty percent alright
00:44:38
compared to forty five December speaker
00:44:40
dependent system trained in a
00:44:41
reasonable amount of data or fifty
00:44:43
percent during a a standard map
00:44:45
adaptation I also talk about the
00:44:49
applying this to speech synthesis
00:44:50
because oh no we tend to do at least a
00:44:57
research labs speech synthesis using
00:45:00
statistical parametric approaches
00:45:02
adaptation has as as as you know it's
00:45:05
been it's been it's been very
00:45:06
successful that's the that's the work
00:45:09
on what we've done for particularly
00:45:15
developing speech synthesis for people
00:45:17
some kind of speech disorder and we do
00:45:19
a lot of work on having multiple every
00:45:21
voice model so so having having
00:45:23
clusters of average voices and then
00:45:26
basically doing the adaptation yeah but
00:45:29
more recently would being applying the
00:45:31
techniques have been talking about two
00:45:33
didn't speech synthesis. So I didn't
00:45:37
speech synthesis you you basically a
00:45:42
mapping from a bunch of linguistic
00:45:47
features which are which are things
00:45:51
things things related to the to the
00:45:54
models plus plus things like position
00:45:56
and so on. Um and you must be not to
00:45:58
the the capita promises which is what
00:46:01
you use to to drive the speech
00:46:02
synthesiser so that's actually trying
00:46:04
to trying to trying to trying to put
00:46:06
the speech parameters and then we use
00:46:08
did and to to do that mapping. And in
00:46:11
this case we were this is primarily
00:46:14
worked "'em" by decision with a breath
00:46:17
we were using both I better adaptation
00:46:22
with the linguistic features L
00:46:25
adaptation on the hidden units of the
00:46:28
DN add. And also a feature mapping on
00:46:31
the on the vocoder on the bouquet of
00:46:33
parameters just the just I think in
00:46:36
this case the linear feature mapping
00:46:37
and what you find is that is that is
00:46:40
that is that these techniques work
00:46:44
reasonably well so these are being
00:46:48
evaluated using extra test. So in the
00:46:50
much protest you about writing speech
00:46:52
synthesis. So you hear a set of
00:46:55
renderings of the same utterance one of
00:47:01
which will be not roll the others of
00:47:03
what will be synthetic and the less the
00:47:08
have to scroll all differences between
00:47:11
zero and the hundred with the
00:47:12
constraints that they've gotta give a
00:47:14
hundred to one utterance who presumably
00:47:16
be the natural naturally spoken
00:47:18
utterance. Um so we did two evaluations
00:47:23
on naturalness and on similarity to the
00:47:26
to the to the to the actual speaker.
00:47:29
And and this is this is this is this is
00:47:31
looking at the different annotations
00:47:33
I've I acta outlook and the output of
00:47:36
all the parameter feature transform
00:47:37
alone then that combinations and the
00:47:40
combination of all three and then as
00:47:42
you saw here is that and it's and it's
00:47:46
significance is that you get the best
00:47:49
performance when you when you combine
00:47:51
all three when you combine L help with
00:47:54
the with the I back to and with the
00:47:58
with the feature transform and you get
00:48:01
a simple cotton for when you can use
00:48:04
the the evaluation based on based on
00:48:08
based on similarity. And if you can
00:48:10
pass doing a best HMM adaptation to to
00:48:14
doing this gain an adaptation in terms
00:48:17
of both not for lesson similarity and
00:48:20
these a preference goals based on a
00:48:22
scale of a hundred and a scale of ten
00:48:24
then then we then we then we then we
00:48:27
find that that is a a strong preference
00:48:31
to the DNN system everybody maybe the
00:48:32
HMM system so that's that's that's a
00:48:37
sort of matches all because that really
00:48:38
was just a kind of we just took the
00:48:40
techniques we don't for adaptation for
00:48:43
speech recognition and then more or
00:48:48
less directly apply them to the to the
00:48:49
speech synthesis case. So just to
00:48:53
finish off maybe I could just sort of
00:48:57
show how we how we how we use these
00:48:59
things is in different applications
00:49:01
"'cause" is always interesting to share
00:49:04
a demo you may have seen this demo but
00:49:07
a Porsche art again it's it's it's an
00:49:10
example of whoa why we want to do
00:49:14
personalised speech synthesis oh oh or
00:49:23
or oh so that's your mcdonald's here as
00:49:35
you can see as much in your own
00:49:35
disease. Um and that's the that's the
00:49:39
that's the kind of adapted synthetic
00:49:41
voice and in in that case. Um in use
00:49:46
case we didn't we didn't have
00:49:47
recordings of in before his before his
00:49:49
voice is quite distorted. So that was
00:49:53
something adaptation data wasn't his
00:49:55
voice and it actually needed some
00:49:56
manual intervention to kind of supply
00:49:58
side the last disorder parts but it was
00:50:01
also mixed in with his brother's voice
00:50:03
who who who has sort of somewhat
00:50:05
similar voice characteristics and then
00:50:08
and then uses the automatic annotation.
00:50:10
Um one things we find out when we when
00:50:13
we I and use a notice but you can't
00:50:17
routinely use family members as a way
00:50:21
to adapt somebody's voice for that for
00:50:24
that person communication eight because
00:50:27
people actually don't want to sound
00:50:29
like the brother "'cause" they've
00:50:30
always sounded different to their
00:50:31
brother or a sister and that's and
00:50:33
that's quite an important thing a
00:50:35
collie. Um when we started doing this
00:50:38
work is choice of voice wasn't American
00:50:41
male or American female which is
00:50:43
potentially even even worse if you're
00:50:46
it's got want. So and this is something
00:50:52
that Jenny chain crest of wins and
00:50:55
Simon Simon king of been I've been I've
00:50:57
been working on a lot somewhere kind of
00:50:59
us don the process of doing trials and
00:51:01
you need she's setting up a similar
00:51:02
project to do this in Japan so it's
00:51:06
it's it's a this is the supporters are
00:51:08
very like this is just an example of
00:51:13
multi domain Nassau and they didn't say
00:51:15
such a film is just a it's just a demo
00:51:17
I like they just applied the system
00:51:20
that they built for MGB to to this well
00:51:24
name but afforded and I think it I
00:51:25
think it's a nice example yeah oh oh
00:51:50
one so the interesting thing there is
00:51:58
that there's no real modelling done for
00:52:00
the background find it all back parent
00:52:02
foreground sound effects so or all
00:52:04
music and so on. It's really we really
00:52:07
just kind of building building the
00:52:10
system's doing as the speech activity
00:52:13
detection as we can which is which is
00:52:16
which is a file thing and yeah it works
00:52:18
it works surprisingly well a final a
00:52:22
final thing is just well to final
00:52:25
things one is unlike somehow most audio
00:52:30
for this work sure I just discovered is
00:52:34
would be working on rising all history
00:52:39
so this is this is this is people
00:52:41
typically elder people talking about in
00:52:45
this case things that happened in the
00:52:46
nineteen forties the the this this this
00:52:50
particular building in your option I'm
00:52:52
transcribing that I'm making searchable
00:52:54
and that's quite challenging because to
00:52:56
some quite strong dialects and not not
00:52:58
vector. Um we also did this thing which
00:53:03
piece about people that the BBC news
00:53:08
hack which is which is assist nickel
00:53:11
global box which is combining speech
00:53:14
recognition and GA tagging and named
00:53:18
entities. So oh no switch of the audio
00:53:23
here so basically what's going on is
00:53:26
that as the as the speech is as a
00:53:27
speech is recognise you getting
00:53:29
entities coming up you can find out
00:53:31
which geographical locations are
00:53:34
related to that to that you story. And
00:53:37
you can get you can get kind of you
00:53:43
know representations of the all of the
00:53:46
main words and names that they use and
00:53:47
and I'm the main places so it's a gives
00:53:50
you multiple different ways of of of
00:53:52
kind of navigating of another continue
00:53:54
story. And this is this is this is this
00:53:58
is actually quite nice quite quite nice
00:54:00
way of combining things. So I just I
00:54:02
just kind of give an example there is a
00:54:04
very three Demos just to just to sort
00:54:06
of finish off with well I what I wonder
00:54:10
side we've with a lot of working
00:54:13
adaptation of doing an acoustic models
00:54:15
and they will go things is that well
00:54:19
the techniques work reasonably well the
00:54:21
complementary and the things you have
00:54:23
to be very careful about using
00:54:25
adaptation is that you have to make
00:54:26
that a robust benefits and because you
00:54:28
might be wanting to that quite a lot of
00:54:30
parameters even in a compact case for
00:54:32
relatively small amounts of data and
00:54:34
you wanna make sure you have good
00:54:35
performance on a first pass to carry
00:54:37
well we haven't done so much is looking
00:54:41
to see how well these approaches work
00:54:43
for adapting on and models and and to
00:54:47
end models. Um why for models we've
00:54:52
tried very hard to to use to you to to
00:54:56
explicit use metadata stay particular
00:54:58
broadcast testing you get a lot of
00:54:59
metadata which is which is generated
00:55:02
about about genre and relating teletext
00:55:07
things and so on. And that's been quite
00:55:09
hard to use more successfully than also
00:55:13
must methods and the what we do know
00:55:18
that sing on on specifically that the
00:55:20
noise conditions a safe probably for
00:55:22
aurora which is which is rather
00:55:24
artificial. So we very interested
00:55:26
emitter starting to work on button to
00:55:27
kind of realistic most conditions for
00:55:29
example sports commentary and so on. Um
00:55:32
so only that there and this is highly
00:55:37
collaborative work from all these
00:55:38
people in the NST project. So thanks
00:55:41
yeah this so that's a good question it
00:56:11
will be the ten seconds will be the
00:56:17
technical and that's automatically
00:56:19
segmented differences well spoken stage
00:56:28
yeah yeah yeah it's yeah it's it's
00:56:31
active it's ten seconds after you've
00:56:33
done speech activity detection yeah
00:56:36
yeah yeah yeah yeah so hi. I so if you
00:57:19
if you're thinking in terms of
00:57:20
different languages having having some
00:57:22
labels is very important having so one
00:57:25
of the things that we find for example
00:57:28
on and GB is that you have to be cool
00:57:31
white careful about its these kind of
00:57:35
semi supervised approaches because you
00:57:39
can end up in a situation where the
00:57:42
data selection you're selecting data
00:57:44
which is pretty well much the model you
00:57:45
have already. So so you getting better
00:57:47
and better not data and you just not
00:57:50
selecting data that is not well
00:57:52
matched. So I think that means is that
00:57:58
some some kind of not fully automatic
00:58:04
active learning is gonna be is gonna be
00:58:06
required because because I think
00:58:08
particularly starting you languages
00:58:10
being able to have correct annotations
00:58:13
of the of the of the data that's that
00:58:16
you're not selecting I think can be can
00:58:19
be can be can be very helpful you know
00:58:22
one of the best ways to do this is to
00:58:25
have a working system does something
00:58:27
useful so people actually want to
00:58:29
annotate the data which is not always
00:58:31
easy to do. But I I think the quite a
00:58:37
long way from getting really accurate
00:58:39
systems that come that that that lot of
00:58:42
starting to work without without labels
00:58:43
I think is really I think I think what
00:58:45
people doing on unsupervised training
00:58:47
is really really interesting but I
00:58:49
think at the moment if you want to get
00:58:50
is accurate system as possible you know
00:58:53
sometimes you can do you can be better
00:58:55
to spend you know it be better spend a
00:58:58
month during annotation than a month
00:59:00
writing any module. So those two things
00:59:17
everything there's the very pragmatic
00:59:21
view that we just wanted to have a set
00:59:24
of parameters that captured what the
00:59:25
whole network with their reading. Um
00:59:28
but was quite compact with and so this
00:59:31
kind of goes back to to some sense is a
00:59:33
kind of our fashion viewable hidden
00:59:35
units to do is you thinking that hidden
00:59:36
units actually capturing sort of facts
00:59:39
of features about the import and you
00:59:41
believe if you got a lot of these
00:59:42
things then waiting these things
00:59:43
differently for different speakers a
00:59:45
different domains is the is the right
00:59:47
is the right way is the right way to do
00:59:49
it. You can you can also if you want to
00:59:51
start to think about it in terms in
00:59:54
terms of kind of doing like a basis
00:59:56
reconstruction of the problem. And
00:59:58
again you kind of directly manipulating
01:00:01
the bases but we don't really have I
01:00:03
think a very strong theoretical handle
01:00:06
on that so that's I don't know if
01:00:09
that's a good answer or not and then
01:00:11
there's the the reason is that you know
01:00:14
honestly we tried quite a lot of things
01:00:15
in this this thing works pretty well
01:00:17
yeah yeah I yeah yeah one thing yeah
01:00:43
actually just just wanna think of it
01:00:45
one thing one thing I didn't show is
01:00:48
that we you know the kind of tease need
01:00:50
things that people do what they what
01:00:52
they project things into two
01:00:53
dimensions. So that was sort of
01:00:55
interesting for the aurora data because
01:00:59
what I'll ideas is really kind of
01:01:01
reduce the variance on the noisy part
01:01:03
that so that's a it it's it's clearly
01:01:07
doing something that but it's it's
01:01:08
really actually it's really doing its
01:01:10
job not so much on the speech but on
01:01:12
the noise in a in a in a situation. Um
01:01:16
but on this thing just on the speech or
01:01:20
redo that easy things it's it's a lot
01:01:21
less you can you know how these things
01:01:24
are you can kind of do it and you can
01:01:26
win Sebastian you can and you can come
01:01:27
up with some story but it is it's not
01:01:29
so clear to me like that yeah but you
01:01:34
don't make an easy differences "'cause"
01:01:35
if we do some yeah right from fifteen
01:01:37
to thirteen percent you're making some
01:01:38
different. It's not huge huge huge
01:01:40
difference the that that's that's
01:02:00
actually in the so the prince so so so
01:02:04
that is the transactions paper which is
01:02:06
which is now in the actually website
01:02:07
sporting a remote site it it isn't a
01:02:11
remote site well and I think we put in
01:02:13
the aurora tease these the end of that
01:02:16
paper actually so if you go to like the
01:02:17
last page about like a couple that you
01:02:20
can you can stare yeah yeah I
01:02:37
improvement the the yeah it's it's not
01:02:43
requesting it is true you don't get a
01:02:45
bigger relative improvement what I mean
01:02:47
you might you get a big about improving
01:02:49
with the lower right because because
01:02:50
the labels a better but you don't see
01:02:54
that directly what what you do sees you
01:02:56
get a better improvements if the
01:02:58
speaker allocation is correct then
01:03:01
that's a big thing and that's and
01:03:02
that's and that's and that's probably
01:03:04
the reason it works quite one pad but
01:03:06
there's not a strongly you're right
01:03:08
it's not a strong whatever right
01:03:09
dependence that which which is which is
01:03:11
which is sort of and I mean you can you
01:03:12
can argue both wise but yeah yep I
01:03:35
think it I think it would be I think
01:03:41
it's an interesting thing today we and
01:03:43
and and and frankly we haven't done it
01:03:46
all that we've kind of just about it.
01:03:48
Um I I I think I mean so so I was at
01:03:55
Dimitri thesis defence yesterday and
01:03:57
thinking of the things he'd been doing
01:03:58
looking what the filters you writing on
01:04:01
the wall waveforms and potentially what
01:04:03
might be happening on the filters of
01:04:04
the filters and things like that. And
01:04:06
relating those just speech knowledge I
01:04:09
think is interesting and then if you
01:04:11
want to apply L hulk in that situation
01:04:13
to see what happened to the filters I
01:04:15
think would be very interesting
01:04:17
actually so we haven't done it I can
01:04:20
think of lots of things that you could
01:04:22
do. Um the the thing that would be
01:04:29
interesting and and again we haven't
01:04:31
done and and and you that is as good as
01:04:35
mine as to what the right way to do it
01:04:36
would be would be to stop to think in
01:04:38
terms of if you are kind of more
01:04:41
specifically predicting articulatory
01:04:42
features some here and and seeing and
01:04:45
see and seeing how that might work and
01:04:47
and kind of linking to specific speaker
01:04:50
based things he might get more some
01:04:52
interpretation that but this is all
01:04:54
just like trying to you know start a
01:04:56
PHD on this tomorrow and see where you
01:04:58
are a couple of S yeah so so what you
01:05:41
already have a system in the language
01:05:43
yeah so so it's kind of what we do on
01:05:47
NGB what we do these what we call
01:05:50
matching our rights already kind of
01:05:51
matching matching right so so this the
01:05:57
so if so we basically take the system
01:06:03
and you see so if you've got a if
01:06:10
you've got some transcription of that
01:06:12
then then then you can see how well how
01:06:15
how well to to systems match and then
01:06:18
you can choose you can choose data
01:06:20
based on that in the case of something
01:06:23
where we where we have some I kind of
01:06:29
like we supervised thing you can do
01:06:30
that if it's completely unsupervised
01:06:33
then you're in the world of confidence
01:06:36
measures all matching multiple systems
01:06:39
and then it goes that's the problem
01:06:42
which I mentioned before that you
01:06:44
really have the problem. You stop to
01:06:46
just select select things you can
01:06:48
already do a bit and you and you just
01:06:49
focus in and then on the other things
01:06:51
you can do how you can actually do the
01:06:53
diversity. I think is a really big
01:06:55
problem and and and I don't see any way
01:06:58
around actually using the same things
01:07:01
to actually predict which bits you want
01:07:03
to get transcribed or at least you know
01:07:06
select some imparts or something I
01:07:08
think it's in my view I think it's
01:07:11
quite if you if if you want to do
01:07:12
actors speech recognition that it's
01:07:14
it's somehow to avoid that if you if
01:07:17
you kinda wanna do something that's
01:07:18
interesting you can kind of the old
01:07:19
easier resource things and so on. But
01:07:21
if you actually want to get an accurate
01:07:24
that word coming out. Um then I think
01:07:27
you need to actually selects and stuff
01:07:28
transcription I don't know I I dunno
01:07:31
what your viewers I I think the big
01:07:44
problem actually yeah yeah yeah yeah
01:07:48
yeah yeah yeah the other thing what
01:07:52
becomes interesting is if you have
01:07:55
systems related languages and how you
01:07:57
can how you can leverage their easter
01:07:59
like and some I guess we're gonna be
01:08:01
thinking quite hot about kind of you
01:08:03
know for example russian and ukrainian
01:08:05
and if we said the data for russian but
01:08:07
not surprised and I really for
01:08:09
ukrainian and how you can leverage
01:08:10
space two you know so the the main
01:08:33
thing is that it's many fewer
01:08:34
parameters parameters right if you get
01:08:41
two thousand two thousand that oh yeah
01:08:46
yeah you can you can you can you can
01:08:48
draw that way if you wanted I think
01:08:50
yeah yeah yeah yeah yeah of course of
01:08:51
course of course of course yeah yeah
01:08:54
yeah oh yeah yeah I identity buys it
01:09:19
doesn't work as well as adapting
01:09:20
amplitudes that's that's an empirical
01:09:22
results and you can kind of I you know
01:09:25
that basses is kind of like translating
01:09:27
things around right and so yeah yeah
01:09:33
right but I think it's nearly all
01:09:47
utterance level imposing because in
01:09:53
some expensive may not oh I mean it's
01:10:14
really things we do we do a lot of
01:10:15
things for the features because they
01:10:17
feel like as an LDA transform as well.
01:10:20
Um so they can be quite heavily well

Share this talk: 


Recommended talks

Multilingual speech recognition in under-resourced environments
Marelie Davel, North-West University, South Africa
June 2, 2017 · 11:05 a.m.
336 views
Structured Discriminative Models for Speech Recognition (part 1)
Mark Gales, Cambridge University, Engineering Department, UK
June 21, 2012 · 3:03 p.m.
154 views