Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
a okay hello everybody i'm an air base at
00:00:04
the idiot research institute and all present to my
00:00:09
first year work on bridging the gap between typical and pathological speech recognition
00:00:17
i'll give a brief introduction of what my project is about it's
00:00:21
done show you what i've done so far and finish with some
00:00:25
plans for the future too if already seen
00:00:29
some or can other topics and pathological speech processing
00:00:33
for example detecting and classifying speech pathology is or
00:00:38
building tools to a pair pay my focuses on building it's automatic speech
00:00:43
recognition systems that work for everyone
00:00:47
regardless of how their speech sounds like
00:00:50
and that can benefit uses c. might also have mobility impairments under home for example
00:00:57
by having some automated systems like shown here that
00:01:01
can perform actions um under control by people's voices
00:01:08
um john this briefly before um uh for the system that importance to
00:01:15
deal with perturbations that occur at different levels and a human speech production process
00:01:21
and i'm focusing on the ones hollered at the
00:01:24
bottom that relates to the acoustic level and pronunciation
00:01:30
there's a multitude of challenges and this field because there's so many different
00:01:37
speech just orders um and they
00:01:40
can manifest themselves and different server chase
00:01:44
then each patients is different and perhaps different individual sometimes like all to change
00:01:50
over time depending on treatment and all this means the there's very little um
00:01:56
data available to train systems on which we
00:01:59
would otherwise need for good performing speech recognition systems
00:02:04
and so we are currently available commercial an open source
00:02:08
a speech recognisers um so for much worse on pathological
00:02:13
than non typical speech um which i entered s. and mark
00:02:21
um so far my work us mostly follow
00:02:24
its um two lines five investigates the use of
00:02:29
models trained live uh eleven am i let us free from my objective function which is
00:02:36
state of the art method in speech recognition
00:02:39
um and then um looked at retrain speech
00:02:45
representations were we train models on large corpora
00:02:49
and then fine tune them for our data sets
00:02:53
um so for ever cuts with a forego
00:02:56
corpus of to celtic speech um which is very
00:03:00
standard and the feel and recently for us or that's about fifteen hours a
00:03:05
um of fifteen speakers half of them to south america off of unhealthy controlled speakers
00:03:12
um in in their sham consisting mostly of of isolated words actually
00:03:17
there's an if you a quarter of the utterances or sentences actually
00:03:23
because there's so few speakers the elation or training procedure usually works and uh
00:03:29
leaf one i would cross validation set up where we train on fourteen speakers
00:03:34
and then avoid on the remaining one and repeat that for all fifteen speakers and
00:03:41
all the results i'm showing or just averaged across all the speakers so they don't
00:03:46
show all the details because obviously it
00:03:49
according to the severity individual speakers results will
00:03:53
very very much so comparing i'm i'm more old
00:03:59
school a g. m. m. based models with uh
00:04:03
you'll network system trained with the other from my objective function
00:04:07
um we see make improvements oh um
00:04:12
especially for but in this case uh and only for the the staffing speakers actually and
00:04:19
i'm not split the pollution here between the isolated words
00:04:23
and the sentences because as soon before and engines presentation
00:04:28
the problem with the census is that they're repeats um for old speakers
00:04:34
so to the language model allows us to actually achieve
00:04:38
close to zero word error rates for as the speakers um
00:04:42
if we train the language model only on this corpus how and
00:04:47
my work um will focus so far on the on the
00:04:50
acoustic models a rough left the language model fixed um and we
00:04:55
see here there's a lot of improvements to be gained um and
00:05:00
it working on the acoustic model for the this optic speakers um
00:05:07
if we drill down further into the arrows that's a system makes we
00:05:11
see that law the errors or or insertions or words um which is
00:05:19
probably due to the fact that the celtics speakers often speak much slower so
00:05:25
the speech and as a stray to compensate for that and
00:05:29
and assume the normal speaking rate and answered many additional words
00:05:34
um were a lot of the errors come from but which with
00:05:37
the other my training um we can already reduce to some degree
00:05:45
um this we can further illustrate
00:05:49
by constraining the the coding grammar for
00:05:54
the isolated words by um constrain the output to
00:05:59
only be one birds and set of possibly
00:06:02
generating more than one word um where the
00:06:08
improvements for this topic speakers are much larger than for the control speakers and the second column
00:06:17
um sir one large area of a focus will
00:06:22
be to deal with with the speaking rate for abilities
00:06:25
um one difference between those models is that the neural
00:06:30
network based models and call the are usually trained with
00:06:33
speech perturbations augmenting the training data wispy perturbed
00:06:37
regions of itself which also helps with speaking raver ability
00:06:43
um so just to control for that we also left that out
00:06:48
and uh still see the outs the
00:06:54
no matter of my base models uh perform for battered and the g. m. m. ones
00:07:04
um one other method that has been employed in the passes to
00:07:10
um use a different frame shift for the soft rig and a typical
00:07:15
speakers and the m. s. c. feature extraction process so to have a larger
00:07:21
a shift between frames for of logical speakers to account for their love was
00:07:26
speaking rates and because we saw that um the the oven my models already
00:07:33
handle that uh this much better um we thought whether we could just um
00:07:41
treats those are uniformly and use the same same frame shift proudest would
00:07:47
do otherwise um without so we don't have to make additional some trends
00:07:51
not sunday the caseworker for this topic speakers here all
00:07:55
these models we don't actually see i'm a noticeable difference
00:08:03
so the moving on from that's to a
00:08:06
different topic um i got interested in recently um
00:08:11
this using pretty trained models which is already very common in computer vision and and
00:08:18
natural language processing especially where we use largest
00:08:22
data so that we have and train a
00:08:26
very big model and on um and then if we want to move on to new task
00:08:30
where we just have little data we take this big pretend model um like this one
00:08:35
here for computer vision and just finding out on on the new data sets um this is
00:08:42
is currently not very common for speech processing there's no like single mould everybody starts to
00:08:48
to fine tune and adapts um but there's some works
00:08:53
that are popping up recently that that started look at this
00:08:57
um and this are three examples out there
00:09:01
holding press under that in the speech this year
00:09:04
um and i've started it's um to investigate the the first one
00:09:11
and and how it could be applied to pathological speech recognition as well
00:09:17
so this problem agnostic speech encode or or pace is uh i'll convolutional
00:09:25
encountered that works from the roll away
00:09:27
from signal and with multiple convolutional layers
00:09:32
ten samples the waveform and i could sit
00:09:35
in to a higher dimensional representation um had a
00:09:41
lower sampling rate and it's train and a
00:09:44
completely unsupervised waiters and labelled training data required um
00:09:51
and way we're from this encoded representation um these
00:09:58
rectangles here we predicts um different features off
00:10:03
the input signal that we that we can
00:10:06
i'm also find out on supervisor so supervised manner so to
00:10:11
um predicts back the original waveform and missus c. features different prosodic features
00:10:18
um or some information that we can and for
00:10:21
on the speaker information or on subsequent speech samples
00:10:26
and so the intuition behind this is that by having to take all this kind of different things
00:10:33
um the features will will have a generic representation of
00:10:38
the speeches unsuitable for a lot of different tasks um
00:10:42
not only the signal task like speech recognition uh feed preach ranges a large speech
00:10:47
recognition models that might be also useful if you wanna work and on other tasks um
00:10:55
and want to try this out there's a model and code available for this as well um sir
00:11:03
these are just some results from the original we're on a corpus of noisy speech
00:11:09
where they showed that these pages features all performances cease even up
00:11:15
we don't further find in them and like fine tuning them during
00:11:20
the training of the speech recogniser uh they sort of other improvements
00:11:28
and i was
00:11:30
oh oh replicating some of the results and also using them and call the with
00:11:35
the uh from my objective function to see if if that's compatible as well um
00:11:42
um one thing i saw this i already got better results than they
00:11:47
had but just using m. c.'s with with the models and call the
00:11:51
um but actually just swapping those i would for the pace
00:11:56
features um also okay if a large improvements in my case on
00:12:03
on this corpus even without further fine tuning um
00:12:07
this encode or um which is currently not possible of all want to
00:12:12
train it and call it because that's a high torch models of old experiments
00:12:17
uh i'll show here now are just using the
00:12:20
the frozen features um but you extract just like that
00:12:26
um than oh i also wanted to check that out
00:12:30
on a much more challenging corpus where the speech differs
00:12:34
much more from the data that the um the same
00:12:37
code was trained on so the i'm a conversational speech corpus
00:12:42
and here without fine tuning we don't yet see any any gains
00:12:48
um and some lower for or the torah go stuff like speech corpus
00:12:56
um without fine tuning um there are no improvements at least
00:13:02
for the the topic speakers um so the next step here will
00:13:08
we to to investigates um all to find turning this
00:13:11
anchor and and uh looking at different strategies um of
00:13:17
combining retraining and fine tuning and these kinds of models
00:13:22
to is to see whether and they could also work for
00:13:26
um pathological speech recognition for for
00:13:29
for other tasks in general um says
00:13:34
still kind of starting to be developed um then and the future also plan
00:13:41
to work on an additional corpora like the numbers lexus speech corpus and home servers
00:13:47
or post to to better holiday the results um continue to
00:13:53
current experiments and speech presentation earning um and
00:13:59
and learning from the roll away from in general
00:14:02
with different methods and and analysts think these representations
00:14:08
um and and focusing especially on on out
00:14:12
adaptation or um handling speaking rates or ability
00:14:18
and then when the longterm also ally specific pronunciation
00:14:22
differences and and how these are handled by speech recognisers
00:14:27
um so to thank you very much and happy to answer any questions

Share this talk: 


Conference Program

ESR03 : Interpretable speech pathology detection
Julian Fritsch
Sept. 4, 2019 · 2:30 p.m.
160 views
ESR09 : Clinical relevance of intelligibility mesures
Pommée Timothy
Sept. 4, 2019 · 4:49 p.m.
Big Data with Health Data
Sébastien Déjean
Sept. 5, 2019 · 9:20 a.m.
ESR11 : First year review
Bence Halpern
Sept. 5, 2019 · 11:20 a.m.

Recommended talks

Speech after Treatment for Head and Neck Cancers
Michiel van den Brekel, Amsterdam
Sept. 24, 2018 · 2:45 p.m.
144 views