Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
okay so the of the working title of my uh
00:00:05
research is currently into printable speech pathology detection
00:00:10
um i structured into a short description of
00:00:14
speech pathology is the previous work on sleepiness estimation
00:00:19
and current and future work so the way uh we are
00:00:24
a structuring the uh uh research is by considering that uh
00:00:29
speech pathology send it to do phenomena are can be like a divide it into problems of
00:00:36
for nation articulation and language level problems but
00:00:43
um obviously intelligibility is the combination of our nation
00:00:49
combination of formation articulation and uh prosody which
00:00:53
is why we want to develop a a framework
00:00:56
to an allies and detect them first individually
00:01:00
and preferably in the end a combination of them
00:01:04
um the uh work on sleepiness estimation was the
00:01:07
the data from into speech compares challenge of this year
00:01:12
uh which we did not submit to the challenge but ah
00:01:16
got the license and are still a somewhat working on it
00:01:20
so we're um the hypothesis hypothesis is that um
00:01:27
sleepiness leads to a monotonic unless crisp a pronunciation
00:01:32
the data is a a recording so for reading and speaking
00:01:37
and it's a a classification or regression regression
00:01:41
task uh from the flavours from nine to
00:01:45
one to nine we a double fit as the classification task by um
00:01:52
using a raw speech c. n. n. framework which has its input um
00:01:57
ross speech then comes the filter stage of a one the con players
00:02:03
max pooling and value activation this is the filter stage
00:02:07
then comes the classification stage with a fully connected layers
00:02:13
from that we get the upper class a posterior oppose director put
00:02:19
lot per frame the average that to get a and score per utterance
00:02:25
now i want to highlight like some of the hyper
00:02:29
power meters uh uh from the first uh uh convolutional layer
00:02:34
which has as input um uh an input sigmund w.'s sack uh
00:02:39
which is typically something like two hundred and fifty to five hundred
00:02:44
milliseconds then the call with is quite relevant and uh we we
00:02:52
we distinguish a subset mental model ling of colour with of two milliseconds which
00:02:58
is a less than a pitch period and a a segment on modelling
00:03:03
off twenty milliseconds can with which is more a few uh few pitch periods
00:03:10
uh there's also co shift and a number of frames and i'll show later
00:03:14
like that uh depending on the kong with different information from the raw speech
00:03:20
this model was um apart from the plane or when
00:03:26
a lower or speech approach we also tried to integrate some
00:03:30
speech production knowledge and transfer learning so what we did
00:03:34
was um we use the ami corpus and uh it
00:03:39
do use frame level features of i'm incorporates a run it through
00:03:43
the collie a speech recogniser to get a a frame to phone alignment
00:03:49
and with that frame to phone alignment we use the phone to articulation mapping
00:03:56
to train the prediction of a articulation classes which
00:04:00
are for high to fight articulation manner place and while
00:04:06
powell to be use the the phones use the mapping and then
00:04:12
train uh based on the alignment to the production of
00:04:16
uh articulatory features now we've though is uh um models
00:04:27
uh_huh
00:04:31
okay um yeah okay sorry had of yeah years
00:04:36
the this year and uh for predicting the articulatory features
00:04:41
and those c. n. n.'s we then copy the weights yeah and use
00:04:47
again use again ross speech to the sleepiness they'd uh and do predict sleepiness
00:04:53
so basically the trend for learning and speech
00:04:56
production knowledge is just for initialising for different models
00:05:01
which are um initialised by the o. predicting
00:05:06
the four different the categories of uh articulatory features
00:05:13
now as i said uh
00:05:17
that uh choosing the call with matters uh and
00:05:22
in terms of what uh information is a model
00:05:25
we can see that when a summing up the first layer filters
00:05:30
and uh a looking at the frequency response of that
00:05:33
that's has a frequency response would for peak between see that
00:05:37
between a thousand in two thousand hertz which is um the um
00:05:44
the uh how do you call that uh
00:05:48
no doubt it yeah like the the not system by uh just
00:05:53
yeah the the not the source but the system information the at not the up a nation but the the
00:05:59
yeah okay so if the uh the results so compared to the baseline a
00:06:07
result i just listed two of them so compare features and bag of audio words
00:06:12
gave a point two five uh uh uh that spearman
00:06:16
correlation on the debt certain point three on the tests that
00:06:22
and our models i'm here the vanilla approaches gave a
00:06:29
similar one that's similar performance on the dev set of point two eight the uh
00:06:35
with the test results here we messed up the protocol was so we were only training on
00:06:40
the train data and not on the train and the development data this is why these are bad
00:06:46
we changed in here so when initialising the models uh we've uh the articulatory
00:06:51
features and the predicting sleep in it you get something like point two nine
00:06:58
on the deficit and yeah similar performances on the test set
00:07:04
and we also try diffusing fusing the
00:07:07
posterior vectors with another multilayer perception and
00:07:12
got a little bit too often improvement when you're throwing in either
00:07:16
different categories or even the baseline system so our hypothesis was that
00:07:23
uh um save emerge um manner or the manner c. n.
00:07:28
n. and seek mental c. n. n. and the baseline which
00:07:32
giving better performance indicating that there it's different information
00:07:36
in the posterior wrecked just so that gave some improvement
00:07:41
okay now the current work
00:07:46
now these uh articulatory feature uh um categories they be earlier
00:07:52
just used for a initialising our students we now want us to
00:07:58
uh s. s. articulation and so i just listed them here to give you an idea
00:08:04
we have about uh ten uh ten values per category and
00:08:11
except for the of our class which is basically a one
00:08:15
to one mapping from the bowels of the phone set of
00:08:18
the language to the balls with a rejecting plus four continents
00:08:23
so basically when we when we would the uh look at the post syria's of uh say the
00:08:30
a place of articulation values we would uh all of a from
00:08:35
from say lay be alert dental to get some information out about whether
00:08:40
there is uh some label all that or dental sound in the speech segment
00:08:46
so how do we use uh use the this these uh posted as
00:08:51
we are currently uh working on some data
00:08:54
from the medical university of vienna which is um
00:08:58
after lip surgery uh we all around make 'em acid filling which is a a plastic surgery
00:09:06
and uh we use the posterior the
00:09:09
first years of the articulatory features um and
00:09:15
uh for them into a dynamic time warping to get a distant
00:09:20
and now we we try this on per utterance uh but
00:09:24
does it did not give very good results so now we're
00:09:28
investigating whether we can make you get an alignment of that
00:09:31
data or and analysts only a starboard units debate because we we're
00:09:37
paul the siding bad save close of sort of
00:09:40
a speech are more relevant than some other other sounds
00:09:48
to the future work um uh let's read this uh of bottom up so
00:09:54
we wanna continue on learning acoustic level information
00:09:57
using c. n. n. such as formation an illustration
00:10:01
we will also continue on articulation assessment with the
00:10:06
features that i presented but would probably also do some work on the uh acoustic landmarks
00:10:12
and uh we will throw in all this it yeah knowledge to also get some
00:10:19
prosody analysis on a typical speech and uh pause pattern analysis
00:10:28
thank you how

Share this talk: 


Conference Program

ESR03 : Interpretable speech pathology detection
Julian Fritsch
Sept. 4, 2019 · 2:30 p.m.
161 views
ESR09 : Clinical relevance of intelligibility mesures
Pommée Timothy
Sept. 4, 2019 · 4:49 p.m.
Big Data with Health Data
Sébastien Déjean
Sept. 5, 2019 · 9:20 a.m.
ESR11 : First year review
Bence Halpern
Sept. 5, 2019 · 11:20 a.m.

Recommended talks

Pitch Estimation Using Mutual Information
Yanbo Xu, University of Maryland College Park
Sept. 7, 2012 · 11:07 a.m.