Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:02
okay i don't i'm everything so i'm i'm taking too long to stop may end up with for the presentation that
00:00:09
so good morning everyone and going to review my first
00:00:13
you will be a he uh i'm gonna focus on uh
00:00:18
basically learning outcomes and what i have done in the first
00:00:21
year like the the uh so for it i will summarise mine
00:00:26
meaning research activities all fine including the second and i have done at oxford
00:00:31
uh i will also mention uh what additional
00:00:35
activities i have participated in doing this one here
00:00:38
uh reflect on what i have learned so far but
00:00:42
for the learning outcomes of these things and that was all
00:00:47
try to show some direction what's going to happen in the future so all
00:00:53
remind everyone what my project is about because i'm sure no one remembers the
00:00:58
it's to synthesise those to do all kinds of speech samples
00:01:04
and that's that's basically so from black box approach what you can see is that
00:01:09
we will have some pretty operatives examples
00:01:12
patients before an operation and uh we have
00:01:18
and we want to simulate we want to sympathise how they will sell often operation
00:01:24
yeah that's sounds sounds good so far about uh the the problem is that there
00:01:29
are not so many all kinds of speech data available and they they they that
00:01:35
all work and so speech that i mean the speech data
00:01:38
has high variance because of the the current treatment model eighties
00:01:43
there are also the front to more sizes animals cancer different
00:01:47
places where the two more can occur which all calls like
00:01:52
hi differences in how though kansas beaches bonus on like up the operation
00:01:58
um and also there is the additional thing that a black box model is
00:02:03
not that it not in out there should be some systematic articulate to lee's any light
00:02:10
like what's the least on someone is speaking like
00:02:13
like uh that with a particular problem so not just
00:02:19
some sorta for transforming from one acoustical scenario
00:02:22
to another ad comes to go scenario uh
00:02:27
and so that made the first question that i have investigated is uh
00:02:33
can we use articulatory considerations to do sympathise please
00:02:36
so so what what what that really means uh
00:02:42
so basically the idea was used to train a neural
00:02:45
network using articulation data and and speech data fayette bands
00:02:50
and uh and first thing that ties have the speech from articulation data
00:02:55
and uh normally what happens if if we want to the the speech from articulation
00:03:01
is that uh the provider articulation information to a statistical model
00:03:06
which is an enemy network abuse you know okay it's and uh
00:03:11
and it and uh in our case what i use this is a simple cooler to partition this
00:03:17
speech into various parameters like the fundamental frequency that m.
00:03:22
f. c. c.s and the band that area of the cities
00:03:25
and the the i remain in our network was
00:03:29
to be the the m. f. c. c.s of the
00:03:35
of the of the speech in order to sin to sympathise uh the speech itself
00:03:42
and we retain the peach and a better than that relate
00:03:45
to cities we don't be big that because what we're interested in
00:03:49
is how the articulation changes have now the articulation
00:03:52
can be predicted indecent scenarios and uh so that's uh
00:03:58
that's okay for for for help the speed but how do we actually make that'll people
00:04:03
speech and that's i think a very valid question and something that the times they still investigating
00:04:08
so what what i what i used as a night approach to synthesise but would people speech
00:04:15
i still think about how we can slow down or or
00:04:20
you need the velocity of the speech in articulatory domain and so
00:04:25
for that i just take simply a time stevie spectre and simply
00:04:29
fresh for the day but it's all that factor to limit sort of the
00:04:33
battle city of the op the articulators but uh from the articulation they that
00:04:41
and the
00:04:43
and what was was very interesting to the is that actually it's
00:04:48
the berries over by the way the end around any problems that that
00:04:53
that matter because it is so if there is sort of a domino effect in sort
00:04:58
of realising sort of such such as speech because there is in itself the production and overall
00:05:05
such a such a such a model which tries pretty the speech
00:05:10
from from the articulation that is also the limitation of the
00:05:14
vocoder which causes and the overall that speech probably and that
00:05:20
uh the additional psychoacoustic thing that i think we haven't considered is that pathological speech
00:05:28
has to be actually even more natural than natural
00:05:31
speech to be considered not just simply bad computer speech
00:05:37
and because you the computer synthesised speech is not natural enough then it just
00:05:43
really sounds like a computer making that abuse arrows in
00:05:47
the production some kind of a a smoothing effect which
00:05:51
is which is sometimes every damn thing for example hidden
00:05:54
markov models if um and the one night method uh
00:06:01
so such naive model is obviously not agree although for for more doing
00:06:06
that's a possibility but what is surprising that it's really cannot show like a consistent
00:06:11
ontology but it still shoals the according to a speech language but would you some aspect
00:06:18
both pathological speech summer we saw that some some books all the speech and some the thought to speech
00:06:25
um and and that another problem is that i have
00:06:29
continuous the matt doing doing the miami so it is
00:06:33
the there is not really messed published object evaluation mess people pathological speech
00:06:39
for even for actual speech we we have something like the mean opinion score three have mel cepstral big
00:06:45
portions two point two five how loud the deeply
00:06:48
held the speech uh uh but hope but logical speech
00:06:53
that the uh the only thing is we can do is to compare it with one too but logical speech
00:07:00
and we don't have a little bit and and be pathological speech synthesis the idea is
00:07:04
to sort of have a generative model rather than a uh comparing the the existing examples
00:07:11
uh so i'm the project that i this for two months doing doing my p. h. d.
00:07:17
uh it it was a second and third hoax would wave research uh
00:07:23
huh by my by my project was about that abiding by the uh speech sample
00:07:30
fake uh it's also quotes pool or natural which is also called when i feed
00:07:35
the in in in a sort of jog on time or or g. and the um
00:07:43
and so with this system is called a sort of scooping come to measure and in order to them about that
00:07:49
uh they have used uh yes these two two thousand nineteen benchmark we should uh
00:07:54
uh is a benchmark released for this year's in those into speech conference and
00:08:00
it's in order to understand what's going on it's it's good to know that
00:08:05
that uh what's really the purpose of the system and
00:08:09
and so that's the purpose of such systems to protect against
00:08:12
attacks uh you know to magic speech verification and there are really two cases of that so one type
00:08:18
of attack is when somebody comes next to you and uh tries record us was speaking and feed back into
00:08:25
a a speaker verification system which might be sums
00:08:29
that self checkout machine at the supermarket asking for for
00:08:33
your age uh well skip asking for you and identity that would be like a very natural scenario but
00:08:40
just an example uh another is when you when you are looking into a bank and you have
00:08:46
to repeat i'm i'm exec sentence for that you would use something like voice conversion and speech synthesis
00:08:52
as a solo artist cloth logical accents and the and the former
00:08:56
it's called the physical access to now your in a in speech detection
00:09:01
so what what what what the i i. b. b. is
00:09:04
is basically a of that a lot the method of uh
00:09:10
uh for for detecting spoof change uh using good
00:09:15
combinations of existing things and the and the whole
00:09:19
and the going to some of my own intuition so oh well what
00:09:25
we use formal those apple features is it is either log spectrograms and
00:09:30
or constant q. transports which are really interesting kind of features if you
00:09:35
have never heard about and there's the i've mainly used for musical signal processing
00:09:40
and the the the other one at is is
00:09:43
just a simple log spectrum not mass spectrograms interestingly
00:09:48
and the the model we i have used it's a ballad that native red red's net
00:09:53
combined with the gaussian mixture model simmer
00:09:56
to expect a architecture uh and it's uh
00:10:01
and it's really interesting that this but the um so the idea
00:10:06
a part of that expect the ark architecture is basically learning uh features
00:10:11
sort of like an outgoing call the if you have time yeah without doing what is
00:10:15
that you've done some sort of and then things in in it come in a compressed space
00:10:21
uh which can be useful for speaker verification expect asides to speak of
00:10:26
a vacation these kind of them but been set used for school that action
00:10:30
and a and b. c. speed to a gaussian mixture model
00:10:37
and so there are some i think less of that i think are very interesting and the the
00:10:42
so i there's some lessons than that i think i may be interesting to to share so
00:10:47
the idea is is that there is like um they divide that phenomenon um but the here
00:10:52
spool speech which like speech that is that these mutated from like so the big these uh
00:10:59
cloned from there the examples that are already over evil online and it's we
00:11:05
know that these are convincing but these are actually not too difficult to detect
00:11:10
so we could because of because most of the cases what happens is that this
00:11:15
was going out with them so all the mice for perceived speech quality and actually
00:11:21
if you use some other feature which focus on low frequency bands which
00:11:25
are not leave out recognise that human heaving you can still detect that something
00:11:29
is going blocks all i think the one of the message of this product
00:11:32
is to just don't panic over the apparitions yuppies board was going bills because
00:11:37
print practical applications we can still be correct but that would be that they are real speech
00:11:43
uh one of the surprising thing is that that works for some features uh
00:11:49
yeah finally can get the space spectral that we are transferable
00:11:52
beach it's something that people can forget about that big big these
00:11:56
then we do just abductees and pick now spectrograms but
00:11:59
they exist and they can be used for falls whooping depiction
00:12:03
uh at another thing is that in the yes puts book two
00:12:06
thousand nineteen a simulated we play database use and actually many people found
00:12:11
doing the channels that it is not really be present at the top the actual we'll
00:12:15
put it up so it in some sense it's not a not a good bunch my
00:12:21
and also one really interesting the ink in in the middle of the people earning a a rat is
00:12:26
that a single no matter what does by mail
00:12:29
but for that one caption make sure models english challenges
00:12:33
uh so uh g. m. and still seem to have a place in in in the d. one yeah
00:12:39
uh so i would like to talk about my all the
00:12:41
other activities also quickly i started looking out with i've wrote already
00:12:45
two articles one of them about the summer school experience i
00:12:49
have been doing in norway another one i had is a post
00:12:53
uh a lot just in the speech articles i find
00:12:56
interesting i i don't know why anybody else find interesting but
00:13:01
it's for you to see other people uh i plan to do
00:13:06
an article about by it's a commendable but they've research and also an article about
00:13:11
this event and beep or do i have the time also about in in in about into speech and
00:13:18
also some article which is taught targeted a general audience about
00:13:22
what really is going down with this widespread voice conversion too because i think that there
00:13:27
is a general view about the media about these things and i think we should talk to
00:13:32
the ordinary people about what are the real dangers obese voice conversion
00:13:36
tools and uh i and how they can uh protect against them
00:13:41
uh other than that i'd done done quite a lot of course it's uh so i should yeah
00:13:48
uh i answered the e. c. t. s. for my entire page yeah and i seem to have
00:13:54
twenty three of them so i think i'm progressing quite well i studied during the p. h. d.
00:14:00
and uh the i. v. i. you also teach the september
00:14:04
in the university of amsterdam a speech synthesis and recognition point
00:14:09
at the the b. a. artificial intelligence and the we are
00:14:14
also going to do some actual magnetic or colour graphic data
00:14:18
uh and speech what action we the billions they checked uh uh uh and the university upon him for
00:14:25
for or against the speech to better understand or all kinds of speech inundated with them yeah
00:14:31
uh um yeah and for my segments but the future future plans that i used to just second
00:14:38
second second one sort of apartment but also plays the church focusing on
00:14:42
translating the models or product lifeline and an additional segment that
00:14:47
maybe up to study or kansas but that'll jeez
00:14:50
you using explainable nationally and just continually search engine
00:14:55
yes oh thank you for this thing i asked any questions at all

Share this talk: 


Conference Program

ESR03 : Interpretable speech pathology detection
Julian Fritsch
Sept. 4, 2019 · 2:30 p.m.
161 views
ESR09 : Clinical relevance of intelligibility mesures
Pommée Timothy
Sept. 4, 2019 · 4:49 p.m.
Big Data with Health Data
Sébastien Déjean
Sept. 5, 2019 · 9:20 a.m.
ESR11 : First year review
Bence Halpern
Sept. 5, 2019 · 11:20 a.m.

Recommended talks

Pitch Estimation Using Mutual Information
Yanbo Xu, University of Maryland College Park
Sept. 7, 2012 · 11:07 a.m.