Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:03
okay i don't mind a wispy design am i i'm a p. h. d. student of
00:00:08
paid happens because interesting and just today i want to talk about speech synthesis
00:00:13
uh i think you didn't have any lecture about speech synthesis so far because
00:00:17
of that i want to start with a small introduction to speech synthesis
00:00:22
so i want to start from the past maybe from the seventeenth century and but then i would
00:00:27
quickly come to the uh current state of the art that that's uh i will talk about
00:00:33
the text preprocessing because that takes a speech that since he's actually actually two twenties that's
00:00:38
and then later i will come to different systems of speech that is in because articulatory speech
00:00:44
synthesis might be interesting for you i will have a one special part about this
00:00:49
topic inch uh i will prepare you for the staff workshop in the afternoon we're we
00:00:54
use the vocal track up that's profits because developed do is used to stage
00:01:02
okay why should be a wise speech input as and this is important for you it's just
00:01:07
a small motivation uh because of spoken language is used in many parts of our life
00:01:13
so if for example if we sit in a car it's dangerous if we look on a
00:01:17
map to the same time so which might be useful for us if we if
00:01:20
we have for example mobile device that can tell us the way what automatic automatically then
00:01:27
like in uncomfortable i'm comfortable environment such as it happens might be also difficult
00:01:32
uh without speech synthesis because it's often very loud so
00:01:37
maybe it's difficult if we talked read uh if we sit next to each other and we cannot understand us anymore
00:01:42
because it's so not not then we can for example uh use a speech synthesis system which at first
00:01:49
recognise the out articulatory positions yeah just as the normal and
00:01:53
then uh the next uh the present which we
00:01:56
want to talk to a it has headphone inch uh
00:02:00
these gestures will be re synthesised using speech synthesis
00:02:05
and other applications uh probably because we have a we don't have state space to read
00:02:11
a textbook but we need some instructions also that we can use speech basis
00:02:16
another good thing is now many people in the world have a smart for which can uh
00:02:21
can be controlled used using spoken language and the boys assistant you typically uses speech synthesis
00:02:29
what's language includes more information so it also includes information about emotions
00:02:34
uh they are mostly decoded by intonation for example by age
00:02:38
if he were changes and uh for example the location
00:02:42
of articulation is also determined in spoken language
00:02:47
we can also uh i have different voice quality so model
00:02:50
or breathy voice we already talked about whispering for example
00:02:54
and this is also important for speech synthesis and uh for us here
00:02:59
it's interesting speech synthesis can help to understand and should diseases
00:03:04
for example in the current uh a pro project it's to taste and we uh try to help stroke
00:03:11
patients uh we try to have struck patience to get recovered because uh we can try to
00:03:17
tell them how should pay for for example move the time so articulate correctly
00:03:26
okay uh in general speech synthesis is defined by uh the
00:03:30
process to uh when i machine and generates speech
00:03:35
actually just very old sort of very first approach was a brave often can plan
00:03:41
so it was a purely mechanical talking machine and here and ask you excited to lead that resonator
00:03:47
to produce a okay tube you produce different uh different balls and they have said several exhilarated
00:03:55
pacific very tools to uh to produce for example consonants
00:04:00
it's quite interesting system uh it's you you have a rebuilt machine we have
00:04:04
one interesting i think sample can also has one famous one and
00:04:08
basically uh any extremes previewed and produce by the spells so you need to push it and then
00:04:15
yeah feels into this system inside the system and then we have a or eat here
00:04:21
this reach is basically the glottis so it was buried karaoke and uh this is actually the
00:04:26
sauce for mobiles so then we have this excitation signal that we so uh in the
00:04:31
in the the previous lectures today and we have a leather leather resonator that we can use to uh produce
00:04:39
one specific role for example so we can press it and then the shape he will change
00:04:44
it would be changed and this is the way how we can produce those different models
00:04:50
and as you can see here there are additional auxiliary tools so for example here it's
00:04:55
uh actually we can also produce fricatives is actually it was uh it it was
00:04:59
possible to produce many sounds with uh this uh with with this box
00:05:06
okay now i want to come to the electric devices so in the nineteen
00:05:10
fifties uh there was a and a technique called patent playback was invented
00:05:16
so this is basically the conversion of a picture to sell if one so i you can see this spectrogram as a
00:05:22
picture and uh then we've uh for the electrical system uh
00:05:26
this picture was converted back into a back to sound
00:05:30
so basically because light was pro project it on this uh
00:05:34
on the spectrogram which located and it was a button
00:05:38
a sense that the signal was sent an amplifier and uh we got a uh uh some uh we got speech out of that
00:05:50
so uh if we go a bit more into the signal processing station one
00:05:55
of the first approach just uh which really uh which are used formants
00:06:00
formant synthesis was like enough i don't hear can seem together with his first formant synthesiser
00:06:06
here he had already knowledge about some uh about the glottal source
00:06:10
so he could directly model it using a a electric
00:06:13
circuit circuits and in the back he also had some fit of things so it was possible to uh to
00:06:20
synthesise formants but uh with this device and you can see here he has some tools
00:06:26
so it is possible to uh move these tools around and changing the formant frequencies start yeah
00:06:32
dynamically so it's possible for example to produce a also combination of several uh vocals tables
00:06:43
and nowadays we have computer based since synthesis so this standard actually with the
00:06:48
catch talk system this was a purely you rule based formant synthesiser oh
00:06:54
and here yes i catch also use the glottal source signal but now it's a everything
00:07:00
computationally so the glottis all signal but you saw before uh is no map mathematically
00:07:05
described and then uh send to a bank of filters uh this um main mainly
00:07:11
to ban passages of the second or of a second order and then uh
00:07:16
it was a it's possible to produce even different voices so
00:07:21
uh this is uh mostly speaker independent uh for
00:07:24
synthesis system other approaches uh when we uh when we were able to collect more data
00:07:31
uh is for example beep so i get it from here we had a this let us to data driven speech synthesis
00:07:37
so here when you use a a user database of recorded speech
00:07:42
we can uh concatenate i know we spent the speech into
00:07:46
my phones for example or to other units and then or
00:07:50
the recent synthesis is actually only uh come combining piece
00:07:55
the phonemes yeah combining these tie forms all these units back to a new language and i want to doesn't
00:08:03
large database were available and then it was possible to uh use h. m. m.
00:08:08
based statistical signal processing for speech synthesis that means actually we but uh we
00:08:15
when to back a bit to the track talk system now we have a a can a source filter model
00:08:20
and we have information about how well the sauce really works and how dull filter really works
00:08:26
and uh because we have a lot of uh speech data which we can uh could and as before
00:08:32
and since two thousand and sixteen of the wave gets i was introduced now the
00:08:37
state of the art uses more and more neural networks for speech synthesis
00:08:44
okay now i want to come to the texas speech problem so actually we have
00:08:48
a twitch i like this so we start with a text input and we
00:08:52
have a we many steps here into text processing so for example here we have
00:08:57
segmentation segmentation means that we need so uh there that we need so
00:09:03
segments the text for example into words or into other structures then
00:09:08
we need to expand non words for example numbers oh abbreviations
00:09:13
and then we need to find the part of speech so there are many
00:09:16
i may actually this is a all sub tasks is uh but he's
00:09:22
includes text processing these two last boxes are the only uh
00:09:27
i actually the only signal processing state boxes and therefore we
00:09:32
can use several techniques to synthesise speech for example
00:09:36
because of this uh i will exclude this part and uh oh this will come in a later talk
00:09:44
okay uh let's go through the steps so first we have
00:09:48
a segmentation step of the sink a inch text normalisation
00:09:52
i will use this utterance now we have to be feared as a teaspoon you all those all the steps
00:09:59
at first we will split the text into tokens that token uh in this case is a single worked in preparation
00:10:06
so we can see how the token would be split up a a here we have seen what's
00:10:11
and we have to point s. x. s. one acceptable can be in any
00:10:15
rule normalised the tokens so that means and now we don't uh we
00:10:19
don't need capitalised letters anymore uh we but we need to a white uh
00:10:25
uh right down the numbers so we have to in this case
00:10:29
and uh this is quite a difficult task because uh two can be we we don't
00:10:34
at first we don't really know what represents this too so it can be twice it
00:10:39
can be a time so we need to decode in a in a correct form
00:10:44
and this uh this is often done using a some said statistically message methods
00:10:49
so for example if we have model the context around this number
00:10:54
and then a part of speech tagging is the next part so basically we need
00:10:59
to find the part of speech of all the tokens that we've found before
00:11:03
uh this is mostly done using a dictionary so usually a text to speech
00:11:07
system always has a dictionary with underlying out with the part of speech
00:11:12
because sometimes a a one word test even a modern only one part of
00:11:17
speech and we need to find the correct one out of touch dictionary
00:11:25
well the next big step is chunk parsing chunk parsing means basically
00:11:29
we want to the sequence our entire sentence into those
00:11:32
segments so and uh now we want so uh the identify
00:11:37
for example phrase boundaries phrase boundaries means we have
00:11:41
a prosodic uh we went to we have prosodic features that include more than uh
00:11:47
we think what's so here you can see each uh concerns the entire sentence
00:11:52
the uh in many cases we can well usually we can uh
00:11:58
say a sentence consists of one nominal phrase and one rubber phrase
00:12:02
and now we need to speech the sentence uh in these senior phrases uh if we
00:12:09
uh if we want to go for that but then we could in theory also
00:12:13
split lovable phrase again into a rubber phrase and into a second nominal
00:12:18
phrase but this is getting complicated and here we need be trees so usually
00:12:23
this is not considered only this a lot phrasing is considered you should
00:12:32
okay the next step which is quite important as other stuff from graph seems to phonemes so uh usually
00:12:39
texas speech systems try to find affirming phonemic transcription of a word in an online dictionary that
00:12:45
means we have characters but we kind of started a translate the characters into uh into
00:12:52
i into speech before we need so uh we need to find the correct phonemes because an eight can be for example
00:12:58
long or a short pronounced and these are different phonemes for
00:13:02
example or we have voiced and unvoiced fricatives in
00:13:06
a specific context and therefore we need the dictionary where where we
00:13:11
can find a word and its corresponding uh phonetic transcription
00:13:15
this actually and it contains the somehow annotation it also cans a a contains
00:13:20
information about sort of boundaries and stress a stressed syllables in the word
00:13:27
and it is difficult if the word is not in the dictionary then all information
00:13:31
to be predicted only generated separately this is a a difficult times because we
00:13:37
need to find other syllables we need to find the correct phonemic transcription and
00:13:43
uh yes we we also need to find which syllables are stressed because
00:13:48
in every cent uh it it's defined then every what is there is one stressed
00:13:54
syllable it we need to find a we need always to define these
00:13:59
if we now go back to what i'm utterance that we want to those uh synthesise
00:14:05
then this utterance looks uh now like this so we had here we have the phonemes this is not really some
00:14:12
huh this is important uh this is uh the annotation follows the timit corpus which is often used for uh
00:14:19
speech recognition tasks but at least also here we have a phoneme half an image we a re presentation
00:14:26
can for example see also now consists of three characters
00:14:30
but in the end it will uh it will be to phonemes and in how or
00:14:37
yes the same for example into it's interesting he we have at first to stop
00:14:41
that means uh we don't have any yeah flow then the glottis all kinds and well
00:14:46
the t. will be articulated in in the end it will be able to see
00:14:52
and also blow punctuation gets uh gets one talk funny
00:15:00
okay and the next step is to uh it's the symbolic prosody generation
00:15:05
so right now we have a we have all funny funny late
00:15:09
information that we need to uh or somehow we need to
00:15:13
generate prosody so we need to find a way to rid intonation basically
00:15:18
prosody is uh not concerned with larger elements of speech than only if isolated from it it's phonetic signal
00:15:25
segments so now we are dealing with syllables we are dealing with what's all with phrases again
00:15:31
see the most important features for prosody are the pitch control so this is a
00:15:37
for example this determines to have a questions or do we have and common sentence because the question indians the
00:15:43
the f. c. will it would be increased and for in other cases it gives you will increase
00:15:50
and the phone duration or also the articulatory gestures
00:15:55
so if we talk fast or slow uh uh this is this is everything important for the prosody
00:16:01
it depends on linguistic features depends in phrases and it also depends on emotions so for example if we
00:16:08
uh if we talk in a relaxed way and we'll probably talk slowly yeah but if we
00:16:14
if you are probably are if we are excited than be much targets
00:16:19
are fast uh it we talk maybe with the high is zero
00:16:23
and the dynamic of the f. s. you can be changed the so there are many factors which influence the prosecution
00:16:32
here we want to uh predict the symbolic prosody so this is a maybe a a this is
00:16:38
a more something like you have a pure task yeah this isn't something i it's like
00:16:44
ah it's more a task that we can uh that is
00:16:48
a preprocessing step for the actual prosody generation so
00:16:53
i uh as you already know now what intonation depends on many different factors uh so uh
00:17:00
often rule based approaches for intonation can be used uh for example we have uh but toby
00:17:06
um but toby said this is actually a set of uh i think in total four or five uh several
00:17:12
uh excess like this and they i can be used to produce let's uh in
00:17:19
the and commands uh which uh from which we can generate the prosody later
00:17:24
for example here we have a high accent that means in this case the f.'s you must be high
00:17:32
this means we know the s. u. goes from low to high in this
00:17:36
as a as a come on but you have to do is low
00:17:41
and typically accented words sense in front or phrase boundaries get prosody labels so we don't
00:17:48
uh we don't catch deeper until a single phonemes but we stay in a higher
00:17:54
uh in higher level the titles accents uh access depends on the part of speech depends
00:18:00
on the type of sentence and depends on phrase boundaries as i told you before
00:18:04
for example we differ um you uh on the want and we have a question
00:18:08
and we have other sentences all of the sentences have different uh into nations
00:18:17
okay if we apply uh if we had some access to our trends there will come a to this point so he
00:18:23
can see now we he goes uh the su op and
00:18:27
then it will mostly stay up here we have an
00:18:32
inside the two we would go from low to high because you want to emphasise too
00:18:37
and big is akin high and then the absolute increases to the end of the sentence so this is a common sentence uh
00:18:45
important is also a it because it talked about emotions and i forgot
00:18:49
that before um imagine one sentence i have a car i can
00:18:55
emphasise different words and all this bull in a result in the different uh
00:19:01
in a different intonation for example you can say i have a car
00:19:04
that means and i want to emphasise that i have a car but uh
00:19:09
do you have to be would be a a bits different then i have a car
00:19:12
and i want to emphasise that i have a car in nothing as maybe
00:19:17
so this is also something with which is difficult and so
00:19:21
setting this accent is yeah it impact is quite difficult
00:19:29
so the signal have now we can generate signals parameter so far it was only a
00:19:35
menu text processing and now we want to go uh to the signal processing more
00:19:40
so we have all symbolic information to generate signals now we have
00:19:45
though we split the text into a a phonemes and we
00:19:50
yes and we had a come on stop uh generate probably a
00:19:54
so now we can calculate the phone duration and pitch control
00:19:59
the fold rationing can be produced uh can be introduced rule based so all i
00:20:06
can clutch is famous for his clutch rules so this was a model
00:20:10
uh where you can compute the foundation up uh with little with the situation
00:20:16
so the parameters of the situational basically that we have we yeah
00:20:20
we have a minimum duration misses the eight uh this
00:20:24
is how strong you know phone can be compressed and we have the inherent inherent um of duration this is
00:20:31
the normal do duration of oh well well enough uh of a phone nose pressed stressful
00:20:38
and here we have one free parameter eighty this will be changed using uh
00:20:43
several uh you see a lot of rules and i will show you
00:20:48
the rules in the next slide but only briefly because these are menus
00:20:51
and it was good but it's it's a lot see so uh
00:20:58
important is also later this one was generalised by phone something so there's a regression model by
00:21:03
fans and who who also predicted uh who also calculate before duration but this models uh
00:21:09
where at least in in the beginning of his time quite difficult so i'll trust because
00:21:14
this also includes a decision trees so a lot of data was used uh therefore
00:21:22
okay uh at least at first some foundations so here's what
00:21:27
i already begin to mention you have beat it
00:21:30
it so we have different uh we have shotgun on boats and they all differ in in its durations
00:21:37
and he uh basically we have a table of almost all english models e.
00:21:45
and then the rules are more important so here for example you can see
00:21:49
context information you can see uh if we are now when i'm stressed
00:21:55
a segment that means we don't have any uh we don't have this the count what has for example no
00:22:01
accent or the county council has no i said right now then uh the phone duration will be
00:22:08
shortage this means the part free parameter eight will be reduced
00:22:13
important as uh this uh a is adapted iteratively so that's this means rule of trying
00:22:19
to find one syllable yeah we will try to find one rule that applies
00:22:24
and then we go to the next to uh and uh i was already changed but if it's a if
00:22:30
another who applies a will be modified again so until we come to
00:22:34
the final i wish i gives us our uh actual phone duration
00:22:42
yes and here i even more uh even more rules so you can see here uh also in phrase uh
00:22:49
if we are in phrases we can also mount it we will modify
00:22:52
awhile am again in another way or if we have clusters
00:22:57
then away everything influences all away in our phone duration the end
00:23:05
now uh we need to predict uh the pitch target uh the pitch control uh yeah many models i want to
00:23:13
introduce the for just talking about today because the functions act model
00:23:17
is actually quite a simple model but it works quite good
00:23:21
so the for just talking model uh wants to emulate the mechanism of su production
00:23:28
a basin overlapping phrase and accent accent commands and this is actually the important thing uh we already talked about
00:23:34
phrases and accents before so this is kind of a natural model and uh most there for quite good
00:23:41
so the phrase come once a components are actually the house response of uncritically tents you know system of the second
00:23:47
order this is a low pass filter and the parameters office component are the the image time so when does
00:23:54
uh when we're when do we have a new phrase and the amplitude of the
00:23:59
pause which tells us how strongly increases uh yes you do in that time
00:24:05
and the other thing is uh the other component is the pope frequency of the
00:24:09
filter this tells us how fast do we reach those uh target page
00:24:14
the accent commands is uh uh the response of a a response on that
00:24:20
rectangular impulse here the system is exactly the same that we have one
00:24:25
more degree of freedom so now we also have the duration
00:24:29
so how long do we have to sex and this was also a influence our uh pitch target
00:24:37
here you can see the parts response so this is directly the creation for a a four
00:24:43
one phrase command inch uh if we just we can model
00:24:47
and accent commands played to phrase commands so basically
00:24:50
if we have we have one phrase one can we hold it until one time and then we have
00:24:56
a second phrase command would be a negative amplitude this means in this way we can
00:25:01
model uh we can what will delete the accent commands with two phrase commands
00:25:08
uh it if we combine this then you can see here we have in this
00:25:12
utterance we have three phrase commands and we have to be ex commands ends
00:25:17
here in the dash line you can see how this influences all uh as you at at first
00:25:22
for the faces and then you can see how we and uh the accents on this
00:25:28
on this page uh control and in the end you can see that this
00:25:32
is actually of this can be the pitch control phone a normal sentence
00:25:41
or if we know at the information about duration and uh it's
00:25:45
you into our trends our final utterance looks in this way
00:25:50
so you can see that we have here several phone it
00:25:54
phone durations we calculated them using the chat rooms
00:25:59
and now we added little a pitch targets hits important uh we have
00:26:04
uh we have at first one relative time and then we have the frequency
00:26:09
relative time means by zero cross cross so in the beginning of
00:26:13
uh in the beginning of this phone and we have one hundred and sixty nine hurts
00:26:19
and here in the middle of a of the seconds for new we have a full time enough to earn a charts
00:26:27
so this is a bit difficult to read at first but was to what is important
00:26:31
for you know for uh for now here we can see the entire page controller

Share this talk: 


Conference Program

Speech analysis and characterisation
Elmar Nöth, Erlangen-Nürnberg
Feb. 11, 2019 · 9:18 a.m.
325 views
Voice source analysis
Prof Juan Rafael Orozco - Arroyave, Colombia
Feb. 11, 2019 · 10:10 a.m.
107 views
Speech synthesis 1
Peter Steiner, TU Dresden, Germany
Feb. 11, 2019 · 11:12 a.m.
143 views
Speech synthesis 2
Peter Steiner, TU Dresden, Germany
Feb. 11, 2019 · 11:39 a.m.