Speech synthesis 1

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:03

okay i don't mind a wispy design am i i'm a p. h. d. student of

00:00:08

paid happens because interesting and just today i want to talk about speech synthesis

00:00:13

uh i think you didn't have any lecture about speech synthesis so far because

00:00:17

of that i want to start with a small introduction to speech synthesis

00:00:22

so i want to start from the past maybe from the seventeenth century and but then i would

00:00:27

quickly come to the uh current state of the art that that's uh i will talk about

00:00:33

the text preprocessing because that takes a speech that since he's actually actually two twenties that's

00:00:38

and then later i will come to different systems of speech that is in because articulatory speech

00:00:44

synthesis might be interesting for you i will have a one special part about this

00:00:49

topic inch uh i will prepare you for the staff workshop in the afternoon we're we

00:00:54

use the vocal track up that's profits because developed do is used to stage

00:01:02

okay why should be a wise speech input as and this is important for you it's just

00:01:07

a small motivation uh because of spoken language is used in many parts of our life

00:01:13

so if for example if we sit in a car it's dangerous if we look on a

00:01:17

map to the same time so which might be useful for us if we if

00:01:20

we have for example mobile device that can tell us the way what automatic automatically then

00:01:27

like in uncomfortable i'm comfortable environment such as it happens might be also difficult

00:01:32

uh without speech synthesis because it's often very loud so

00:01:37

maybe it's difficult if we talked read uh if we sit next to each other and we cannot understand us anymore

00:01:42

because it's so not not then we can for example uh use a speech synthesis system which at first

00:01:49

recognise the out articulatory positions yeah just as the normal and

00:01:53

then uh the next uh the present which we

00:01:56

want to talk to a it has headphone inch uh

00:02:00

these gestures will be re synthesised using speech synthesis

00:02:05

and other applications uh probably because we have a we don't have state space to read

00:02:11

a textbook but we need some instructions also that we can use speech basis

00:02:16

another good thing is now many people in the world have a smart for which can uh

00:02:21

can be controlled used using spoken language and the boys assistant you typically uses speech synthesis

00:02:29

what's language includes more information so it also includes information about emotions

00:02:34

uh they are mostly decoded by intonation for example by age

00:02:38

if he were changes and uh for example the location

00:02:42

of articulation is also determined in spoken language

00:02:47

we can also uh i have different voice quality so model

00:02:50

or breathy voice we already talked about whispering for example

00:02:54

and this is also important for speech synthesis and uh for us here

00:02:59

it's interesting speech synthesis can help to understand and should diseases

00:03:04

for example in the current uh a pro project it's to taste and we uh try to help stroke

00:03:11

patients uh we try to have struck patience to get recovered because uh we can try to

00:03:17

tell them how should pay for for example move the time so articulate correctly

00:03:26

okay uh in general speech synthesis is defined by uh the

00:03:30

process to uh when i machine and generates speech

00:03:35

actually just very old sort of very first approach was a brave often can plan

00:03:41

so it was a purely mechanical talking machine and here and ask you excited to lead that resonator

00:03:47

to produce a okay tube you produce different uh different balls and they have said several exhilarated

00:03:55

pacific very tools to uh to produce for example consonants

00:04:00

it's quite interesting system uh it's you you have a rebuilt machine we have

00:04:04

one interesting i think sample can also has one famous one and

00:04:08

basically uh any extremes previewed and produce by the spells so you need to push it and then

00:04:15

yeah feels into this system inside the system and then we have a or eat here

00:04:21

this reach is basically the glottis so it was buried karaoke and uh this is actually the

00:04:26

sauce for mobiles so then we have this excitation signal that we so uh in the

00:04:31

in the the previous lectures today and we have a leather leather resonator that we can use to uh produce

00:04:39

one specific role for example so we can press it and then the shape he will change

00:04:44

it would be changed and this is the way how we can produce those different models

00:04:50

and as you can see here there are additional auxiliary tools so for example here it's

00:04:55

uh actually we can also produce fricatives is actually it was uh it it was

00:04:59

possible to produce many sounds with uh this uh with with this box

00:05:06

okay now i want to come to the electric devices so in the nineteen

00:05:10

fifties uh there was a and a technique called patent playback was invented

00:05:16

so this is basically the conversion of a picture to sell if one so i you can see this spectrogram as a

00:05:22

picture and uh then we've uh for the electrical system uh

00:05:26

this picture was converted back into a back to sound

00:05:30

so basically because light was pro project it on this uh

00:05:34

on the spectrogram which located and it was a button

00:05:38

a sense that the signal was sent an amplifier and uh we got a uh uh some uh we got speech out of that

00:05:50

so uh if we go a bit more into the signal processing station one

00:05:55

of the first approach just uh which really uh which are used formants

00:06:00

formant synthesis was like enough i don't hear can seem together with his first formant synthesiser

00:06:06

here he had already knowledge about some uh about the glottal source

00:06:10

so he could directly model it using a a electric

00:06:13

circuit circuits and in the back he also had some fit of things so it was possible to uh to

00:06:20

synthesise formants but uh with this device and you can see here he has some tools

00:06:26

so it is possible to uh move these tools around and changing the formant frequencies start yeah

00:06:32

dynamically so it's possible for example to produce a also combination of several uh vocals tables

00:06:43

and nowadays we have computer based since synthesis so this standard actually with the

00:06:48

catch talk system this was a purely you rule based formant synthesiser oh

00:06:54

and here yes i catch also use the glottal source signal but now it's a everything

00:07:00

computationally so the glottis all signal but you saw before uh is no map mathematically

00:07:05

described and then uh send to a bank of filters uh this um main mainly

00:07:11

to ban passages of the second or of a second order and then uh

00:07:16

it was a it's possible to produce even different voices so

00:07:21

uh this is uh mostly speaker independent uh for

00:07:24

synthesis system other approaches uh when we uh when we were able to collect more data

00:07:31

uh is for example beep so i get it from here we had a this let us to data driven speech synthesis

00:07:37

so here when you use a a user database of recorded speech

00:07:42

we can uh concatenate i know we spent the speech into

00:07:46

my phones for example or to other units and then or

00:07:50

the recent synthesis is actually only uh come combining piece

00:07:55

the phonemes yeah combining these tie forms all these units back to a new language and i want to doesn't

00:08:03

large database were available and then it was possible to uh use h. m. m.

00:08:08

based statistical signal processing for speech synthesis that means actually we but uh we

00:08:15

when to back a bit to the track talk system now we have a a can a source filter model

00:08:20

and we have information about how well the sauce really works and how dull filter really works

00:08:26

and uh because we have a lot of uh speech data which we can uh could and as before

00:08:32

and since two thousand and sixteen of the wave gets i was introduced now the

00:08:37

state of the art uses more and more neural networks for speech synthesis

00:08:44

okay now i want to come to the texas speech problem so actually we have

00:08:48

a twitch i like this so we start with a text input and we

00:08:52

have a we many steps here into text processing so for example here we have

00:08:57

segmentation segmentation means that we need so uh there that we need so

00:09:03

segments the text for example into words or into other structures then

00:09:08

we need to expand non words for example numbers oh abbreviations

00:09:13

and then we need to find the part of speech so there are many

00:09:16

i may actually this is a all sub tasks is uh but he's

00:09:22

includes text processing these two last boxes are the only uh

00:09:27

i actually the only signal processing state boxes and therefore we

00:09:32

can use several techniques to synthesise speech for example

00:09:36

because of this uh i will exclude this part and uh oh this will come in a later talk

00:09:44

okay uh let's go through the steps so first we have

00:09:48

a segmentation step of the sink a inch text normalisation

00:09:52

i will use this utterance now we have to be feared as a teaspoon you all those all the steps

00:09:59

at first we will split the text into tokens that token uh in this case is a single worked in preparation

00:10:06

so we can see how the token would be split up a a here we have seen what's

00:10:11

and we have to point s. x. s. one acceptable can be in any

00:10:15

rule normalised the tokens so that means and now we don't uh we

00:10:19

don't need capitalised letters anymore uh we but we need to a white uh

00:10:25

uh right down the numbers so we have to in this case

00:10:29

and uh this is quite a difficult task because uh two can be we we don't

00:10:34

at first we don't really know what represents this too so it can be twice it

00:10:39

can be a time so we need to decode in a in a correct form

00:10:44

and this uh this is often done using a some said statistically message methods

00:10:49

so for example if we have model the context around this number

00:10:54

and then a part of speech tagging is the next part so basically we need

00:10:59

to find the part of speech of all the tokens that we've found before

00:11:03

uh this is mostly done using a dictionary so usually a text to speech

00:11:07

system always has a dictionary with underlying out with the part of speech

00:11:12

because sometimes a a one word test even a modern only one part of

00:11:17

speech and we need to find the correct one out of touch dictionary

00:11:25

well the next big step is chunk parsing chunk parsing means basically

00:11:29

we want to the sequence our entire sentence into those

00:11:32

segments so and uh now we want so uh the identify

00:11:37

for example phrase boundaries phrase boundaries means we have

00:11:41

a prosodic uh we went to we have prosodic features that include more than uh

00:11:47

we think what's so here you can see each uh concerns the entire sentence

00:11:52

the uh in many cases we can well usually we can uh

00:11:58

say a sentence consists of one nominal phrase and one rubber phrase

00:12:02

and now we need to speech the sentence uh in these senior phrases uh if we

00:12:09

uh if we want to go for that but then we could in theory also

00:12:13

split lovable phrase again into a rubber phrase and into a second nominal

00:12:18

phrase but this is getting complicated and here we need be trees so usually

00:12:23

this is not considered only this a lot phrasing is considered you should

00:12:32

okay the next step which is quite important as other stuff from graph seems to phonemes so uh usually

00:12:39

texas speech systems try to find affirming phonemic transcription of a word in an online dictionary that

00:12:45

means we have characters but we kind of started a translate the characters into uh into

00:12:52

i into speech before we need so uh we need to find the correct phonemes because an eight can be for example

00:12:58

long or a short pronounced and these are different phonemes for

00:13:02

example or we have voiced and unvoiced fricatives in

00:13:06

a specific context and therefore we need the dictionary where where we

00:13:11

can find a word and its corresponding uh phonetic transcription

00:13:15

this actually and it contains the somehow annotation it also cans a a contains

00:13:20

information about sort of boundaries and stress a stressed syllables in the word

00:13:27

and it is difficult if the word is not in the dictionary then all information

00:13:31

to be predicted only generated separately this is a a difficult times because we

00:13:37

need to find other syllables we need to find the correct phonemic transcription and

00:13:43

uh yes we we also need to find which syllables are stressed because

00:13:48

in every cent uh it it's defined then every what is there is one stressed

00:13:54

syllable it we need to find a we need always to define these

00:13:59

if we now go back to what i'm utterance that we want to those uh synthesise

00:14:05

then this utterance looks uh now like this so we had here we have the phonemes this is not really some

00:14:12

huh this is important uh this is uh the annotation follows the timit corpus which is often used for uh

00:14:19

speech recognition tasks but at least also here we have a phoneme half an image we a re presentation

00:14:26

can for example see also now consists of three characters

00:14:30

but in the end it will uh it will be to phonemes and in how or

00:14:37

yes the same for example into it's interesting he we have at first to stop

00:14:41

that means uh we don't have any yeah flow then the glottis all kinds and well

00:14:46

the t. will be articulated in in the end it will be able to see

00:14:52

and also blow punctuation gets uh gets one talk funny

00:15:00

okay and the next step is to uh it's the symbolic prosody generation

00:15:05

so right now we have a we have all funny funny late

00:15:09

information that we need to uh or somehow we need to

00:15:13

generate prosody so we need to find a way to rid intonation basically

00:15:18

prosody is uh not concerned with larger elements of speech than only if isolated from it it's phonetic signal

00:15:25

segments so now we are dealing with syllables we are dealing with what's all with phrases again

00:15:31

see the most important features for prosody are the pitch control so this is a

00:15:37

for example this determines to have a questions or do we have and common sentence because the question indians the

00:15:43

the f. c. will it would be increased and for in other cases it gives you will increase

00:15:50

and the phone duration or also the articulatory gestures

00:15:55

so if we talk fast or slow uh uh this is this is everything important for the prosody

00:16:01

it depends on linguistic features depends in phrases and it also depends on emotions so for example if we

00:16:08

uh if we talk in a relaxed way and we'll probably talk slowly yeah but if we

00:16:14

if you are probably are if we are excited than be much targets

00:16:19

are fast uh it we talk maybe with the high is zero

00:16:23

and the dynamic of the f. s. you can be changed the so there are many factors which influence the prosecution

00:16:32

here we want to uh predict the symbolic prosody so this is a maybe a a this is

00:16:38

a more something like you have a pure task yeah this isn't something i it's like

00:16:44

ah it's more a task that we can uh that is

00:16:48

a preprocessing step for the actual prosody generation so

00:16:53

i uh as you already know now what intonation depends on many different factors uh so uh

00:17:00

often rule based approaches for intonation can be used uh for example we have uh but toby

00:17:06

um but toby said this is actually a set of uh i think in total four or five uh several

00:17:12

uh excess like this and they i can be used to produce let's uh in

00:17:19

the and commands uh which uh from which we can generate the prosody later

00:17:24

for example here we have a high accent that means in this case the f.'s you must be high

00:17:32

this means we know the s. u. goes from low to high in this

00:17:36

as a as a come on but you have to do is low

00:17:41

and typically accented words sense in front or phrase boundaries get prosody labels so we don't

00:17:48

uh we don't catch deeper until a single phonemes but we stay in a higher

00:17:54

uh in higher level the titles accents uh access depends on the part of speech depends

00:18:00

on the type of sentence and depends on phrase boundaries as i told you before

00:18:04

for example we differ um you uh on the want and we have a question

00:18:08

and we have other sentences all of the sentences have different uh into nations

00:18:17

okay if we apply uh if we had some access to our trends there will come a to this point so he

00:18:23

can see now we he goes uh the su op and

00:18:27

then it will mostly stay up here we have an

00:18:32

inside the two we would go from low to high because you want to emphasise too

00:18:37

and big is akin high and then the absolute increases to the end of the sentence so this is a common sentence uh

00:18:45

important is also a it because it talked about emotions and i forgot

00:18:49

that before um imagine one sentence i have a car i can

00:18:55

emphasise different words and all this bull in a result in the different uh

00:19:01

in a different intonation for example you can say i have a car

00:19:04

that means and i want to emphasise that i have a car but uh

00:19:09

do you have to be would be a a bits different then i have a car

00:19:12

and i want to emphasise that i have a car in nothing as maybe

00:19:17

so this is also something with which is difficult and so

00:19:21

setting this accent is yeah it impact is quite difficult

00:19:29

so the signal have now we can generate signals parameter so far it was only a

00:19:35

menu text processing and now we want to go uh to the signal processing more

00:19:40

so we have all symbolic information to generate signals now we have

00:19:45

though we split the text into a a phonemes and we

00:19:50

yes and we had a come on stop uh generate probably a

00:19:54

so now we can calculate the phone duration and pitch control

00:19:59

the fold rationing can be produced uh can be introduced rule based so all i

00:20:06

can clutch is famous for his clutch rules so this was a model

00:20:10

uh where you can compute the foundation up uh with little with the situation

00:20:16

so the parameters of the situational basically that we have we yeah

00:20:20

we have a minimum duration misses the eight uh this

00:20:24

is how strong you know phone can be compressed and we have the inherent inherent um of duration this is

00:20:31

the normal do duration of oh well well enough uh of a phone nose pressed stressful

00:20:38

and here we have one free parameter eighty this will be changed using uh

00:20:43

several uh you see a lot of rules and i will show you

00:20:48

the rules in the next slide but only briefly because these are menus

00:20:51

and it was good but it's it's a lot see so uh

00:20:58

important is also later this one was generalised by phone something so there's a regression model by

00:21:03

fans and who who also predicted uh who also calculate before duration but this models uh

00:21:09

where at least in in the beginning of his time quite difficult so i'll trust because

00:21:14

this also includes a decision trees so a lot of data was used uh therefore

00:21:22

okay uh at least at first some foundations so here's what

00:21:27

i already begin to mention you have beat it

00:21:30

it so we have different uh we have shotgun on boats and they all differ in in its durations

00:21:37

and he uh basically we have a table of almost all english models e.

00:21:45

and then the rules are more important so here for example you can see

00:21:49

context information you can see uh if we are now when i'm stressed

00:21:55

a segment that means we don't have any uh we don't have this the count what has for example no

00:22:01

accent or the county council has no i said right now then uh the phone duration will be

00:22:08

shortage this means the part free parameter eight will be reduced

00:22:13

important as uh this uh a is adapted iteratively so that's this means rule of trying

00:22:19

to find one syllable yeah we will try to find one rule that applies

00:22:24

and then we go to the next to uh and uh i was already changed but if it's a if

00:22:30

another who applies a will be modified again so until we come to

00:22:34

the final i wish i gives us our uh actual phone duration

00:22:42

yes and here i even more uh even more rules so you can see here uh also in phrase uh

00:22:49

if we are in phrases we can also mount it we will modify

00:22:52

awhile am again in another way or if we have clusters

00:22:57

then away everything influences all away in our phone duration the end

00:23:05

now uh we need to predict uh the pitch target uh the pitch control uh yeah many models i want to

00:23:13

introduce the for just talking about today because the functions act model

00:23:17

is actually quite a simple model but it works quite good

00:23:21

so the for just talking model uh wants to emulate the mechanism of su production

00:23:28

a basin overlapping phrase and accent accent commands and this is actually the important thing uh we already talked about

00:23:34

phrases and accents before so this is kind of a natural model and uh most there for quite good

00:23:41

so the phrase come once a components are actually the house response of uncritically tents you know system of the second

00:23:47

order this is a low pass filter and the parameters office component are the the image time so when does

00:23:54

uh when we're when do we have a new phrase and the amplitude of the

00:23:59

pause which tells us how strongly increases uh yes you do in that time

00:24:05

and the other thing is uh the other component is the pope frequency of the

00:24:09

filter this tells us how fast do we reach those uh target page

00:24:14

the accent commands is uh uh the response of a a response on that

00:24:20

rectangular impulse here the system is exactly the same that we have one

00:24:25

more degree of freedom so now we also have the duration

00:24:29

so how long do we have to sex and this was also a influence our uh pitch target

00:24:37

here you can see the parts response so this is directly the creation for a a four

00:24:43

one phrase command inch uh if we just we can model

00:24:47

and accent commands played to phrase commands so basically

00:24:50

if we have we have one phrase one can we hold it until one time and then we have

00:24:56

a second phrase command would be a negative amplitude this means in this way we can

00:25:01

model uh we can what will delete the accent commands with two phrase commands

00:25:08

uh it if we combine this then you can see here we have in this

00:25:12

utterance we have three phrase commands and we have to be ex commands ends

00:25:17

here in the dash line you can see how this influences all uh as you at at first

00:25:22

for the faces and then you can see how we and uh the accents on this

00:25:28

on this page uh control and in the end you can see that this

00:25:32

is actually of this can be the pitch control phone a normal sentence

00:25:41

or if we know at the information about duration and uh it's

00:25:45

you into our trends our final utterance looks in this way

00:25:50

so you can see that we have here several phone it

00:25:54

phone durations we calculated them using the chat rooms

00:25:59

and now we added little a pitch targets hits important uh we have

00:26:04

uh we have at first one relative time and then we have the frequency

00:26:09

relative time means by zero cross cross so in the beginning of

00:26:13

uh in the beginning of this phone and we have one hundred and sixty nine hurts

00:26:19

and here in the middle of a of the seconds for new we have a full time enough to earn a charts

00:26:27

so this is a bit difficult to read at first but was to what is important

00:26:31

for you know for uh for now here we can see the entire page controller

Share this talk:

Conference Program

48:20

Speech analysis and characterisation
Elmar Nöth, Erlangen-Nürnberg
Feb. 11, 2019 · 9:18 a.m.

325 views

31:03

Voice source analysis
Prof Juan Rafael Orozco - Arroyave, Colombia
Feb. 11, 2019 · 10:10 a.m.

107 views

26:45

Speech synthesis 1
Peter Steiner, TU Dresden, Germany
Feb. 11, 2019 · 11:12 a.m.

143 views

36:09

Speech synthesis 2
Peter Steiner, TU Dresden, Germany
Feb. 11, 2019 · 11:39 a.m.

Recommended talks

15:20

Speech synthesis 1
Peter Steiner, TU Dresden, Germany

Embed

Transcriptions

Conference Program

Speech analysis and characterisation
Elmar Nöth, Erlangen-Nürnberg
Feb. 11, 2019 · 9:18 a.m.

Voice source analysis
Prof Juan Rafael Orozco - Arroyave, Colombia
Feb. 11, 2019 · 10:10 a.m.

Speech synthesis 1
Peter Steiner, TU Dresden, Germany
Feb. 11, 2019 · 11:12 a.m.

Speech synthesis 2
Peter Steiner, TU Dresden, Germany
Feb. 11, 2019 · 11:39 a.m.

Recommended talks

ESR08 : Developing measurment procedure of pathological speech intelligibility
Wei Xue
Sept. 4, 2019 · 4:31 p.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Speech synthesis 1 Peter Steiner, TU Dresden, Germany

Embed

Transcriptions

Conference Program

Speech analysis and characterisation Elmar Nöth, Erlangen-Nürnberg Feb. 11, 2019 · 9:18 a.m.

Voice source analysis Prof Juan Rafael Orozco - Arroyave, Colombia Feb. 11, 2019 · 10:10 a.m.

Speech synthesis 1 Peter Steiner, TU Dresden, Germany Feb. 11, 2019 · 11:12 a.m.

Speech synthesis 2 Peter Steiner, TU Dresden, Germany Feb. 11, 2019 · 11:39 a.m.

Recommended talks

ESR08 : Developing measurment procedure of pathological speech intelligibility Wei Xue Sept. 4, 2019 · 4:31 p.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Speech synthesis 1
Peter Steiner, TU Dresden, Germany

Speech analysis and characterisation
Elmar Nöth, Erlangen-Nürnberg
Feb. 11, 2019 · 9:18 a.m.

Voice source analysis
Prof Juan Rafael Orozco - Arroyave, Colombia
Feb. 11, 2019 · 10:10 a.m.

Speech synthesis 1
Peter Steiner, TU Dresden, Germany
Feb. 11, 2019 · 11:12 a.m.

Speech synthesis 2
Peter Steiner, TU Dresden, Germany
Feb. 11, 2019 · 11:39 a.m.

ESR08 : Developing measurment procedure of pathological speech intelligibility
Wei Xue
Sept. 4, 2019 · 4:31 p.m.