Player is loading...

Elmar Nöth, Erlangen-Nürnberg

Monday, 11 February 2019 · 9:18 a.m. · 48m 20s

Embed code

Note: this content has been automatically generated.

00:00:00

welcome everybody i think it's all funny up at work on a apologises for the weather um

00:00:09

you should visit me more often than you you wanna have okay

00:00:12

um what i'm gonna start out with a with a a some stuff

00:00:16

about the fat for you transform windowing and kept strum and then

00:00:22

a raw files gonna go on talk about l. p. c. analysis

00:00:27

and then pager steiner we'll talk about um some that

00:00:31

that in general and then articulatory isn't that

00:00:34

and i think you know the articulatory synthesis especially can help us a lot about

00:00:39

understanding barry or tickle asian and the generate that sounds so it's not only how to

00:00:45

some fans but also how you know uh how speech is generated

00:00:49

in general and that will help us with the um

00:00:53

with with a computation of of of features and and and and i think speed on

00:01:00

nowadays very often you know people have an end to end system you know speech

00:01:05

signal inverse although i still would like to present it in a day

00:01:10

six then show manner and if well we oh we'll concentrate on thursday if

00:01:16

we have a sends a signal when we have a preprocessing in

00:01:19

feature extraction so the preprocessing basically transfers

00:01:24

the signal into a transform signal

00:01:28

it's still in the same order of mac and then you have the

00:01:31

feature extraction in the feature extraction is basically you kind of throw

00:01:36

away unnecessary information and keep all the necessary information soul if you

00:01:41

want to find out what you know so what was said

00:01:46

you know to some extent you have to keep a different part of the information then who said it on or how he said

00:01:53

but we basically we we tunes by an order of magnitude

00:01:57

and then we have a classification and in this case it's a single classification but of course

00:02:02

very of when speech analysis we go ashore that i'm analysis and then roll you know analysis

00:02:08

by now if we have a sequence of i think him and we're pretty much here

00:02:14

so basically what we see here and we'll get into that if

00:02:17

the at time and frequency representation of the speech signal

00:02:21

so basically this is what his so human ear what

00:02:25

hits the microphone the the the time sick now

00:02:29

e. down here is the spectrogram which shows that

00:02:34

at this moment what frequency is press and

00:02:37

this is uh the the fundamental frequency

00:02:41

if i go up or down with my voice and this is the energy

00:02:46

and so we have this uh as big a signal and basically

00:02:50

we call in the first step from here to here

00:02:53

so why do we do that well you money because it too it's a

00:02:57

frequency analysis now if we look at the discrete for your transform

00:03:01

basically we'll see if that we ha a discrete signal we had the

00:03:06

analogue signal but it is transformed into a discrete signal so

00:03:11

we will look at a finite thick no we can't look at the complete signal as a whole so we get

00:03:17

short now if you look at you want to characterise the frequency that are present in this discrete

00:03:24

finite time signal you know that the uh we know that will this will give us a uh

00:03:32

internet

00:03:33

a a continuous and and periodic frequency representation

00:03:40

if we now assume that our finance signal is extended towards infinity

00:03:47

periodically so we just copy that thinking of everyone want and why

00:03:52

don't we have the same representation which is a discrete no

00:03:58

and if periodic and it's also internet but all the information that

00:04:04

is present in that suffices to look at this or

00:04:08

so that's what they did we go so if we have the spectrum that that before you transform over time

00:04:15

signal and if the time signals and periodic it's and periodic if you know uh for certain and

00:04:23

it uh uh the fifth at the end of the thing that ever and last j. comes and

00:04:29

that means that all spectral components disappear the resolving line spectrum is computed with the four

00:04:35

gig a spectrum and we can recover the complete signal by the inverse for it

00:04:44

and if it's not periodic then we just think it's it's in fun

00:04:51

so um we didn't do it for a provision describes

00:04:55

the spectral density at this uh uh frequency

00:05:00

nominal resolution i just kind of put that down so that when you look at it later i'm gonna

00:05:05

put up low you can look at the at the fly so let's not go through all these

00:05:10

uh uh formulas in it and if it's really that that's important 'cause we always look at when we look at the spectrogram

00:05:17

we only see half of it because its periodic yeah it's a

00:05:20

matter so you know going instead of going from minus

00:05:26

a sampling frequency by two two plus we go from zero to two plus so

00:05:33

oh sorry

00:05:35

so if you look at the we're spectrum think about this being a cop it into the negative to maine okay

00:05:45

uh_huh

00:05:46

um

00:05:51

so we have a we we we we know that the the free the spectrum is symmetric and we only

00:05:57

look at the the one half of it and you know the guy with a p. h. d.

00:06:03

uh we need and square complex multiplication ins and

00:06:06

de facto you draw a transform a um

00:06:11

i only need and towns log and and if we uh you was the fact that we have a real

00:06:17

input signal and we can even a total uh and minus one so we save one one step

00:06:24

and the sampling a a few rooms that this

00:06:28

is our continuous time signal now we

00:06:33

this great high fit with sample it with the uh the

00:06:37

the continuous signal with a certain uh uh um

00:06:42

a sampling frequency and week one height typically you know the the the people

00:06:49

don't talk that much about it but that you know this is

00:06:52

also analogue not only that one but this is an lot and we

00:06:55

typically contacted with sixteen bit which uh uh i indicated here

00:07:02

and the sampling frequent uh a theory says that if the signal is

00:07:06

banned them it that is there's no all frequencies above a certain

00:07:12

well uh limit then it suffices to sample it with twice that

00:07:18

frequency if we know the read no more frequencies in the signal beyond

00:07:25

a killer birds we can family sixteen kilohertz and we covered

00:07:30

what in that sampling frequency uh it's not that is that

00:07:34

you would assume that there is no ever in quantisation

00:07:38

that's why when you know you look at the c. d. quality

00:07:42

you assume the human ear goals not further than twenty kilohertz

00:07:47

which for most of you guys that going for me definitely not i don't eat a coloured uh

00:07:55

but that's why the something for once he is forty four on the c. d. so you have

00:08:01

twenty times to last something and that mostly recovers the quantisation

00:08:06

of uh uh that ever that we do here

00:08:11

so basically the family tree a few rooms that you we can recover

00:08:16

uh with this uh uh in the population we can recover that signal without the loss

00:08:22

so it's uh uh uh you can represent the same signal even the time domain for the for him now

00:08:29

we so we heard that before we saw that before so we're yeah the signal with it

00:08:36

oh so you know even if you don't know german you can say

00:08:41

yeah that's the girl who says something interim now dish let me

00:08:49

who look at this representation and it's the same signal

00:08:56

but notice that this is different from that

00:09:01

e. d. so what's happening there and i can't even go further and i still have the same signal

00:09:09

but i have a

00:09:13

so basically if i look at these people three frequency representations

00:09:20

and they represent the same time second well what we do when here is

00:09:24

we lock ed and what i said before we take this finite signal

00:09:29

so we cut out some ten and we take the frequency representation prove no basically what

00:09:35

we look in here is we take a health frequency or a hand signal for

00:09:40

teen or two hundred fifty six milliseconds we cut that out and we think okay from

00:09:45

now on this is infinite and then we get the frequency and we've displayed here

00:09:52

okay so we take that time signal we make the fourier transform we

00:09:58

have a spectrum and this is the part of the complete picture

00:10:03

and of course if you take different window length to get a different your presentation

00:10:09

so we cannot that's sick no no there's two important things the one thing if

00:10:16

notice here and notice here there is a a discontinuity

00:10:23

if we really think we're gonna take that and go into infinity

00:10:30

so we introduce introduce a discontinuity here

00:10:35

the other thing is if you really closely you know this is a representation of sample

00:10:41

hold one with one of those little things is one sixteen thousands of the second

00:10:49

and you know you sample and you hold that for one sixteen pounds of the second so basically if you look at

00:10:55

those representation you see a clear yard you know that's one shows up here again and it shows up here again

00:11:02

so this is this here is the fundamental period the time between

00:11:07

two openings and thousands of the of the vocal folds

00:11:11

and it looks like it's the metric but if you take take it exactly it's not you know that's

00:11:16

one that looked a little bit different from that's one so it's almost periodic but it's not

00:11:22

but this

00:11:25

fundamental period have to show what somehow in the spectrogram if we take enough of this

00:11:31

signal because then we have this classy quite supper got we'll see that in so

00:11:37

so we take that and then we make that

00:11:40

for you transform off up this short time and if

00:11:45

you now look at this from the top

00:11:49

and you um you you you you called the this access

00:11:55

by colour and you turn around a zero two eight

00:12:03

zero two eight turn it and you flip it out and you look at it that of the colour is the intensity here

00:12:12

this is the time that this this is the frequency that's that's it that's how we get the spectrogram okay

00:12:19

now here is that uh spectrograms that willow that that will that that's the different ones i just you know

00:12:28

the what we see here this is very short and we see no

00:12:32

uh uh yeah that is that the form seems to be the same

00:12:37

except for the use the things here in your well this thing

00:12:42

says it's also the frequencies that are present except in the time signal

00:12:48

this is a girl speaks with about two hundred and seventy works

00:12:53

soul one fundamental p. a period one thousand milliseconds divide but

00:12:59

uh that two hundred and seven diverts means up four millisecond per period

00:13:05

so time segment that's okay it doesn't have it does not have

00:13:13

it has to less than less or equal to one of those fundamental period that the fundamental period it only had left

00:13:21

then it you will not see any periodicity in the spectrum

00:13:27

this one path on average about two fundamental with um this one has about three to four

00:13:34

and so you see the fundamental frequency your hand if it would

00:13:38

truly periodic the spectrum would be aligned spectrum with fundamental

00:13:43

period twice three times four times the harmonics that's what we

00:13:48

see here on the mental period one two three

00:13:52

first second third and so on how money so here we don't see any fundamental period uh here we see

00:14:00

a little bit one two three four here it's much stronger one two three four and so on

00:14:09

and here the fundamental period is even more more

00:14:15

this thing but you know you think that the higher harmonics

00:14:20

or like ten times from the fundamental frequency of fundamental frequency

00:14:26

two seven d. ten times two thousand seven hundred you know if i

00:14:31

have a quarter of a second and the poor get the city

00:14:35

chance around a little bit around two hundred seventy then i still

00:14:39

see the fundamental period but ten times it's what about

00:14:44

see the difference from c. this is enough to have that let's say five

00:14:50

six fundamental period and this maybe have sixty fundamental period and if those

00:14:56

differ too much it it goes from two seventy two to seventy four to seventy five

00:15:01

then the other one the one up here close to two thousand seven hundred fifty

00:15:05

so we have a fifty hertz difference here at five hertz step and so the fundamental

00:15:10

peered show up here but not anymore really around here so so you're half tool

00:15:17

depending on the analysis size that you take you know

00:15:21

you either stretch the information the global information

00:15:26

or a little bit more about the exact frequencies on the other hand

00:15:34

the shorter you are the better good commerce solutions

00:15:38

and the longer you're commercials isn't it

00:15:42

so here we have the time free one two representations this one is very good in frequency

00:15:48

this is very bad in frequency resolution isn't very good in time resolution that's very

00:15:53

a a a bad time so there's another speech signal um looks more like this one right

00:16:04

we're like this one then like this or presentation but in fact it isn't

00:16:11

uh_huh

00:16:13

but it's also sixteen millisecond window this is a four

00:16:17

millisecond window this is the sixteen millisecond window

00:16:21

what's the difference

00:16:24

mm

00:16:29

uh_huh it's a male voice

00:16:33

male voice with a lower fundamental period i heard it it's me it's me so

00:16:39

i speak with about a hundred and ten hundred twenty works the girl

00:16:43

three times is like two hundred seventy words so for the same analysis window

00:16:50

if i take a four millisecond window went my fundamental peer is ten milliseconds then have up forty percent of one

00:16:57

that's why you know my my voice there is no fundamentals if i take

00:17:03

the same analysis window as i would for high pitch voice soul

00:17:09

short term analysis that we gonna look at so what do we look

00:17:12

one the description of the spectral composition of the speech signal

00:17:17

the problem with the discrete fourier transform is only meaningful for you and it's a

00:17:22

a speech signal is per divinity on them not figure out extensive changes over time

00:17:27

the observation for very short time segments the speech signals or

00:17:32

as a first approximation stationary but if the spectral composition

00:17:36

is constant so we take a small interval

00:17:41

and so for each point and in time we cut out the window

00:17:46

from the second or we analysed as using the for you transform

00:17:51

a we implicitly as all that big kind of region is repeatedly per yard

00:17:57

so we have a periodic a continuation of the window of the speech thing

00:18:03

now there's a couple questions that come up immediately if we cut

00:18:07

out a time signal basically we multiply it by a window

00:18:12

like cutout sixty milliseconds i multiply the sixty milliseconds by one and the rest of the

00:18:21

the the speech signal by tsar now to multiply a time signal in the time domain

00:18:28

in the frequency domain that is the convolution for convoluted in time infrequent it's a multiplication

00:18:37

having out correspond to the multiplication of the speech signal with the window function

00:18:42

which is not zero exactly in the interesting info and the question is which

00:18:48

uh uh so which that influences the spectrum so which a window

00:18:53

function should be chosen and how big should the window be

00:18:57

and should the windows overlap and if soul by how much no if we cut out the window in you

00:19:04

know we can cut it out with the rectangular window multiply each by want what we can't hide out

00:19:10

with the window that kind of tampons the ages why would

00:19:14

we do that well remember the disk continuation we introduce

00:19:18

but this continuation if we have the window it because it it starts here

00:19:25

it ends here and now we think of this window as being internet

00:19:30

then here we have a chop this continuation and then we go for

00:19:34

the same window so if we take a window that smooth says

00:19:40

then you know we we boy that this continuation on the other hand we manipulate the

00:19:47

signal and we dampen those parts of the time signal so maybe if we

00:19:52

then take the next we don't we don't all we overlap a little bit then that signal gets in

00:20:00

now see here is the typical windows that we use that as

00:20:04

the hamming window that they're handing window yeah twenty thousand kaiser

00:20:09

lots of and the rectangular window of course would go up here and down here

00:20:16

now basically as i say we modify the signal and that means that when we take before

00:20:23

you train from here we uh uh we we have to take the uh frequency response

00:20:31

hope that a signal into account now do you think about it

00:20:36

if this is thus the frequency of your time signal

00:20:41

and if you convoluted that with the frequency response of the window

00:20:48

what would be the ideal window that would completely

00:20:54

freak over the spectrum of the time signal

00:21:01

it would be a direct it's zero because it would go over you convoluted and each

00:21:09

frequency you get is exactly reproduced and no other

00:21:14

frequencies take into account to explain that frequency

00:21:19

okay so basically you would have a frequency response the calls like this up like that

00:21:29

now we look at the frequency response only in this because it's in metric so i would

00:21:34

like to have it zero or something and nothing here you know if we lock

00:21:42

yeah there are rectangular window that something at zero and something here

00:21:51

and the scale here is logarithmic

00:21:56

so if i if i have a window that has

00:22:01

the us then and i make convoluted this response with

00:22:08

my frequency response over it and then to recover

00:22:12

just want to recover that frequency i use this frequency which is good which is what i want

00:22:19

but also these frequencies which are not even present my original spectrum

00:22:25

okay now

00:22:27

this is the hamming window so see here this is my time signal after the rectangular this is

00:22:34

my hamming window function if he keeps those in the middle but forces those towards zero

00:22:42

okay now if you look at the frequency response here

00:22:47

it takes into account to read how her eighty eight frequency

00:22:53

to recovery if the neighbouring frequencies into account must stronger

00:23:00

what's wrong with the pistols down but this is

00:23:03

much smaller than this one but the one set of far away damp and it's much more

00:23:10

okay so this signal that i uh uh started here with

00:23:16

was originally that's super position of three it's signed so it's a one year when you when you so

00:23:24

or reaction or the the the the original spectrum that i would love

00:23:28

to recover with my window function would have nothing until here

00:23:33

one here one here and one here and nothing else that would only

00:23:39

have three east frequencies this is what i get with the

00:23:44

uh with the rectangular window this is what i get with the hamming window and as we see that frequency

00:23:51

resolution around there is not a good this year but on the other hand this frequencies than an hour

00:24:00

okay so to some extent this is a much better representation than the original signal than this one

00:24:06

and that's what we typically apply a window that tampons

00:24:12

still

00:24:14

the bigger the window in the time domain a higher frequency resolution and

00:24:18

you know you you can think about it if you have a

00:24:22

a a window size of two fifty six and sixteen thousand five to fifty six you have sixty two works

00:24:29

each frequency uh uh uh

00:24:32

component now represents of fill the bank element of sixty to hurt

00:24:38

it says you know uh i'm here to represent the frequencies from two thousand to two thousand and sixty

00:24:48

and of course if you have only sick or uh uh uh

00:24:51

samples then you have one is the represents two fifty

00:24:56

and the windows two k. the speech signal does not stationary anymore so it should be shorter than one for me

00:25:04

and you know if it's too big then

00:25:07

it it it's not stationary and you know the uncertainty principle the better

00:25:12

the timer look in the words of frequency resolution and but

00:25:16

and very typical analysis windows are ten milliseconds step size

00:25:21

so we sampled frequency with a hundred hertz

00:25:26

and uh the window size round twenty five milliseconds with the phoneme eighty milliseconds

00:25:32

that about one third your on average in one form or window

00:25:38

and if we look at the spectrograms well the spectrogram is what we look

00:25:42

at the whole time we can order of faith that if the

00:25:45

spectrum of the squares of the actual values we have up remember we have a complex spectrum but we kick the up the values

00:25:52

time is displayed on the x. frequency on the y. axis of the intensity by the

00:25:57

optical depth uh it be intensity by the optical density or by the colour

00:26:01

broadband spectrogram small frequency resolution high time or something

00:26:06

vertical course in the distance of the fundamental period detection of short

00:26:11

faces of explosion narrow band high frequency resolution low calmer solution

00:26:16

vertical parts in the uh distance of the fundamental frequency

00:26:21

difference haitian ups yeah ours on whatever it was sort of the forms that are close to each other

00:26:27

so this is a in a a broadband spectrogram and you see those strike the asians here

00:26:35

those are the distances of the fundamental a period u. c. d.'s iterations here

00:26:43

this is the fundamental frequency first second third and so on one

00:26:48

okay

00:26:51

and what you see well for one thing you see these parts which are the the the resonance

00:26:57

frequencies of the vocal tract and ruffle say much more about that in in in his car

00:27:05

again you see the stray asians which are the distances of that that

00:27:09

that the the the the the durations of the fundamental periods

00:27:16

and the other thing is that and again uh we'll we'll know more next talk about it

00:27:25

if and when we have the linear uh a model

00:27:28

of of of of of speech it's a convolution

00:27:32

of the excitation signal and the transfer function of

00:27:37

the vocal tract and the expectations uh uh

00:27:41

function has to do with the opening enclosure of the vocal tract of of of of a vocal chords

00:27:48

and depending on whether you have like a a soft voice or a harsh voice

00:27:54

um your excitation signal looks more like this or that's

00:28:00

not this is more like the final saw it

00:28:02

so it had the fundamental frequency and a higher harmonic abandoned by much

00:28:09

this is a more harsh voice although higher harmonics are not as

00:28:14

much time i imagine you want to read cover this

00:28:18

bias some of us uh a sinus sinusoid then you need

00:28:23

to hire signs as you know you need the fundamental

00:28:27

sinusoid and the harmonics of it to get a complete periodic signal back

00:28:34

so that's as much but you're the fundamental frequency and the harmonic user

00:28:39

fundamental frequency in our mind and uh not as much tampa

00:28:42

and of course that shows up that shows also i'll in in your in your spectrum so your

00:28:49

spectrum has both the vocal tract form and the

00:28:53

excitation informed which leads me to cepstrum coefficients

00:29:00

and if you heard of mel cepstrum forget now for the moment captain

00:29:05

so what we say if the the the the third source filter model

00:29:10

where we take a speech signal that we see is the convolution

00:29:14

of the excitation signal that we just looked at the characteristics at the vocal tract transfer function the information

00:29:22

now if we just say convolutional with a four right convolution means in

00:29:28

if you go from the time signal in the frequency domain

00:29:31

you have a multiplication convolution becomes multiplication now if you go in the log domain this becomes an addition

00:29:40

so basically the lock up the four year transform off your speech signal with

00:29:44

the law of the excitation signal plus a lot of the transfer function

00:29:51

okay

00:29:53

and i i i'm gonna put the the the slides on line so if if you

00:29:58

don't really have to take down the form that the important thing if it

00:30:04

i can now do some manipulation and take the inverse forty

00:30:09

and call back into the you know to go and and and get more if if there are separable in that

00:30:17

domain then i can you know recover the information about

00:30:21

the excitation signal or about the uh transfer function

00:30:27

so i have the i get to catch drum as to how more more more fake analysis that

00:30:33

you know and and and i can in faith i do we have a real input

00:30:40

signal i can replace the inverse uh for you transform by the uh uh uh

00:30:46

by the by the causing transform so anyway this is how i get a kick drum corps fishes

00:30:54

but i i can get the this because i'm transform instead of the f. b. inverse

00:31:00

f. f. t. and this is how i get my my mike cepstrum coefficients now

00:31:06

basically i wanna show you how i can separate now this information we see here is a spectrum

00:31:14

and that's spectrum is oh a convolution of my excitation signal and my transfer function

00:31:25

and basically what is in the spectrum i tried to show you that on the images before it fee so rough

00:31:33

following him up the transfer function

00:31:39

okay so we're imposed by the fundamental frequency the first

00:31:45

so so you have one two three four

00:31:47

five six seven eight nine ten you go down there it's about two hundred and thirty hertz

00:31:55

times ten two thousand three hundred you know it's much easier to to

00:31:59

to get that down near the okay that about one two thousand

00:32:03

that's three hundred divided by ten thousand two hundred thirty hurts so if i now look at the think not

00:32:11

and i think this is my second i i wondered character everyone to describe either the sum of calls lines

00:32:18

well you know you get the fundamental call sign which gives you the rough shape

00:32:25

and then you got hired calls lines which he knew the hires shape

00:32:30

the more small changes but that was small changes or the harmonics that's the fundamental frequency

00:32:37

and it's someone like so important to describe something that goes like that here

00:32:42

and me of frequency and it's a frequency it's up spectrum

00:32:48

of the spectrum so i turned around the make up

00:32:50

yeah strum spectrum cepstrum so i mean that have strong coefficient

00:32:57

that will describe me in the fruit when in which you currently domain

00:33:03

this movement and he's here well they describe my rough shape

00:33:12

okay you with me yeah uh_huh so now

00:33:18

the caps from provisions resonances in the power spectrum in this case

00:33:22

it's about five hundred hertz and and two thousand and three downwards with on something and without work some you you're

00:33:30

so five hundred two thousand and two seventy they're super bowls but the harmonics

00:33:37

the capstone it's a spectrum of the spectrum peaks

00:33:40

among the cepstrum coefficient indicate spectral oscillation

00:33:45

the slow portions lower cost cepstrum coefficients fast portions the harmonics up

00:33:51

cepstrum coefficients you but you frenzy of thirty five units

00:33:56

here's thirty five

00:33:59

times one eight fifty millisecond of the sampling you know it's a thousand eight kilohertz we had uh

00:34:06

uh uh from zero to eight kilohertz and minus eight or zero

00:34:12

so i'm thirty five times more naked four point three milliseconds and then we divide one

00:34:19

second one thousand millisecond divided by by the four milliseconds we get two hundred and a thirty hertz

00:34:26

so that's exactly what we so we recover the fundamental frequency we can

00:34:30

say this oscillation from two hundred and thirty two four sixty

00:34:34

uh uh to six ninety and so one is represented by goes on with this to friends

00:34:43

now we can

00:34:46

um we can put a a low pass filter with

00:34:52

q. of all these corporations or a high pass filter which kills all these coefficients

00:34:59

and basically if we then goal in back with wood when we're in the queue fancy domain

00:35:06

which just put all the lower coefficients a zero then we get a high pass filter

00:35:12

lived during what we put all the high is to serve when we get the low pass filter

00:35:18

if we look at that this is what we see there's is the low pass

00:35:22

filter it and this is the high pass filtered so we separated the excitation

00:35:28

and the vocal tract transform faster vocal tract transfer function that the excitation

00:35:34

signal when in thirty four sixty six ninety and so on

00:35:39

and it's completely just the information about the the harmonics and

00:35:43

this is just the information about the transfer function

00:35:49

uh_huh here we have a time signal and here we have the cup still

00:35:54

round so that the uh not upon thing i've got the spectrogram

00:35:58

and we he we have the capsule them notice that is down here

00:36:02

the right here and yellow that's information about the vocal tract

00:36:08

and a half that you friendly that's responsible for the fundamental frequency

00:36:14

now i low pass filter the f. and i high pass filter this

00:36:20

what we see here we see the little have we see the same

00:36:24

information is here except that this regulations the harmonics are removed

00:36:29

the long

00:36:33

so if we put it in here this whole thing to zero

00:36:38

and only use this this little part we get us if we put it in here just zero

00:36:44

and we only use this information here we get that and you see the fundamentals are much

00:36:50

you know much more straightened out much more easy to see then then then over here

00:36:56

the more equalised here the vocal tract information is

00:37:00

okay if on the other hand we put our cut off frequency here then we have the vocal

00:37:06

tract and information here and we put all the to zero we get a smooth version

00:37:12

of this spectrogram and almost only noise from this from this information i okay

00:37:22

okay so much for the cepstral we do with something where we look at spectral bands so

00:37:33

we take that spectrum

00:37:36

and we say well our for a provisions there they correlate

00:37:40

hiding with each other the neighbour ones let's just take

00:37:44

fans which just integrate over frequency domains by an order of magnitude

00:37:53

so let's say we have two hundred fifty six quotations we want to go down to twenty ten

00:38:00

how do we do it well we

00:38:03

put together

00:38:06

uh information of our two hundred fifty six dimensional vector that it's

00:38:11

highly correlates what highly correlates neighbouring for you how do we how many do

00:38:16

we do we do we do a linear let's look at the ear

00:38:19

let's look at the park or mel scale because the only i guess

00:38:25

what does it do low frequencies good resolution high frequencies battery solution

00:38:34

on the other hand low frequencies that resolution time resolution high frequencies good time solution

00:38:43

okay so there's this cut out so when the low frequencies we hats few for

00:38:50

the for your coefficients integrate over the no high we had several wants

00:38:56

uh_huh

00:38:58

so this is indicated here you know that time signal is put into different

00:39:04

band pass filters and they give us the energy in these topics

00:39:09

and

00:39:12

basically um this is what we i indicate here with these triangle

00:39:18

of each triangle now covers a certain spectrum porch frequency portion

00:39:26

and basically you integrate over goals by doing this that you you know if this is my spectrum

00:39:35

i had a a you know that frequency component i multiplied with the value of the triangle

00:39:42

and i added up so i have my my my bank spectrum core of my

00:39:47

my spectrum coefficients and i get a band spectrum coefficient by you know taking

00:39:53

the value of the triangle times the coefficient loss and someone okay

00:40:03

and so in the mouth spectrum does it by the way park spectrum but it it

00:40:07

it's minor changes or form the moment we just found faith it's relatively equivalent

00:40:16

and a profitable sale the bit about the differences um so we have

00:40:20

seven triangle filters around centre frequencies hundred fifty so the low frequencies

00:40:26

and three ah the filters of four five hundred four thousand that

00:40:30

and and so on each band and in the centre frequency of its neighbouring band

00:40:36

the spectral there's a distribution is smooth remember the harmonics those

00:40:42

uh uh multiples of the fundamental frequency they've destroyed because

00:40:47

we integrate over over over these frequency ranges so

00:40:52

the spectral distributions smooth harmonic structure disappears and the red resonances of the vocal tract emerges

00:41:00

so here is the four years spectrum and here is

00:41:04

a spectrum the park spectrum that i would get

00:41:09

if i i think i'm family with

00:41:15

spectrum non equidistant like i sample it

00:41:25

with a small sampling rate down here and with a lot of sampling rate over there

00:41:32

and of course now i can call from that now cepstrum back

00:41:37

into the spectrum and this is what i would do here

00:41:41

i i've fabulous i get my in this case apart spectrum now

00:41:47

i'd go back from the park spectrum to the uh uh a full spectrum and this is what i get

00:41:55

okay so so uh

00:41:59

to to to recover to read cover this whole thing i only have one fan for

00:42:05

the really cover that i have several samples it's not a good distance okay

00:42:12

so this would be the spectrum of the of the equivalent representation of this

00:42:21

okay and and and so what we see is we get a very smooth information

00:42:29

so the mel cepstrum with uh it's still the most used

00:42:33

features and the speech recognition in new people earning approaches

00:42:37

a a very often we don't all the only if we don't do the steps we've we take the spectrum and

00:42:43

put it into convolutional network on the the kernels of the convolutional network they kind of do the same thing

00:42:50

if you look at it they kind of looked at the higher frequencies and that

00:42:54

resolution and but they learn that themselves what is the best so that

00:43:04

basically we have to call centre and form of the law to read my for mel spectrum

00:43:09

there is no more harmonics in the mel spectrum that's important uh the calls

00:43:13

and transform here correlates the features similar to a principal component analysis

00:43:19

typically the coefficient c. zero which is the energy is replaced by loud so energy

00:43:25

and taking the logarithm integral uh in a compression of the energy scale of the human you does it too

00:43:32

and uh uh if you have a low uh uh uh uh a signal to noise

00:43:39

ratio very often the a lot of them are split by by ruth would function

00:43:45

so basically this sums up what we do when we take the time signal we take a

00:43:51

power spectrum we we hey uh the windows we get the mel spectrum of those

00:43:59

from the mel spectrum we take the logarithm then we take the code time transform and then we get

00:44:05

twenty five and out of these we take the the first well so another

00:44:09

uh and another smooth thing and that our of feature vector and typically

00:44:16

um let me jump over that uh typically lee this would be your spectrum

00:44:23

and that would be the representation of your uh your cats true and

00:44:29

uh uh

00:44:31

what typically it's important although it's not only where

00:44:36

does this what what's the current value

00:44:39

of that uh uh but where did it come from where did you go to

00:44:43

and in the olden days or so far and very often we just take one of those

00:44:51

one of those coefficients and we take the remote just over

00:44:56

okay

00:44:58

now very often people work directly in in

00:45:02

people incorrectly in the spectrum and

00:45:05

instead of only taking that the river different that direction you allow

00:45:09

the relatives also in this direction in this direction and so

00:45:14

but basically what if you have one of the caps from corporations

00:45:19

over time and you know you you you calculate some kind of repression right

00:45:26

so the standard feature vectors speech recognition is a short term energy

00:45:31

twelve mel cepstrum coefficient first to reverted second the river there there's plenty of variations and let me finish

00:45:39

by saying in what we hear what is left from the speak signal

00:45:44

if we use this just the mel cepstrum coefficients just thought well

00:45:49

coefficients and so we packed authentic know that were hurt

00:45:54

it was sample that sixteen kilohertz with having window with twenty millisecond window size

00:45:59

ten milliseconds that size and the one thousand twenty four f. f. t. and

00:46:04

the mel spectrum twenty five corporations the reduced to twelve mel frequency cepstrum coefficients

00:46:12

so this is the spectrum the log spectrum

00:46:17

uh_huh

00:46:22

oh

00:46:25

now we go in a spectrum for instance from the ah from i

00:46:30

and you see the harmonics and you see and now we go into the cap strum

00:46:36

not cepstrum and then we do the the inverse instead of the the logarithm

00:46:41

take the exponent and so on and we we we have this

00:46:45

non equidistant please send a spectrum and we've reconstructed and that that record from

00:46:54

same for the e. in i'm

00:46:57

okay the wreck curve of is now the representation of of

00:47:02

and all that at length

00:47:07

this is what we have

00:47:10

this is what what is left from the uh from the

00:47:15

original spectrum now the thing is we can now

00:47:19

uh excited

00:47:22

and look at what what comes out now we don't have the fundamental frequency anymore that's lost so the

00:47:28

only thing what we can do with would take those ten milliseconds next item with white noise

00:47:35

what if we it's side of our with like no white noise you know what that is

00:47:42

i see a cellist yes

00:47:49

okay uh_huh

00:47:54

oh

00:47:56

so this is left

00:47:58

this is your best representation of the original signal this is what the computer years

00:48:07

okay

Elmar Nöth, Erlangen-Nürnberg

11 Feb. 2019 · 9:18 a.m.

Prof Juan Rafael Orozco - Arroyave, Colombia

11 Feb. 2019 · 10:10 a.m.

Peter Steiner, TU Dresden, Germany

11 Feb. 2019 · 11:12 a.m.

Peter Steiner, TU Dresden, Germany

11 Feb. 2019 · 11:39 a.m.