Player is loading...

Embed

Embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
welcome everybody i think it's all funny up at work on a apologises for the weather um
00:00:09
you should visit me more often than you you wanna have okay
00:00:12
um what i'm gonna start out with a with a a some stuff
00:00:16
about the fat for you transform windowing and kept strum and then
00:00:22
a raw files gonna go on talk about l. p. c. analysis
00:00:27
and then pager steiner we'll talk about um some that
00:00:31
that in general and then articulatory isn't that
00:00:34
and i think you know the articulatory synthesis especially can help us a lot about
00:00:39
understanding barry or tickle asian and the generate that sounds so it's not only how to
00:00:45
some fans but also how you know uh how speech is generated
00:00:49
in general and that will help us with the um
00:00:53
with with a computation of of of features and and and and i think speed on
00:01:00
nowadays very often you know people have an end to end system you know speech
00:01:05
signal inverse although i still would like to present it in a day
00:01:10
six then show manner and if well we oh we'll concentrate on thursday if
00:01:16
we have a sends a signal when we have a preprocessing in
00:01:19
feature extraction so the preprocessing basically transfers
00:01:24
the signal into a transform signal
00:01:28
it's still in the same order of mac and then you have the
00:01:31
feature extraction in the feature extraction is basically you kind of throw
00:01:36
away unnecessary information and keep all the necessary information soul if you
00:01:41
want to find out what you know so what was said
00:01:46
you know to some extent you have to keep a different part of the information then who said it on or how he said
00:01:53
but we basically we we tunes by an order of magnitude
00:01:57
and then we have a classification and in this case it's a single classification but of course
00:02:02
very of when speech analysis we go ashore that i'm analysis and then roll you know analysis
00:02:08
by now if we have a sequence of i think him and we're pretty much here
00:02:14
so basically what we see here and we'll get into that if
00:02:17
the at time and frequency representation of the speech signal
00:02:21
so basically this is what his so human ear what
00:02:25
hits the microphone the the the time sick now
00:02:29
e. down here is the spectrogram which shows that
00:02:34
at this moment what frequency is press and
00:02:37
this is uh the the fundamental frequency
00:02:41
if i go up or down with my voice and this is the energy
00:02:46
and so we have this uh as big a signal and basically
00:02:50
we call in the first step from here to here
00:02:53
so why do we do that well you money because it too it's a
00:02:57
frequency analysis now if we look at the discrete for your transform
00:03:01
basically we'll see if that we ha a discrete signal we had the
00:03:06
analogue signal but it is transformed into a discrete signal so
00:03:11
we will look at a finite thick no we can't look at the complete signal as a whole so we get
00:03:17
short now if you look at you want to characterise the frequency that are present in this discrete
00:03:24
finite time signal you know that the uh we know that will this will give us a uh
00:03:32
internet
00:03:33
a a continuous and and periodic frequency representation
00:03:40
if we now assume that our finance signal is extended towards infinity
00:03:47
periodically so we just copy that thinking of everyone want and why
00:03:52
don't we have the same representation which is a discrete no
00:03:58
and if periodic and it's also internet but all the information that
00:04:04
is present in that suffices to look at this or
00:04:08
so that's what they did we go so if we have the spectrum that that before you transform over time
00:04:15
signal and if the time signals and periodic it's and periodic if you know uh for certain and
00:04:23
it uh uh the fifth at the end of the thing that ever and last j. comes and
00:04:29
that means that all spectral components disappear the resolving line spectrum is computed with the four
00:04:35
gig a spectrum and we can recover the complete signal by the inverse for it
00:04:44
and if it's not periodic then we just think it's it's in fun
00:04:51
so um we didn't do it for a provision describes
00:04:55
the spectral density at this uh uh frequency
00:05:00
nominal resolution i just kind of put that down so that when you look at it later i'm gonna
00:05:05
put up low you can look at the at the fly so let's not go through all these
00:05:10
uh uh formulas in it and if it's really that that's important 'cause we always look at when we look at the spectrogram
00:05:17
we only see half of it because its periodic yeah it's a
00:05:20
matter so you know going instead of going from minus
00:05:26
a sampling frequency by two two plus we go from zero to two plus so
00:05:33
oh sorry
00:05:35
so if you look at the we're spectrum think about this being a cop it into the negative to maine okay
00:05:45
uh_huh
00:05:46
um
00:05:51
so we have a we we we we know that the the free the spectrum is symmetric and we only
00:05:57
look at the the one half of it and you know the guy with a p. h. d.
00:06:03
uh we need and square complex multiplication ins and
00:06:06
de facto you draw a transform a um
00:06:11
i only need and towns log and and if we uh you was the fact that we have a real
00:06:17
input signal and we can even a total uh and minus one so we save one one step
00:06:24
and the sampling a a few rooms that this
00:06:28
is our continuous time signal now we
00:06:33
this great high fit with sample it with the uh the
00:06:37
the continuous signal with a certain uh uh um
00:06:42
a sampling frequency and week one height typically you know the the the people
00:06:49
don't talk that much about it but that you know this is
00:06:52
also analogue not only that one but this is an lot and we
00:06:55
typically contacted with sixteen bit which uh uh i indicated here
00:07:02
and the sampling frequent uh a theory says that if the signal is
00:07:06
banned them it that is there's no all frequencies above a certain
00:07:12
well uh limit then it suffices to sample it with twice that
00:07:18
frequency if we know the read no more frequencies in the signal beyond
00:07:25
a killer birds we can family sixteen kilohertz and we covered
00:07:30
what in that sampling frequency uh it's not that is that
00:07:34
you would assume that there is no ever in quantisation
00:07:38
that's why when you know you look at the c. d. quality
00:07:42
you assume the human ear goals not further than twenty kilohertz
00:07:47
which for most of you guys that going for me definitely not i don't eat a coloured uh
00:07:55
but that's why the something for once he is forty four on the c. d. so you have
00:08:01
twenty times to last something and that mostly recovers the quantisation
00:08:06
of uh uh that ever that we do here
00:08:11
so basically the family tree a few rooms that you we can recover
00:08:16
uh with this uh uh in the population we can recover that signal without the loss
00:08:22
so it's uh uh uh you can represent the same signal even the time domain for the for him now
00:08:29
we so we heard that before we saw that before so we're yeah the signal with it
00:08:36
oh so you know even if you don't know german you can say
00:08:41
yeah that's the girl who says something interim now dish let me
00:08:49
who look at this representation and it's the same signal
00:08:56
but notice that this is different from that
00:09:01
e. d. so what's happening there and i can't even go further and i still have the same signal
00:09:09
but i have a
00:09:13
so basically if i look at these people three frequency representations
00:09:20
and they represent the same time second well what we do when here is
00:09:24
we lock ed and what i said before we take this finite signal
00:09:29
so we cut out some ten and we take the frequency representation prove no basically what
00:09:35
we look in here is we take a health frequency or a hand signal for
00:09:40
teen or two hundred fifty six milliseconds we cut that out and we think okay from
00:09:45
now on this is infinite and then we get the frequency and we've displayed here
00:09:52
okay so we take that time signal we make the fourier transform we
00:09:58
have a spectrum and this is the part of the complete picture
00:10:03
and of course if you take different window length to get a different your presentation
00:10:09
so we cannot that's sick no no there's two important things the one thing if
00:10:16
notice here and notice here there is a a discontinuity
00:10:23
if we really think we're gonna take that and go into infinity
00:10:30
so we introduce introduce a discontinuity here
00:10:35
the other thing is if you really closely you know this is a representation of sample
00:10:41
hold one with one of those little things is one sixteen thousands of the second
00:10:49
and you know you sample and you hold that for one sixteen pounds of the second so basically if you look at
00:10:55
those representation you see a clear yard you know that's one shows up here again and it shows up here again
00:11:02
so this is this here is the fundamental period the time between
00:11:07
two openings and thousands of the of the vocal folds
00:11:11
and it looks like it's the metric but if you take take it exactly it's not you know that's
00:11:16
one that looked a little bit different from that's one so it's almost periodic but it's not
00:11:22
but this
00:11:25
fundamental period have to show what somehow in the spectrogram if we take enough of this
00:11:31
signal because then we have this classy quite supper got we'll see that in so
00:11:37
so we take that and then we make that
00:11:40
for you transform off up this short time and if
00:11:45
you now look at this from the top
00:11:49
and you um you you you you called the this access
00:11:55
by colour and you turn around a zero two eight
00:12:03
zero two eight turn it and you flip it out and you look at it that of the colour is the intensity here
00:12:12
this is the time that this this is the frequency that's that's it that's how we get the spectrogram okay
00:12:19
now here is that uh spectrograms that willow that that will that that's the different ones i just you know
00:12:28
the what we see here this is very short and we see no
00:12:32
uh uh yeah that is that the form seems to be the same
00:12:37
except for the use the things here in your well this thing
00:12:42
says it's also the frequencies that are present except in the time signal
00:12:48
this is a girl speaks with about two hundred and seventy works
00:12:53
soul one fundamental p. a period one thousand milliseconds divide but
00:12:59
uh that two hundred and seven diverts means up four millisecond per period
00:13:05
so time segment that's okay it doesn't have it does not have
00:13:13
it has to less than less or equal to one of those fundamental period that the fundamental period it only had left
00:13:21
then it you will not see any periodicity in the spectrum
00:13:27
this one path on average about two fundamental with um this one has about three to four
00:13:34
and so you see the fundamental frequency your hand if it would
00:13:38
truly periodic the spectrum would be aligned spectrum with fundamental
00:13:43
period twice three times four times the harmonics that's what we
00:13:48
see here on the mental period one two three
00:13:52
first second third and so on how money so here we don't see any fundamental period uh here we see
00:14:00
a little bit one two three four here it's much stronger one two three four and so on
00:14:09
and here the fundamental period is even more more
00:14:15
this thing but you know you think that the higher harmonics
00:14:20
or like ten times from the fundamental frequency of fundamental frequency
00:14:26
two seven d. ten times two thousand seven hundred you know if i
00:14:31
have a quarter of a second and the poor get the city
00:14:35
chance around a little bit around two hundred seventy then i still
00:14:39
see the fundamental period but ten times it's what about
00:14:44
see the difference from c. this is enough to have that let's say five
00:14:50
six fundamental period and this maybe have sixty fundamental period and if those
00:14:56
differ too much it it goes from two seventy two to seventy four to seventy five
00:15:01
then the other one the one up here close to two thousand seven hundred fifty
00:15:05
so we have a fifty hertz difference here at five hertz step and so the fundamental
00:15:10
peered show up here but not anymore really around here so so you're half tool
00:15:17
depending on the analysis size that you take you know
00:15:21
you either stretch the information the global information
00:15:26
or a little bit more about the exact frequencies on the other hand
00:15:34
the shorter you are the better good commerce solutions
00:15:38
and the longer you're commercials isn't it
00:15:42
so here we have the time free one two representations this one is very good in frequency
00:15:48
this is very bad in frequency resolution isn't very good in time resolution that's very
00:15:53
a a a bad time so there's another speech signal um looks more like this one right
00:16:04
we're like this one then like this or presentation but in fact it isn't
00:16:11
uh_huh
00:16:13
but it's also sixteen millisecond window this is a four
00:16:17
millisecond window this is the sixteen millisecond window
00:16:21
what's the difference
00:16:24
mm
00:16:29
uh_huh it's a male voice
00:16:33
male voice with a lower fundamental period i heard it it's me it's me so
00:16:39
i speak with about a hundred and ten hundred twenty works the girl
00:16:43
three times is like two hundred seventy words so for the same analysis window
00:16:50
if i take a four millisecond window went my fundamental peer is ten milliseconds then have up forty percent of one
00:16:57
that's why you know my my voice there is no fundamentals if i take
00:17:03
the same analysis window as i would for high pitch voice soul
00:17:09
short term analysis that we gonna look at so what do we look
00:17:12
one the description of the spectral composition of the speech signal
00:17:17
the problem with the discrete fourier transform is only meaningful for you and it's a
00:17:22
a speech signal is per divinity on them not figure out extensive changes over time
00:17:27
the observation for very short time segments the speech signals or
00:17:32
as a first approximation stationary but if the spectral composition
00:17:36
is constant so we take a small interval
00:17:41
and so for each point and in time we cut out the window
00:17:46
from the second or we analysed as using the for you transform
00:17:51
a we implicitly as all that big kind of region is repeatedly per yard
00:17:57
so we have a periodic a continuation of the window of the speech thing
00:18:03
now there's a couple questions that come up immediately if we cut
00:18:07
out a time signal basically we multiply it by a window
00:18:12
like cutout sixty milliseconds i multiply the sixty milliseconds by one and the rest of the
00:18:21
the the speech signal by tsar now to multiply a time signal in the time domain
00:18:28
in the frequency domain that is the convolution for convoluted in time infrequent it's a multiplication
00:18:37
having out correspond to the multiplication of the speech signal with the window function
00:18:42
which is not zero exactly in the interesting info and the question is which
00:18:48
uh uh so which that influences the spectrum so which a window
00:18:53
function should be chosen and how big should the window be
00:18:57
and should the windows overlap and if soul by how much no if we cut out the window in you
00:19:04
know we can cut it out with the rectangular window multiply each by want what we can't hide out
00:19:10
with the window that kind of tampons the ages why would
00:19:14
we do that well remember the disk continuation we introduce
00:19:18
but this continuation if we have the window it because it it starts here
00:19:25
it ends here and now we think of this window as being internet
00:19:30
then here we have a chop this continuation and then we go for
00:19:34
the same window so if we take a window that smooth says
00:19:40
then you know we we boy that this continuation on the other hand we manipulate the
00:19:47
signal and we dampen those parts of the time signal so maybe if we
00:19:52
then take the next we don't we don't all we overlap a little bit then that signal gets in
00:20:00
now see here is the typical windows that we use that as
00:20:04
the hamming window that they're handing window yeah twenty thousand kaiser
00:20:09
lots of and the rectangular window of course would go up here and down here
00:20:16
now basically as i say we modify the signal and that means that when we take before
00:20:23
you train from here we uh uh we we have to take the uh frequency response
00:20:31
hope that a signal into account now do you think about it
00:20:36
if this is thus the frequency of your time signal
00:20:41
and if you convoluted that with the frequency response of the window
00:20:48
what would be the ideal window that would completely
00:20:54
freak over the spectrum of the time signal
00:21:01
it would be a direct it's zero because it would go over you convoluted and each
00:21:09
frequency you get is exactly reproduced and no other
00:21:14
frequencies take into account to explain that frequency
00:21:19
okay so basically you would have a frequency response the calls like this up like that
00:21:29
now we look at the frequency response only in this because it's in metric so i would
00:21:34
like to have it zero or something and nothing here you know if we lock
00:21:42
yeah there are rectangular window that something at zero and something here
00:21:51
and the scale here is logarithmic
00:21:56
so if i if i have a window that has
00:22:01
the us then and i make convoluted this response with
00:22:08
my frequency response over it and then to recover
00:22:12
just want to recover that frequency i use this frequency which is good which is what i want
00:22:19
but also these frequencies which are not even present my original spectrum
00:22:25
okay now
00:22:27
this is the hamming window so see here this is my time signal after the rectangular this is
00:22:34
my hamming window function if he keeps those in the middle but forces those towards zero
00:22:42
okay now if you look at the frequency response here
00:22:47
it takes into account to read how her eighty eight frequency
00:22:53
to recovery if the neighbouring frequencies into account must stronger
00:23:00
what's wrong with the pistols down but this is
00:23:03
much smaller than this one but the one set of far away damp and it's much more
00:23:10
okay so this signal that i uh uh started here with
00:23:16
was originally that's super position of three it's signed so it's a one year when you when you so
00:23:24
or reaction or the the the the original spectrum that i would love
00:23:28
to recover with my window function would have nothing until here
00:23:33
one here one here and one here and nothing else that would only
00:23:39
have three east frequencies this is what i get with the
00:23:44
uh with the rectangular window this is what i get with the hamming window and as we see that frequency
00:23:51
resolution around there is not a good this year but on the other hand this frequencies than an hour
00:24:00
okay so to some extent this is a much better representation than the original signal than this one
00:24:06
and that's what we typically apply a window that tampons
00:24:12
still
00:24:14
the bigger the window in the time domain a higher frequency resolution and
00:24:18
you know you you can think about it if you have a
00:24:22
a a window size of two fifty six and sixteen thousand five to fifty six you have sixty two works
00:24:29
each frequency uh uh uh
00:24:32
component now represents of fill the bank element of sixty to hurt
00:24:38
it says you know uh i'm here to represent the frequencies from two thousand to two thousand and sixty
00:24:48
and of course if you have only sick or uh uh uh
00:24:51
samples then you have one is the represents two fifty
00:24:56
and the windows two k. the speech signal does not stationary anymore so it should be shorter than one for me
00:25:04
and you know if it's too big then
00:25:07
it it it's not stationary and you know the uncertainty principle the better
00:25:12
the timer look in the words of frequency resolution and but
00:25:16
and very typical analysis windows are ten milliseconds step size
00:25:21
so we sampled frequency with a hundred hertz
00:25:26
and uh the window size round twenty five milliseconds with the phoneme eighty milliseconds
00:25:32
that about one third your on average in one form or window
00:25:38
and if we look at the spectrograms well the spectrogram is what we look
00:25:42
at the whole time we can order of faith that if the
00:25:45
spectrum of the squares of the actual values we have up remember we have a complex spectrum but we kick the up the values
00:25:52
time is displayed on the x. frequency on the y. axis of the intensity by the
00:25:57
optical depth uh it be intensity by the optical density or by the colour
00:26:01
broadband spectrogram small frequency resolution high time or something
00:26:06
vertical course in the distance of the fundamental period detection of short
00:26:11
faces of explosion narrow band high frequency resolution low calmer solution
00:26:16
vertical parts in the uh distance of the fundamental frequency
00:26:21
difference haitian ups yeah ours on whatever it was sort of the forms that are close to each other
00:26:27
so this is a in a a broadband spectrogram and you see those strike the asians here
00:26:35
those are the distances of the fundamental a period u. c. d.'s iterations here
00:26:43
this is the fundamental frequency first second third and so on one
00:26:48
okay
00:26:51
and what you see well for one thing you see these parts which are the the the resonance
00:26:57
frequencies of the vocal tract and ruffle say much more about that in in in his car
00:27:05
again you see the stray asians which are the distances of that that
00:27:09
that the the the the the durations of the fundamental periods
00:27:16
and the other thing is that and again uh we'll we'll know more next talk about it
00:27:25
if and when we have the linear uh a model
00:27:28
of of of of of speech it's a convolution
00:27:32
of the excitation signal and the transfer function of
00:27:37
the vocal tract and the expectations uh uh
00:27:41
function has to do with the opening enclosure of the vocal tract of of of of a vocal chords
00:27:48
and depending on whether you have like a a soft voice or a harsh voice
00:27:54
um your excitation signal looks more like this or that's
00:28:00
not this is more like the final saw it
00:28:02
so it had the fundamental frequency and a higher harmonic abandoned by much
00:28:09
this is a more harsh voice although higher harmonics are not as
00:28:14
much time i imagine you want to read cover this
00:28:18
bias some of us uh a sinus sinusoid then you need
00:28:23
to hire signs as you know you need the fundamental
00:28:27
sinusoid and the harmonics of it to get a complete periodic signal back
00:28:34
so that's as much but you're the fundamental frequency and the harmonic user
00:28:39
fundamental frequency in our mind and uh not as much tampa
00:28:42
and of course that shows up that shows also i'll in in your in your spectrum so your
00:28:49
spectrum has both the vocal tract form and the
00:28:53
excitation informed which leads me to cepstrum coefficients
00:29:00
and if you heard of mel cepstrum forget now for the moment captain
00:29:05
so what we say if the the the the third source filter model
00:29:10
where we take a speech signal that we see is the convolution
00:29:14
of the excitation signal that we just looked at the characteristics at the vocal tract transfer function the information
00:29:22
now if we just say convolutional with a four right convolution means in
00:29:28
if you go from the time signal in the frequency domain
00:29:31
you have a multiplication convolution becomes multiplication now if you go in the log domain this becomes an addition
00:29:40
so basically the lock up the four year transform off your speech signal with
00:29:44
the law of the excitation signal plus a lot of the transfer function
00:29:51
okay
00:29:53
and i i i'm gonna put the the the slides on line so if if you
00:29:58
don't really have to take down the form that the important thing if it
00:30:04
i can now do some manipulation and take the inverse forty
00:30:09
and call back into the you know to go and and and get more if if there are separable in that
00:30:17
domain then i can you know recover the information about
00:30:21
the excitation signal or about the uh transfer function
00:30:27
so i have the i get to catch drum as to how more more more fake analysis that
00:30:33
you know and and and i can in faith i do we have a real input
00:30:40
signal i can replace the inverse uh for you transform by the uh uh uh
00:30:46
by the by the causing transform so anyway this is how i get a kick drum corps fishes
00:30:54
but i i can get the this because i'm transform instead of the f. b. inverse
00:31:00
f. f. t. and this is how i get my my mike cepstrum coefficients now
00:31:06
basically i wanna show you how i can separate now this information we see here is a spectrum
00:31:14
and that's spectrum is oh a convolution of my excitation signal and my transfer function
00:31:25
and basically what is in the spectrum i tried to show you that on the images before it fee so rough
00:31:33
following him up the transfer function
00:31:39
okay so we're imposed by the fundamental frequency the first
00:31:45
so so you have one two three four
00:31:47
five six seven eight nine ten you go down there it's about two hundred and thirty hertz
00:31:55
times ten two thousand three hundred you know it's much easier to to
00:31:59
to get that down near the okay that about one two thousand
00:32:03
that's three hundred divided by ten thousand two hundred thirty hurts so if i now look at the think not
00:32:11
and i think this is my second i i wondered character everyone to describe either the sum of calls lines
00:32:18
well you know you get the fundamental call sign which gives you the rough shape
00:32:25
and then you got hired calls lines which he knew the hires shape
00:32:30
the more small changes but that was small changes or the harmonics that's the fundamental frequency
00:32:37
and it's someone like so important to describe something that goes like that here
00:32:42
and me of frequency and it's a frequency it's up spectrum
00:32:48
of the spectrum so i turned around the make up
00:32:50
yeah strum spectrum cepstrum so i mean that have strong coefficient
00:32:57
that will describe me in the fruit when in which you currently domain
00:33:03
this movement and he's here well they describe my rough shape
00:33:12
okay you with me yeah uh_huh so now
00:33:18
the caps from provisions resonances in the power spectrum in this case
00:33:22
it's about five hundred hertz and and two thousand and three downwards with on something and without work some you you're
00:33:30
so five hundred two thousand and two seventy they're super bowls but the harmonics
00:33:37
the capstone it's a spectrum of the spectrum peaks
00:33:40
among the cepstrum coefficient indicate spectral oscillation
00:33:45
the slow portions lower cost cepstrum coefficients fast portions the harmonics up
00:33:51
cepstrum coefficients you but you frenzy of thirty five units
00:33:56
here's thirty five
00:33:59
times one eight fifty millisecond of the sampling you know it's a thousand eight kilohertz we had uh
00:34:06
uh uh from zero to eight kilohertz and minus eight or zero
00:34:12
so i'm thirty five times more naked four point three milliseconds and then we divide one
00:34:19
second one thousand millisecond divided by by the four milliseconds we get two hundred and a thirty hertz
00:34:26
so that's exactly what we so we recover the fundamental frequency we can
00:34:30
say this oscillation from two hundred and thirty two four sixty
00:34:34
uh uh to six ninety and so one is represented by goes on with this to friends
00:34:43
now we can
00:34:46
um we can put a a low pass filter with
00:34:52
q. of all these corporations or a high pass filter which kills all these coefficients
00:34:59
and basically if we then goal in back with wood when we're in the queue fancy domain
00:35:06
which just put all the lower coefficients a zero then we get a high pass filter
00:35:12
lived during what we put all the high is to serve when we get the low pass filter
00:35:18
if we look at that this is what we see there's is the low pass
00:35:22
filter it and this is the high pass filtered so we separated the excitation
00:35:28
and the vocal tract transform faster vocal tract transfer function that the excitation
00:35:34
signal when in thirty four sixty six ninety and so on
00:35:39
and it's completely just the information about the the harmonics and
00:35:43
this is just the information about the transfer function
00:35:49
uh_huh here we have a time signal and here we have the cup still
00:35:54
round so that the uh not upon thing i've got the spectrogram
00:35:58
and we he we have the capsule them notice that is down here
00:36:02
the right here and yellow that's information about the vocal tract
00:36:08
and a half that you friendly that's responsible for the fundamental frequency
00:36:14
now i low pass filter the f. and i high pass filter this
00:36:20
what we see here we see the little have we see the same
00:36:24
information is here except that this regulations the harmonics are removed
00:36:29
the long
00:36:33
so if we put it in here this whole thing to zero
00:36:38
and only use this this little part we get us if we put it in here just zero
00:36:44
and we only use this information here we get that and you see the fundamentals are much
00:36:50
you know much more straightened out much more easy to see then then then over here
00:36:56
the more equalised here the vocal tract information is
00:37:00
okay if on the other hand we put our cut off frequency here then we have the vocal
00:37:06
tract and information here and we put all the to zero we get a smooth version
00:37:12
of this spectrogram and almost only noise from this from this information i okay
00:37:22
okay so much for the cepstral we do with something where we look at spectral bands so
00:37:33
we take that spectrum
00:37:36
and we say well our for a provisions there they correlate
00:37:40
hiding with each other the neighbour ones let's just take
00:37:44
fans which just integrate over frequency domains by an order of magnitude
00:37:53
so let's say we have two hundred fifty six quotations we want to go down to twenty ten
00:38:00
how do we do it well we
00:38:03
put together
00:38:06
uh information of our two hundred fifty six dimensional vector that it's
00:38:11
highly correlates what highly correlates neighbouring for you how do we how many do
00:38:16
we do we do we do a linear let's look at the ear
00:38:19
let's look at the park or mel scale because the only i guess
00:38:25
what does it do low frequencies good resolution high frequencies battery solution
00:38:34
on the other hand low frequencies that resolution time resolution high frequencies good time solution
00:38:43
okay so there's this cut out so when the low frequencies we hats few for
00:38:50
the for your coefficients integrate over the no high we had several wants
00:38:56
uh_huh
00:38:58
so this is indicated here you know that time signal is put into different
00:39:04
band pass filters and they give us the energy in these topics
00:39:09
and
00:39:12
basically um this is what we i indicate here with these triangle
00:39:18
of each triangle now covers a certain spectrum porch frequency portion
00:39:26
and basically you integrate over goals by doing this that you you know if this is my spectrum
00:39:35
i had a a you know that frequency component i multiplied with the value of the triangle
00:39:42
and i added up so i have my my my bank spectrum core of my
00:39:47
my spectrum coefficients and i get a band spectrum coefficient by you know taking
00:39:53
the value of the triangle times the coefficient loss and someone okay
00:40:03
and so in the mouth spectrum does it by the way park spectrum but it it
00:40:07
it's minor changes or form the moment we just found faith it's relatively equivalent
00:40:16
and a profitable sale the bit about the differences um so we have
00:40:20
seven triangle filters around centre frequencies hundred fifty so the low frequencies
00:40:26
and three ah the filters of four five hundred four thousand that
00:40:30
and and so on each band and in the centre frequency of its neighbouring band
00:40:36
the spectral there's a distribution is smooth remember the harmonics those
00:40:42
uh uh multiples of the fundamental frequency they've destroyed because
00:40:47
we integrate over over over these frequency ranges so
00:40:52
the spectral distributions smooth harmonic structure disappears and the red resonances of the vocal tract emerges
00:41:00
so here is the four years spectrum and here is
00:41:04
a spectrum the park spectrum that i would get
00:41:09
if i i think i'm family with
00:41:15
spectrum non equidistant like i sample it
00:41:25
with a small sampling rate down here and with a lot of sampling rate over there
00:41:32
and of course now i can call from that now cepstrum back
00:41:37
into the spectrum and this is what i would do here
00:41:41
i i've fabulous i get my in this case apart spectrum now
00:41:47
i'd go back from the park spectrum to the uh uh a full spectrum and this is what i get
00:41:55
okay so so uh
00:41:59
to to to recover to read cover this whole thing i only have one fan for
00:42:05
the really cover that i have several samples it's not a good distance okay
00:42:12
so this would be the spectrum of the of the equivalent representation of this
00:42:21
okay and and and so what we see is we get a very smooth information
00:42:29
so the mel cepstrum with uh it's still the most used
00:42:33
features and the speech recognition in new people earning approaches
00:42:37
a a very often we don't all the only if we don't do the steps we've we take the spectrum and
00:42:43
put it into convolutional network on the the kernels of the convolutional network they kind of do the same thing
00:42:50
if you look at it they kind of looked at the higher frequencies and that
00:42:54
resolution and but they learn that themselves what is the best so that
00:43:04
basically we have to call centre and form of the law to read my for mel spectrum
00:43:09
there is no more harmonics in the mel spectrum that's important uh the calls
00:43:13
and transform here correlates the features similar to a principal component analysis
00:43:19
typically the coefficient c. zero which is the energy is replaced by loud so energy
00:43:25
and taking the logarithm integral uh in a compression of the energy scale of the human you does it too
00:43:32
and uh uh if you have a low uh uh uh uh a signal to noise
00:43:39
ratio very often the a lot of them are split by by ruth would function
00:43:45
so basically this sums up what we do when we take the time signal we take a
00:43:51
power spectrum we we hey uh the windows we get the mel spectrum of those
00:43:59
from the mel spectrum we take the logarithm then we take the code time transform and then we get
00:44:05
twenty five and out of these we take the the first well so another
00:44:09
uh and another smooth thing and that our of feature vector and typically
00:44:16
um let me jump over that uh typically lee this would be your spectrum
00:44:23
and that would be the representation of your uh your cats true and
00:44:29
uh uh
00:44:31
what typically it's important although it's not only where
00:44:36
does this what what's the current value
00:44:39
of that uh uh but where did it come from where did you go to
00:44:43
and in the olden days or so far and very often we just take one of those
00:44:51
one of those coefficients and we take the remote just over
00:44:56
okay
00:44:58
now very often people work directly in in
00:45:02
people incorrectly in the spectrum and
00:45:05
instead of only taking that the river different that direction you allow
00:45:09
the relatives also in this direction in this direction and so
00:45:14
but basically what if you have one of the caps from corporations
00:45:19
over time and you know you you you calculate some kind of repression right
00:45:26
so the standard feature vectors speech recognition is a short term energy
00:45:31
twelve mel cepstrum coefficient first to reverted second the river there there's plenty of variations and let me finish
00:45:39
by saying in what we hear what is left from the speak signal
00:45:44
if we use this just the mel cepstrum coefficients just thought well
00:45:49
coefficients and so we packed authentic know that were hurt
00:45:54
it was sample that sixteen kilohertz with having window with twenty millisecond window size
00:45:59
ten milliseconds that size and the one thousand twenty four f. f. t. and
00:46:04
the mel spectrum twenty five corporations the reduced to twelve mel frequency cepstrum coefficients
00:46:12
so this is the spectrum the log spectrum
00:46:17
uh_huh
00:46:22
oh
00:46:25
now we go in a spectrum for instance from the ah from i
00:46:30
and you see the harmonics and you see and now we go into the cap strum
00:46:36
not cepstrum and then we do the the inverse instead of the the logarithm
00:46:41
take the exponent and so on and we we we have this
00:46:45
non equidistant please send a spectrum and we've reconstructed and that that record from
00:46:54
same for the e. in i'm
00:46:57
okay the wreck curve of is now the representation of of
00:47:02
and all that at length
00:47:07
this is what we have
00:47:10
this is what what is left from the uh from the
00:47:15
original spectrum now the thing is we can now
00:47:19
uh excited
00:47:22
and look at what what comes out now we don't have the fundamental frequency anymore that's lost so the
00:47:28
only thing what we can do with would take those ten milliseconds next item with white noise
00:47:35
what if we it's side of our with like no white noise you know what that is
00:47:42
i see a cellist yes
00:47:49
okay uh_huh
00:47:54
oh
00:47:56
so this is left
00:47:58
this is your best representation of the original signal this is what the computer years
00:48:07
okay

Share this talk: 


Conference program

Speech analysis and characterisation
Elmar Nöth, Erlangen-Nürnberg
11 Feb. 2019 · 9:18 a.m.
Voice source analysis
Prof Juan Rafael Orozco - Arroyave, Colombia
11 Feb. 2019 · 10:10 a.m.
Speech synthesis 1
Peter Steiner, TU Dresden, Germany
11 Feb. 2019 · 11:12 a.m.
Speech synthesis 2
Peter Steiner, TU Dresden, Germany
11 Feb. 2019 · 11:39 a.m.