Player is loading...

Embed

Embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
morning eh my name is the modifier or scroll uh it i am from columbia i. e. the the p. h.
00:00:07
d. without worrying a london eh currently i am professor
00:00:11
in the the university of what you're feeling columbia
00:00:14
and i am also related to let me in like that eh and today i'm gonna talk
00:00:20
about speech signals for presentation use in in your production made about a l. p. c.
00:00:29
so we're gonna talk about uh a little bit about the vocal
00:00:32
tract model in a then a about the source filter model
00:00:38
there we are gonna talk about what we can do with this kind of models there and at
00:00:43
the end about what we can do with the residual signal and the p. l. p.
00:00:48
like most of the information here it comes from the book of one class at all and all that
00:00:57
oh okay so what's the vocal tract typically the vocal tract the is divided into three the
00:01:05
pieces that say the these area is a core the
00:01:09
sub laura track is mainly the locks and
00:01:14
the this part of the of the trapped the before the glow cheese then
00:01:18
between the glow teas and of heirloom is called the vocal tract
00:01:24
and through the nasal cavities call the the nasal the tract
00:01:29
so when we talk about vocal tract we mainly talk about these area
00:01:35
but we have to keep in mind all of these eh part because is what the the energy is produced
00:01:41
with the not the speech and also about the nasal
00:01:45
cavity because it's important to to produce nasal sounds
00:01:49
for instance the end or in in portuguese is also important in french also
00:01:55
and he's here in the vocal tract the we have to keep in mind the
00:02:00
what is called the articulators particular resumed its are mainly composed by the value
00:02:06
that song
00:02:08
the lips and the joe okay the so the to the task
00:02:14
of the balloon is mainly to open or close in
00:02:17
the the the are passing through the nasal cavity okay so when you don't have the value more
00:02:22
you're going the you're not able to control it then your speech uh it sounds different
00:02:30
okay and this is just assume in of the previous uh figure
00:02:34
to highlight where the vocal forts are you're exactly the
00:02:38
glow in the boat is so basically eh we model the
00:02:42
or the producing the lungs and then the pressure here
00:02:47
starts to to to go up and then at up to the point that that is the
00:02:53
the vocal cords have to open the two let's do the air passing through and then
00:02:58
the pressures starts to go down and then the vocal folds close again and then be open again when the pressure is high you know
00:03:05
so that's why the vocal folds rubber grating when you are eh speaking
00:03:10
okay so boggle fords are a crucial to to produce a speech then that song
00:03:16
the chunk is very important because depending on where you put the tong then you can produce different sounds so
00:03:22
basically when you model the vocal tract you try to model resins is here in the vocal copy okay
00:03:30
and the bizarre to to the upper and lower lips which
00:03:34
are also important because then you you change the
00:03:38
the shape of the vocal tract and depending on the shape then you get different sounds and different resonances
00:03:44
and again this is the value to let you the last uh air through the nasal cavity
00:03:51
okay so we
00:03:53
would we model all of this the vocal tract through all in your filter
00:03:59
okay so linear uh in this case means that we're
00:04:02
not considering eh changes due to thermal changes
00:04:06
or due to the viscosity in inside the the the cavities okay
00:04:13
and we consider the input of the over the filter that b. d. care that is coming
00:04:18
through the the vocal cords and the filter itself is the the ease modelling the resonances
00:04:25
here in the vocal tract so note is that we're not considering the nasal cavity
00:04:32
so uh the if we want to waddle phenomenon uh ya phenomenon related to
00:04:37
to nasal problems or or the problems to control the volume liking
00:04:42
children with eh cleft lip and palate any have to introduce something else
00:04:49
okay so here we have the excitation signal this is the linear filter and output is to the speech
00:04:56
signal and so this is what i said before this is the
00:04:59
and are also showed the the lower volume velocity starts
00:05:04
to to increase when the pressure here is increasing and then up to the point the vocal folds open then
00:05:12
it's not listening to it starts to increase because years passing through and then starts to go down again and then the
00:05:18
vocal folds on our clothes and then the open again and it's up real big or a quasi periodic or phenomenon
00:05:27
and we are also mean in this model of the vocal tract but the way if
00:05:32
you some mechanical it is actually a mechanical signal and is our plane signal okay
00:05:39
that is important for the model was on that and it is propagating through their live too far right groups
00:05:50
so i hear as as well as i said the shape of the vocal tract the changing where your speaking so
00:05:59
this this uh it shows that changes so the the crows area of the
00:06:05
of the to of the tube which change in over the time
00:06:09
and over the distance in this case distances from here to here froze from the plot is to the the lips
00:06:16
so from the got it to lips before the model we assume but these are a few can be model by
00:06:23
eh on a ray of of concatenated a small slices
00:06:28
of fields there without any loose among them okay
00:06:33
so the only thing we uh consider is the changes in the area but no change in in the time
00:06:39
okay so we're not including the time domain when you really i assume these they're slices and now
00:06:45
we are going to model what one person one is like there's lights or slice yeah
00:06:53
so we take one of those slices
00:06:58
then we have a we consider there is our fluid was into the to which is the air in this case
00:07:05
so there is a certain uh press your enter in to the beginning of the end of the
00:07:10
end of this is small slice so we assume that is the same in in both sides
00:07:15
and the slices uh just uh both of of the distance and if we
00:07:20
take another slice a little bit uh uh a larger some delta
00:07:26
then there is a difference in pressure here we also changes in the distance not in that time
00:07:33
and when we solve the to the the fluid eh dynamics equations
00:07:39
then we get the this equation system met basically model in
00:07:43
the change changes in pressure over the these times and changes
00:07:47
in the lower velocity volume the in the distance okay
00:07:53
when we solve articulation system we find is a these two
00:07:58
equations for the velocity volume and for the pressure
00:08:01
'cause i'm is basically a these new blasted symbol is a a telling us uh they are
00:08:07
was in from these c. eh sliced to the next
00:08:10
one and didn't the minus the sign is
00:08:14
telling us about the the are coming from these sliced abuses lights
00:08:19
and a the length of each of them is constant okay so that we we
00:08:24
we're not considering it different to eh shapes in the in the slides
00:08:32
now when we take the the set transform over this uh it two equations
00:08:39
and we describe prizes the model that we we find this all
00:08:43
pole model which is the so called linear predictive coding okay
00:08:47
i'm basically this is the model in the transfer function of the vocal tract
00:08:53
and the the input as our showed could be the
00:08:57
the the laurel excitation which is quite a periodic
00:09:01
uh or uh a gaussian noise white noise okay
00:09:05
depending on on which the phoneme you are
00:09:08
going to produce you have a either here or here or a combination of both
00:09:14
and the the output is the speech signal so
00:09:20
when we when we take the set transform uh that in verse one
00:09:24
over this the opal model then we get this expression where
00:09:31
this is the the error that we that you uh you can make
00:09:35
when you try to model the speech signal now we are
00:09:40
what we are dealing with this kind of models or with the with or with this
00:09:45
kind of filters is just trying to predict how the speech signal behaves eh
00:09:53
in one sample considering the past p. samples where p. is the more the the order of the model okay
00:10:00
so that means that we are gonna to we're gonna predicts a are a sample of the speech
00:10:06
signal 'cause you're in the previews eh p. samples of the same signal that we already have
00:10:12
what so that means the the the production error is the difference between the
00:10:16
current the eh speech signal and the signal that we are the predicting
00:10:22
and we want to and this is expressed like this and we want to minimise the the the
00:10:27
production or or so in order to to minimise the production error or what we do is
00:10:32
to find these corporations which are the linear coefficients that uh
00:10:37
allows us to to to find a minimal are
00:10:43
so this is the not so in order to to to compute the are we with some of
00:10:47
the through all of the and possible uh examples then this is the total production or or
00:10:54
so this is the expression that we have to minimise and the the the
00:10:58
those corporations that minimise this expression are called the l. p. c. coefficients
00:11:04
so in order to to find the mean the optimal corporations what we do is to take the derivative over
00:11:10
the a. e. which a are they put them in your coefficients and is make it equal to zero
00:11:18
so after it uh taking the the robot if we can uh it find his expression and
00:11:25
we can see eh instead of having this some then with that we can
00:11:29
say that we have a a set of linear equations over here
00:11:34
and then we can change a little bit to play around with this and this
00:11:39
in here and then put it here in this one inside here and
00:11:46
we can say that this multiplication is actually a correlation
00:11:51
the function okay son and we if we change this expression into and use the
00:11:57
the good the correlation coefficient instead of it in here what we have is
00:12:03
uh the correlation in in which i and j. does sort
00:12:08
and j. does appear here and here is the correlation function with applied by the sum of all of the
00:12:14
court the the p. coefficients and these expression is the
00:12:18
is well known as the yule walker eh equations
00:12:22
and it can be efficiently solved the following a on over it and uh the
00:12:28
proposed by a by a uh to uh to levinson and darwin eh proportion
00:12:38
and so one of the of the methods to solve that uh it equation system is the autocorrelation method
00:12:45
and in order to do that the first thing we have to do
00:12:48
is to take uh the only one a a a interval of
00:12:52
the signal as a by the interval and to assume that is zero
00:12:57
in the rest of the intervals notice that in the past
00:13:01
here within the define any uh the interval for
00:13:05
the for the window to be analysed
00:13:08
now we are saying that we're gonna take and samples of the signal okay
00:13:14
so that means for the total uh the production or or we have and for the speech signal and be
00:13:20
for the predicted signal okay remember that we took a filter with p. samples to
00:13:25
predict the next one so we install we have n. plus a. p. examples
00:13:30
and that is the the new total the error that we have to minimise
00:13:36
if we write the previews equation then we can find that
00:13:41
is basically the correlation the correlation over the the the two samples is
00:13:47
basically the autocorrelation with a certain delay so is the same signal
00:13:51
also correlated with itself uh eh whatever whether sort and a delay
00:13:57
and it can be eh eh written the following just
00:14:02
the definition of the of the autocorrelation signal
00:14:07
and if we if we use the the matrix it representation we take
00:14:12
this file will find this and then you can see that
00:14:15
all of the the they are the most of the matrix are uh the the same
00:14:21
so and that's it's a symmetry is called the top it's the symmetry
00:14:26
and eh levinson and organ the recording is just taken advantage of
00:14:31
of this eh symmetry in order to find eh
00:14:34
efficiently did the a. j. equations which readily
00:14:38
near the equator the the the ha uh corporations which are the linear uh am
00:14:44
eh the article visions for the linear filter that allows us to to model the vocal tract
00:14:51
okay so now we have found the the efficient the
00:14:57
uh efficiently we have found the optimal eh quotations that allows us to model the vocal tract
00:15:03
now the question is what we can do with us with with those a coefficients
00:15:09
so we can we can do several things yeah but before talking about it
00:15:14
i will recall earlier with or what um our started talking about
00:15:19
uh and that is being friends of time window in eh in
00:15:24
in the speech processing on in this case in the linear uh production the
00:15:28
first thing is let's assume we have a a ball of a
00:15:33
of a person with a with a pitch of a hundred and ten
00:15:37
hertz that means are a fundamental period of about nine milliseconds
00:15:43
and in here in a in indian see you we can see the result in a spectrum over
00:15:49
the rectangular using rectangular a windows here is with very milliseconds on here with fifteen seconds
00:15:55
okay and you can see here the harmonics that ever was talking about
00:16:01
but here when you when you we use uh having window
00:16:04
if they having window which is not long enough that means if we don't include at least to to
00:16:10
to to to uh the for the winter period
00:16:13
then we cannot see the properly the the
00:16:18
the harmonics okay so this is our requirements of the hamming windowing
00:16:22
okay which is not the case for the rectangular window okay
00:16:26
so if if we are using having windows then we have to use eh we we we
00:16:32
have to make sure that we we are included at least uh to the fundamental periods
00:16:39
and we can see that this was no record of my speech might voice i have uh oh
00:16:45
a different page uh of course and when we use having window that doesn't include eh eh
00:16:52
the more than two eh eh eh from the weather periods then you you don't see all of the
00:16:59
the harmonics but funny we take a longer harmonica a window
00:17:03
then you start to see a all the harmonics
00:17:09
another phenomenon that appears to in the spectrum when your window in you the leakage
00:17:15
eh and other was also talking about it and that is that they get a nominal
00:17:19
our due to the to the in when you want
00:17:22
to do that uh in a to discontinuity
00:17:28
then what you are interviews in the spectrum is additional uh a spectral components that
00:17:33
are not uh or re aura that are not part of the speech signal
00:17:38
and when you come both that or when you multiply the spectrum with the speech signal
00:17:44
the result is that you'll cancel the components or you add components that you don't want to see
00:17:49
and that is what is happening here and here we use rectangular eh eh um
00:17:56
eh windows and it doesn't matter whether you are you are taking a
00:18:00
long or short window it always appears due to these the discontinued
00:18:04
and here when you take having windows are they are eh as
00:18:08
they are uh it's soft or changes lower bit then
00:18:13
eh you you kind of serve that eh the harmonics are a clearly eh describe
00:18:20
in the spectrum okay so it's important to keep that in mind so
00:18:25
both things how long has to the the window to be and the
00:18:29
and eh which kind of of of window you want to choose
00:18:34
normally if you want to to be in a safe side you you go for thirty bit twenty five milliseconds of windowing
00:18:44
and this is eh the leakage in in the case of
00:18:47
my voice eh use a rectangular a window thirty milliseconds
00:18:54
okay now for l. p. c. analysis then this is also stand our a
00:19:01
and this is a portion of the the sorry milliseconds of the of the signal
00:19:08
now we take the spectrum over the that the window and
00:19:13
we compute the l. p. c. coefficients and this is
00:19:16
the transfer function of the result and eh filter
00:19:22
and the important thing for us is not only the fundamental frequency but also those speaks
00:19:29
over the spectrum because all of those speaks our room very much
00:19:33
related to to uh a resonances in the vocal tract
00:19:37
and so using these uh the information about the position of this eh
00:19:41
eh formants which uh which is the name of this fixed
00:19:45
you can infer which uh it it which kind of uh power
00:19:48
which kind of so why am open sound are you producing
00:19:54
how can we do that so we know that there are there is uh this uh mix
00:20:00
and if you go to the set a a domain
00:20:04
disciple a domain representation you can find easily
00:20:09
the all of these resonances eh by taking the the the
00:20:13
position of the angle of that uh of those sports
00:20:16
okay and we can identify different for uh it sounds like ours
00:20:21
isn't that a representation so that means the computation you don't yeah you need to find a peaks over the
00:20:27
presentation of the of the envelope you you can come here and and pick the exact one is it
00:20:36
how can we identify different also uses the known of the open space in
00:20:41
this is the ah sound this is that you this is the i
00:20:45
the the rest of our more dollars and these three balls are core the corner rob about was and
00:20:52
they are very very important because the they hum to some extent represent us
00:20:58
uh it doesn't matter which language do you speak a a
00:21:03
they represent us the the whole eh possibilities of of moving
00:21:08
the town okay independent on on on the language
00:21:12
so for instance for the egg bowl well the the mm the average
00:21:17
f. one or for the first formant is eight hundred fifty hertz
00:21:21
and the second one is to eh uh about sixteen hundred hertz so
00:21:27
if you see here than we are about the around six hundred for the first week and around
00:21:36
in my case this is for my boys yeah we are uh about uh
00:21:39
eleven or twelve hundred uh for the holiday for the our eighty
00:21:44
so we your round here and here for the a and for the ball high we are around two hundred
00:21:54
okay for the first formant and for the second one is a
00:21:57
little bit to about two hundred two thousand so in my
00:22:01
case i am like here for the for the eye for the you'll well well the average is two hundred fifty
00:22:10
in my case i i'm more or less uh in two hundred fifty and
00:22:15
the second uh formant is around here close to six hundred hertz
00:22:21
taking here okay so you can do you can use that uh the button is
00:22:27
not just to confirm that the the person is pretty is uncertain bowel
00:22:32
that is very useful to to that knows how type cable is a person to
00:22:37
move the town properly we'll see uh oh some example yeah on that
00:22:42
and we can also track the the stability of the vocal force vibration and the capability of
00:22:48
the person to put the tonkin a certain position you're in certain amount of time
00:22:54
so for instance for the a bowel this is a time signal
00:22:58
and this is the the fundamental frequency and this is the first uh
00:23:02
formant you can do the same for the with the second formant
00:23:05
yeah this is for my uh for for my voice and this is
00:23:08
for a parkinson's person and you can see that the fundamental frequency
00:23:12
is chaotic and also the the first formant is is is hard for
00:23:17
them to keep the tonkin uh uh in a certain position
00:23:20
so and you can use the for instance the just the the
00:23:25
the deviation of the scarf to model how able are the the person
00:23:29
these these kind of people to to to move the song properly
00:23:36
you can also do plot the vocal triangle as i said this corner
00:23:41
bowels are very eh informative so the a. i. and you
00:23:45
eh so be area of the triangle gives you information
00:23:49
about the the articulation capability of of a person
00:23:54
note here and this is the speech of a person eh eh this is not my
00:23:58
speech but uh by the other but uh present the close to sixty years old
00:24:03
and this is the other person with parkinson's disease and the the reference are the
00:24:09
same in the plot and you can see the compression of the vocal triangle
00:24:13
so just the area of the triangle is given you enough information about
00:24:17
uh it they are eh um their vocal at a capability
00:24:24
okay now let's talk about the residual when you when you model the the
00:24:30
the vocal tract then you good this eh there and transfer
00:24:35
function if you take the inverse of that the
00:24:39
filter then you kind of train the the the the residual okay and then you can compare the
00:24:48
the the original signal with the with the reconstructed one and the difference is the the or
00:24:54
or or the residual so you can see that for us the same well well
00:24:59
there is always a peak in the beginning of each the fundamental period so that is
00:25:04
useful to detect when the boys and starts like here for instance we have uh
00:25:11
the transition one s. and uh_huh uh_huh
00:25:17
oh okay okay that's more should work okay this is just it's ah it's ah so
00:25:25
so this is an ass and this is a on a and and so on so on
00:25:29
we are interested in modelling or in understanding the transition here or p. c. eh
00:25:35
how useful is the the the production error or
00:25:38
there's residual signal to find that a transition
00:25:44
and you see here that for the s. sound there is no any eh
00:25:49
eh pretty basically or non any or but here as soon as the vocal fords start
00:25:55
to buy rate to produce the a sound then topic appear here and then
00:26:00
appear here again when another period starts and so on and so on
00:26:04
so the the production error is useful to detect this uh the starting
00:26:10
points of the of the eh eh um local for vibration
00:26:17
now the l. p. c. r. the useful out very eh
00:26:23
important but as as i said uh it you
00:26:26
make several assumptions like linear eighty eh
00:26:29
in in the representation of the vocal tract and also you don't uh consider
00:26:36
that the human auditory system it has less eh a resolution about
00:26:42
the the eight hundred kilo uh a cards okay that means
00:26:47
in lower frequencies the uri the the humans you're very well but above eight
00:26:52
hundred not so got that that's a good so we don't need to
00:26:57
to to have the same eh resolution in the upper side of the spectrum
00:27:04
and also the the linear prediction is not properly
00:27:09
the or not is is not considering properly
00:27:12
the the characteristics of the of the perception in the humans
00:27:17
so in order to to work on these problems or or or those eh weeks points of of the modelling
00:27:25
the scenic or musky proposed uh on a different way of doing it and that
00:27:29
is the p. l. p. which is stands for a a perceptually their production
00:27:34
it considers all of these psycho acoustics eh aspects of human hearing
00:27:39
and it it consist of a taking the speech signal then you take the concept of critical
00:27:45
bonds and here you can use bark once in the case of the original p.
00:27:49
l. p.s you use bark once but you can also use mel the dancing with no
00:27:53
problem or different uh a button scale the typical in these cases the bark bands
00:28:00
then you do a process of the equal loudness real persons and
00:28:03
then you you do the conversion from intensity to loudness
00:28:07
and then you take the inverse the eh fourier transform and then you find again here the same eh eh
00:28:14
the coefficient for all the linear representation and so what you found is the missing opal model
00:28:21
so here is the representation of the critical bands the following the
00:28:26
bark a scale and as as i was uh it showed
00:28:30
yeah before here in the low eh eh spectrum you have uh
00:28:35
a better resolution and here in the upper uh a spectrum you have
00:28:39
uh and uh of course resolution of the of the signal
00:28:44
so here is important part of the spectrum here is not
00:28:46
so important so you use eh eh um more course
00:28:51
the presentation so here they equate the the equal of this preamp
00:28:56
was this is basically in order to approach to mice
00:28:59
what the humans do in the non equal sensitivity when you are listening to
00:29:04
you you you are not a listening equally all of the frequency bands
00:29:08
so in order to compensate let's say eh that's a way of listening
00:29:12
then this eh korean posses approximates that what you do in
00:29:17
the in the what all the possible in the human in the in the hearing and now the intensity to love this conversion
00:29:23
consist i had basically in taking the the um
00:29:28
a one third to uh the power one third of the of the of the
00:29:32
intensity in order to to get the the loudness instead of the intensity
00:29:39
and that is mainly to to reduce the dynamics of the amplitude in the spectrum
00:29:44
and then you got the you have a better solution in the changes of the the changing of the of the spectrum
00:29:52
which are the advantages of the p. l. p. over the l. p.
00:29:54
c. so basically is considering psycho acoustics characteristics characteristics of human hearing
00:30:00
and it has shown good results in the
00:30:03
speaker independent speaker eh speech recognition
00:30:07
and uh additionally but was also but with the lid reduce number of provisions and that is because of
00:30:14
what i said about the low uh eh are those changes in the in the speech spectrum
00:30:21
i and also is more sensitive to search and uh phonetic units like in the cells
00:30:26
due to the same thing and also due to the ability of modelling better the the bandwidth
00:30:31
of the vocal although informants which are important for for more than a nasal vowels
00:30:37
oh okay so that's all i have to these are the the preferences
00:30:44
so as i said most of the information uh in the slides come from this book
00:30:50
this is also um i'm very handy the source of information

Share this talk: 


Conference Program

Speech analysis and characterisation
Elmar Nöth, Erlangen-Nürnberg
11 Feb. 2019 · 9:18 a.m.
Voice source analysis
Prof Juan Rafael Orozco - Arroyave, Colombia
11 Feb. 2019 · 10:10 a.m.
Speech synthesis 1
Peter Steiner, TU Dresden, Germany
11 Feb. 2019 · 11:12 a.m.
Speech synthesis 2
Peter Steiner, TU Dresden, Germany
11 Feb. 2019 · 11:39 a.m.