Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:01
um welcome today's lecture is uh actually it's just a bit error vision namely for everybody in the room
00:00:08
really gonna be talk about speech processing and just a basic introduction to a lot of the features
00:00:13
that we commonly using pi linguistics that we commonly use in the open smalltalk eighteen will reinforce that
00:00:19
within the tutorial this afternoon but really learning how to extract a few of the ease and then
00:00:24
a bit of an introduction to machine any i'll spend of a bit of time talking about
00:00:28
so the generalisation and what are the uh sort of why and how we set up a machine
00:00:33
learning problem and then also introduce you guys do a lot of the sort of conventional
00:00:38
pretty deep learning machine learning techniques that i use quite a lot
00:00:42
it's been a little bit of time anyone support vector machines 'cause that's what was sort of focus on again in the lab this afternoon
00:00:49
so who am i why my up here talking to you guys uh
00:00:53
i'm doctor nicholas commons but please the common make i'm i had
00:00:56
really take intended it at the chair of embedded intelligence for health care
00:01:00
and well being at the university of out that in germany
00:01:04
so the ability h. and doesn't really make sense to a lot of people lunch um and
00:01:08
it's essentially a so the professor in training position
00:01:13
uh i think it up so the the other equivalent positions in the u. k. would be lecture uh
00:01:18
or in the states would be a sort of assistant professor shapes uh roughly what i to
00:01:23
my main research changes really focus around machine learning and healthcare as a
00:01:27
multi sensory sensing of health care and different well being aspects
00:01:31
really focusing a lot what affective and behavioural computing and my real background as
00:01:36
well is in this sort of idea of using neurological disorder is um
00:01:42
a mental disorder is um things like this or mental health my p. h. d. was
00:01:47
actually using speech to say if we could find a market for depression itself
00:01:52
so can the i work on a few different projects including the tap this project i also work in a few
00:01:56
others to stage a red dot and a numeral talk a little bit more about them at the moment
00:02:01
my previous roles of the post opposition at the chair all the um computer
00:02:07
of complex and intelligent assistant at the university of pastel also in germany
00:02:11
and i did all my research my sort of undergraduate and my p. h. d. at the university of
00:02:16
new south files which is in c. d. extra yeah so yes i'm actually a straight in
00:02:21
i originally come from right down the bottom in hobart which is in
00:02:24
tasmania and is a really really lovely part of the wealth
00:02:27
i got most and spent most of my life in sydney's uh it's a very nice c. d.
00:02:32
do my bit for this year interest born encouraging everyone to please go visit of
00:02:35
course saying today as the so the traditional welcome from us yeah yeah um
00:02:41
i haven't the strain accent i can't really apologise i have the if you do struggle to
00:02:45
understand anything i say come sometimes happens please let me know and i'll really express myself
00:02:52
i'll say it again until we eventually get there so what we need to do is uh
00:02:56
just a little bit more about my sort of research institute which is also which now is that uh it with that they did
00:03:02
a chair of embedded intelligence healthcare and well being were quite a
00:03:05
new sort of research institute founded in two thousand and seventeen
00:03:09
so we just sort of had at your anniversary really we have one professor been professor be ensured
00:03:14
the the most to you guys all know probably from the literature you should have been reading
00:03:19
he's open smile he's the compare challenges he that challenges are used
00:03:24
sort of all of the speech pathology quite a bit in especially when we're talking about a mission
00:03:28
an affective computing we have one ability asian kind of myself we have six ever post docs
00:03:33
twelve doctoral students and to visiting positions currently the moment as well
00:03:38
and the call research of the group is really this idea sort of machine learning to signal processing
00:03:43
and tells cassie focusing on in bed it you pick us sensing i i. t. devices
00:03:48
um intelligence day planning machine learning conventional methods and then really
00:03:52
looking at this in healthcare sort of things like
00:03:55
speech pathology or into wellness things i've acted computing was said to work on a few different projects the first one
00:04:02
being tapper so obviously here talking about speech pathology different ways to sort of died no is an improved agnes
00:04:07
speech pathology i also work on the stage project where we're looking at ways of creating sustainable a. g.
00:04:13
so how can we develop x. it really support elderly people to sort of leave a multiplexer acme
00:04:19
look at as sort of current mood current working environment current physical activity recommend what things
00:04:25
they should do well work uh what things they should do outside of work
00:04:28
to sort of maintain active and healthy why style with the end of keeping people what a longer
00:04:32
i also work on the right as soon as project which has big overlaps with tap is so we're
00:04:37
really looking at ways of using you pick us sensing using smart phones in using sort of
00:04:42
signals that we collected the phones as well to monitor people with depression
00:04:46
multiple sclerosis an epilepsy we're really looking in terms of relapse here
00:04:49
can we predict if somebody's gonna have a relapse and it's a massive sort
00:04:53
of big project with uh i think roughly thirty cotton isn't looking at
00:04:58
collecting samples from well over about a thousand people during the entire course of our project
00:05:03
class yeah well containing no which again has some overlap with chapters will looking at sort of improving
00:05:09
a a motion of therapies for children with autism says okay well we have there is channel and
00:05:14
how the student expressing themselves in terms of sort of effect an arousal
00:05:19
and that one sorta levels and then also looking at sort of different vocalisations i may consider counting needs
00:05:25
and trying to really i'd therapies in a therapist and make it
00:05:27
sort of more automated therapy sessions itself it's really interesting project
00:05:32
but that's sort of enough about aston really on to the so the
00:05:35
lecture today so we're basing in in in around speech pathology and
00:05:39
speech pathology if the study diagnosis and treatment of sort of communication
00:05:43
disorders at least that's hell i'm really defining it for today
00:05:47
this includes difficulty with speech language fluency and voice itself
00:05:52
and difficulty communication is really just the dysfunction in any sort of bad about
00:05:57
speech production apparatuses could happen as some sort of cognitive effect this
00:06:01
could happen to some sort of physiological effect some sort of neurological effects
00:06:04
something affecting and that's affecting us are fine control within speech itself
00:06:08
we can also sad sort of developmental in learning disabilities interlace it coming to hear as well because
00:06:13
add things like strike dementia and brain injuries that really um sort of common unaffected speech production
00:06:19
so what i'm really interested in this sort of rome is really finding objective measures to so
00:06:24
they help the diagnosis of different conditions we we age people are trying to make
00:06:28
diagnose people get them into the right healthcare monitor adam might be doing different treatments that's my sort of
00:06:33
real call and train within the sort of speech pathology in what was sort of focusing on here
00:06:38
so how do we do this how does this sort of or lie with doing machine learning so this really relates into the field of what's
00:06:44
known as empirical ending so this is sort of basing a decision on
00:06:47
any existing data so try to learn from past events essentially
00:06:51
so just a very simple example sort of what i mean by here so imagine it's friday night
00:06:56
we've ordered a pizza delivery there's a game of football is a good movie starting in about thirty minutes and all of a
00:07:01
sudden your friend rains he wants to come over he wants you to come and pick him up from his house
00:07:07
can you actually go to is house in time can you actually pick amount making in time we're paid to be delivered in we get back
00:07:13
in time to the sort of start of the movie the meeting here is what you make it back in time to the pizza delivery
00:07:18
to happen we actually make this decision halloween normally do this is sort of humans so what we normally do
00:07:24
sort of not exactly but we build some sort of statistical model we look at sort of
00:07:29
past events and sort of realise what might be happening what might be going on here
00:07:33
so the last time we voted some pages to a house we found that on average it sometimes
00:07:39
sorry uh we've all the pages on average we found the four times added by the been delivered light
00:07:44
so we think okay fifty fifty maybe i've got time to go and pick up my friend maybe i've got time to get back in time
00:07:51
we know if it might be so the come back a bit late it might be okay so we sort of going well maybe maybe not so we've got to
00:07:57
sort of make the decision we've gotta look more information we've gotta pull in more
00:08:01
normal what else can we learn what else can we use to actually help
00:08:05
make a decision in time so we need to consider all the sort of relevant
00:08:08
information and this is what we're doing with machine learning which on its cover
00:08:12
lots of different information from speech pull it in and learn from it and learning from relevant information so
00:08:17
we always had this idea of some sort of dependent variable what are we trying to predict
00:08:21
and what does it sort of represent itself in this very simple cases just pizza
00:08:25
delivery time we also have an independent variable so these are the features this
00:08:29
is the information that we're really sort of pushing into a machine learning algorithm
00:08:33
this is what we actually used to sort of formal make a decision
00:08:36
we wanna break something down my pizza delivery might look at different days of the week see if there's a pattern in when drivers might be sort of
00:08:42
coming in and being late all drivers might be only within this sort of delivery slot time
00:08:47
so we wanna build relationships between all these different pieces of information wanna build them up till tomorrow
00:08:53
and find relationships find patents and use these patterns to really
00:08:57
um makeup delivery so we noticed that when we look
00:09:00
at days of the week and different pizza deliveries we since the the um mondays at three or four deliveries
00:09:05
had been on time so i'm sorry like within this particular example we could also include other information to
00:09:12
add up to date traffic information i mean nowadays we'll have apps that sort of track is uh
00:09:16
we know a little bit more so we can sort of build this decision tree up is that go okay it's monday night of or the my flute
00:09:22
will it be lights uh looking adjust was it like before wasn't like previously have go
00:09:26
fifty fifty and then i sort of break this down and very simple decision tray
00:09:30
and realise that there's probably twenty five percent my page might be
00:09:34
like in this particular example so using this information i build
00:09:37
a slightly better model output in sort of more information i got okay i'm not gonna go and pick out my friend
00:09:42
any moment it's is most likely gonna be in time my friend can just walk a path like it's not really that
00:09:47
good a friend anyway it's worth missing pizza for so this sort of information this is how we really build up
00:09:52
this sort of idea of sort of speech pathology and sort of different empirical burning
00:09:56
to have to be really do this had we pull this all back into sort of
00:10:00
speech pathology itself and into sorta speech pathology detection we always that with data
00:10:05
realise that was labelled data normally as well so this is collecting speech are different
00:10:10
people with different pathological conditions and maybe from control samples as well so things we work with like
00:10:15
this actually uh depressions parkinson's or two sons all different sorts of speech we might collect within
00:10:21
collaboration with some clinical partners we might try to get good decent medical labels that go along with these as well
00:10:27
we sort of clean the data up a little bit and we start to move erroneous factors
00:10:30
with my having out last sample someone might have forgotten to turn the microphone on
00:10:34
someone might of you know i had a map coughing fit halfway through it and it nobody else really had this we might look to sort of
00:10:40
really big obvious out lies the really gonna affect our decision it could affect the sort of
00:10:44
machine learning in different aspects of losing so we clean up the data a little bit
00:10:48
then we really got to extract relevant information this is the first
00:10:51
part of the lecture today learned talk about feature extraction itself
00:10:55
so this is taking a whole speech signal was speech signals a very very
00:11:00
complex to look at as a lot of different information going on
00:11:03
we sort of break them down into smaller and smaller chunks we can start to see happens we can start to
00:11:08
extract little bits of extra sort of information from these in the the so the features that we use itself
00:11:13
and then we push this into a machine learning model we go okay only
00:11:17
used to learning only you support vector machines only use decision trees
00:11:21
we sort of buffy this information in we train our model we test our model we go okay cool
00:11:26
i got something that's working particularly accurate yeah a great i'm gonna publish a paper on
00:11:31
it never ends gonna be happy so that's roughly sort of what we're doing here
00:11:34
but we turn in in this movie what i'm talking about the second part of the
00:11:37
which within the machine i mean i can really okay give us of the chances
00:11:41
to train the best model possible so looking at making sure we have generalisation on machine
00:11:46
learning models and the difference or areas we've gotta look out for within here
00:11:49
and then also talking about different machine learning algorithms that we can actually use as well just
00:11:54
a bit of a break down today so it's really the use of computational intelligence
00:11:58
to solve some sort of protection problem in the case it happens is is some sort
00:12:01
of speech pathology detection problem so decided data collection preprocessing feature extraction machine learning
00:12:07
i mean this is pretty standard track any particular computational intelligence task any particular machine learning task
00:12:13
and there's a said today we're really focusing on feature extraction extracting useful information from world data itself
00:12:19
and machine learning algorithm so winning rules learning how we find these different patterns in the data how we break down
00:12:25
find the relevant bits of information itself and then use them to sort of make a final decision itself
00:12:31
this is before really getting to maintain looking in a very high
00:12:34
level what our features themselves what to actually me when
00:12:37
i talk about pitch what is that how can we think about this is the sort of bigger why the topic
00:12:43
so page itself the features themselves so just the representation of the data that with feeding into
00:12:48
a particular machine learning algorithm itself and i really just represent little single pieces of information
00:12:55
and every single one of these pieces of information is something that the machine learning algorithm
00:12:58
uses to make its actual decision itself to my very simple example we use um
00:13:04
previous deliveries and days of the week unravel break down and decision but we might
00:13:08
have to feed more information into that algorithm traffic conditions weather conditions so on
00:13:12
and so on and build a finer and finer and more accurate model of course we start adding too much information in these patents
00:13:18
get too hard to find so we've got to be careful about what we faded in and exactly how we faded in itself
00:13:24
to to pay might fade hundreds and thousands of such as these bits of
00:13:27
information to concatenate them together into what's known as a feature vector
00:13:31
and this is what we're sorta to extract different feature vectors in the labs this afternoon looking at the open smalltalk it
00:13:36
and if you different ways of extracting feature vectors from there from the
00:13:39
very sort of focused sort of ha maps very small contents
00:13:44
set of features to the very wide sort of big uh compare feature set which is i think six thousand
00:13:49
seven hundred up pieces of features and piece of information we can possibly fade into machine learning algorithm
00:13:55
the role of the machine learning algorithms in terms of the features here is to really identify patterns look for the speech is here
00:14:00
find out anything we can learn what can we learn from it how can we
00:14:03
continually improve this funny how can we actually make a call that we need
00:14:08
so what is machine learning itself sort of various again high level really the creation
00:14:12
of very robust models we wanna do some sort of prediction to some classification
00:14:17
get some sort of output predicted are sort of independent variables from a dependent variables from a particular data set
00:14:23
itself as i said we're primarily concerned with an identification and finding these patents in such
00:14:28
a way that we can target them towards a particular call towards a particular task
00:14:33
we found this process normally via some sort iterative updating we have
00:14:36
some sort of cost function we continually improve and that
00:14:39
parameters the algorithms towards this cost function over time to actually get a better better estimate but we wanna line
00:14:46
that's that that estimate is not just improving on the data that we continually
00:14:49
fitting into the algorithm but such that it's wider more sort of um
00:14:53
deployable on the wider brenda sort of probability distribution of actual features
00:14:58
based on actual problem space that looking in dealing with itself
00:15:01
and this idea of sort of learning phases in training phases we
00:15:04
collaborate our algorithm objective parameters that voice gotta continually test this
00:15:09
and make sure that it's not over feeling that it's not just getting very good recognising a patent
00:15:14
just in the information we're giving it but the pattern holds in sort of why the set
00:15:18
so that's sort of the wrath broad sort of interaction and i'm gonna break it down now into sort
00:15:23
of two particular parts of first uh that will focus on before the morning break is feature extraction
00:15:28
so looking at low level descriptors looking at supper segment of features also briefly introduce into you to
00:15:34
a bag of audio uh just another way of sort of organised in the data itself
00:15:38
and then also talking about feature representation mining so just a little bit
00:15:41
of interaction into using convolutional neural networks and their role they comply
00:15:46
and actually allowing us to one features so the first part is really about so hand crafting and
00:15:51
only space speeches and we move into sort of really how can we learn have we use
00:15:55
so big advances now indeed neural networks actually wanted useful and relevant features themselves
00:16:01
and then the second part after the morning break we'll talk about machine learning will
00:16:04
talk that generalisation then we'll talk about discriminative models and we'll talk about um
00:16:08
generative models themselves and just uh really prefer the view of this sort of
00:16:12
core ideas of a lot of different models are advantages and the disadvantages
00:16:16
skipping over day planning skipping of hidden markov models a little bit 'cause you
00:16:20
guys a couple this elsewhere every discourse of this week or will
00:16:23
cover it over the course of this week itself so first up on to sort of feature extraction and on to low level descriptors
00:16:32
yeah
00:16:39
uh_huh
00:16:44
so the feature is so again taking these bigger so the board awhile looks it it's really can be thought
00:16:49
of as some sort of abstract representation of our daily just considering what data into some other form
00:16:56
normally we don't realise in such a way that it's actually extracting useful information for
00:17:00
that so the task that we want when we plan to do this
00:17:03
really extract relevant information to the task at hand we just look what day they're sort
00:17:08
of fit this into some sort of speech pathology algorithm some sort of speech
00:17:12
pathology detection algorithm is gonna be a lot of different confounding factors see something like
00:17:16
linguistic variability is gonna get in the way of this very very quickly
00:17:20
might just be learning not quite what we wanna learn so really wanna focus our
00:17:23
feature extraction extract particular so the information we know is gonna be good
00:17:28
and this idea of sort of reducing redundancies which are one of eighteen unwanted information
00:17:33
into any machine learning algorithms machine learning algorithms essentially as i said they
00:17:37
look for patterns but they really look for variations and they're looking for this
00:17:41
variation in the data and the variation you're feeding into it is wrong
00:17:45
in such a way that it can actually confound the decision the machine learning algorithm is gonna look at that so
00:17:50
very simple example even starting a very broad level when was sort of thinking about how to collect speech
00:17:55
yeah we collect all of patience in one particular room but satan
00:17:58
a very very small good soundproof room really high quality audio
00:18:02
and then we i can isn't control samples i'm just gonna go to the lecture minister yeah some microphones
00:18:07
and collect some speech in some sort of the open room what's reverb lots of record
00:18:12
we're gonna start off on a very sort of wrongful already feeding
00:18:15
redundant information in a richly 'cause the sort of um
00:18:19
speech recordings in june be so different interaction nature we just gonna learn that their record a microphone
00:18:26
a are recorded on microphone be these things come through into the sort of feature extraction algorithms
00:18:31
to reducing redundant informations on in the about sort of feature extraction itself but it's looking so
00:18:36
the whole pack wine how can really construct up best algorithm to actually begin with itself
00:18:41
the real classic examples of so the speech features include pitch
00:18:45
which will talk about very sort of briefly energy
00:18:48
we'll talk about a little bit about spectral features themselves and then sort of talk about how this
00:18:52
put this all together in such a way that we actually can learn from it do ah sort of speech pathology do
00:18:57
these sort of para linguistic sort of feature representations that come across quite a lot of time in the literature itself
00:19:03
but before i start into sort of feature extraction i've oh it's just very
00:19:07
useful to do a very quick overview into sort of speech production
00:19:11
with the understanding a little bit about speech production just have some scanned a little bit
00:19:14
about why we're extracting particular features what they're actually represent within the speech signals themselves
00:19:20
so speech is a very very complex for that action that we to you guys are
00:19:24
probably had this stressed throughout two years so the training of it so far
00:19:28
it's actually the most complex action in terms of masculine movements and sort of fine control that we actually
00:19:33
do nothing replies sort of more coordinate of different muscles different muscle groups within our body itself
00:19:39
so speech is always starting with some sort of processing some sort of
00:19:42
cognitive thought i haven't message i want to transmit this message
00:19:46
so that's just the linguistic content we also process do i wanna stress something in the sentence
00:19:51
do i wanna make a point clear don't emphasise something so we're looking at putting
00:19:55
the prosodic information within this prosodic information things like ah sort of emotional responses
00:20:00
much to get in the way you know to take this we might have some like fee a comic awesome like anger coming across
00:20:05
of the top joy happiness is sort of gets in the way this sort of prosodic information itself
00:20:10
we sort of decided then what we wanna say how we wanna say we don't need to put this into action
00:20:14
we need to actually generate all in your muscular commands start to get the my directions to luis page
00:20:20
the main actions always in quite a few spot long loy sort of popping tiny little bits of yeah
00:20:26
balanced giving us some sort of energy within either using a vocal fold in an active member
00:20:31
with a sort of vibrating at the fundamental frequency is put some sort of time in some sort of pitch into the speech signal itself
00:20:37
within shaping up a contract for some sort of the to july and this is doing some particular
00:20:42
filtering action this arouses sort of shape the sounds in sort of to to the sounds
00:20:47
distribute them to make very specific speech sounds this requires a lot of training if you think
00:20:52
it takes a sort of five ten years to really learn how
00:20:55
to speak probably really control these mussels in white reduced
00:20:58
sounds similar sounds the same my time and time again and to learn a language
00:21:02
patterns to go with them to actually emphasise and get this message across
00:21:06
the speech itself is very very sort of particular to have this donation this action
00:21:10
of the vocal folds we have the shaping of the vocal tract in this
00:21:13
is what's known as articulation and sit down by not in the vocal tract itself
00:21:17
to also using not your lips teeth tongue helen software one nasal cavities
00:21:22
this quite a lot of information not just vocal tract don't think he vocal tract is really everything for your vocal fold
00:21:27
pretty much to your lips and all this sort of it out of you knows as well so
00:21:31
it's quite a lot of things happening when are actually just producing a single speeders speech sound
00:21:35
so respiration can be thought of as a sort of power source this is the energy this what does as
00:21:40
if we wanna really wanna change how loudly speak we have to say take more air
00:21:44
in and push more air out itself and if we wanna sort of um
00:21:49
supply pressure if we wanna do a very very long sentences well we need of some
00:21:53
more more air into laos to talk very fast some particular long enough time
00:21:57
so can the thought of some sort of battery some sort power source for the actual speech production swells
00:22:02
so if a nation is this sort of conversion of this sort of sauce from the ones energy coming out of the lungs into sort of
00:22:08
the first so the beat of speech production itself so those are already
00:22:12
said we have this idea voice speech production so we've on we
00:22:15
vibrator vocal fold some sort of particular fundamental frequency and this gives
00:22:20
us a rough shape rough sort of sinusoidal each shape
00:22:24
in a very bad way of saying it there for the sort of voice sounds itself
00:22:28
in some the unvoiced sounds we actually just holding a vocal fold open rushing air
00:22:33
out of balance forcing it very very quickly than relying on the the the movement
00:22:37
of the sort of positioning of the articulate is quite different bits of um
00:22:42
'kay different uh i've forgotten what in english he um took a different it's a constriction in
00:22:48
the actual vocal tract in this pizza construction then allow for the different speech sounds
00:22:52
we hear things that make t. sound really constricting very much at the front end is
00:22:56
using a town very quickly to get this sort of action t. s. f.
00:23:00
are all very much examples of unvoiced speech sound set the vocal folds playing a role
00:23:04
but at the same time when using and using articulators to be just different sounds themselves
00:23:09
this process is articulation so the process of forming speech channels and recognisable speech sounds themselves
00:23:14
by the movement is that to curators so we shaping the vocal tract we using it in shaping in
00:23:19
such a way that we produce beckon you recognisable sounds recognisable sounds of to joe language itself
00:23:25
does it work lies right coordination together and this is why speech is such a valuable
00:23:30
marker of a lot of different neurological conditions we have um many different ways
00:23:35
the different conditions different pathologies can sort of get in into rock the speech signal themselves
00:23:40
and there's a lot of the times i can interrupt the speech signal in such a way we can find the sort of common patterns
00:23:45
between groups of people in this allows us to really to exercise the sort of work that we really wanted to
00:23:51
so how do we model speech had we get this idea of speech production and and
00:23:55
sort of throw it into some sort of model allowing us to extract features
00:23:58
from allowing us to sort of understand a little bit about what's going on the most common way we do this is known as the source filter model
00:24:05
so breaking so the speech systems down really the main aspects respiration foundation articulation
00:24:10
no mass function together they must function waning code prosodic information into the speech as well and the sole source
00:24:16
filter model really allows us to explain this and that sort of allows us to put speech production together
00:24:25
as a series of knowing of separable we near field is no i would just
00:24:29
alone and a model these aspects from allows to sort of start extracting
00:24:33
the display could really a a just a an idea of the sheer number of scenes
00:24:39
the moving she number of things that are happening from the site idea unmask away
00:24:51
uh_huh
00:25:00
the idea is looking uh all the different aspects
00:25:03
everything that's moving within this is easily speaking
00:25:09
this is actually doing p. boxing so i forgot which would happily that this one is really
00:25:13
cool it's sort of a little off topic by just putting 'cause it's cool example
00:25:18
yeah
00:25:31
uh_huh
00:25:35
uh_huh
00:25:38
okay so in the source filter model itself we always have the source in the sauce is essentially the ad being come from
00:25:44
the lungs through the vocal folds themselves already talked about this
00:25:47
uh it's voiced speech we consider the vocal folds active
00:25:50
so we've got some sort of compilation happening we've got this initial energy pulses that we come through and
00:25:55
then look involving them with a sort of filtering action of the actual vocal folds and shot
00:26:00
then we had this sort of impulse responses impulse response comes along from the sort of vocal fold i think
00:26:05
eleven so at a rash through it yeah so the register with with billy's principle
00:26:09
is sort of built up pressure the size of the work of old
00:26:12
sort of makes an snapshot this is the aspect of the vocal folds nothing
00:26:15
shot the closed again yeah pressure built up under the vocal folds
00:26:19
we have them open and so on and so on we get this sort of impulse train here
00:26:23
and with this goes through the sort of global model and we get the excitation signal actually and you
00:26:28
the voice speech production some sells this model is normally generally a second or the low pass filter
00:26:34
um i hope everybody soon the with as i transform of familiar
00:26:37
enough with that transform no it's you know it's really
00:26:40
have the in that 'cause it comes up a lot of the times there's a transform is just the sort of
00:26:46
wrap version of the f. f. k. which is replacing it really is that simple is that
00:26:51
that explanation of what's going on within their itself but it's just another way of representing a frequency transform
00:26:57
with the the signal itself was eddie on voice page we normally how the vocal fold open so within the
00:27:01
source filter model is is always just a random noise generator that actually generating different speech sounds itself
00:27:08
the next aspect of the source filter model of discourse a filter and this is how we sort of get this
00:27:12
energy signal excitation signal which might just come across sounding like some sort of on at a particular frequency itself
00:27:19
then actually allow this to change the it into some sort of speech
00:27:22
representation themselves and this is done by altering the shape of the
00:27:25
vocal tract and this produces a very particular set of filter characteristics and
00:27:29
sells the vocal tract because consider is some sort of um
00:27:34
and even cost section all choose in the sort of course sections
00:27:37
of choose change over time we have sort of partial reflections
00:27:41
i h. of the different non junction so these choose
00:27:44
itself these parts or fractions essentially allow us and
00:27:48
to produce in the the sort of filtering action so is is sort of vocal track changes over time
00:28:01
uh_huh
00:28:05
is is but so is a very good attitude changes movements in the point
00:28:10
of constructions of the changing up and down just producing different sounds
00:28:14
her
00:28:17
yeah
00:28:19
we know melissa some so the this is the output of the actual goddesses some sort of so low pass
00:28:24
filter responses second although i pass filter expenses and some by some sort of band pass filtering itself
00:28:30
and this gives us the spectrum actual frequency distribution the peaks
00:28:33
of the format frequency big so the a spec true
00:28:37
speak to the vocal tract spectrum being the formant frequencies themselves so every time we try to build some sort of model
00:28:43
of speech production formants one of the big things we have to take into account so they're really dominant page
00:28:48
we can the vocal tract spectrum itself it dominant peaks change are pretty consistent across different speech
00:28:53
sounds itself this allows actually craig is different speech sounds with a very particular positions
00:28:58
we generally try to have build some sort of vocal tract model early scanning for three four maybe
00:29:04
even five up to five sort of formant positions and sells it really depends on how
00:29:09
well we've sampled exactly what we wanna do but it's the sort of positions and holding
00:29:13
other contract in the sumo positions producing similar formats allows two bridges recognisable sounds
00:29:18
languages and pull them altogether so had we actually model is we model this is also the second or the
00:29:24
resonances so we just stacking lots of second order filters together
00:29:28
second order low pass cast a low pass resonators
00:29:32
and ideally we having two poles within this particular model very sort of quick brief induction over so the jump over the
00:29:38
sort of filtering aspects of it there so last part of speech production is always the roles of the late
00:29:44
roll the lives in the sort of source filter model anyway is to really get the
00:29:49
sort of air pressure that's coming out the shape depressions coming out about vocal tract
00:29:53
we really just amplified how pushed out into the room and we model is
00:29:56
sort of aspect together as a sort of single pole high pass filter
00:30:00
and then once we've got all these filters we can just jammed together essentially unable
00:30:04
to some some sort of vocal tract response some sort of frequency response
00:30:08
of the actual model is in the speech production when we look at a single windows speech range is
00:30:13
very very short window we can see all aspects pretty much of the source filter model happening themselves
00:30:19
so we always have our impulse response are these are the so the harmonics that we can actually say within the different signals themselves
00:30:25
we always had the vocal tract of ones we can say the different peaks of the actual vocal tract response
00:30:30
we have some to lie low pass filtering shape as well normally hear every tiny bit more information about
00:30:35
frequencies and we generally have a high frequency and this is come across through really from the correspond
00:30:41
which is the second or low pass filter no with first ones which is the first a high pass filter so when we sort of put these together
00:30:47
we get just general first or the low pass filtering shape can really say in any sort of speak production
00:30:53
so we can use a soulful sort of also to model in different aspects of sort of speech production voiced
00:30:58
speech production vocal tract response isn't really so they get different bits of information
00:31:03
put different low level descriptors out of the actual speech signal itself
00:31:07
so roughly broken is down into sort of three different
00:31:09
aspects itself represented features and um prosodic features themselves
00:31:15
really used a lot impair linguistics so i really identify differences
00:31:20
in speaking styles they really give the re than
00:31:23
the life the intonation emotion especially sort of arousal motions
00:31:27
we can really hear within the prosodic features themselves
00:31:31
have sausages that i'll talk about as well and the sorta model glottal flow are actually what
00:31:36
is known as a regular phone nation so when we talk on the source filter model
00:31:39
why's presuming the vocal tract of local folder operating it sort of one hundred percent they always
00:31:45
open no i shut the sort of rhythmic right they always sort of chart fully open fully but actually in case
00:31:51
events real live speech this never really happens this is what's known as a regular foundation we can really here
00:31:56
some sort of different alterations within what's happening within the sort of all
00:32:00
slow and the actions of the vocal folds within the sausages themselves
00:32:04
oh also talk briefly about sort of format and spectral properties this is really
00:32:08
detailed information of what's happening in the vocal tract what's happening in the
00:32:11
filter what's allowing us to sort of producing here these different sounds so we
00:32:15
just breaking this down so we have a sort of vocal folds
00:32:19
a source speech is we have a sort of vocal tract which is that foreman spectral features and we have the sort
00:32:24
of prosodic features we solicit over the top and relief reflect
00:32:28
something some sort of differences in house that sounds
00:32:32
really before we start even extracting features from speech we need to think about how we wanna do this
00:32:36
we extract reaches for um i want pitch over some five seconds speech interval
00:32:42
one p. one sort of pitch value yeah it's pretty much meaningless for what
00:32:45
we've got we've got we've really got is belied speeches is highly complex
00:32:50
signal it's very much time bearing in time varies and changes
00:32:53
very very quickly we change variance change vocal track shades
00:32:58
somewhere between sort of ten to forty milliseconds on average so
00:33:01
we changing this correct characteristics changing this filter operations
00:33:04
very very quickly so we need to start sort of breaking the speech than actually doing what's known as windowing
00:33:09
so this is where we just assume that we've got some sort of speech we've got half each is uh features will change
00:33:15
slightly slower than the sort of speech complex itself we start to break things down to go from into i utterance
00:33:21
we sort of break this down brightness down we break this down to very tiny sort of windows we normally extract these windows
00:33:27
on average the most so the speech tell us something like twenty five milliseconds we
00:33:33
have a lot of them by ten millisecond simple we sort of windows across
00:33:36
maybe if we doing pitch extraction we might extend this out let's talk a little bit more about that later on during the morning itself
00:33:42
the other reason we celebrate is down within to sort of small and small milliseconds
00:33:47
really to do with so the spectral analysis and the idea of for it transforms and properties of the for
00:33:52
it transforms again which also very briefly over as we go into the sort of main aspects of it
00:33:57
so within very very tiny windows this page we can actually
00:34:01
proves you that does make some sort of um
00:34:05
assumption that the signals periodic disparity consumption essentially enables us to do for your transforms
00:34:11
hello uh presuming that we're working on some sort of periodic speech signal itself
00:34:16
this enables for in our system for analysis realise the basis of most spectral information that we extract
00:34:22
from the speech signals so windowing as i said a typical windowing so
00:34:25
we use a lot within power linguistics is twenty five milliseconds
00:34:29
and we know we have some sort of chip some sort of overlap by ten milliseconds themselves
00:34:33
one of the standard windows we can use is just the rectangular window where we just take the speech signal in go bang
00:34:39
and we sort of just cut that we take bits of the signal itself is do this time and
00:34:43
time again this is is the function of got here so yeah we have some sort of period
00:34:48
with those have some of soul overlap from the spirit and we start to extract different bits of information themselves
00:34:55
but actually using a rectangular window starts to bring a some sort of issues
00:34:59
depending on exactly the speech feature looking at the information we want
00:35:03
if we doing rectangular windowing we always sort of end up introducing
00:35:07
some sort of discontinuity serve resuming some sort of periodic
00:35:11
this within the speech window themselves but we're just randomly sampling it there's no guarantee we're gonna caught whether so those
00:35:17
signal is reaching the zero crossing points rose coloured kept somewhere where we just
00:35:21
sort of introducing so we never have this sign is away whenever
00:35:24
cutting in doing a windowing very nice decisions we always introducing some sort
00:35:29
of discontinuity into this you do wonder actually using rectangular windows themselves
00:35:34
is generally correct some sort of high frequency that we may or may not wanna live with within a signal depending
00:35:39
exactly what we wanna do how we wanna do it so sometimes we use different sort of windowing functions
00:35:44
the role windowing functions a lot is a normally got some sort of shape some sort of rough gaussian to them
00:35:49
are essentially just doing a bit of tapering so it's sort of bringing them down and extracting zero
00:35:54
bits of information and we're having this attenuate shunned by not having the so the high frequencies
00:35:59
so you really have maximum amount of information start and sort of brinkley's down different
00:36:03
shapes he is it hamming window the kaiser window the hamming window gaussian windows
00:36:07
the robot this sort of idea so this sort of dampen so it's discontinuities
00:36:11
the actual stat me into the windowing itself last uh so the
00:36:14
maximum information in the middle of the signals themselves exactly which
00:36:19
window we choose really just depends on how complex you wanna make the actual signal
00:36:24
how we'll time you want things how long we sort of processes work
00:36:27
hamming window hamming window a general use quite a lot different um
00:36:32
toss themselves i think having window is most likely when used in most of the folder and small scripts i i
00:36:47
uh_huh
00:36:50
okay so the onto different um
00:36:53
different features in different low level descriptors we call these low level descriptors 'cause are extracting amok they
00:36:59
sort of windows weeks checking these tiny tiny windows extract in this is sort of information here
00:37:04
so the real basic speech feature we can actually take a short term energy so what he
00:37:08
short term energy it's essentially the loudness of the actual speech signal itself that's what reflects
00:37:13
what we're trying to do and we're actually extracting so time energy
00:37:16
really tracking the upper invoke the actual speech signal itself
00:37:21
so distracted i haven't looked through time with is using a simple squared function here with displaying the some values with the the
00:37:27
particular window function do we know link is particularly important here we set the window to along with so the not um
00:37:35
recently cried a very much a low pass filter we lose a lot of information itself
00:37:39
we take the window too short we lose a lot of information but as i said gently twenty five
00:37:43
milliseconds for window link really does allow us to sort of track sort of look quite nicely itself
00:37:49
and uses were very simple very useful feature especially when we put it
00:37:53
together like something like pitch we can start to actually build even
00:37:56
a simple a motion classified can people just using speech just using pitch
00:38:00
values and just using energy value speech and user go up
00:38:03
the sort of positive happy emotions and angie values go down for more negative emotions
00:38:07
so even very very simple features you can build a very crude so
00:38:11
the classifies with itself pitch detection is sort of relying on um
00:38:17
extracting the so the fundamental frequency of the actual
00:38:20
speech signal itself celebrate vocal fold vibration itself
00:38:24
pages a sort of perceptual characteristics that we can't really extract all measure
00:38:29
fully itself so when i talk about page when i talk about fundamental frequency i'm really talk about the right
00:38:33
of vocal fold vibration because is the right of vocal vocal fold vibration essentially this perception quality speech
00:38:40
actually changes and that's a little
00:39:14
yeah
00:39:17
so you can see now we are those uh sort of write a vibration changes
00:39:20
the actual pitch that we hear changes but well it's don't know majoring
00:39:24
exactly the pitch we here but as i said actually measuring the sort of right of vibration of the vocal folds themselves
00:39:29
and this is actually one of the most difficult task we can do is sort of
00:39:33
speech processing are not gonna go too much sort of into the real details
00:39:37
a very complex speech productive as pitch extraction algorithms today because they show
00:39:41
sort of she difficulty we could do all lecture on so that
00:39:45
pitch detection and very good ways of doing pitched detection itself but
00:39:50
the reason the exceptionally difficult task is 'cause speech signals had this sort of
00:39:54
quasi periodic nonstationary property to them as i said the constantly changing
00:39:59
exchanging a very very quick right son is read a vocal fold vibration
00:40:02
is constantly changing and we had this filter that sits on top
00:40:06
travel is to to we've gotta get rid off to actually find and locate and identify this sort of
00:40:11
right to vocal fold vibration in this filter positions uh pretty much infinite exactly where they could be so
00:40:18
trying to really get this very fast accurate information the filter try to get with this information itself
00:40:24
and at the same time dealing with natural variations in sort of human voices a lot
00:40:29
possible ranges of the temple structures so it's exceptionally difficult to do very very accurate
00:40:34
so the pitch detection in the same time in saying this that it's very
00:40:38
very easy to do very very crude speech detection algorithms quite quickly itself
00:40:42
and some of them all credo maze we can actually do this is just using very simple things like uh yeah um
00:40:48
a short time autocorrelation function of the average magnitude different functions is essentially just correlating the signal with itself
00:40:55
looking web page uh between the croatians and using that as sort a measure of page itself
00:41:00
we only really talk about page as a voice speech characteristic because
00:41:05
it's the only time the vocal folds themselves actually vibrating itself and we can make this task
00:41:09
a little bit easier when using so the autocorrelation functions the average different magnitude functions
00:41:14
by just doing a bit of filtering we know they're sort of typical ranges the pitch will appear from
00:41:19
so we don't need to start looking for pitch values up in the thousands of hurts themselves
00:41:23
no we can sort of fruit roll away a lot of the high frequency information remove
00:41:27
this and then sort of try and form these other ways of doing it
00:41:31
the autocorrelation function will also come back and revisit when we talk about linear
00:41:35
prediction analysis and how to actually get the vocal tract information itself
00:41:39
so the autocorrelation is simply the um correlation of the signal with itself and this
00:41:45
is it's just i've sent to express this functional sort of time delays itself
00:41:49
and we'd expresses some sort of function of cache the ring properties of
00:41:53
the autocorrelation function i bet it i even it has a global
00:41:56
maximum of zero itself and this go maximum is always equal to
00:42:00
the energy than ally signal itself sort of a sort of
00:42:03
zero point here is energy and the autocorrelation is periodic
00:42:08
itself for periodic signals that we fade into it
00:42:10
we can use this perilously to essentially allows to determine the pitch period from the signals themselves so
00:42:22
what we find when we've actually feed in a speech signals which are perfectly periodic with
00:42:27
some sort of perilously itself we can generally find that the uh sorry for perfectly
00:42:32
getting when it to have a perfectly periodic signals itself we can then we find the
00:42:36
than first maximum page that we can actually say within a sort of um
00:42:42
or a correlation function is the actual pitch of the signal itself that we're looking for is
00:42:47
the first harmonic are able then to use this sort of time difference to work out
00:42:51
knowing the fundamental frequency on knowing the sampling frequency the signal to then work backwards and get the pitch value itself
00:42:56
and this is exactly really true for speech signals 'cause i'm not really a periodic itself
00:43:01
presumably we can look at uh autocorrelation function find the first take find the next first maximum pick
00:43:07
after this look at the sample values between them actually to extract the page signals themselves
00:43:13
the average difference magnitude function is roughly the autocorrelation function and we're doing minuses instead of
00:43:18
subtraction and and we're looking for troughs instead of pages within the same signals itself
00:43:23
so if we have some sort of voiced analysis we have some sort of periodic function
00:43:27
we can say that we have a first take a being the energy the signal
00:43:31
itself or the first off a a sort of again colliding with the energy the signal
00:43:35
itself we have the next maximum values and is this sort of sample here
00:43:39
the number of samples here which will know by was set by the sampling frequency that were actually extract the signal
00:43:44
at and from there we can extract a very quick crude measure of actual page of the signal itself
00:43:50
again it's not the most accurate one and not quite used a lot but also gives you an idea
00:43:56
of how we could do it in a very very rough basis we wanna sort of do more
00:44:01
fine sheen exact pitch detection we probably wouldn't do it in the time domain
00:44:06
to begin this we probably would do it more in the frequency domain
00:44:09
we're looking at things like sub harmonic structures where we start to break the signal down into different groups of harmonics
00:44:14
start to some these harmonics of the top of each other and from this
00:44:17
sort of summation of harmonics is sort of natural dominant harmonic comes up
00:44:22
and so that gives us the idea of where the pitches in the signal itself we also might use
00:44:26
the cepstrum so this is the inverse of the fourier transform of the log magnitude function itself
00:44:31
so just taking the log magnitude men taking the inverse of that itself and pitch can often come up as a dominant peak
00:44:37
within the spectrum so we can use the study different methods to roughly more in in a line with what is done
00:44:43
events most of them we use of one of these two methods to extract the
00:44:46
page information this is just a rough idea of what's happening with some harmonic
00:44:50
summations themselves is breaking the signal down into different harmonic groups doing
00:44:54
some filtering in extracting harmonics eventually we just add up and
00:44:58
add up these different harmonics of the window itself move on the
00:45:01
dominant peyton installed paid essentially comes out at the actual
00:45:06
page of the signal so we're looking at a straight p. a straight spectrogram is very so hard
00:45:11
to track the exactly trauma itself the sub harmonic spectrum really allows us to very very
00:45:17
easily identify the pitch ago itself that we're just gonna trace the sort of double takes every time
00:45:23
itself again this s. room taking the log of the fourier transform and take an invoice
00:45:29
and we can one the dominant pick within here correlating again very much with the page with the absolute signal
00:45:36
very rough introduction if you are more interested definitely read the papers pitch
00:45:42
our written speech extraction is a big open sort area feel the
00:45:46
study that's always actively ongoing and lots of different ways
00:45:50
of going on there but sort of moving up our vocal tract
00:45:53
moving apostle source filter model now looking it's a different
00:45:58
id is a different features and the next sort of grouping of features anyway is sort of voice quality features
00:46:03
so said was uh looking at some sort of measures some sort of a
00:46:06
regular for nation themselves because in the video that i show yeah
00:46:10
the we always make the assumption as i said the vocal folds always opening and shutting annoys doing this at some sort of
00:46:16
right and chatting completely but it's never really happens in real life and we can sort he this in speech signals themselves
00:46:22
using 'cause someone with a very very tense sort of very no this sounding voice gently happening 'cause it got more sort of
00:46:28
rigid vocal folds themselves a sort of slightly robotic sound that you can often hysteria easy to sort of do it yourself
00:46:34
i just sort of pulling impinging your vocal tract really tightly so the tense in the muscles in around there
00:46:39
the same time you can he things like course unless you things like sort of always the clock me
00:46:44
as someone might not be so the opening and shutting the vocal folds fully in his sort of yeah this sort
00:46:49
of noise this sort of rushing through within here itself and we can extract this and look at this
00:46:55
information a lot of different ways within the open small to box itself the main sort of voice quality features themselves a cheetah
00:47:01
shame ah and also the harmonic to noise racy a digit
00:47:04
uh is really deviations from perfect pitch period in c.
00:47:08
humour is deviations in energy now how many to noise ratio is pretty
00:47:12
much exactly what it sounds as the harmonic to noise ratio
00:47:15
so looking at the noise and looking at the level of the sort of different harmonics within
00:47:19
it hand that's done by the autocorrelation function we can extract this sort of information
00:47:23
so just a little example of what i mean so we're looking at human being this permutations in sort of energy
00:47:29
itself and then judah being permutations in the pitch period that we might be looking at
00:47:34
within a particular signal and very very easy to sort of seen here differences
00:47:38
in th in surgery that endanger not within the speech signals it's
00:47:48
he says in say we got very legit a very legitimate values itself
00:47:54
just someone speaking in a very normal so low level of
00:48:02
does it sort of shouting we can hit is sort of say hey
00:48:05
the differences because they see differences sort of distribution the the data
00:48:09
as human values themselves so that's roughly voice quality features as sort of
00:48:14
predicted by the open small tool box again there's a lot more
00:48:17
different way isn't looking at the glottal flow and measuring glottal flow very
00:48:20
precisely looking very different aspects of the opening and closing um
00:48:26
ratios in again voice quality in colour for modelling the whole nother area
00:48:30
of study the if you guys are interested definitely rate up more
00:48:33
on and there's some very cool ways we can hear very much difference is it sort of this
00:48:37
idea also the differences in voice quality in these sort of productions that we can actually here
00:48:42
they're not really related or saudi really related to the local folding and chatting in the properties of that
00:48:49
big sort of age representation only talk about is linear predictive coding which itself
00:48:54
isn't every feature representation we can use them we can use them
00:48:57
as a feature where it really the first step towards identifying what the vocal track spectrum is
00:49:03
and then from there we can sort of go on extract sort of formant frequency information
00:49:07
from the actual vocal tract spectrum itself so we need production coding is um
00:49:13
basically i'm sorry see the slide yep used to estimate vocal tract transfer function itself
00:49:17
i'm from this we can identify dominant peaks in the shape of the vocal
00:49:20
track spectrum and incidentally the timing the filter productions of any order aggressive system
00:49:25
itself so any sort of system or we have lots of filters essentially
00:49:29
cascaded together and we won and identify this information in we can use it to a lot of things
00:49:34
we can take from the original speech signal we do the linear prediction coding we get the filter coefficients of the actual vocal tract
00:49:40
from this we can then take the original signal we can do the filtering we can extract the excitation signal itself
00:49:46
oh we can do some sort of form of actually reproducing so we can have some sort of filter characteristics
00:49:50
we can fit in some sort of squat excitation signal and do some sort of speech synthesis
00:49:55
so look to different reasons we wanna look at the sort of linear predictive coding using
00:49:59
i provides a really good model of speech production it provides
00:50:02
is very very accurate shape of the vocal track spectrum
00:50:05
and at least to recently good source filter operations of extract a sort of high enough for the filter
00:50:11
that we can actually do that we can do some sort of
00:50:13
inverse filtering forever badges is here is it's analytically traceable and
00:50:18
despite from the chunk amass sometimes right you in the next couple of slide it actually quite
00:50:23
simple and quite a straightforward implementation to extract the linear predictive coding and actually do it
00:50:29
zero basic idea of linear predictive coding this just doesn't apply to speech this applies to most signals itself
00:50:35
we essentially saying the current sample in this case the current speech sample can be
00:50:40
approximated the some so we need a combination of pass speech samples itself
00:50:44
to have a speech sample we have a lap last bits of information and we have some sort of a
00:50:49
linear combination of these and these actually coincide with the vocal tract filter parameters themselves and these i. e.
00:50:55
coefficients that we wanna recognise them we we do this by minimising the mean square
00:50:59
between the production uh an actual to shorten segment of the speech signals
00:51:03
looking at differences every time so we're doing this minimum so squared samples actually just solving this equation here
00:51:10
and then up the wise enough for the filter parameters so it's pretty straightforward to to
00:51:14
so essentially we have i output formation so we start go okay here's my current speech signal
00:51:19
is my pass speech signals and we set some or that the
00:51:22
filtering generally this is set somewhere between ten to fourteen
00:51:26
we don't have a production era we know i true sample we know what we want to predict to
00:51:31
get this production era therefore we can actually make this production yeah so this the actual speech sample
00:51:37
these are estimated speech samples and sent it to mean squared
00:51:40
prediction task from there so squaring the main of it
00:51:44
we're differentiating the pretty tucker vision setting it says there are essentially solving system of equations here
00:51:50
to actually break it down i'm not gonna solve the system of equations it does sort of come out here in every
00:51:56
arranged slightly what we can actually find when we sort of cell and set up the system of equations itself
00:52:02
actually comes out the uh we can express it as the autocorrelation function here
00:52:07
so essentially trying to find one part of your autocorrelation using estimated version another part
00:52:12
of the autocorrelation function this then allows us directly sort of set up
00:52:17
as a system of equations so we take a speech signal take our correlation with the speech in itself
00:52:22
set this up as a system of equations and then essentially we wanna solve for the a matrix
00:52:27
to find these particular values themselves so well the ah values come
00:52:32
from the autocorrelation function learn estimate this filter parameter itself
00:52:36
what we find the very important probably is that the type that's
00:52:39
matrix so it's in metric will diagonal elements being equal itself
00:52:43
this allows us to solve the system of equations using sort of various different forms itself again
00:52:50
there's a lot of background information 'cause i've gotta sorta going quickly through a lot of different aspects of speech production
00:52:55
until to uh as excitation sitting behind this so essentially we wanna in but
00:53:01
the autocorrelation matrix itself i mean solve so this of
00:53:05
course can be very computationally intensive where uh
00:53:08
inviting some big major except between ten to fourteen but because it's type x. it's got
00:53:12
the symmetry property to it there are various iterative methods that we actually used to solve the
00:53:17
so the doubles algorithm which is a sort of setting up a sort of iterative way
00:53:21
to actually sort of worked through and soul for the first coefficients alter the second
00:53:24
television and so on and so on we pulled up sort of each reduce
00:53:28
solutions here we can use a gradient descent algorithm us into what we go much doing machine learning
00:53:33
some sort of optimisation task to actually allows to solve the solar system of equation and slide
00:53:42
uh i'm gonna skip over exactly how they haven't algorithm than gradient descent algorithms work
00:53:48
but they do work that allows to solve this and then from this model here we can and
00:53:52
sort of build up the vocal tract spectrum we can build up the filter probably themselves
00:53:57
so we can say is that was we change pay what essentially happened is we get so the more more
00:54:01
detailed information of the vocal tracks picture so we have some sort of frame sumter input signal itself
00:54:07
we take some sort of fourier transform and we're getting the short timeframe so this is the spectrum itself
00:54:12
and that when the linear prediction coefficients what we're looking for is the real with a contract
00:54:17
so we start with the sort of very small number of linear prediction
00:54:20
coefficients we'll get a rough very loose or the one pay
00:54:24
you can a little bit of a low pass shape because we start put more more increase the order of the vocal tract
00:54:30
or uh that with a putting into the l. p. c. algorithm we sat together more data out shape
00:54:35
we start to say that the frequencies in the spectral the formant frequencies become
00:54:38
more clear as we sort of go higher and higher in order
00:54:42
the reason we don't go too high and all that is essentially we start
00:54:45
looking at format information and we start looking at harmonics information itself
00:54:49
if we start to go of orders of sort of twenty if we start to say that they're sort of harmonic
00:54:54
information that we can say in the spectrum really starts to come into the l. p. c. spectrum itself
00:54:59
starts the get rid of the foreman information makes format tracking more difficult and um
00:55:05
doesn't say makes for much checking more difficult task and we've got the inverse fourier transform forgetting really be
00:55:11
shortened spectrum we don't need to do that by some long winded new production television method as well
00:55:17
so one thing we might use linear prediction television for is to help ah speech extraction tossed so
00:55:22
it can make a time domain task such as autocorrelation functions in the average difference magnitude functions
00:55:27
a lot easier to do so we can examine to get the error signal from
00:55:32
the l. p. c. algorithms itself and then find the pitch
00:55:35
periods directly from the area signals area signal itself
00:55:39
by looking at the approximate takes in looking at the sort of autocorrelation function there it's essentially will get rid of
00:55:45
all the vocal tract information but retains the sort of keep h. information within the signals
00:55:50
and also from the linear prediction coefficients what we're very interested in is getting the formant frequencies
00:55:56
so essentially the formant frequencies are the dominant peaks within here and
00:56:00
we wanted pretty much find the dominant peaks in maxima
00:56:03
and they determined using some some numerical analysis will look at the polls and uh sort
00:56:08
of zeroes the actual function and this allows us then to sort of backtrack
00:56:12
you are sort of electrical engineering style analysis and find the different formant positions
00:56:17
formant tracking can be exceptionally hard task to do we need a good estimate the l. p. c.
00:56:23
and then we can still see have sort of a lot of things between first and second
00:56:26
format that can cost over at various different points of time make this sort of algorithm
00:56:32
a lot harder to run again it's a whole nother lecture on sort of format tracking so i'm just gonna sort of skip over here
00:56:40
but there's different ways we can actually track and the these algorithms you can
00:56:43
rest rest funding the so the main one here looking at the pages
00:56:47
so the last low levels group of low level features the only talk about
00:56:51
this morning are the power spectrum so those spectral analysis or a
00:57:00
um so this is going to just taking the spectrum looking at the
00:57:04
different harmonic structures looking how different frequency information changes from also low
00:57:08
frequency that high frequency information in the signals themselves we sort of build
00:57:12
up this information time we can really look at temple does
00:57:15
displacements of sort of frequencies and how they respect sort of solar power spectrum itself is just
00:57:20
a lot of the discrete time fourier transform as implemented by the fast fourier transform itself
00:57:25
do we take in some sort of hamming window and then we essentially just to
00:57:28
the f. f. pay vignette fifty we can scare take a lot value
00:57:32
and we get a sort of on the line signal here this is the l. p. c. analysis at the top of it but you can actually see within the
00:57:39
power spectrum it badly does follow the format shape we actually get within itself
00:57:44
from this sort of l. p. save from the sort of formant analysis we can either use directly
00:57:50
the sort of two hundred fifty six five hundred toll features that we
00:57:53
might actually get out the represent the information the spectral information
00:57:56
from the f. f. pay all my sort of break this down so we can look at aspects such a spectral gradient
00:58:01
look at the overall buff we near shape wrap line distribution later
00:58:05
high frequency of the actual the actual power spectrum itself
00:58:10
or might look at different ball of points with different bits of the energy occur
00:58:13
or different entropy about five to seven to be the more noise like
00:58:17
more information is essentially embedded within it itself so the so the spectral gradients vector well off point
00:58:23
and and to be really give us us a lot of information that we can they then go and going use in sort of
00:58:28
algorithms on this is extract with you know it and small talk
00:58:31
about noted chainsaw talk about the sort opens while bit mall
00:58:35
in the afternoon so we might take the spectrogram of the features we might be interested
00:58:39
in doing some feature presentation learning from spectrograms up talk a bit more about that
00:58:43
very very shortly the spectrogram is just essentially taking six sensitive to estimate of the actual
00:58:49
signal to a sort of sliding window itself so that's what this algorithm sorry
00:58:53
i'm just sort of taking a fifty estimates like this every time and extracting different
00:58:57
sliding over time with the windowing function instructing different eternity different frequencies
00:59:01
using the exponential here is all it's really going on there
00:59:06
um within when we're doing spectral analysis we've always gotta consider that there's some sort of
00:59:11
trade off going on and this really relates different properties all before it transform analysis
00:59:16
so when the going for a transform analysis and doing spectral analysis essentially
00:59:21
i two parameters of interest i directly related to each other so temporal
00:59:25
resolution in frequency resolution set by one parameter being a window links
00:59:29
and this allows us this then essentially means we have to have some sort of
00:59:33
trade off one of forming some sort of spectrograms generation put this here
00:59:37
a lot of times we skip reverend nor actually use the sort of same window things time and time again
00:59:42
but eventually every time spectral analysis we can't really get a good frequency information all good
00:59:48
temple information because the control by the two parameters there's no the ideal window link
00:59:53
as as it was sort of fall into some default standards but it's good to remember
00:59:57
there's a whole other theory that sits under here allows sort of different properties here
01:00:01
so what can happen in essentially if we choose a very very
01:00:04
long window things we get very very good frequency information but
01:00:08
obviously longer winter windowing skits is very very paul temple information within
01:00:12
the signal we shared a very short we know things
01:00:15
we can get very good very good temple information but we lose the frequency resolutions or voice got this trade off
01:00:21
tried of really depends exactly what you want your calls to be some sort of almost tunable parameter
01:00:26
if you sort of doing this style of analysis within the actual um machine learning algorithms itself
01:00:32
uh when we always talk about spectral analysis mel
01:00:36
frequency cepstral coefficients of probably the most
01:00:39
dominant spectral feature uh i always recommend the the the sorry go to feature uh
01:00:45
in any sort of speech processing task if you really don't know that
01:00:48
where to start what to do no frequency cepstral coefficients or is a good place to start
01:00:53
something like a support vector machine back and is always good place to start as well you can get a lot of different information from them
01:00:58
um no frequency capsule coefficients are used everywhere the very dominant things
01:01:02
like automatic speech recognition the sort of the main features
01:01:05
when the thinking like the planning nowadays the mel spectrum is probably the most common sort
01:01:10
of feature representation that fed into a lot of the planning algorithms currently there
01:01:14
so what is what are and then say say is what they represent so it's essentially a filtering the
01:01:20
signal using the mel filter which is based very much on uh so this hearing response as humans
01:01:26
so all this is showing here is that we're collecting i'm listening
01:01:31
to better resolution at lower frequencies in a hearing response and we
01:01:35
have a high frequencies this is represented in the mail
01:01:37
filter by the so serious a triangular filters listen to you can say that the filters sort of narrow uh
01:01:43
at the lower frequencies and they get more spread out have a sort of
01:01:46
wider bandwidth of the high frequencies themselves and stagger the mel filter
01:01:50
so when we're falling m. f. c. c.s itself it's just really
01:01:53
a matter of making the features more more sort of relevant
01:01:57
trendy lose we're done than informations from the spectrograms and make it so the more relevant to what speech processing itself
01:02:03
so i start with the extraction of the spectrograms extraction the the magnitude spectrum
01:02:08
so we take the fast fourier transform of the speech signal itself we take the magnitude of this
01:02:13
we take the logarithm the business allows us to sort of separate excitation vocal tracks
01:02:17
allows us to rough way you have some sort of distribution towards human mouth this deception on log scale itself
01:02:23
within do after the first lot of sort of information that action itself to my taken
01:02:28
enough if they are paying two hundred fifty six point five hundred and twelve point
01:02:32
and then we sort of feel to this down to forty points using triangular filters equidistant
01:02:37
over the mel scale itself and this gives us the mel band spectrogram itself
01:02:41
is is that the mel band spectrogram is often fed a lot into different depending
01:02:44
out within siemens we use a lot that is itself the most but
01:02:49
no band still has a sort of whatever done that information in return we got
01:02:53
rid of a lot of the high frequency information but it still can
01:02:56
be so the compressed and we can really john this sort of information further
01:03:00
down so we should be correlated for the using the discrete cosine transform
01:03:04
from that we take the first twelve coefficients we generally replaces erect coefficient with some
01:03:09
sort of energy representation to get out the mel frequency cepstral coefficients which
01:03:13
don't look like much when you actually just got them on the screen but when you
01:03:16
feed them into algorithms you find the particularly very very sort of powerful representation
01:03:21
probably the most dominant as i said speech feature that we have
01:03:25
and use sort of across all aspects of speech processing themselves
01:03:29
typically what we do then is take the mel frequency cepstral vector itself
01:03:34
and then as i said we've often put log energy in and then we take
01:03:37
the delta delta delta coefficients so this is essentially looking at how different distributions
01:03:43
change over time so we have a frame when we look at how the sort of
01:03:47
change or frame before it and delta delta coefficients at the change of the change
01:03:51
so we stack these normally together in some sort of twelve or thirteen is once we've got log energy there or the colour vision so
01:03:57
we stacked the delta stacked the delta delta one is thirty six
01:04:00
televisions gently represent one of the best sort of most powerful
01:04:05
low level descriptors that we have within speech processing nowadays a. s. r. is very
01:04:09
heavily based on the the mel spectrum all know frequency capsule coefficients themselves
01:04:15
okay so this tune more little things the only cover quickly before we break for this morning
01:04:23
the three things only cover quickly before we break for the morning and this is sort of how we go from this low level
01:04:29
features and start to form something that we actually feeding into a
01:04:31
machine learning algorithms all the time within speech pathology itself
01:04:35
we're not really interested in the low level descriptors we're not interested in this sort of very very quick
01:04:40
change because this often who were like a lot to this a linguistic
01:04:45
content often interested more how these distributions area of the course
01:04:49
of a chunk of the speech signal or even the terms of the whole utterance of an actual speech signal and sells
01:04:54
this is the idea of the supper segment or feature analysis this is essentially saying we have some sort of
01:05:00
when craig i think i essentially a single fixing when go back uh which describes the sequences of short term features
01:05:06
oh that as i said something to evolution over time over the course of
01:05:10
some sort of window maybe five seconds or a whole window with itself
01:05:13
so we start without frame levels we have our friends very five milliseconds level up by ten
01:05:18
milliseconds we extract that early days from always frames is into what we're doing here
01:05:23
okay this sort of single fixing factor here it's summarising all these l.
01:05:27
days using some sort of statistical functional measures i might need
01:05:31
variance standard distributions role of points all sorts of things we used to actually try this year
01:05:36
really we can do that very very fixings representations real power this sort of side
01:05:41
separate segment of each is is the sort of two things he won the
01:05:45
information that we're looking for a lot in speech pathology isn't added
01:05:49
in the actual short timeframe information that more embedded in the so distribution of the utterance
01:05:54
of the sort of information here so that's what we're capturing using the statistical functional
01:05:58
and also provide this is so the single kingsley expect that regardless of whether installing
01:06:03
so means essentially we korea's creating a sumo interact uh for every single one about data points are not cohort
01:06:09
we're not really feeling in my feeling in the sort of same information with same amount of information into machine learning algorithms
01:06:16
we're not got the sort of waiting so we have to do with low level features and sort of creating
01:06:20
estimates every twenty five milliseconds and something them up somehow but we really was cutting the utterance links values
01:06:27
so doesn't quite mad at how what small bits of variation the utterance things
01:06:30
is we're getting the single fixing spectra out of at the same time
01:06:34
so do i mean by statistical functional as as i've already said we'll
01:06:37
hear things like means moments extreme is percentile slopes regression lines
01:06:42
any sort of semi something information that we can so we start with some sort of feature distribution
01:06:47
might look at the may not look at the standard deviations different spots within here different bits of information we can extract
01:06:54
this allows us to generate very very big feature space is very very quickly this is what happens in
01:07:00
open smile this is what we do is sort of do here we've have i enjoyed comfy choose
01:07:05
and these of referred to as a low level descriptors often too
01:07:08
easily extract the sort of delta delta cover delta coefficients
01:07:11
and then we use these m. functions to summarise these i will pays out into this sort of utterance level
01:07:16
so we can generate very very light feature spaces inhibit features bases
01:07:20
can turn a lot of rich information relating to sort it
01:07:23
speech pathology a motion whatever a particular task is within para linguistic
01:07:27
to listen to what's going on how is being said really isn't bad it a lot in the very sort of which representation
01:07:33
cells and this is all often known as brute forcing of light features faces to
01:07:37
starting with our low level descriptors starting with putting go to go to this
01:07:42
in their itself women starting to use the functions themselves come up with a very very rich very subtle wide
01:07:49
distribution and how we extract the isn't a little bit more information into different feature representation
01:07:54
gained using this you guys all about and start to extract
01:07:57
some information regarding today's in the tutorial this afternoon
01:08:02
so one of the way we can extract the so the utterance level representations is
01:08:06
by using what's known as bag of audio words which is sort of um
01:08:10
a little bit advertising mania for the group in the work we do so back what
01:08:14
itself is very much a linguistic natural the linguistic representations come from natural language processing
01:08:20
so looking at a document and then we're looking to sum up the document in order
01:08:23
document just instances of distributions of words is in a sort of he histogram
01:08:28
of occurrences into what's happening here so i have some sort of document the cat
01:08:32
is on the table we might have some sort of code book or
01:08:36
or dictionary and then we looking at different frequencies that occur within
01:08:39
this cable is these frequencies easily distributions that we can use
01:08:44
as a sort of feature representation that we actually can fade into some sort of machine learning algorithm
01:08:49
so we look at these different histograms everytime different features different occurrences and feed this information in these can
01:08:55
come from different sort of groups this could be a good bottle features this could be spectral features
01:09:00
could be prosodic features so on and so on we can make up very very rich so histograms itself
01:09:05
uh we develop software in the group called the bag um open crossbow
01:09:11
itself and this is just a cool tip for doing multimodal bag awaits formations it's based in java and raise a
01:09:17
quick and easy to do manual step some sort of normalisation the yellow days some sort of code book generation
01:09:23
vector quantisation and maybe some sort of post processing on the actual vector quantisation just normally
01:09:28
to normalised to different time links within the actual books within the actual utterances themselves
01:09:34
so the type of audio it can offer so the different advantages to sort of us up
01:09:39
suppose segment of age analysis themselves um one of the core advantages
01:09:43
is robustness over oyster quantisation again some sort of cobalt
01:09:47
this allows us to sort of account a little bit to noise that might occur to smoke limitations in the input data itself
01:09:53
in the den with an hour at the room working on in the while data real well data
01:09:59
can sometimes get better results when actually instead of using functions to sum up an utterance
01:10:03
we use this sort of back of all your words approach to some utterances themselves
01:10:07
time invariant again is one of the sort of big advantages of using it so we're fixing
01:10:11
to representation regardless of time we can normalised these distributions according to time as well
01:10:17
multi metal fusion is one good aspect of this because a lot of the times we can have speech
01:10:21
linguistics and video information extracted very much a different sort of timing intervals
01:10:26
but again will sort of expressing things uh some sort of
01:10:31
utterance level representation in this looking histograms allows us to do very
01:10:34
simple future itself and privacy is one very big aspect that works quite well with a sort of back about your words
01:10:41
as is it spectral features contain large amounts of linguistic information
01:10:45
histograms contain no amount of linguistic information within them itself
01:10:49
so it's an irreversible mapping you just mapping which occurrences of what's going on here there's no way
01:10:53
you can actually reproduce all week or a what was said in the actual algorithm itself
01:10:59
bagwell information is very very easy so uh we always start with
01:11:03
audio instances oh they extractions and then we generally choosing a
01:11:06
codebook and we're doing this bagging thing bias into quantisation which i'll
01:11:10
go over in a minute and so the histogram construction
01:11:14
being the final step of the actual the key parameters we normally have a codebook
01:11:18
size and number of assignments i'll talk a little bit more about this
01:11:21
so then we start here with some sort of code poke and this is just some example of some sort of
01:11:27
training set that we wanna contact against itself to is have a codebook
01:11:31
and those have a feature set the quantisation is essentially going
01:11:35
what is the nearest code word i features may should too so this could be a
01:11:39
just m. f. c. c. data and then we're going okay
01:11:42
using euclidean distance what is essentially the closest distribution
01:11:46
in my code book and we're finding that then we'll just assigning and histogram
01:11:50
count we do this over time over the course of the actual full
01:11:54
so the sample that we doing we can find that we can see this sort of frequencies it distributions and from here we
01:12:00
can then very very easily find term frequencies sundays out of different windows if we want or the course of the full
01:12:06
so the utterance itself and find different distributions there and this uses concise
01:12:11
to windows as the actual cobalt as actual time frequencies themselves
01:12:16
so we could also do multiple assignments so this is when we say is saying instead of assigning
01:12:21
to the nearest codebook we assigned than yours and words in the codebook itself this is
01:12:26
generally what we wanna do and one of the so called parameters when you doing bad
01:12:29
of audio what is inside your number of assignments you jen we wanna sparse representation
01:12:34
but as fast representational one is probably two spots there's not enough information they're just playing
01:12:39
around with the number of assignments you can find that can actually give you
01:12:43
very much a a difference so the set of results is you actually increase i see
01:12:47
so the last thing i'll talk about very briefly anyway i skip over some of the this morning to supposedly getting towards break time
01:12:54
here is the sort of feature representation linings so the mean
01:12:59
low level feature description suffer segment of feature descriptions uh very much
01:13:03
based on knowledge as a sort of talked a lot this
01:13:05
morning i said steep they look quite a lot of things at all there's a huge amount of information back a
01:13:10
it's huge amount of information behind the a lot of page teaches m. f. c. c. teaches spectral features
01:13:17
based heavily in electrical engineering uh taking used to develop the very much knowledge based on
01:13:22
the sort of source filter operation extracting very particular aspects of speech production that
01:13:27
one of them all of things is sort of come out of the planning what we can do now is essentially got a
01:13:33
don't really care about ending that knowledge i one of the you know find a
01:13:37
representation from my data it's very very specific to the task i'm actually doing
01:13:43
and things that convolutional neural networks allow us to do this if each representation learning
01:13:47
is learning features essentially directly from the data or from some so
01:13:51
high level representation there is something like a spectrogram themselves and
01:13:56
just sort of targeting it to what our actual task at hand going
01:14:00
okay i've got a particular task i've got my data itself
01:14:03
get me representation that absolutely perfect for this task is a lot
01:14:07
of the times we use m. f. c. c.s absolutely
01:14:10
everything you can sit there and you can use m. f. c. c.s absolutely everything with the support vector machine
01:14:15
and someone said to me you you always get somewhere between sixty to eighty percent
01:14:19
accuracy when using m. f. c. c. support vector machine for any given task
01:14:23
you never gonna get a hundred percent accuracy using something like this because of the
01:14:27
all the extra information there's so much information embedded in speech system much variability that
01:14:32
that it's very very hard to get high high levels of accuracy for particular task in speech pathology learning
01:14:38
because of the sort of confounding information it in here one possible way
01:14:43
with the so the caddy it here that we had absolutely enough training data actually
01:14:48
really extract the information probably from using the convolutional neural networks themselves is feature
01:14:53
presentation i think about doing targeted feature extraction towards particular angle itself
01:14:59
and this is mainly done using convolutional neural net what's their special form of feed forward neural networks
01:15:04
and saying to just do it convolutional operation time time again different levels
01:15:08
and sort of tightening feature extraction to what that task itself
01:15:11
and these convolutional channels get re used time and time again were sent along the weighting the
01:15:16
importance of different convolutional kernels to what the particular task that we might actually have
01:15:22
convolution as uh so the of refresh of uh which continues or introduction to
01:15:27
people it might not be familiar with that is sent to manipulation
01:15:30
operation with just taking two signals and performing the third signal itself with
01:15:34
into doing an infinite summation over some sort of complete colonel
01:15:38
to to itself with the actual signal choice donated normally within here as a star operation itself
01:15:44
when we think convolution the best way to think of any convolutional
01:15:47
operation is always three particular words being fit shifted multiply
01:15:50
so essentially we have a signal we have some sort of convolutional filter recently slipping one of them uh and then just
01:15:57
doing shifting the to to uh the signal itself and taking a serious adult products
01:16:01
to actually form the convolutional operation and that's what's going on when we're doing images it's the same
01:16:06
thing we have a two d. representation we just it and then we just shifted over time
01:16:11
but the but the x. and y. dimensions within a particular image itself and again three day it goes out compilation is
01:16:17
gotten dimensionality is itself essentially this is all this sort of maths is showing here i've
01:16:22
taken a convolutional operate uh i'm just really expressing it in terms of a
01:16:26
set of inner products as into this is what's happening here reversed time version say
01:16:31
and we're doing a shift to multiply operation to actually get the output itself
01:16:35
so what so the happening here within the different operations when we do sort of feature presentation lending itself
01:16:41
essentially my animations of gone right outside the skip to the end
01:16:45
so we're going to performing convolutional operations themselves and this is a this is using a
01:16:50
set of filters to identify patterns within the signals and the network sort of alliance
01:16:54
the white associated with each of these filters and similar patterns may carry multiple regions
01:16:59
itself so once the importance of this different features towards the task at hand
01:17:03
then we do some sort of normally down sampling operation in this is generally done by max bully
01:17:08
so this is just taking saying in a particular uh frame i'm interested
01:17:11
in just the actual maximum value that comes out of here
01:17:14
and this just allows us again so so the invariant she's in translation a little bit of noise or the noise um
01:17:22
we have just a little bit more boston noise itself i mean we can do the operation
01:17:26
time and time again we can repeat many times convolutional mac following completion max pulling
01:17:30
so on and so on we get the sort of very rich very um a lot of
01:17:34
different feel to pulls different feature maps out then essentially flat in the feature maps
01:17:39
we do a classification this really targets everything towards the end we do our final production
01:17:44
then generally what you find is that you start to these convolutional operators itself
01:17:48
the the high level operators generally reflect sort of a high level and the more we get down the more target the
01:17:55
actual features were extracted not what the particular task that we're
01:17:58
interested in doing themselves so first at being convolution
01:18:03
it's also my had different filters in we sent to just pass these filters over the signal of the signal and
01:18:08
go about different feature after each filter itself then again same again building up a sort of rich fit
01:18:13
feature map aligning the sort of weights associated with these filters themselves then we do our information reduction
01:18:19
we wanna kick the maximum output window doing max falling within a small number neighbourhood itself
01:18:24
so we actually see their reduce them and just essentially getting
01:18:28
in finding the maximum within these different operations themselves
01:18:31
reducing information having a being a bit more invariant towards noise itself
01:18:36
and we're sort of repeat is over time everytime we might in jackson on them yeah he's in there
01:18:41
for interested in doing that and we just sort of do this operation time and time again
01:18:45
convolutional maxwell income which too much boring so on and so on as i said we seven to learn the features
01:18:50
we can flat not final set of feature maps out this is not so there's
01:18:54
a fully connected via we can pass that maybe through some sort of feet
01:18:57
fall in your net worth some sort of soft max i uh some some mapping
01:19:00
essentially will probably distribution and then find out sort of general output itself
01:19:06
so this is sometimes done in so now into when that works itself
01:19:10
and this is a sort of a score function of internet work
01:19:13
so internet what two senses something we feed what data in to me actually get the out and we
01:19:17
get the production out in speech we don't we find this is done using a couple convolutional neural
01:19:22
completion elias and we normally write this into two recurrent neural network lies
01:19:26
to learn some sort of temple dependency within the operation itself
01:19:30
we started another work with this in the chair and we found that we have some pretty cool results
01:19:34
when we started looking inactive patients of different convolutional lies themselves so we thought it's not gonna
01:19:39
you know it's just gonna be random it's gonna want something but what we actually found when we
01:19:43
look to different correlations actually found the convolutional neural networks and doing this sort of feature extraction
01:19:49
again not all the sort of parameters found something useful but we
01:19:53
actually have the find activation to correlate very well to
01:19:56
energy correlated very well to loudness uncorrelated very well to page
01:19:59
itself and this is really doing the motion um
01:20:03
animation classification task and generally loudness energy and page uh what we use quite a lot the motion
01:20:10
the very highly related arousal so this is a really cool was all that we found out
01:20:14
this is only using and when lining and looking at the different activation so the
01:20:19
oh not that finishes up nicely for the the first part of the morning so uh we'll
01:20:22
take a coffee break now and then we'll come back and talk about machine learning and
01:20:27
back to more conventional machine learning methods after the break itself settings you time

Share this talk: 


Conference Program

ML for speech classification, detection and regression (part 1)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 9:04 a.m.
478 views
ML for speech classification, detection and regression (part 2)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 10:59 a.m.
104 views
Quiz
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 11:59 a.m.