ML for speech classification, detection and regression (part 1)

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:01

um welcome today's lecture is uh actually it's just a bit error vision namely for everybody in the room

00:00:08

really gonna be talk about speech processing and just a basic introduction to a lot of the features

00:00:13

that we commonly using pi linguistics that we commonly use in the open smalltalk eighteen will reinforce that

00:00:19

within the tutorial this afternoon but really learning how to extract a few of the ease and then

00:00:24

a bit of an introduction to machine any i'll spend of a bit of time talking about

00:00:28

so the generalisation and what are the uh sort of why and how we set up a machine

00:00:33

learning problem and then also introduce you guys do a lot of the sort of conventional

00:00:38

pretty deep learning machine learning techniques that i use quite a lot

00:00:42

it's been a little bit of time anyone support vector machines 'cause that's what was sort of focus on again in the lab this afternoon

00:00:49

so who am i why my up here talking to you guys uh

00:00:53

i'm doctor nicholas commons but please the common make i'm i had

00:00:56

really take intended it at the chair of embedded intelligence for health care

00:01:00

and well being at the university of out that in germany

00:01:04

so the ability h. and doesn't really make sense to a lot of people lunch um and

00:01:08

it's essentially a so the professor in training position

00:01:13

uh i think it up so the the other equivalent positions in the u. k. would be lecture uh

00:01:18

or in the states would be a sort of assistant professor shapes uh roughly what i to

00:01:23

my main research changes really focus around machine learning and healthcare as a

00:01:27

multi sensory sensing of health care and different well being aspects

00:01:31

really focusing a lot what affective and behavioural computing and my real background as

00:01:36

well is in this sort of idea of using neurological disorder is um

00:01:42

a mental disorder is um things like this or mental health my p. h. d. was

00:01:47

actually using speech to say if we could find a market for depression itself

00:01:52

so can the i work on a few different projects including the tap this project i also work in a few

00:01:56

others to stage a red dot and a numeral talk a little bit more about them at the moment

00:02:01

my previous roles of the post opposition at the chair all the um computer

00:02:07

of complex and intelligent assistant at the university of pastel also in germany

00:02:11

and i did all my research my sort of undergraduate and my p. h. d. at the university of

00:02:16

new south files which is in c. d. extra yeah so yes i'm actually a straight in

00:02:21

i originally come from right down the bottom in hobart which is in

00:02:24

tasmania and is a really really lovely part of the wealth

00:02:27

i got most and spent most of my life in sydney's uh it's a very nice c. d.

00:02:32

do my bit for this year interest born encouraging everyone to please go visit of

00:02:35

course saying today as the so the traditional welcome from us yeah yeah um

00:02:41

i haven't the strain accent i can't really apologise i have the if you do struggle to

00:02:45

understand anything i say come sometimes happens please let me know and i'll really express myself

00:02:52

i'll say it again until we eventually get there so what we need to do is uh

00:02:56

just a little bit more about my sort of research institute which is also which now is that uh it with that they did

00:03:02

a chair of embedded intelligence healthcare and well being were quite a

00:03:05

new sort of research institute founded in two thousand and seventeen

00:03:09

so we just sort of had at your anniversary really we have one professor been professor be ensured

00:03:14

the the most to you guys all know probably from the literature you should have been reading

00:03:19

he's open smile he's the compare challenges he that challenges are used

00:03:24

sort of all of the speech pathology quite a bit in especially when we're talking about a mission

00:03:28

an affective computing we have one ability asian kind of myself we have six ever post docs

00:03:33

twelve doctoral students and to visiting positions currently the moment as well

00:03:38

and the call research of the group is really this idea sort of machine learning to signal processing

00:03:43

and tells cassie focusing on in bed it you pick us sensing i i. t. devices

00:03:48

um intelligence day planning machine learning conventional methods and then really

00:03:52

looking at this in healthcare sort of things like

00:03:55

speech pathology or into wellness things i've acted computing was said to work on a few different projects the first one

00:04:02

being tapper so obviously here talking about speech pathology different ways to sort of died no is an improved agnes

00:04:07

speech pathology i also work on the stage project where we're looking at ways of creating sustainable a. g.

00:04:13

so how can we develop x. it really support elderly people to sort of leave a multiplexer acme

00:04:19

look at as sort of current mood current working environment current physical activity recommend what things

00:04:25

they should do well work uh what things they should do outside of work

00:04:28

to sort of maintain active and healthy why style with the end of keeping people what a longer

00:04:32

i also work on the right as soon as project which has big overlaps with tap is so we're

00:04:37

really looking at ways of using you pick us sensing using smart phones in using sort of

00:04:42

signals that we collected the phones as well to monitor people with depression

00:04:46

multiple sclerosis an epilepsy we're really looking in terms of relapse here

00:04:49

can we predict if somebody's gonna have a relapse and it's a massive sort

00:04:53

of big project with uh i think roughly thirty cotton isn't looking at

00:04:58

collecting samples from well over about a thousand people during the entire course of our project

00:05:03

class yeah well containing no which again has some overlap with chapters will looking at sort of improving

00:05:09

a a motion of therapies for children with autism says okay well we have there is channel and

00:05:14

how the student expressing themselves in terms of sort of effect an arousal

00:05:19

and that one sorta levels and then also looking at sort of different vocalisations i may consider counting needs

00:05:25

and trying to really i'd therapies in a therapist and make it

00:05:27

sort of more automated therapy sessions itself it's really interesting project

00:05:32

but that's sort of enough about aston really on to the so the

00:05:35

lecture today so we're basing in in in around speech pathology and

00:05:39

speech pathology if the study diagnosis and treatment of sort of communication

00:05:43

disorders at least that's hell i'm really defining it for today

00:05:47

this includes difficulty with speech language fluency and voice itself

00:05:52

and difficulty communication is really just the dysfunction in any sort of bad about

00:05:57

speech production apparatuses could happen as some sort of cognitive effect this

00:06:01

could happen to some sort of physiological effect some sort of neurological effects

00:06:04

something affecting and that's affecting us are fine control within speech itself

00:06:08

we can also sad sort of developmental in learning disabilities interlace it coming to hear as well because

00:06:13

add things like strike dementia and brain injuries that really um sort of common unaffected speech production

00:06:19

so what i'm really interested in this sort of rome is really finding objective measures to so

00:06:24

they help the diagnosis of different conditions we we age people are trying to make

00:06:28

diagnose people get them into the right healthcare monitor adam might be doing different treatments that's my sort of

00:06:33

real call and train within the sort of speech pathology in what was sort of focusing on here

00:06:38

so how do we do this how does this sort of or lie with doing machine learning so this really relates into the field of what's

00:06:44

known as empirical ending so this is sort of basing a decision on

00:06:47

any existing data so try to learn from past events essentially

00:06:51

so just a very simple example sort of what i mean by here so imagine it's friday night

00:06:56

we've ordered a pizza delivery there's a game of football is a good movie starting in about thirty minutes and all of a

00:07:01

sudden your friend rains he wants to come over he wants you to come and pick him up from his house

00:07:07

can you actually go to is house in time can you actually pick amount making in time we're paid to be delivered in we get back

00:07:13

in time to the sort of start of the movie the meeting here is what you make it back in time to the pizza delivery

00:07:18

to happen we actually make this decision halloween normally do this is sort of humans so what we normally do

00:07:24

sort of not exactly but we build some sort of statistical model we look at sort of

00:07:29

past events and sort of realise what might be happening what might be going on here

00:07:33

so the last time we voted some pages to a house we found that on average it sometimes

00:07:39

sorry uh we've all the pages on average we found the four times added by the been delivered light

00:07:44

so we think okay fifty fifty maybe i've got time to go and pick up my friend maybe i've got time to get back in time

00:07:51

we know if it might be so the come back a bit late it might be okay so we sort of going well maybe maybe not so we've got to

00:07:57

sort of make the decision we've gotta look more information we've gotta pull in more

00:08:01

normal what else can we learn what else can we use to actually help

00:08:05

make a decision in time so we need to consider all the sort of relevant

00:08:08

information and this is what we're doing with machine learning which on its cover

00:08:12

lots of different information from speech pull it in and learn from it and learning from relevant information so

00:08:17

we always had this idea of some sort of dependent variable what are we trying to predict

00:08:21

and what does it sort of represent itself in this very simple cases just pizza

00:08:25

delivery time we also have an independent variable so these are the features this

00:08:29

is the information that we're really sort of pushing into a machine learning algorithm

00:08:33

this is what we actually used to sort of formal make a decision

00:08:36

we wanna break something down my pizza delivery might look at different days of the week see if there's a pattern in when drivers might be sort of

00:08:42

coming in and being late all drivers might be only within this sort of delivery slot time

00:08:47

so we wanna build relationships between all these different pieces of information wanna build them up till tomorrow

00:08:53

and find relationships find patents and use these patterns to really

00:08:57

um makeup delivery so we noticed that when we look

00:09:00

at days of the week and different pizza deliveries we since the the um mondays at three or four deliveries

00:09:05

had been on time so i'm sorry like within this particular example we could also include other information to

00:09:12

add up to date traffic information i mean nowadays we'll have apps that sort of track is uh

00:09:16

we know a little bit more so we can sort of build this decision tree up is that go okay it's monday night of or the my flute

00:09:22

will it be lights uh looking adjust was it like before wasn't like previously have go

00:09:26

fifty fifty and then i sort of break this down and very simple decision tray

00:09:30

and realise that there's probably twenty five percent my page might be

00:09:34

like in this particular example so using this information i build

00:09:37

a slightly better model output in sort of more information i got okay i'm not gonna go and pick out my friend

00:09:42

any moment it's is most likely gonna be in time my friend can just walk a path like it's not really that

00:09:47

good a friend anyway it's worth missing pizza for so this sort of information this is how we really build up

00:09:52

this sort of idea of sort of speech pathology and sort of different empirical burning

00:09:56

to have to be really do this had we pull this all back into sort of

00:10:00

speech pathology itself and into sorta speech pathology detection we always that with data

00:10:05

realise that was labelled data normally as well so this is collecting speech are different

00:10:10

people with different pathological conditions and maybe from control samples as well so things we work with like

00:10:15

this actually uh depressions parkinson's or two sons all different sorts of speech we might collect within

00:10:21

collaboration with some clinical partners we might try to get good decent medical labels that go along with these as well

00:10:27

we sort of clean the data up a little bit and we start to move erroneous factors

00:10:30

with my having out last sample someone might have forgotten to turn the microphone on

00:10:34

someone might of you know i had a map coughing fit halfway through it and it nobody else really had this we might look to sort of

00:10:40

really big obvious out lies the really gonna affect our decision it could affect the sort of

00:10:44

machine learning in different aspects of losing so we clean up the data a little bit

00:10:48

then we really got to extract relevant information this is the first

00:10:51

part of the lecture today learned talk about feature extraction itself

00:10:55

so this is taking a whole speech signal was speech signals a very very

00:11:00

complex to look at as a lot of different information going on

00:11:03

we sort of break them down into smaller and smaller chunks we can start to see happens we can start to

00:11:08

extract little bits of extra sort of information from these in the the so the features that we use itself

00:11:13

and then we push this into a machine learning model we go okay only

00:11:17

used to learning only you support vector machines only use decision trees

00:11:21

we sort of buffy this information in we train our model we test our model we go okay cool

00:11:26

i got something that's working particularly accurate yeah a great i'm gonna publish a paper on

00:11:31

it never ends gonna be happy so that's roughly sort of what we're doing here

00:11:34

but we turn in in this movie what i'm talking about the second part of the

00:11:37

which within the machine i mean i can really okay give us of the chances

00:11:41

to train the best model possible so looking at making sure we have generalisation on machine

00:11:46

learning models and the difference or areas we've gotta look out for within here

00:11:49

and then also talking about different machine learning algorithms that we can actually use as well just

00:11:54

a bit of a break down today so it's really the use of computational intelligence

00:11:58

to solve some sort of protection problem in the case it happens is is some sort

00:12:01

of speech pathology detection problem so decided data collection preprocessing feature extraction machine learning

00:12:07

i mean this is pretty standard track any particular computational intelligence task any particular machine learning task

00:12:13

and there's a said today we're really focusing on feature extraction extracting useful information from world data itself

00:12:19

and machine learning algorithm so winning rules learning how we find these different patterns in the data how we break down

00:12:25

find the relevant bits of information itself and then use them to sort of make a final decision itself

00:12:31

this is before really getting to maintain looking in a very high

00:12:34

level what our features themselves what to actually me when

00:12:37

i talk about pitch what is that how can we think about this is the sort of bigger why the topic

00:12:43

so page itself the features themselves so just the representation of the data that with feeding into

00:12:48

a particular machine learning algorithm itself and i really just represent little single pieces of information

00:12:55

and every single one of these pieces of information is something that the machine learning algorithm

00:12:58

uses to make its actual decision itself to my very simple example we use um

00:13:04

previous deliveries and days of the week unravel break down and decision but we might

00:13:08

have to feed more information into that algorithm traffic conditions weather conditions so on

00:13:12

and so on and build a finer and finer and more accurate model of course we start adding too much information in these patents

00:13:18

get too hard to find so we've got to be careful about what we faded in and exactly how we faded in itself

00:13:24

to to pay might fade hundreds and thousands of such as these bits of

00:13:27

information to concatenate them together into what's known as a feature vector

00:13:31

and this is what we're sorta to extract different feature vectors in the labs this afternoon looking at the open smalltalk it

00:13:36

and if you different ways of extracting feature vectors from there from the

00:13:39

very sort of focused sort of ha maps very small contents

00:13:44

set of features to the very wide sort of big uh compare feature set which is i think six thousand

00:13:49

seven hundred up pieces of features and piece of information we can possibly fade into machine learning algorithm

00:13:55

the role of the machine learning algorithms in terms of the features here is to really identify patterns look for the speech is here

00:14:00

find out anything we can learn what can we learn from it how can we

00:14:03

continually improve this funny how can we actually make a call that we need

00:14:08

so what is machine learning itself sort of various again high level really the creation

00:14:12

of very robust models we wanna do some sort of prediction to some classification

00:14:17

get some sort of output predicted are sort of independent variables from a dependent variables from a particular data set

00:14:23

itself as i said we're primarily concerned with an identification and finding these patents in such

00:14:28

a way that we can target them towards a particular call towards a particular task

00:14:33

we found this process normally via some sort iterative updating we have

00:14:36

some sort of cost function we continually improve and that

00:14:39

parameters the algorithms towards this cost function over time to actually get a better better estimate but we wanna line

00:14:46

that's that that estimate is not just improving on the data that we continually

00:14:49

fitting into the algorithm but such that it's wider more sort of um

00:14:53

deployable on the wider brenda sort of probability distribution of actual features

00:14:58

based on actual problem space that looking in dealing with itself

00:15:01

and this idea of sort of learning phases in training phases we

00:15:04

collaborate our algorithm objective parameters that voice gotta continually test this

00:15:09

and make sure that it's not over feeling that it's not just getting very good recognising a patent

00:15:14

just in the information we're giving it but the pattern holds in sort of why the set

00:15:18

so that's sort of the wrath broad sort of interaction and i'm gonna break it down now into sort

00:15:23

of two particular parts of first uh that will focus on before the morning break is feature extraction

00:15:28

so looking at low level descriptors looking at supper segment of features also briefly introduce into you to

00:15:34

a bag of audio uh just another way of sort of organised in the data itself

00:15:38

and then also talking about feature representation mining so just a little bit

00:15:41

of interaction into using convolutional neural networks and their role they comply

00:15:46

and actually allowing us to one features so the first part is really about so hand crafting and

00:15:51

only space speeches and we move into sort of really how can we learn have we use

00:15:55

so big advances now indeed neural networks actually wanted useful and relevant features themselves

00:16:01

and then the second part after the morning break we'll talk about machine learning will

00:16:04

talk that generalisation then we'll talk about discriminative models and we'll talk about um

00:16:08

generative models themselves and just uh really prefer the view of this sort of

00:16:12

core ideas of a lot of different models are advantages and the disadvantages

00:16:16

skipping over day planning skipping of hidden markov models a little bit 'cause you

00:16:20

guys a couple this elsewhere every discourse of this week or will

00:16:23

cover it over the course of this week itself so first up on to sort of feature extraction and on to low level descriptors

00:16:32

yeah

00:16:39

uh_huh

00:16:44

so the feature is so again taking these bigger so the board awhile looks it it's really can be thought

00:16:49

of as some sort of abstract representation of our daily just considering what data into some other form

00:16:56

normally we don't realise in such a way that it's actually extracting useful information for

00:17:00

that so the task that we want when we plan to do this

00:17:03

really extract relevant information to the task at hand we just look what day they're sort

00:17:08

of fit this into some sort of speech pathology algorithm some sort of speech

00:17:12

pathology detection algorithm is gonna be a lot of different confounding factors see something like

00:17:16

linguistic variability is gonna get in the way of this very very quickly

00:17:20

might just be learning not quite what we wanna learn so really wanna focus our

00:17:23

feature extraction extract particular so the information we know is gonna be good

00:17:28

and this idea of sort of reducing redundancies which are one of eighteen unwanted information

00:17:33

into any machine learning algorithms machine learning algorithms essentially as i said they

00:17:37

look for patterns but they really look for variations and they're looking for this

00:17:41

variation in the data and the variation you're feeding into it is wrong

00:17:45

in such a way that it can actually confound the decision the machine learning algorithm is gonna look at that so

00:17:50

very simple example even starting a very broad level when was sort of thinking about how to collect speech

00:17:55

yeah we collect all of patience in one particular room but satan

00:17:58

a very very small good soundproof room really high quality audio

00:18:02

and then we i can isn't control samples i'm just gonna go to the lecture minister yeah some microphones

00:18:07

and collect some speech in some sort of the open room what's reverb lots of record

00:18:12

we're gonna start off on a very sort of wrongful already feeding

00:18:15

redundant information in a richly 'cause the sort of um

00:18:19

speech recordings in june be so different interaction nature we just gonna learn that their record a microphone

00:18:26

a are recorded on microphone be these things come through into the sort of feature extraction algorithms

00:18:31

to reducing redundant informations on in the about sort of feature extraction itself but it's looking so

00:18:36

the whole pack wine how can really construct up best algorithm to actually begin with itself

00:18:41

the real classic examples of so the speech features include pitch

00:18:45

which will talk about very sort of briefly energy

00:18:48

we'll talk about a little bit about spectral features themselves and then sort of talk about how this

00:18:52

put this all together in such a way that we actually can learn from it do ah sort of speech pathology do

00:18:57

these sort of para linguistic sort of feature representations that come across quite a lot of time in the literature itself

00:19:03

but before i start into sort of feature extraction i've oh it's just very

00:19:07

useful to do a very quick overview into sort of speech production

00:19:11

with the understanding a little bit about speech production just have some scanned a little bit

00:19:14

about why we're extracting particular features what they're actually represent within the speech signals themselves

00:19:20

so speech is a very very complex for that action that we to you guys are

00:19:24

probably had this stressed throughout two years so the training of it so far

00:19:28

it's actually the most complex action in terms of masculine movements and sort of fine control that we actually

00:19:33

do nothing replies sort of more coordinate of different muscles different muscle groups within our body itself

00:19:39

so speech is always starting with some sort of processing some sort of

00:19:42

cognitive thought i haven't message i want to transmit this message

00:19:46

so that's just the linguistic content we also process do i wanna stress something in the sentence

00:19:51

do i wanna make a point clear don't emphasise something so we're looking at putting

00:19:55

the prosodic information within this prosodic information things like ah sort of emotional responses

00:20:00

much to get in the way you know to take this we might have some like fee a comic awesome like anger coming across

00:20:05

of the top joy happiness is sort of gets in the way this sort of prosodic information itself

00:20:10

we sort of decided then what we wanna say how we wanna say we don't need to put this into action

00:20:14

we need to actually generate all in your muscular commands start to get the my directions to luis page

00:20:20

the main actions always in quite a few spot long loy sort of popping tiny little bits of yeah

00:20:26

balanced giving us some sort of energy within either using a vocal fold in an active member

00:20:31

with a sort of vibrating at the fundamental frequency is put some sort of time in some sort of pitch into the speech signal itself

00:20:37

within shaping up a contract for some sort of the to july and this is doing some particular

00:20:42

filtering action this arouses sort of shape the sounds in sort of to to the sounds

00:20:47

distribute them to make very specific speech sounds this requires a lot of training if you think

00:20:52

it takes a sort of five ten years to really learn how

00:20:55

to speak probably really control these mussels in white reduced

00:20:58

sounds similar sounds the same my time and time again and to learn a language

00:21:02

patterns to go with them to actually emphasise and get this message across

00:21:06

the speech itself is very very sort of particular to have this donation this action

00:21:10

of the vocal folds we have the shaping of the vocal tract in this

00:21:13

is what's known as articulation and sit down by not in the vocal tract itself

00:21:17

to also using not your lips teeth tongue helen software one nasal cavities

00:21:22

this quite a lot of information not just vocal tract don't think he vocal tract is really everything for your vocal fold

00:21:27

pretty much to your lips and all this sort of it out of you knows as well so

00:21:31

it's quite a lot of things happening when are actually just producing a single speeders speech sound

00:21:35

so respiration can be thought of as a sort of power source this is the energy this what does as

00:21:40

if we wanna really wanna change how loudly speak we have to say take more air

00:21:44

in and push more air out itself and if we wanna sort of um

00:21:49

supply pressure if we wanna do a very very long sentences well we need of some

00:21:53

more more air into laos to talk very fast some particular long enough time

00:21:57

so can the thought of some sort of battery some sort power source for the actual speech production swells

00:22:02

so if a nation is this sort of conversion of this sort of sauce from the ones energy coming out of the lungs into sort of

00:22:08

the first so the beat of speech production itself so those are already

00:22:12

said we have this idea voice speech production so we've on we

00:22:15

vibrator vocal fold some sort of particular fundamental frequency and this gives

00:22:20

us a rough shape rough sort of sinusoidal each shape

00:22:24

in a very bad way of saying it there for the sort of voice sounds itself

00:22:28

in some the unvoiced sounds we actually just holding a vocal fold open rushing air

00:22:33

out of balance forcing it very very quickly than relying on the the the movement

00:22:37

of the sort of positioning of the articulate is quite different bits of um

00:22:42

'kay different uh i've forgotten what in english he um took a different it's a constriction in

00:22:48

the actual vocal tract in this pizza construction then allow for the different speech sounds

00:22:52

we hear things that make t. sound really constricting very much at the front end is

00:22:56

using a town very quickly to get this sort of action t. s. f.

00:23:00

are all very much examples of unvoiced speech sound set the vocal folds playing a role

00:23:04

but at the same time when using and using articulators to be just different sounds themselves

00:23:09

this process is articulation so the process of forming speech channels and recognisable speech sounds themselves

00:23:14

by the movement is that to curators so we shaping the vocal tract we using it in shaping in

00:23:19

such a way that we produce beckon you recognisable sounds recognisable sounds of to joe language itself

00:23:25

does it work lies right coordination together and this is why speech is such a valuable

00:23:30

marker of a lot of different neurological conditions we have um many different ways

00:23:35

the different conditions different pathologies can sort of get in into rock the speech signal themselves

00:23:40

and there's a lot of the times i can interrupt the speech signal in such a way we can find the sort of common patterns

00:23:45

between groups of people in this allows us to really to exercise the sort of work that we really wanted to

00:23:51

so how do we model speech had we get this idea of speech production and and

00:23:55

sort of throw it into some sort of model allowing us to extract features

00:23:58

from allowing us to sort of understand a little bit about what's going on the most common way we do this is known as the source filter model

00:24:05

so breaking so the speech systems down really the main aspects respiration foundation articulation

00:24:10

no mass function together they must function waning code prosodic information into the speech as well and the sole source

00:24:16

filter model really allows us to explain this and that sort of allows us to put speech production together

00:24:25

as a series of knowing of separable we near field is no i would just

00:24:29

alone and a model these aspects from allows to sort of start extracting

00:24:33

the display could really a a just a an idea of the sheer number of scenes

00:24:39

the moving she number of things that are happening from the site idea unmask away

00:24:51

uh_huh

00:25:00

the idea is looking uh all the different aspects

00:25:03

everything that's moving within this is easily speaking

00:25:09

this is actually doing p. boxing so i forgot which would happily that this one is really

00:25:13

cool it's sort of a little off topic by just putting 'cause it's cool example

00:25:18

yeah

00:25:31

uh_huh

00:25:35

uh_huh

00:25:38

okay so in the source filter model itself we always have the source in the sauce is essentially the ad being come from

00:25:44

the lungs through the vocal folds themselves already talked about this

00:25:47

uh it's voiced speech we consider the vocal folds active

00:25:50

so we've got some sort of compilation happening we've got this initial energy pulses that we come through and

00:25:55

then look involving them with a sort of filtering action of the actual vocal folds and shot

00:26:00

then we had this sort of impulse responses impulse response comes along from the sort of vocal fold i think

00:26:05

eleven so at a rash through it yeah so the register with with billy's principle

00:26:09

is sort of built up pressure the size of the work of old

00:26:12

sort of makes an snapshot this is the aspect of the vocal folds nothing

00:26:15

shot the closed again yeah pressure built up under the vocal folds

00:26:19

we have them open and so on and so on we get this sort of impulse train here

00:26:23

and with this goes through the sort of global model and we get the excitation signal actually and you

00:26:28

the voice speech production some sells this model is normally generally a second or the low pass filter

00:26:34

um i hope everybody soon the with as i transform of familiar

00:26:37

enough with that transform no it's you know it's really

00:26:40

have the in that 'cause it comes up a lot of the times there's a transform is just the sort of

00:26:46

wrap version of the f. f. k. which is replacing it really is that simple is that

00:26:51

that explanation of what's going on within their itself but it's just another way of representing a frequency transform

00:26:57

with the the signal itself was eddie on voice page we normally how the vocal fold open so within the

00:27:01

source filter model is is always just a random noise generator that actually generating different speech sounds itself

00:27:08

the next aspect of the source filter model of discourse a filter and this is how we sort of get this

00:27:12

energy signal excitation signal which might just come across sounding like some sort of on at a particular frequency itself

00:27:19

then actually allow this to change the it into some sort of speech

00:27:22

representation themselves and this is done by altering the shape of the

00:27:25

vocal tract and this produces a very particular set of filter characteristics and

00:27:29

sells the vocal tract because consider is some sort of um

00:27:34

and even cost section all choose in the sort of course sections

00:27:37

of choose change over time we have sort of partial reflections

00:27:41

i h. of the different non junction so these choose

00:27:44

itself these parts or fractions essentially allow us and

00:27:48

to produce in the the sort of filtering action so is is sort of vocal track changes over time

00:28:01

uh_huh

00:28:05

is is but so is a very good attitude changes movements in the point

00:28:10

of constructions of the changing up and down just producing different sounds

00:28:14

her

00:28:17

yeah

00:28:19

we know melissa some so the this is the output of the actual goddesses some sort of so low pass

00:28:24

filter responses second although i pass filter expenses and some by some sort of band pass filtering itself

00:28:30

and this gives us the spectrum actual frequency distribution the peaks

00:28:33

of the format frequency big so the a spec true

00:28:37

speak to the vocal tract spectrum being the formant frequencies themselves so every time we try to build some sort of model

00:28:43

of speech production formants one of the big things we have to take into account so they're really dominant page

00:28:48

we can the vocal tract spectrum itself it dominant peaks change are pretty consistent across different speech

00:28:53

sounds itself this allows actually craig is different speech sounds with a very particular positions

00:28:58

we generally try to have build some sort of vocal tract model early scanning for three four maybe

00:29:04

even five up to five sort of formant positions and sells it really depends on how

00:29:09

well we've sampled exactly what we wanna do but it's the sort of positions and holding

00:29:13

other contract in the sumo positions producing similar formats allows two bridges recognisable sounds

00:29:18

languages and pull them altogether so had we actually model is we model this is also the second or the

00:29:24

resonances so we just stacking lots of second order filters together

00:29:28

second order low pass cast a low pass resonators

00:29:32

and ideally we having two poles within this particular model very sort of quick brief induction over so the jump over the

00:29:38

sort of filtering aspects of it there so last part of speech production is always the roles of the late

00:29:44

roll the lives in the sort of source filter model anyway is to really get the

00:29:49

sort of air pressure that's coming out the shape depressions coming out about vocal tract

00:29:53

we really just amplified how pushed out into the room and we model is

00:29:56

sort of aspect together as a sort of single pole high pass filter

00:30:00

and then once we've got all these filters we can just jammed together essentially unable

00:30:04

to some some sort of vocal tract response some sort of frequency response

00:30:08

of the actual model is in the speech production when we look at a single windows speech range is

00:30:13

very very short window we can see all aspects pretty much of the source filter model happening themselves

00:30:19

so we always have our impulse response are these are the so the harmonics that we can actually say within the different signals themselves

00:30:25

we always had the vocal tract of ones we can say the different peaks of the actual vocal tract response

00:30:30

we have some to lie low pass filtering shape as well normally hear every tiny bit more information about

00:30:35

frequencies and we generally have a high frequency and this is come across through really from the correspond

00:30:41

which is the second or low pass filter no with first ones which is the first a high pass filter so when we sort of put these together

00:30:47

we get just general first or the low pass filtering shape can really say in any sort of speak production

00:30:53

so we can use a soulful sort of also to model in different aspects of sort of speech production voiced

00:30:58

speech production vocal tract response isn't really so they get different bits of information

00:31:03

put different low level descriptors out of the actual speech signal itself

00:31:07

so roughly broken is down into sort of three different

00:31:09

aspects itself represented features and um prosodic features themselves

00:31:15

really used a lot impair linguistics so i really identify differences

00:31:20

in speaking styles they really give the re than

00:31:23

the life the intonation emotion especially sort of arousal motions

00:31:27

we can really hear within the prosodic features themselves

00:31:31

have sausages that i'll talk about as well and the sorta model glottal flow are actually what

00:31:36

is known as a regular phone nation so when we talk on the source filter model

00:31:39

why's presuming the vocal tract of local folder operating it sort of one hundred percent they always

00:31:45

open no i shut the sort of rhythmic right they always sort of chart fully open fully but actually in case

00:31:51

events real live speech this never really happens this is what's known as a regular foundation we can really here

00:31:56

some sort of different alterations within what's happening within the sort of all

00:32:00

slow and the actions of the vocal folds within the sausages themselves

00:32:04

oh also talk briefly about sort of format and spectral properties this is really

00:32:08

detailed information of what's happening in the vocal tract what's happening in the

00:32:11

filter what's allowing us to sort of producing here these different sounds so we

00:32:15

just breaking this down so we have a sort of vocal folds

00:32:19

a source speech is we have a sort of vocal tract which is that foreman spectral features and we have the sort

00:32:24

of prosodic features we solicit over the top and relief reflect

00:32:28

something some sort of differences in house that sounds

00:32:32

really before we start even extracting features from speech we need to think about how we wanna do this

00:32:36

we extract reaches for um i want pitch over some five seconds speech interval

00:32:42

one p. one sort of pitch value yeah it's pretty much meaningless for what

00:32:45

we've got we've got we've really got is belied speeches is highly complex

00:32:50

signal it's very much time bearing in time varies and changes

00:32:53

very very quickly we change variance change vocal track shades

00:32:58

somewhere between sort of ten to forty milliseconds on average so

00:33:01

we changing this correct characteristics changing this filter operations

00:33:04

very very quickly so we need to start sort of breaking the speech than actually doing what's known as windowing

00:33:09

so this is where we just assume that we've got some sort of speech we've got half each is uh features will change

00:33:15

slightly slower than the sort of speech complex itself we start to break things down to go from into i utterance

00:33:21

we sort of break this down brightness down we break this down to very tiny sort of windows we normally extract these windows

00:33:27

on average the most so the speech tell us something like twenty five milliseconds we

00:33:33

have a lot of them by ten millisecond simple we sort of windows across

00:33:36

maybe if we doing pitch extraction we might extend this out let's talk a little bit more about that later on during the morning itself

00:33:42

the other reason we celebrate is down within to sort of small and small milliseconds

00:33:47

really to do with so the spectral analysis and the idea of for it transforms and properties of the for

00:33:52

it transforms again which also very briefly over as we go into the sort of main aspects of it

00:33:57

so within very very tiny windows this page we can actually

00:34:01

proves you that does make some sort of um

00:34:05

assumption that the signals periodic disparity consumption essentially enables us to do for your transforms

00:34:11

hello uh presuming that we're working on some sort of periodic speech signal itself

00:34:16

this enables for in our system for analysis realise the basis of most spectral information that we extract

00:34:22

from the speech signals so windowing as i said a typical windowing so

00:34:25

we use a lot within power linguistics is twenty five milliseconds

00:34:29

and we know we have some sort of chip some sort of overlap by ten milliseconds themselves

00:34:33

one of the standard windows we can use is just the rectangular window where we just take the speech signal in go bang

00:34:39

and we sort of just cut that we take bits of the signal itself is do this time and

00:34:43

time again this is is the function of got here so yeah we have some sort of period

00:34:48

with those have some of soul overlap from the spirit and we start to extract different bits of information themselves

00:34:55

but actually using a rectangular window starts to bring a some sort of issues

00:34:59

depending on exactly the speech feature looking at the information we want

00:35:03

if we doing rectangular windowing we always sort of end up introducing

00:35:07

some sort of discontinuity serve resuming some sort of periodic

00:35:11

this within the speech window themselves but we're just randomly sampling it there's no guarantee we're gonna caught whether so those

00:35:17

signal is reaching the zero crossing points rose coloured kept somewhere where we just

00:35:21

sort of introducing so we never have this sign is away whenever

00:35:24

cutting in doing a windowing very nice decisions we always introducing some sort

00:35:29

of discontinuity into this you do wonder actually using rectangular windows themselves

00:35:34

is generally correct some sort of high frequency that we may or may not wanna live with within a signal depending

00:35:39

exactly what we wanna do how we wanna do it so sometimes we use different sort of windowing functions

00:35:44

the role windowing functions a lot is a normally got some sort of shape some sort of rough gaussian to them

00:35:49

are essentially just doing a bit of tapering so it's sort of bringing them down and extracting zero

00:35:54

bits of information and we're having this attenuate shunned by not having the so the high frequencies

00:35:59

so you really have maximum amount of information start and sort of brinkley's down different

00:36:03

shapes he is it hamming window the kaiser window the hamming window gaussian windows

00:36:07

the robot this sort of idea so this sort of dampen so it's discontinuities

00:36:11

the actual stat me into the windowing itself last uh so the

00:36:14

maximum information in the middle of the signals themselves exactly which

00:36:19

window we choose really just depends on how complex you wanna make the actual signal

00:36:24

how we'll time you want things how long we sort of processes work

00:36:27

hamming window hamming window a general use quite a lot different um

00:36:32

toss themselves i think having window is most likely when used in most of the folder and small scripts i i

00:36:47

uh_huh

00:36:50

okay so the onto different um

00:36:53

different features in different low level descriptors we call these low level descriptors 'cause are extracting amok they

00:36:59

sort of windows weeks checking these tiny tiny windows extract in this is sort of information here

00:37:04

so the real basic speech feature we can actually take a short term energy so what he

00:37:08

short term energy it's essentially the loudness of the actual speech signal itself that's what reflects

00:37:13

what we're trying to do and we're actually extracting so time energy

00:37:16

really tracking the upper invoke the actual speech signal itself

00:37:21

so distracted i haven't looked through time with is using a simple squared function here with displaying the some values with the the

00:37:27

particular window function do we know link is particularly important here we set the window to along with so the not um

00:37:35

recently cried a very much a low pass filter we lose a lot of information itself

00:37:39

we take the window too short we lose a lot of information but as i said gently twenty five

00:37:43

milliseconds for window link really does allow us to sort of track sort of look quite nicely itself

00:37:49

and uses were very simple very useful feature especially when we put it

00:37:53

together like something like pitch we can start to actually build even

00:37:56

a simple a motion classified can people just using speech just using pitch

00:38:00

values and just using energy value speech and user go up

00:38:03

the sort of positive happy emotions and angie values go down for more negative emotions

00:38:07

so even very very simple features you can build a very crude so

00:38:11

the classifies with itself pitch detection is sort of relying on um

00:38:17

extracting the so the fundamental frequency of the actual

00:38:20

speech signal itself celebrate vocal fold vibration itself

00:38:24

pages a sort of perceptual characteristics that we can't really extract all measure

00:38:29

fully itself so when i talk about page when i talk about fundamental frequency i'm really talk about the right

00:38:33

of vocal fold vibration because is the right of vocal vocal fold vibration essentially this perception quality speech

00:38:40

actually changes and that's a little

00:39:14

yeah

00:39:17

so you can see now we are those uh sort of write a vibration changes

00:39:20

the actual pitch that we hear changes but well it's don't know majoring

00:39:24

exactly the pitch we here but as i said actually measuring the sort of right of vibration of the vocal folds themselves

00:39:29

and this is actually one of the most difficult task we can do is sort of

00:39:33

speech processing are not gonna go too much sort of into the real details

00:39:37

a very complex speech productive as pitch extraction algorithms today because they show

00:39:41

sort of she difficulty we could do all lecture on so that

00:39:45

pitch detection and very good ways of doing pitched detection itself but

00:39:50

the reason the exceptionally difficult task is 'cause speech signals had this sort of

00:39:54

quasi periodic nonstationary property to them as i said the constantly changing

00:39:59

exchanging a very very quick right son is read a vocal fold vibration

00:40:02

is constantly changing and we had this filter that sits on top

00:40:06

travel is to to we've gotta get rid off to actually find and locate and identify this sort of

00:40:11

right to vocal fold vibration in this filter positions uh pretty much infinite exactly where they could be so

00:40:18

trying to really get this very fast accurate information the filter try to get with this information itself

00:40:24

and at the same time dealing with natural variations in sort of human voices a lot

00:40:29

possible ranges of the temple structures so it's exceptionally difficult to do very very accurate

00:40:34

so the pitch detection in the same time in saying this that it's very

00:40:38

very easy to do very very crude speech detection algorithms quite quickly itself

00:40:42

and some of them all credo maze we can actually do this is just using very simple things like uh yeah um

00:40:48

a short time autocorrelation function of the average magnitude different functions is essentially just correlating the signal with itself

00:40:55

looking web page uh between the croatians and using that as sort a measure of page itself

00:41:00

we only really talk about page as a voice speech characteristic because

00:41:05

it's the only time the vocal folds themselves actually vibrating itself and we can make this task

00:41:09

a little bit easier when using so the autocorrelation functions the average different magnitude functions

00:41:14

by just doing a bit of filtering we know they're sort of typical ranges the pitch will appear from

00:41:19

so we don't need to start looking for pitch values up in the thousands of hurts themselves

00:41:23

no we can sort of fruit roll away a lot of the high frequency information remove

00:41:27

this and then sort of try and form these other ways of doing it

00:41:31

the autocorrelation function will also come back and revisit when we talk about linear

00:41:35

prediction analysis and how to actually get the vocal tract information itself

00:41:39

so the autocorrelation is simply the um correlation of the signal with itself and this

00:41:45

is it's just i've sent to express this functional sort of time delays itself

00:41:49

and we'd expresses some sort of function of cache the ring properties of

00:41:53

the autocorrelation function i bet it i even it has a global

00:41:56

maximum of zero itself and this go maximum is always equal to

00:42:00

the energy than ally signal itself sort of a sort of

00:42:03

zero point here is energy and the autocorrelation is periodic

00:42:08

itself for periodic signals that we fade into it

00:42:10

we can use this perilously to essentially allows to determine the pitch period from the signals themselves so

00:42:22

what we find when we've actually feed in a speech signals which are perfectly periodic with

00:42:27

some sort of perilously itself we can generally find that the uh sorry for perfectly

00:42:32

getting when it to have a perfectly periodic signals itself we can then we find the

00:42:36

than first maximum page that we can actually say within a sort of um

00:42:42

or a correlation function is the actual pitch of the signal itself that we're looking for is

00:42:47

the first harmonic are able then to use this sort of time difference to work out

00:42:51

knowing the fundamental frequency on knowing the sampling frequency the signal to then work backwards and get the pitch value itself

00:42:56

and this is exactly really true for speech signals 'cause i'm not really a periodic itself

00:43:01

presumably we can look at uh autocorrelation function find the first take find the next first maximum pick

00:43:07

after this look at the sample values between them actually to extract the page signals themselves

00:43:13

the average difference magnitude function is roughly the autocorrelation function and we're doing minuses instead of

00:43:18

subtraction and and we're looking for troughs instead of pages within the same signals itself

00:43:23

so if we have some sort of voiced analysis we have some sort of periodic function

00:43:27

we can say that we have a first take a being the energy the signal

00:43:31

itself or the first off a a sort of again colliding with the energy the signal

00:43:35

itself we have the next maximum values and is this sort of sample here

00:43:39

the number of samples here which will know by was set by the sampling frequency that were actually extract the signal

00:43:44

at and from there we can extract a very quick crude measure of actual page of the signal itself

00:43:50

again it's not the most accurate one and not quite used a lot but also gives you an idea

00:43:56

of how we could do it in a very very rough basis we wanna sort of do more

00:44:01

fine sheen exact pitch detection we probably wouldn't do it in the time domain

00:44:06

to begin this we probably would do it more in the frequency domain

00:44:09

we're looking at things like sub harmonic structures where we start to break the signal down into different groups of harmonics

00:44:14

start to some these harmonics of the top of each other and from this

00:44:17

sort of summation of harmonics is sort of natural dominant harmonic comes up

00:44:22

and so that gives us the idea of where the pitches in the signal itself we also might use

00:44:26

the cepstrum so this is the inverse of the fourier transform of the log magnitude function itself

00:44:31

so just taking the log magnitude men taking the inverse of that itself and pitch can often come up as a dominant peak

00:44:37

within the spectrum so we can use the study different methods to roughly more in in a line with what is done

00:44:43

events most of them we use of one of these two methods to extract the

00:44:46

page information this is just a rough idea of what's happening with some harmonic

00:44:50

summations themselves is breaking the signal down into different harmonic groups doing

00:44:54

some filtering in extracting harmonics eventually we just add up and

00:44:58

add up these different harmonics of the window itself move on the

00:45:01

dominant peyton installed paid essentially comes out at the actual

00:45:06

page of the signal so we're looking at a straight p. a straight spectrogram is very so hard

00:45:11

to track the exactly trauma itself the sub harmonic spectrum really allows us to very very

00:45:17

easily identify the pitch ago itself that we're just gonna trace the sort of double takes every time

00:45:23

itself again this s. room taking the log of the fourier transform and take an invoice

00:45:29

and we can one the dominant pick within here correlating again very much with the page with the absolute signal

00:45:36

very rough introduction if you are more interested definitely read the papers pitch

00:45:42

our written speech extraction is a big open sort area feel the

00:45:46

study that's always actively ongoing and lots of different ways

00:45:50

of going on there but sort of moving up our vocal tract

00:45:53

moving apostle source filter model now looking it's a different

00:45:58

id is a different features and the next sort of grouping of features anyway is sort of voice quality features

00:46:03

so said was uh looking at some sort of measures some sort of a

00:46:06

regular for nation themselves because in the video that i show yeah

00:46:10

the we always make the assumption as i said the vocal folds always opening and shutting annoys doing this at some sort of

00:46:16

right and chatting completely but it's never really happens in real life and we can sort he this in speech signals themselves

00:46:22

using 'cause someone with a very very tense sort of very no this sounding voice gently happening 'cause it got more sort of

00:46:28

rigid vocal folds themselves a sort of slightly robotic sound that you can often hysteria easy to sort of do it yourself

00:46:34

i just sort of pulling impinging your vocal tract really tightly so the tense in the muscles in around there

00:46:39

the same time you can he things like course unless you things like sort of always the clock me

00:46:44

as someone might not be so the opening and shutting the vocal folds fully in his sort of yeah this sort

00:46:49

of noise this sort of rushing through within here itself and we can extract this and look at this

00:46:55

information a lot of different ways within the open small to box itself the main sort of voice quality features themselves a cheetah

00:47:01

shame ah and also the harmonic to noise racy a digit

00:47:04

uh is really deviations from perfect pitch period in c.

00:47:08

humour is deviations in energy now how many to noise ratio is pretty

00:47:12

much exactly what it sounds as the harmonic to noise ratio

00:47:15

so looking at the noise and looking at the level of the sort of different harmonics within

00:47:19

it hand that's done by the autocorrelation function we can extract this sort of information

00:47:23

so just a little example of what i mean so we're looking at human being this permutations in sort of energy

00:47:29

itself and then judah being permutations in the pitch period that we might be looking at

00:47:34

within a particular signal and very very easy to sort of seen here differences

00:47:38

in th in surgery that endanger not within the speech signals it's

00:47:48

he says in say we got very legit a very legitimate values itself

00:47:54

just someone speaking in a very normal so low level of

00:48:02

does it sort of shouting we can hit is sort of say hey

00:48:05

the differences because they see differences sort of distribution the the data

00:48:09

as human values themselves so that's roughly voice quality features as sort of

00:48:14

predicted by the open small tool box again there's a lot more

00:48:17

different way isn't looking at the glottal flow and measuring glottal flow very

00:48:20

precisely looking very different aspects of the opening and closing um

00:48:26

ratios in again voice quality in colour for modelling the whole nother area

00:48:30

of study the if you guys are interested definitely rate up more

00:48:33

on and there's some very cool ways we can hear very much difference is it sort of this

00:48:37

idea also the differences in voice quality in these sort of productions that we can actually here

00:48:42

they're not really related or saudi really related to the local folding and chatting in the properties of that

00:48:49

big sort of age representation only talk about is linear predictive coding which itself

00:48:54

isn't every feature representation we can use them we can use them

00:48:57

as a feature where it really the first step towards identifying what the vocal track spectrum is

00:49:03

and then from there we can sort of go on extract sort of formant frequency information

00:49:07

from the actual vocal tract spectrum itself so we need production coding is um

00:49:13

basically i'm sorry see the slide yep used to estimate vocal tract transfer function itself

00:49:17

i'm from this we can identify dominant peaks in the shape of the vocal

00:49:20

track spectrum and incidentally the timing the filter productions of any order aggressive system

00:49:25

itself so any sort of system or we have lots of filters essentially

00:49:29

cascaded together and we won and identify this information in we can use it to a lot of things

00:49:34

we can take from the original speech signal we do the linear prediction coding we get the filter coefficients of the actual vocal tract

00:49:40

from this we can then take the original signal we can do the filtering we can extract the excitation signal itself

00:49:46

oh we can do some sort of form of actually reproducing so we can have some sort of filter characteristics

00:49:50

we can fit in some sort of squat excitation signal and do some sort of speech synthesis

00:49:55

so look to different reasons we wanna look at the sort of linear predictive coding using

00:49:59

i provides a really good model of speech production it provides

00:50:02

is very very accurate shape of the vocal track spectrum

00:50:05

and at least to recently good source filter operations of extract a sort of high enough for the filter

00:50:11

that we can actually do that we can do some sort of

00:50:13

inverse filtering forever badges is here is it's analytically traceable and

00:50:18

despite from the chunk amass sometimes right you in the next couple of slide it actually quite

00:50:23

simple and quite a straightforward implementation to extract the linear predictive coding and actually do it

00:50:29

zero basic idea of linear predictive coding this just doesn't apply to speech this applies to most signals itself

00:50:35

we essentially saying the current sample in this case the current speech sample can be

00:50:40

approximated the some so we need a combination of pass speech samples itself

00:50:44

to have a speech sample we have a lap last bits of information and we have some sort of a

00:50:49

linear combination of these and these actually coincide with the vocal tract filter parameters themselves and these i. e.

00:50:55

coefficients that we wanna recognise them we we do this by minimising the mean square

00:50:59

between the production uh an actual to shorten segment of the speech signals

00:51:03

looking at differences every time so we're doing this minimum so squared samples actually just solving this equation here

00:51:10

and then up the wise enough for the filter parameters so it's pretty straightforward to to

00:51:14

so essentially we have i output formation so we start go okay here's my current speech signal

00:51:19

is my pass speech signals and we set some or that the

00:51:22

filtering generally this is set somewhere between ten to fourteen

00:51:26

we don't have a production era we know i true sample we know what we want to predict to

00:51:31

get this production era therefore we can actually make this production yeah so this the actual speech sample

00:51:37

these are estimated speech samples and sent it to mean squared

00:51:40

prediction task from there so squaring the main of it

00:51:44

we're differentiating the pretty tucker vision setting it says there are essentially solving system of equations here

00:51:50

to actually break it down i'm not gonna solve the system of equations it does sort of come out here in every

00:51:56

arranged slightly what we can actually find when we sort of cell and set up the system of equations itself

00:52:02

actually comes out the uh we can express it as the autocorrelation function here

00:52:07

so essentially trying to find one part of your autocorrelation using estimated version another part

00:52:12

of the autocorrelation function this then allows us directly sort of set up

00:52:17

as a system of equations so we take a speech signal take our correlation with the speech in itself

00:52:22

set this up as a system of equations and then essentially we wanna solve for the a matrix

00:52:27

to find these particular values themselves so well the ah values come

00:52:32

from the autocorrelation function learn estimate this filter parameter itself

00:52:36

what we find the very important probably is that the type that's

00:52:39

matrix so it's in metric will diagonal elements being equal itself

00:52:43

this allows us to solve the system of equations using sort of various different forms itself again

00:52:50

there's a lot of background information 'cause i've gotta sorta going quickly through a lot of different aspects of speech production

00:52:55

until to uh as excitation sitting behind this so essentially we wanna in but

00:53:01

the autocorrelation matrix itself i mean solve so this of

00:53:05

course can be very computationally intensive where uh

00:53:08

inviting some big major except between ten to fourteen but because it's type x. it's got

00:53:12

the symmetry property to it there are various iterative methods that we actually used to solve the

00:53:17

so the doubles algorithm which is a sort of setting up a sort of iterative way

00:53:21

to actually sort of worked through and soul for the first coefficients alter the second

00:53:24

television and so on and so on we pulled up sort of each reduce

00:53:28

solutions here we can use a gradient descent algorithm us into what we go much doing machine learning

00:53:33

some sort of optimisation task to actually allows to solve the solar system of equation and slide

00:53:42

uh i'm gonna skip over exactly how they haven't algorithm than gradient descent algorithms work

00:53:48

but they do work that allows to solve this and then from this model here we can and

00:53:52

sort of build up the vocal tract spectrum we can build up the filter probably themselves

00:53:57

so we can say is that was we change pay what essentially happened is we get so the more more

00:54:01

detailed information of the vocal tracks picture so we have some sort of frame sumter input signal itself

00:54:07

we take some sort of fourier transform and we're getting the short timeframe so this is the spectrum itself

00:54:12

and that when the linear prediction coefficients what we're looking for is the real with a contract

00:54:17

so we start with the sort of very small number of linear prediction

00:54:20

coefficients we'll get a rough very loose or the one pay

00:54:24

you can a little bit of a low pass shape because we start put more more increase the order of the vocal tract

00:54:30

or uh that with a putting into the l. p. c. algorithm we sat together more data out shape

00:54:35

we start to say that the frequencies in the spectral the formant frequencies become

00:54:38

more clear as we sort of go higher and higher in order

00:54:42

the reason we don't go too high and all that is essentially we start

00:54:45

looking at format information and we start looking at harmonics information itself

00:54:49

if we start to go of orders of sort of twenty if we start to say that they're sort of harmonic

00:54:54

information that we can say in the spectrum really starts to come into the l. p. c. spectrum itself

00:54:59

starts the get rid of the foreman information makes format tracking more difficult and um

00:55:05

doesn't say makes for much checking more difficult task and we've got the inverse fourier transform forgetting really be

00:55:11

shortened spectrum we don't need to do that by some long winded new production television method as well

00:55:17

so one thing we might use linear prediction television for is to help ah speech extraction tossed so

00:55:22

it can make a time domain task such as autocorrelation functions in the average difference magnitude functions

00:55:27

a lot easier to do so we can examine to get the error signal from

00:55:32

the l. p. c. algorithms itself and then find the pitch

00:55:35

periods directly from the area signals area signal itself

00:55:39

by looking at the approximate takes in looking at the sort of autocorrelation function there it's essentially will get rid of

00:55:45

all the vocal tract information but retains the sort of keep h. information within the signals

00:55:50

and also from the linear prediction coefficients what we're very interested in is getting the formant frequencies

00:55:56

so essentially the formant frequencies are the dominant peaks within here and

00:56:00

we wanted pretty much find the dominant peaks in maxima

00:56:03

and they determined using some some numerical analysis will look at the polls and uh sort

00:56:08

of zeroes the actual function and this allows us then to sort of backtrack

00:56:12

you are sort of electrical engineering style analysis and find the different formant positions

00:56:17

formant tracking can be exceptionally hard task to do we need a good estimate the l. p. c.

00:56:23

and then we can still see have sort of a lot of things between first and second

00:56:26

format that can cost over at various different points of time make this sort of algorithm

00:56:32

a lot harder to run again it's a whole nother lecture on sort of format tracking so i'm just gonna sort of skip over here

00:56:40

but there's different ways we can actually track and the these algorithms you can

00:56:43

rest rest funding the so the main one here looking at the pages

00:56:47

so the last low levels group of low level features the only talk about

00:56:51

this morning are the power spectrum so those spectral analysis or a

00:57:00

um so this is going to just taking the spectrum looking at the

00:57:04

different harmonic structures looking how different frequency information changes from also low

00:57:08

frequency that high frequency information in the signals themselves we sort of build

00:57:12

up this information time we can really look at temple does

00:57:15

displacements of sort of frequencies and how they respect sort of solar power spectrum itself is just

00:57:20

a lot of the discrete time fourier transform as implemented by the fast fourier transform itself

00:57:25

do we take in some sort of hamming window and then we essentially just to

00:57:28

the f. f. pay vignette fifty we can scare take a lot value

00:57:32

and we get a sort of on the line signal here this is the l. p. c. analysis at the top of it but you can actually see within the

00:57:39

power spectrum it badly does follow the format shape we actually get within itself

00:57:44

from this sort of l. p. save from the sort of formant analysis we can either use directly

00:57:50

the sort of two hundred fifty six five hundred toll features that we

00:57:53

might actually get out the represent the information the spectral information

00:57:56

from the f. f. pay all my sort of break this down so we can look at aspects such a spectral gradient

00:58:01

look at the overall buff we near shape wrap line distribution later

00:58:05

high frequency of the actual the actual power spectrum itself

00:58:10

or might look at different ball of points with different bits of the energy occur

00:58:13

or different entropy about five to seven to be the more noise like

00:58:17

more information is essentially embedded within it itself so the so the spectral gradients vector well off point

00:58:23

and and to be really give us us a lot of information that we can they then go and going use in sort of

00:58:28

algorithms on this is extract with you know it and small talk

00:58:31

about noted chainsaw talk about the sort opens while bit mall

00:58:35

in the afternoon so we might take the spectrogram of the features we might be interested

00:58:39

in doing some feature presentation learning from spectrograms up talk a bit more about that

00:58:43

very very shortly the spectrogram is just essentially taking six sensitive to estimate of the actual

00:58:49

signal to a sort of sliding window itself so that's what this algorithm sorry

00:58:53

i'm just sort of taking a fifty estimates like this every time and extracting different

00:58:57

sliding over time with the windowing function instructing different eternity different frequencies

00:59:01

using the exponential here is all it's really going on there

00:59:06

um within when we're doing spectral analysis we've always gotta consider that there's some sort of

00:59:11

trade off going on and this really relates different properties all before it transform analysis

00:59:16

so when the going for a transform analysis and doing spectral analysis essentially

00:59:21

i two parameters of interest i directly related to each other so temporal

00:59:25

resolution in frequency resolution set by one parameter being a window links

00:59:29

and this allows us this then essentially means we have to have some sort of

00:59:33

trade off one of forming some sort of spectrograms generation put this here

00:59:37

a lot of times we skip reverend nor actually use the sort of same window things time and time again

00:59:42

but eventually every time spectral analysis we can't really get a good frequency information all good

00:59:48

temple information because the control by the two parameters there's no the ideal window link

00:59:53

as as it was sort of fall into some default standards but it's good to remember

00:59:57

there's a whole other theory that sits under here allows sort of different properties here

01:00:01

so what can happen in essentially if we choose a very very

01:00:04

long window things we get very very good frequency information but

01:00:08

obviously longer winter windowing skits is very very paul temple information within

01:00:12

the signal we shared a very short we know things

01:00:15

we can get very good very good temple information but we lose the frequency resolutions or voice got this trade off

01:00:21

tried of really depends exactly what you want your calls to be some sort of almost tunable parameter

01:00:26

if you sort of doing this style of analysis within the actual um machine learning algorithms itself

01:00:32

uh when we always talk about spectral analysis mel

01:00:36

frequency cepstral coefficients of probably the most

01:00:39

dominant spectral feature uh i always recommend the the the sorry go to feature uh

01:00:45

in any sort of speech processing task if you really don't know that

01:00:48

where to start what to do no frequency cepstral coefficients or is a good place to start

01:00:53

something like a support vector machine back and is always good place to start as well you can get a lot of different information from them

01:00:58

um no frequency capsule coefficients are used everywhere the very dominant things

01:01:02

like automatic speech recognition the sort of the main features

01:01:05

when the thinking like the planning nowadays the mel spectrum is probably the most common sort

01:01:10

of feature representation that fed into a lot of the planning algorithms currently there

01:01:14

so what is what are and then say say is what they represent so it's essentially a filtering the

01:01:20

signal using the mel filter which is based very much on uh so this hearing response as humans

01:01:26

so all this is showing here is that we're collecting i'm listening

01:01:31

to better resolution at lower frequencies in a hearing response and we

01:01:35

have a high frequencies this is represented in the mail

01:01:37

filter by the so serious a triangular filters listen to you can say that the filters sort of narrow uh

01:01:43

at the lower frequencies and they get more spread out have a sort of

01:01:46

wider bandwidth of the high frequencies themselves and stagger the mel filter

01:01:50

so when we're falling m. f. c. c.s itself it's just really

01:01:53

a matter of making the features more more sort of relevant

01:01:57

trendy lose we're done than informations from the spectrograms and make it so the more relevant to what speech processing itself

01:02:03

so i start with the extraction of the spectrograms extraction the the magnitude spectrum

01:02:08

so we take the fast fourier transform of the speech signal itself we take the magnitude of this

01:02:13

we take the logarithm the business allows us to sort of separate excitation vocal tracks

01:02:17

allows us to rough way you have some sort of distribution towards human mouth this deception on log scale itself

01:02:23

within do after the first lot of sort of information that action itself to my taken

01:02:28

enough if they are paying two hundred fifty six point five hundred and twelve point

01:02:32

and then we sort of feel to this down to forty points using triangular filters equidistant

01:02:37

over the mel scale itself and this gives us the mel band spectrogram itself

01:02:41

is is that the mel band spectrogram is often fed a lot into different depending

01:02:44

out within siemens we use a lot that is itself the most but

01:02:49

no band still has a sort of whatever done that information in return we got

01:02:53

rid of a lot of the high frequency information but it still can

01:02:56

be so the compressed and we can really john this sort of information further

01:03:00

down so we should be correlated for the using the discrete cosine transform

01:03:04

from that we take the first twelve coefficients we generally replaces erect coefficient with some

01:03:09

sort of energy representation to get out the mel frequency cepstral coefficients which

01:03:13

don't look like much when you actually just got them on the screen but when you

01:03:16

feed them into algorithms you find the particularly very very sort of powerful representation

01:03:21

probably the most dominant as i said speech feature that we have

01:03:25

and use sort of across all aspects of speech processing themselves

01:03:29

typically what we do then is take the mel frequency cepstral vector itself

01:03:34

and then as i said we've often put log energy in and then we take

01:03:37

the delta delta delta coefficients so this is essentially looking at how different distributions

01:03:43

change over time so we have a frame when we look at how the sort of

01:03:47

change or frame before it and delta delta coefficients at the change of the change

01:03:51

so we stack these normally together in some sort of twelve or thirteen is once we've got log energy there or the colour vision so

01:03:57

we stacked the delta stacked the delta delta one is thirty six

01:04:00

televisions gently represent one of the best sort of most powerful

01:04:05

low level descriptors that we have within speech processing nowadays a. s. r. is very

01:04:09

heavily based on the the mel spectrum all know frequency capsule coefficients themselves

01:04:15

okay so this tune more little things the only cover quickly before we break for this morning

01:04:23

the three things only cover quickly before we break for the morning and this is sort of how we go from this low level

01:04:29

features and start to form something that we actually feeding into a

01:04:31

machine learning algorithms all the time within speech pathology itself

01:04:35

we're not really interested in the low level descriptors we're not interested in this sort of very very quick

01:04:40

change because this often who were like a lot to this a linguistic

01:04:45

content often interested more how these distributions area of the course

01:04:49

of a chunk of the speech signal or even the terms of the whole utterance of an actual speech signal and sells

01:04:54

this is the idea of the supper segment or feature analysis this is essentially saying we have some sort of

01:05:00

when craig i think i essentially a single fixing when go back uh which describes the sequences of short term features

01:05:06

oh that as i said something to evolution over time over the course of

01:05:10

some sort of window maybe five seconds or a whole window with itself

01:05:13

so we start without frame levels we have our friends very five milliseconds level up by ten

01:05:18

milliseconds we extract that early days from always frames is into what we're doing here

01:05:23

okay this sort of single fixing factor here it's summarising all these l.

01:05:27

days using some sort of statistical functional measures i might need

01:05:31

variance standard distributions role of points all sorts of things we used to actually try this year

01:05:36

really we can do that very very fixings representations real power this sort of side

01:05:41

separate segment of each is is the sort of two things he won the

01:05:45

information that we're looking for a lot in speech pathology isn't added

01:05:49

in the actual short timeframe information that more embedded in the so distribution of the utterance

01:05:54

of the sort of information here so that's what we're capturing using the statistical functional

01:05:58

and also provide this is so the single kingsley expect that regardless of whether installing

01:06:03

so means essentially we korea's creating a sumo interact uh for every single one about data points are not cohort

01:06:09

we're not really feeling in my feeling in the sort of same information with same amount of information into machine learning algorithms

01:06:16

we're not got the sort of waiting so we have to do with low level features and sort of creating

01:06:20

estimates every twenty five milliseconds and something them up somehow but we really was cutting the utterance links values

01:06:27

so doesn't quite mad at how what small bits of variation the utterance things

01:06:30

is we're getting the single fixing spectra out of at the same time

01:06:34

so do i mean by statistical functional as as i've already said we'll

01:06:37

hear things like means moments extreme is percentile slopes regression lines

01:06:42

any sort of semi something information that we can so we start with some sort of feature distribution

01:06:47

might look at the may not look at the standard deviations different spots within here different bits of information we can extract

01:06:54

this allows us to generate very very big feature space is very very quickly this is what happens in

01:07:00

open smile this is what we do is sort of do here we've have i enjoyed comfy choose

01:07:05

and these of referred to as a low level descriptors often too

01:07:08

easily extract the sort of delta delta cover delta coefficients

01:07:11

and then we use these m. functions to summarise these i will pays out into this sort of utterance level

01:07:16

so we can generate very very light feature spaces inhibit features bases

01:07:20

can turn a lot of rich information relating to sort it

01:07:23

speech pathology a motion whatever a particular task is within para linguistic

01:07:27

to listen to what's going on how is being said really isn't bad it a lot in the very sort of which representation

01:07:33

cells and this is all often known as brute forcing of light features faces to

01:07:37

starting with our low level descriptors starting with putting go to go to this

01:07:42

in their itself women starting to use the functions themselves come up with a very very rich very subtle wide

01:07:49

distribution and how we extract the isn't a little bit more information into different feature representation

01:07:54

gained using this you guys all about and start to extract

01:07:57

some information regarding today's in the tutorial this afternoon

01:08:02

so one of the way we can extract the so the utterance level representations is

01:08:06

by using what's known as bag of audio words which is sort of um

01:08:10

a little bit advertising mania for the group in the work we do so back what

01:08:14

itself is very much a linguistic natural the linguistic representations come from natural language processing

01:08:20

so looking at a document and then we're looking to sum up the document in order

01:08:23

document just instances of distributions of words is in a sort of he histogram

01:08:28

of occurrences into what's happening here so i have some sort of document the cat

01:08:32

is on the table we might have some sort of code book or

01:08:36

or dictionary and then we looking at different frequencies that occur within

01:08:39

this cable is these frequencies easily distributions that we can use

01:08:44

as a sort of feature representation that we actually can fade into some sort of machine learning algorithm

01:08:49

so we look at these different histograms everytime different features different occurrences and feed this information in these can

01:08:55

come from different sort of groups this could be a good bottle features this could be spectral features

01:09:00

could be prosodic features so on and so on we can make up very very rich so histograms itself

01:09:05

uh we develop software in the group called the bag um open crossbow

01:09:11

itself and this is just a cool tip for doing multimodal bag awaits formations it's based in java and raise a

01:09:17

quick and easy to do manual step some sort of normalisation the yellow days some sort of code book generation

01:09:23

vector quantisation and maybe some sort of post processing on the actual vector quantisation just normally

01:09:28

to normalised to different time links within the actual books within the actual utterances themselves

01:09:34

so the type of audio it can offer so the different advantages to sort of us up

01:09:39

suppose segment of age analysis themselves um one of the core advantages

01:09:43

is robustness over oyster quantisation again some sort of cobalt

01:09:47

this allows us to sort of account a little bit to noise that might occur to smoke limitations in the input data itself

01:09:53

in the den with an hour at the room working on in the while data real well data

01:09:59

can sometimes get better results when actually instead of using functions to sum up an utterance

01:10:03

we use this sort of back of all your words approach to some utterances themselves

01:10:07

time invariant again is one of the sort of big advantages of using it so we're fixing

01:10:11

to representation regardless of time we can normalised these distributions according to time as well

01:10:17

multi metal fusion is one good aspect of this because a lot of the times we can have speech

01:10:21

linguistics and video information extracted very much a different sort of timing intervals

01:10:26

but again will sort of expressing things uh some sort of

01:10:31

utterance level representation in this looking histograms allows us to do very

01:10:34

simple future itself and privacy is one very big aspect that works quite well with a sort of back about your words

01:10:41

as is it spectral features contain large amounts of linguistic information

01:10:45

histograms contain no amount of linguistic information within them itself

01:10:49

so it's an irreversible mapping you just mapping which occurrences of what's going on here there's no way

01:10:53

you can actually reproduce all week or a what was said in the actual algorithm itself

01:10:59

bagwell information is very very easy so uh we always start with

01:11:03

audio instances oh they extractions and then we generally choosing a

01:11:06

codebook and we're doing this bagging thing bias into quantisation which i'll

01:11:10

go over in a minute and so the histogram construction

01:11:14

being the final step of the actual the key parameters we normally have a codebook

01:11:18

size and number of assignments i'll talk a little bit more about this

01:11:21

so then we start here with some sort of code poke and this is just some example of some sort of

01:11:27

training set that we wanna contact against itself to is have a codebook

01:11:31

and those have a feature set the quantisation is essentially going

01:11:35

what is the nearest code word i features may should too so this could be a

01:11:39

just m. f. c. c. data and then we're going okay

01:11:42

using euclidean distance what is essentially the closest distribution

01:11:46

in my code book and we're finding that then we'll just assigning and histogram

01:11:50

count we do this over time over the course of the actual full

01:11:54

so the sample that we doing we can find that we can see this sort of frequencies it distributions and from here we

01:12:00

can then very very easily find term frequencies sundays out of different windows if we want or the course of the full

01:12:06

so the utterance itself and find different distributions there and this uses concise

01:12:11

to windows as the actual cobalt as actual time frequencies themselves

01:12:16

so we could also do multiple assignments so this is when we say is saying instead of assigning

01:12:21

to the nearest codebook we assigned than yours and words in the codebook itself this is

01:12:26

generally what we wanna do and one of the so called parameters when you doing bad

01:12:29

of audio what is inside your number of assignments you jen we wanna sparse representation

01:12:34

but as fast representational one is probably two spots there's not enough information they're just playing

01:12:39

around with the number of assignments you can find that can actually give you

01:12:43

very much a a difference so the set of results is you actually increase i see

01:12:47

so the last thing i'll talk about very briefly anyway i skip over some of the this morning to supposedly getting towards break time

01:12:54

here is the sort of feature representation linings so the mean

01:12:59

low level feature description suffer segment of feature descriptions uh very much

01:13:03

based on knowledge as a sort of talked a lot this

01:13:05

morning i said steep they look quite a lot of things at all there's a huge amount of information back a

01:13:10

it's huge amount of information behind the a lot of page teaches m. f. c. c. teaches spectral features

01:13:17

based heavily in electrical engineering uh taking used to develop the very much knowledge based on

01:13:22

the sort of source filter operation extracting very particular aspects of speech production that

01:13:27

one of them all of things is sort of come out of the planning what we can do now is essentially got a

01:13:33

don't really care about ending that knowledge i one of the you know find a

01:13:37

representation from my data it's very very specific to the task i'm actually doing

01:13:43

and things that convolutional neural networks allow us to do this if each representation learning

01:13:47

is learning features essentially directly from the data or from some so

01:13:51

high level representation there is something like a spectrogram themselves and

01:13:56

just sort of targeting it to what our actual task at hand going

01:14:00

okay i've got a particular task i've got my data itself

01:14:03

get me representation that absolutely perfect for this task is a lot

01:14:07

of the times we use m. f. c. c.s absolutely

01:14:10

everything you can sit there and you can use m. f. c. c.s absolutely everything with the support vector machine

01:14:15

and someone said to me you you always get somewhere between sixty to eighty percent

01:14:19

accuracy when using m. f. c. c. support vector machine for any given task

01:14:23

you never gonna get a hundred percent accuracy using something like this because of the

01:14:27

all the extra information there's so much information embedded in speech system much variability that

01:14:32

that it's very very hard to get high high levels of accuracy for particular task in speech pathology learning

01:14:38

because of the sort of confounding information it in here one possible way

01:14:43

with the so the caddy it here that we had absolutely enough training data actually

01:14:48

really extract the information probably from using the convolutional neural networks themselves is feature

01:14:53

presentation i think about doing targeted feature extraction towards particular angle itself

01:14:59

and this is mainly done using convolutional neural net what's their special form of feed forward neural networks

01:15:04

and saying to just do it convolutional operation time time again different levels

01:15:08

and sort of tightening feature extraction to what that task itself

01:15:11

and these convolutional channels get re used time and time again were sent along the weighting the

01:15:16

importance of different convolutional kernels to what the particular task that we might actually have

01:15:22

convolution as uh so the of refresh of uh which continues or introduction to

01:15:27

people it might not be familiar with that is sent to manipulation

01:15:30

operation with just taking two signals and performing the third signal itself with

01:15:34

into doing an infinite summation over some sort of complete colonel

01:15:38

to to itself with the actual signal choice donated normally within here as a star operation itself

01:15:44

when we think convolution the best way to think of any convolutional

01:15:47

operation is always three particular words being fit shifted multiply

01:15:50

so essentially we have a signal we have some sort of convolutional filter recently slipping one of them uh and then just

01:15:57

doing shifting the to to uh the signal itself and taking a serious adult products

01:16:01

to actually form the convolutional operation and that's what's going on when we're doing images it's the same

01:16:06

thing we have a two d. representation we just it and then we just shifted over time

01:16:11

but the but the x. and y. dimensions within a particular image itself and again three day it goes out compilation is

01:16:17

gotten dimensionality is itself essentially this is all this sort of maths is showing here i've

01:16:22

taken a convolutional operate uh i'm just really expressing it in terms of a

01:16:26

set of inner products as into this is what's happening here reversed time version say

01:16:31

and we're doing a shift to multiply operation to actually get the output itself

01:16:35

so what so the happening here within the different operations when we do sort of feature presentation lending itself

01:16:41

essentially my animations of gone right outside the skip to the end

01:16:45

so we're going to performing convolutional operations themselves and this is a this is using a

01:16:50

set of filters to identify patterns within the signals and the network sort of alliance

01:16:54

the white associated with each of these filters and similar patterns may carry multiple regions

01:16:59

itself so once the importance of this different features towards the task at hand

01:17:03

then we do some sort of normally down sampling operation in this is generally done by max bully

01:17:08

so this is just taking saying in a particular uh frame i'm interested

01:17:11

in just the actual maximum value that comes out of here

01:17:14

and this just allows us again so so the invariant she's in translation a little bit of noise or the noise um

01:17:22

we have just a little bit more boston noise itself i mean we can do the operation

01:17:26

time and time again we can repeat many times convolutional mac following completion max pulling

01:17:30

so on and so on we get the sort of very rich very um a lot of

01:17:34

different feel to pulls different feature maps out then essentially flat in the feature maps

01:17:39

we do a classification this really targets everything towards the end we do our final production

01:17:44

then generally what you find is that you start to these convolutional operators itself

01:17:48

the the high level operators generally reflect sort of a high level and the more we get down the more target the

01:17:55

actual features were extracted not what the particular task that we're

01:17:58

interested in doing themselves so first at being convolution

01:18:03

it's also my had different filters in we sent to just pass these filters over the signal of the signal and

01:18:08

go about different feature after each filter itself then again same again building up a sort of rich fit

01:18:13

feature map aligning the sort of weights associated with these filters themselves then we do our information reduction

01:18:19

we wanna kick the maximum output window doing max falling within a small number neighbourhood itself

01:18:24

so we actually see their reduce them and just essentially getting

01:18:28

in finding the maximum within these different operations themselves

01:18:31

reducing information having a being a bit more invariant towards noise itself

01:18:36

and we're sort of repeat is over time everytime we might in jackson on them yeah he's in there

01:18:41

for interested in doing that and we just sort of do this operation time and time again

01:18:45

convolutional maxwell income which too much boring so on and so on as i said we seven to learn the features

01:18:50

we can flat not final set of feature maps out this is not so there's

01:18:54

a fully connected via we can pass that maybe through some sort of feet

01:18:57

fall in your net worth some sort of soft max i uh some some mapping

01:19:00

essentially will probably distribution and then find out sort of general output itself

01:19:06

so this is sometimes done in so now into when that works itself

01:19:10

and this is a sort of a score function of internet work

01:19:13

so internet what two senses something we feed what data in to me actually get the out and we

01:19:17

get the production out in speech we don't we find this is done using a couple convolutional neural

01:19:22

completion elias and we normally write this into two recurrent neural network lies

01:19:26

to learn some sort of temple dependency within the operation itself

01:19:30

we started another work with this in the chair and we found that we have some pretty cool results

01:19:34

when we started looking inactive patients of different convolutional lies themselves so we thought it's not gonna

01:19:39

you know it's just gonna be random it's gonna want something but what we actually found when we

01:19:43

look to different correlations actually found the convolutional neural networks and doing this sort of feature extraction

01:19:49

again not all the sort of parameters found something useful but we

01:19:53

actually have the find activation to correlate very well to

01:19:56

energy correlated very well to loudness uncorrelated very well to page

01:19:59

itself and this is really doing the motion um

01:20:03

animation classification task and generally loudness energy and page uh what we use quite a lot the motion

01:20:10

the very highly related arousal so this is a really cool was all that we found out

01:20:14

this is only using and when lining and looking at the different activation so the

01:20:19

oh not that finishes up nicely for the the first part of the morning so uh we'll

01:20:22

take a coffee break now and then we'll come back and talk about machine learning and

01:20:27

back to more conventional machine learning methods after the break itself settings you time

Share this talk:

Conference Program

01:20:37

ML for speech classification, detection and regression (part 1)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 9:04 a.m.

478 views

58:41

ML for speech classification, detection and regression (part 2)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 10:59 a.m.

104 views

32:20

Quiz
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 11:59 a.m.

Recommended talks

19:27

ESR02 : Automatic detection and classification of pathological speech conditiond based on emotion expression
Zhao Ren
Sept. 5, 2019 · 4:34 p.m.

13:54

Template-based ASR using Posterior features and Synthetic References: comparing different TTS systems
Serena Soldo, Idiap Research Institute
Sept. 7, 2012 · 2:04 p.m.

ML for speech classification, detection and regression (part 1)
Nick Cummins, Universität Augsburg

Embed

Transcriptions

Conference Program

ML for speech classification, detection and regression (part 1)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 9:04 a.m.

ML for speech classification, detection and regression (part 2)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 10:59 a.m.

Quiz
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 11:59 a.m.

Recommended talks

ESR02 : Automatic detection and classification of pathological speech conditiond based on emotion expression
Zhao Ren
Sept. 5, 2019 · 4:34 p.m.

Template-based ASR using Posterior features and Synthetic References: comparing different TTS systems
Serena Soldo, Idiap Research Institute
Sept. 7, 2012 · 2:04 p.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

ML for speech classification, detection and regression (part 1) Nick Cummins, Universität Augsburg

Embed

Transcriptions

Conference Program

ML for speech classification, detection and regression (part 1) Nick Cummins, Universität Augsburg Feb. 13, 2019 · 9:04 a.m.

ML for speech classification, detection and regression (part 2) Nick Cummins, Universität Augsburg Feb. 13, 2019 · 10:59 a.m.

Quiz Nick Cummins, Universität Augsburg Feb. 13, 2019 · 11:59 a.m.

Recommended talks

ESR02 : Automatic detection and classification of pathological speech conditiond based on emotion expression Zhao Ren Sept. 5, 2019 · 4:34 p.m.

Template-based ASR using Posterior features and Synthetic References: comparing different TTS systems Serena Soldo, Idiap Research Institute Sept. 7, 2012 · 2:04 p.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

ML for speech classification, detection and regression (part 1)
Nick Cummins, Universität Augsburg

ML for speech classification, detection and regression (part 1)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 9:04 a.m.

ML for speech classification, detection and regression (part 2)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 10:59 a.m.

Quiz
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 11:59 a.m.

ESR02 : Automatic detection and classification of pathological speech conditiond based on emotion expression
Zhao Ren
Sept. 5, 2019 · 4:34 p.m.

Template-based ASR using Posterior features and Synthetic References: comparing different TTS systems
Serena Soldo, Idiap Research Institute
Sept. 7, 2012 · 2:04 p.m.