ML for speech classification, detection and regression (part 2)

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:04

okay i'll get started with the second part of this morning's lecture and this is

00:00:08

focusing on the machine learning aspects are putting the knowledge put in a a

00:00:14

yeah pretty intelligent since uh so the computational intelligence channel line

00:00:18

something from dayton line it was a particular task

00:00:21

and so how we set this up as a problem so the key aspects in around generalisation and then

00:00:27

talking about different so the models of machine learning advantages disadvantages and sort of brief interactions it

00:00:32

to roughly how they work and sort of why we needs page looking

00:00:36

to particular point of times so the what is machine learning

00:00:42

some lots of different ways we can really defined is any particular task

00:00:46

but into finding here is some sort of statistical learning technique

00:00:49

to automatically identify patterns in data we automatically identify these patterns in such a

00:00:54

way that we're getting better and better doing a sort of particular task

00:00:58

so we'll start machine learning by defining some sort of domain of interest and this is generally a

00:01:03

this is actually saying the feature space so we have some sort of feature space this feature space might have some sort of

00:01:09

marginal probability distributions at the distribution of x. in the sort of

00:01:12

wide distribution the actual features texas else uh acting out

00:01:16

collected data displaying the sort of feature space of the main data itself then we need to really to find some sort

00:01:22

of generic analysis task so essentially some sort of mapping some

00:01:26

sort of idea between going from the feature space

00:01:28

to go into what some sort of particular label space itself we wasted anyway we'll space

00:01:33

in these notes anyways being some sort of why this of them the labels for

00:01:36

us the terms of tap as well 'cause some sort of speech pathology label

00:01:40

itself so we always have because we always have corresponding labels and essentially what we wanna do it

00:01:45

one some sort of mapping something between this vector space and the so the label space

00:01:51

and we wanna learn it in such a way that when we get a new piece of data so we have a sort of data we used

00:01:55

to actually learn the mapping do we really wanna do this mapping in such a way that when we get some sort of new gator instance

00:02:01

where accurately able to sort of predict its label that goes with us

00:02:06

so what i mean by new data is data that not seen by machine

00:02:09

learning algorithm when we're sort of optimising the performance parameters one also doing

00:02:13

now listening from the actual data that we generally use and we can

00:02:17

set up machine learning algorithms in one of two particular way is

00:02:20

the most ways most commonly that we do a lot and speech pathology detection is of course supervised learning so this

00:02:27

is learning a model from labels to have features we have models so much on alert critical labels themselves

00:02:33

because i said stop was sort of an supervised learning this is where we might want to discover

00:02:37

labels from lots of models itself we could have loads and loads of different speech collected

00:02:42

we got no idea what's going on with that we might start i wanna cluster

00:02:45

together and to learn something from it itself so we generally five the features

00:02:50

we then do a unsupervised learning of some sort of models and from there we might be able

00:02:54

to infer labels all the the particular labels that might be of interest discover new relationships

00:02:59

there we might not have thought existed but today among the focus really on supervised

00:03:03

learning would touch a little bit of techniques at work we can unsupervised learning

00:03:07

but on the whole it just talk about the sort of civilised mapping toward learning

00:03:11

a particular function to go from pages to the actual label space itself

00:03:15

and this is the the voice got this processing chain in this is true of any supervise lighting that we do in sort of

00:03:21

any machine learning task in particular itself we always have some sort of input data and our case an audio file itself

00:03:27

yeah the feature extraction and that's what we talked about this morning we get up speaker feature representations

00:03:33

and essentially we find this algorithm find this mapping we want to have some sort of

00:03:37

apple and so this could be some sort of probability of certainty of else right

00:03:40

do they have a patio pathological condition or don't have a pathological condition and ideally how

00:03:46

shore out with that they might have this condition as well or could just be one to one mapping yes i have a on there they don't have it itself

00:03:52

and this is what the machine many algorithm itself does it once these

00:03:56

roles that wants to predict the labels from the actual features themselves

00:03:59

and add we do this we normally do this by setting some sorta cost function is could be some sort of

00:04:04

probably distributions could be formed a variety of different ways we sent to just wanna find an update discourse function

00:04:10

using sorta maximise asian minimisation so essentially just using some form of calculus

00:04:15

you know we always wanna find a local minima more local maxima reconsider derivative zero and simply solve the equation

00:04:21

that are very very broad level is roughly what's happening in machine many hours so we just finding a way

00:04:26

to find is optimal decision boundary is optimum decision boundaries

00:04:30

normally something that really just pushes a separation apart

00:04:33

as maximum is it can can give us the best chance also labelling

00:04:36

data instances as well because is very very simple to do and

00:04:40

we have sort of very clean separable data this is very difficult to do and we always have sort of trade offs to make

00:04:46

in the performance of the sort of algorithm itself when we actually don't have separable data when we've gotta make some assumptions

00:04:52

we actually trade off sort of performance parameters within the machine learning algorithms itself

00:04:57

machine learning algorithms the really the brains is system so the building of the mathematical model itself

00:05:03

and we roughly form these into too many different classes of the machine learning algorithms

00:05:08

discriminative models and this is essentially when we try to learn the mapping directly

00:05:12

so we're interested we have a training data essentially wanting the mapping to actually separate

00:05:16

at classes of interest of do i regression within the training data itself

00:05:20

we're not really caring about any sort of why the probability distributions that might be happening

00:05:24

here this is support vector machines this is day planning models is is this

00:05:28

is a lot different things go on here at the same time then we had generated models this is

00:05:33

essentially trying to add up device really the joint probability function between okay just on the labels

00:05:38

and then doing detection basically rearranging stuff using baseball so actually allows to form into a classification task

00:05:44

and you guys cover hidden markov models so we're in the wake of models and idea of a generative model and

00:05:49

sells gaussian mixture models k. means clustering we'll talk about some different forms of sort of generative models themselves

00:05:55

generally models we also use a lot within unsupervised learning itself because we're trying to map

00:06:00

the sort of wider probability distributions that might be happening in the data set itself

00:06:05

the parser really talk about specifics of algorithms are

00:06:08

advantages and the disadvantages own talk about generalisation

00:06:11

'cause generalisation is not really separates machine learning out from something

00:06:15

like an optimisation task it's very uh easy for stopped

00:06:17

eliza said have a cost function we wanna minimise that we wanna maximise what we wanna do something to it

00:06:22

that is exceptionally easy to go really gets machine learning it's brains it's power nuts

00:06:27

idea that would want something from the data is this idea of generalisation

00:06:31

does that will minimise maximise your cost function and this

00:06:34

is done using pretty much different optimisation techniques here

00:06:38

so we don't do this optimisation technique someone something as well so generally my said derivative are

00:06:43

great in the cost function this year i recently so for the model parameters themselves

00:06:47

this is very easy for me to say put this isn't always possible so it novel solutions will have some sort of closed form

00:06:54

days they do have a close format might be very computationally expensive so we have to do the fugitive methods

00:06:59

really get this sort of complex documentation done and one of the ways we could do this is something

00:07:04

like gradient descent and this is used a lot in machine learning a numbed your networks especially

00:07:09

where we just using a sort of iterative solution using negative gradient of

00:07:13

function and so so the working our way to the sort of

00:07:17

global minima that we might have within a particular cost function themselves trying to get this

00:07:21

sort of step sizes that down sit down sit down sit down sit down

00:07:24

but we wanna do this so it not that we were having this local minima within a particular training set but we wanna learn the

00:07:31

sort of wider local minima that might be happening we wanna make sure that we're not just going i'm got a hundred percent

00:07:38

decision on the data that i've collected michael party i can do this a hundred percent accurate

00:07:43

this is the best speech pathology detection system it ever i'm gonna give it to everybody

00:07:48

they're gonna put it in the latin guide this is the worst machine learning system have a it learns nothing because

00:07:54

if we have it to the training data we're gonna get a really really bad system talk

00:07:58

a bit more about this in the moment and different areas we've gotta look for itself

00:08:03

so machine learning essentially we don't wanna just minimise the training era so this is mistakes

00:08:08

made on the training data we also wanna minimise what's there's a generalisation era

00:08:12

so the it'd be winning phase when the channel learn something from the

00:08:15

data up to myself parameters of our algorithms from the data itself

00:08:19

and we're at the same time this is in training always minimising the training yeah this

00:08:23

is the difference between the actual in the predicted levels within the training set itself

00:08:27

we always wanna get this as small as we possibly can of course the same time this sort of introduces a lot of different

00:08:33

so the air isn't isn't mainly relating to sampling error itself so even

00:08:37

some sort of training data that we might have we only have

00:08:40

a very small selection really of the wider sort of distribution that

00:08:43

we have of the actual ta so we're interested in doing

00:08:47

we've got collected a couple hundred samples but everybody with you know this thousands millions of people with

00:08:52

a particular disorder of interest that we might have itself so there's no way we've covered this

00:08:56

so we stopped the mice was is very small point of time we're not gonna get a very good machine learning model and this is what we need to do it

00:09:02

and this is this idea of general last ability to the model

00:09:05

must accurately adequately label new test data samples to do i

00:09:10

training phase we minimise that training areas and we always test and the test place to minimise that test there is

00:09:15

so when they use all like database into doing training optimisation we try to

00:09:19

minimise this generalisation error on the training data this is i did

00:09:23

is why we're training this is why we did detect testing in machine learning so i will always have the sort of two step

00:09:29

so the training phase always this we have allied with data we do ah preprocessing removing

00:09:33

outliers we might do some sort of the noise in within the data itself

00:09:37

might do some other sort of functions here we do our feature extraction we

00:09:41

run opens while we're on the scene enter we do something yeah

00:09:45

we do our machine learning we actually train the model and we ought to my face model itself be set of parameters and said hi

00:09:51

parameters the external things that we can tune about the sort of operation the model we do this in this training phase itself

00:09:57

in in supervised learning we're always doing this with respect to the labels from the label training data

00:10:02

then we get great i've got this model excellent how well does it work this is what we

00:10:05

do in this is why we need the testing phase we wanna minimise this generalisation areas

00:10:10

so we get a new data sample this is anti not new we just

00:10:13

hold out something from the labels from day to actually collected itself

00:10:16

we do the same preprocessing if we've done in the so the day noisy here we try to get it so it's within this is the same conditions

00:10:23

we do if each the same feature extraction is into it just passes through

00:10:27

i'd i learning model and then we get that predicted label here and we get

00:10:30

a sort of set the product labels and measuring this on the test phase

00:10:34

that really gives us our ideal performance that's how we know when we got a good

00:10:37

bottle when we don't have a good model if we just sort of optimising here

00:10:41

what in this training phase we never really check within this test phase we've never got a good idea if we

00:10:46

have a good model and if we have a bad model and we have to spend a lot of time

00:10:51

testing isn't doing this in machine learning machine learning research can be

00:10:54

incredibly boring because we just doing this training testing training testing

00:10:58

change a couple parameters change a couple from less training testing but is very very important this is very important to do

00:11:04

to make sure you get the best and most general while model that you possibly can within these itself

00:11:10

and the reason we wanna do this is we're looking for particular subset of errors and we're trying to get rid of these

00:11:15

areas uh not essentially have them so we have the air and defeating is essentially when we can model that's too simple

00:11:21

so uh we just make some assumptions about the distribution of the two classes drawer line down the middle

00:11:28

it's very very simple it's not a complex model and the channel a chance to make a lot of mistakes

00:11:33

on it and it has what's known as high bias to generally makes a lot of similar mistakes all talk about by some sort of sensitivity

00:11:39

here in a little bit is an time what we want to also avoided serving this is going to make a model too complex

00:11:45

we try to really optimise its own absolutely makes no mistakes

00:11:49

on the actual training data we learn is convoluted decision

00:11:53

function he really know a way really reflects

00:11:57

anything about sort of real life all real well that we might be looking at the data and when we start to look at here

00:12:03

we're gonna go pretty much get the same to areas might just be making chance level or even beneath chance level

00:12:08

way so i have something like on the fitting or the feeling of our great feeling is is that

00:12:12

looking at this case where we can't really find a limit we separable model itself was just sort of trading off

00:12:17

on where we want particular areas to be made in just trying to make the most accurate system we can

00:12:22

we might try out the system so make small false negatives more false positives how does this

00:12:27

which are indifferent costing functions different type parameters so we can always find the most robust model

00:12:31

that we can in terms of working on a particular subset of hold out test data

00:12:36

so increasing the generalisation increasing the robustness of the model and this

00:12:39

is really really important to do with the machine learning itself

00:12:43

and it's based on the concept essentially a pass and there is there is so biases saying on

00:12:48

average how much you predicted values differ from their actual values so there's a some sort of

00:12:54

common way we're actually making areas of uh make a very very simple model we find the the model itself would just make the

00:13:00

same air time and time and time again the same time we might have area so we get very very different predictions

00:13:05

so my head to test instances very very close to each other but all essentially give us wildly

00:13:10

different estimations from outside of the machine learning model and we've got this idea of high variance

00:13:14

so really trying to look for something where we have both low various

00:13:17

um let bias within actual machine learning model but of course

00:13:21

this isn't really super possible to do is ways we can do this but generally we always have to trade off wise there is

00:13:28

buses variance there is itself window sort of minimising for generalisation areas

00:13:32

essentially what happens we start with a very very simple model it's you know it

00:13:37

decision tree attorney making two or three decisions it's gonna have very very high price there is we start to increase

00:13:43

the complexity of this model and increase and operate more parameters

00:13:46

enamel features in them all things drop mice itself

00:13:49

we move up into some really cool big learning system but we don't really have that much training data and we increase the

00:13:55

sort of variance there is a big night so the model becomes more and more complex the bias there is drop

00:14:01

we get rid of these but at the same time we increase the variance

00:14:03

there is themselves essentially what we're doing with continual training and testing

00:14:07

algorithms jiffy looking pretty straight off point plus the op global model complexity

00:14:13

for the any particular task any particular algorithm that i might have

00:14:17

there is non is the easy and so the so the no free lunch algorithms essentially saying no machine learning model isn't

00:14:23

it better than any of the machine learning model we just gotta find the most suitable for the data we have

00:14:27

we've got a very very small data set there's no point in doing trying to do really

00:14:31

cool when twin lining with it because we're gonna over fit will probably get out

00:14:35

no neither count for not variance in the data because we simply can't train it all all of it very very quickly with these

00:14:41

so it's not always saying what's the best algorithm it's a sentencing what's the most suitable algorithm for the actual job we have

00:14:47

the amount of training data we have for the feature representation we only use and so on and

00:14:51

so on there's lots of things we can choose in change and stuff within machine money

00:14:55

this is the real reason we do this we wanna trade off model

00:14:57

complexity versus the different areas the actual system can actually make here

00:15:02

so thin pair linguistics the so the default way we do this now is not actually

00:15:06

just doing training and testing but we often use a validation set as well

00:15:10

so we split the data essentially into three ways 'cause they were just

00:15:13

doing training and testing we can still learn essentially a model

00:15:17

that will not generalised well we could just over fits the actual training data and

00:15:21

to the test data at the same time if we don't have enough yeah

00:15:24

so this is some don't why we introduced this idea of a validation set so what we're doing here is we're training our model we take maybe sixty

00:15:30

percent about data seventy percent about data and we use is to train the

00:15:33

model we have a look at things like feature sets normalised relations

00:15:38

different type of parameters and then we try to optimise the model one the best representation we can from it

00:15:44

and then we use a twenty percent of the data maybe to doing validation we try to

00:15:48

minimise this generalisation error we do i training we do i testing on the validation data

00:15:53

and we go okay if this works this didn't work this work this didn't work so on and so on until eventually i okay

00:15:59

i've got this data here i've got my feature space here i've got my machine learning algorithm here i've got my course

00:16:04

settings for it i think this is the best one possible this is really cool this is gonna make a difference

00:16:09

then to really go okay yep we've got something general liable itself we use

00:16:13

the test itself is evaluates the data so this is just going okay

00:16:19

i this is brand new data nothing is mean seen here when we're doing

00:16:23

this optimisation we've essentially and then we can find what are actually true

00:16:28

algorithm performances using the sort of hold out test data we generally combine

00:16:33

the training set the validation set retrain the model using up a set of sort of features optimisation algorithms that we found

00:16:39

we tested again and find out if we've got a general washable model or

00:16:43

if we've done it in this way it doesn't quite efficiently you

00:16:46

can see you get very very bad accuracies with very very good models very quickly you doing training validation you pull him into test

00:16:52

and you get a big drop informants you know it okay this is not probably the best way

00:16:57

to do it if you performance hold steady you go like i probably got a good model

00:17:01

ultimately you performance increases 'cause you actually improve the model by combining the training

00:17:05

data training a little bit more performance and you get this guy

00:17:08

right yeah yeah i found the best thing possible that we can do but it's very very important

00:17:13

when you were viewing type isn't things this is one of the things to look for people really understand what's going

00:17:17

on especially in speech pathology especially in computational power linguistics so

00:17:21

id train test train validation test gently down in

00:17:25

preferred not to see him type is what i'm looking at them they would just a training about

00:17:29

asian it's really no way to know how general lies when model is so this is gently

00:17:34

the best optimal way within different feel to doing it now

00:17:37

hinting lining especially when we got a lot of data

00:17:41

we can sort of get away with doing training and validation only because innovation to got quite

00:17:45

a fair chunk of data still not really the optimal way to do it itself

00:17:49

often in speech pathology though we don't even have enough data really

00:17:52

divided into sixty twenty twenty we might start to have ways

00:17:57

it still might not be enough to really even train basic models themselves if this is the case we often use cos all validation

00:18:03

does into the saying we have some sort of sample space some sort of what it is you're

00:18:06

you shouldn't and i mean a huge maybe ninety percent of my data actually training algorithm with

00:18:12

oh used ten percent of my data for testing algorithm week and then

00:18:15

we just write take yesterday right at this or k. times riff

00:18:20

ten times if we're going ten focus validation and then we take the average of all

00:18:24

of these performances and that gives us a rough idea of what the sort

00:18:27

of generalised ability of the model might be how well it's actually performing sometimes we

00:18:31

can devise we can do cos all validation on some sort of train

00:18:35

validation set up we can sort of combine these methods to really try to find the most optimal parameter

00:18:39

possible but really it to hold out test set that we learned nothing from is still really

00:18:44

the best way to actually perform a model and to get the most suitable representation

00:18:48

of any particular model out and it's really important when you're doing a machine learning this idea of generalised ability

00:18:54

not of the feeding not on the feeling choosing the right model for the particular task and that's

00:19:00

what you always spend the time with with a machine learning research is waiting testing issue trains of

00:19:05

parameters and just waiting for a particular good result to sort of come out with itself

00:19:09

but that's all good and well but what we actually do how do we do this there is

00:19:14

hundreds of different ways we can the machine learning different mission learning algorithms itself

00:19:19

so just pulling out some of the more common ones that comes through in sort of speech pathology in within computational para linguistic

00:19:24

itself so first class anyway of algorithms all talk about a discriminative

00:19:29

models so essentially we're learning some so hard or soft

00:19:33

decision boundary between classes of interest we're assuming that there's some sort

00:19:37

of probably would be distribution joint probably would be distribution

00:19:40

but we're really just trying to estimate the parameters of this directly from the training data itself

00:19:44

we're not trying to learn this sort of wider probably distribution of this sort of

00:19:47

training set of the sort of features in of this uh label space itself

00:19:51

the advantages here generally we directly learning a decision function objective it's

00:19:56

it's uh it's pretty good when we just have sort of a very small number of data points itself to train with

00:20:02

we're not trying to estimate some sort of wider probably distribution and then sort of make a decision based on these were going okay what do

00:20:09

i need to know to make the decision at hand so it's probably wanna better ones of their and is used quite widely here

00:20:15

so it's um examples because of um discriminate yeah well to

00:20:19

talk through them random far s. k. nearest neighbour is

00:20:22

support vector machines you know networks and yeah you guys will have

00:20:26

so deeply much uh i think tomorrow or the next day so i'm not gonna go into much more detail of whatever job

00:20:32

but in terms of convolutional neural networks this morning in terms of declining 'cause you gotta go over that bit more

00:20:37

so that's one wanna talk about a random for us a random farce is essentially a

00:20:42

ensemble looked replace classifies so forming something like a decision tray so

00:20:47

remember my very first couple of slides this morning and

00:20:50

sort of put out this decision tree of have to choose if your piece is gonna come on time or not

00:20:54

this is essentially just taking a group of days just train on so the

00:20:57

slightly different views of the data are trained using slightly different parameter settings

00:21:01

so we just looking at a slightly different way of the data and then we sent to the summing

00:21:06

take an average of the sort of function of what we've learned from the various decision trees

00:21:10

actually make a final decision itself so the final decision is sort of a either value

00:21:16

or some sort of average if we're doing some sort of regression output so they strays essentially looking at different

00:21:21

views of the data looking at the data from different views itself depending on how we set up

00:21:26

so at the height of any random far is is essentially decision tree classifier so

00:21:31

to non parametric supervise machine learning algorithm and essentially a channel on the target

00:21:36

by just simply making a set of decisions and making the set of decisions recently just

00:21:40

subdivide in the feature space down and down and down and down and down into we actually able to make it

00:21:45

so it's a very it's the most probably interpret will of all the

00:21:48

classifies all really talk about today it's no is explainable one

00:21:52

i haven't put too many side on it's fine ability but i'll talk a little bit about that and talk about

00:21:57

some of the others but i sent to explain ability is one of the beatings we really need in speech pathology now to change into g. d. p. ah

00:22:04

rules the patients now if you wanna put machine learning in any critical cactus

00:22:08

have a right to understand how decision got reached about them most

00:22:13

do the most machine learning methods we use it's very hard to do that with that sort

00:22:17

of a big open field of research that we need to get more into itself

00:22:20

but decision trees are very easy to do explain really would just essentially sub

00:22:24

dividing the feature space making decisions to maximise information going at each point

00:22:29

so what decision will give us the most we can learn from the

00:22:32

data essentially h. point of time of breaking this down down down

00:22:36

two hours ever read notices essentially starting at the top and then you know we just go

00:22:41

down down down and every time on over city maybe decision about the data itself

00:22:47

voice willingly submit into further and further decisions into eventually reach a final decision within the tray itself

00:22:53

we talk a little bit more give a bit more what example naturalists afternoon but

00:22:57

this is a very in way we could do decision trees very simple example

00:23:01

we have a good good of animals that we wanna classify and we have certain aspects and features about animals

00:23:06

that we can use it is into we're just trying to get a

00:23:08

number of legs okay to fall so essentially the human versus a

00:23:13

four legged animal we can break them down and sort of maximise and it's just a rough crude idea of how decision trees work

00:23:20

so the key properties i cough labels are associated very bottom i've actually if no it

00:23:25

each were real relief represent some sort of decision role in classification balls into making decision

00:23:31

every single node we have a set of patterns associated with h. no it's

00:23:35

are relevant features on line so does feature selection automatically for us

00:23:39

within here and it's relatively simple very very easy to understand

00:23:43

essentially make it down in a sort of recursive algorithm style where we just this

00:23:47

into china um maximise the entropy of the information gain h. decision itself

00:23:52

sunset which decision returns as the maximum amount of information we need to make the next the next one

00:23:56

the next one we break this down into we've reached a class label that we particularly want

00:24:01

so steps we cacti the entropy of the parent node actor identifiable individual splits that we might do a maybe

00:24:07

about that goes with these and then essentially always choose display that gives us the best information gain the

00:24:13

it's giving of all the mass of a book out we found um when we're

00:24:16

looking at this sort of going on play if we do this information theories

00:24:20

and we find the best possible split then we can go on go on until we actually get possible labels down

00:24:25

so sort of working about these very briefly already advantages that quick and easy to interpret

00:24:31

computationally they're very very simple quite could defeat and they're very useful in data exploration

00:24:37

so it discovers the significant variables what we really need to

00:24:40

make a decision itself and identifies different relationships between them

00:24:45

they pretty robust this suitable on small data they're not easily

00:24:48

influenced by outliers 'cause royce maximising the information gain

00:24:51

and they can handle missing values relatively well as well they can handle nonlinear and sort of interactions relationship

00:24:57

but for all their sort of couldn't sinhalese massive advantages they're

00:25:01

very very simple algorithm separate device there is themselves

00:25:04

the the in stable so the very most small amount of data change can really change the tree structure

00:25:09

quite dramatically if we change our feature space even just a little bit itself they generally do

00:25:15

on more complex problems have for accuracy or got is what's known as a

00:25:19

weak classifier so we crossed wires something no matter how much we train

00:25:22

of a particular task it's generally just getting about chance level we can do

00:25:27

and what we can do there is actually use we classify as

00:25:30

and this is that the will to over fitting and do the tagging and ensemble methods which

00:25:35

is what random forces it sort of a bagging method of actually these decision trees

00:25:39

and it helps us reduce the variance helps increase the robustness of the actual model itself so it

00:25:46

so we combining multiple classifiers model don't different subsets of the data and

00:25:51

reducing the variance in the predictions by actually doing this itself

00:25:54

yeah the original data we might split into multiple data sets we try multiple different classifies

00:25:59

something like a in week uh i think the defaults maybe five hundred decision trees in random far so actually

00:26:05

doing quite a lot of different decisions itself and then we come up with several and final decision that we actually have

00:26:12

the again would bootstrapping combining results vary in classification and

00:26:17

essentially a bootstrapping is just saying we take a subset of the data with our placement and we do this multiple times itself

00:26:23

i'm backing in general for any form classify is always trying to reduce variance and also

00:26:28

at the same time have a limited effect on the bias the actual signal itself

00:26:31

so it's a good method if you have the sponsors of the week predictors to actually start

00:26:35

to the classification with and generally want to train the algorithm it's actually quite quick

00:26:40

so uh we got for so many trays we need independent bootstraps at

00:26:44

each node we may cost wait sometimes we just sort of um

00:26:47

deliberately altered around a bit so gently not too concerned about

00:26:51

the accuracy of the individual traces place the accuracy of

00:26:54

it the whole forest of the actual trace itself we do avoiding we get our sort of final output here

00:27:00

so random far as the big advantage is good accuracy fast run time

00:27:05

that work well we live status that second handle high dimensionality is

00:27:08

they inherently perform feature selection three that's a decision trees that are

00:27:12

done in on an improved disability the individual decision trees function

00:27:16

the same time the a little more susceptible to uh the feeling is particularly with the noisy environments themselves

00:27:23

and again we into this thing where mostly algorithms i'll talk about we had this idea of a black box approach now

00:27:29

we have no idea understanding of exactly how the decision got made yeah we understand how we set the algorithm out

00:27:35

the background of how but in terms of going which reaches data combined what is the indication that this person has

00:27:41

a particular speech pathology very hard to trace these things factors sort of black box operations the as i said

00:27:48

the field sort of needs to start shifting more towards explainable solutions and also

00:27:52

find ways to make solutions for these black box system themselves actually explainable

00:27:58

so the uh most simple approaches well in terms of

00:28:01

the um discriminate networks is the okay nearest neighbours

00:28:04

sort of classifier so this is pretty easy we're not actually now really doing classifying all learning

00:28:10

from the data itself were doing classifying just by so looking at properties

00:28:14

within the training set evaluate various this impossible decision process recently

00:28:19

pop in my test set in a going okay what's my nearest neighbour okay that's the class i'm gonna assigned to itself

00:28:25

and that's pretty much all we don't really have any testing here which

00:28:29

is going every time but it's a very slow method accent if

00:28:32

you have a very large data set you've got to do this work at least distant manages to hold in your status classifies itself

00:28:39

using something like euclid in the old to norm we might do the k. nearest on nearest neighbours algorithm itself

00:28:45

and then we we find that it's not actually that good so we just okay okay

00:28:50

what a different variants what a different distances itself is signing it therefore into the one itself

00:28:56

but again this is generally get over fit into training data so what we wanna do

00:29:01

is look at using nearest neighbours so instead of going what's my nearest neighbour what's

00:29:06

the cost of my to nearest neighbours what's class in my three nearest neighbours so on and so on is into this is the high programmer do we choose

00:29:13

when we doing nearest neighbours classifier cindy how many nearest neighbours do we have and again this sort of trade off

00:29:19

that we have between everything on defeating we've gotta find the right

00:29:22

sort of appropriate complexity as we take more more nearest neighbours

00:29:25

the decision function gets more and more complex chances waving good go up give it up and didn't go up itself

00:29:34

so uh yeah that essentially just something again using the semantics to find the k. nearest neighbours itself

00:29:40

and assigning that and we've always got to choose this idea okay

00:29:43

experiment lisa wanna find some sort of validation aaron minimise itself

00:29:47

we find is k. goes up to i was that to increase the variance increase the full validation areas that we might actually have

00:29:53

so really down to do the k. nearest neighbour is is it works

00:29:56

very well and very simple basic sort of recognition problems themselves

00:30:01

and is robust to noise we can actually toes sort of waiting into account if the bigger

00:30:05

distances i'll i'll lay things out we give them a sort of a less of away

00:30:09

that enables particularly far away we don't like that too much we can the algorithm until i was just choosing tried for a bit of noise

00:30:15

um it's a lazy when we're not really learning anything when we're doing k. nearest neighbours which is really

00:30:20

just making the decision itself and as a cause of this it's very very high computational cost

00:30:26

so if every single production we wanna make when we're doing chi nearest neighbours we sent to compute the

00:30:30

distance and sort every single instance within the training data itself we've gotta sum up every single distance

00:30:36

find the minimum set every single time so we don't actually have

00:30:39

a model which is doing this one operation one algorithm time

00:30:43

and time again so that's particular disadvantage of speed is something

00:30:46

that's particularly important when doing these style of decisions itself

00:30:51

the most frequently use i guess in speech pathology in computational power

00:30:55

linguistics in terms of classifiers is the support vector machine

00:30:59

and this is 'cause the just the good at making optimal decision

00:31:03

and the security easy to train and they work very well so low data set

00:31:08

how decision how support vector machine but says essentially were always making a binary classifier

00:31:13

is assigning something in to cost one class minus one set two classes itself

00:31:18

i mean essentially lining this decision boundary such that we can actually some decision boundary at the

00:31:24

set of nearest training samples to it need to know is the support vectors themselves

00:31:28

the geometrically these are the sole sort of training patterns that's it absolutely closest to

00:31:33

the boundary this awaited some of these training patents is sitting on this boundary

00:31:38

gives us a way to actually do a classifier is based

00:31:41

very heavily on sort of classic on optimisation so using

00:31:44

with courage and multiply as and doing contradict optimisation to actually learn the decision function will go briefly into itself

00:31:51

so we're learning some sort of class one class to separation we wanna find this pipeline hyper decision between it

00:31:57

as and you wanna maximise the distance between our two classes and then

00:32:00

we can use the support vectors this training instances that say

00:32:06

essentially closest to the decision boundary possible we use these to actually format classifier is very

00:32:11

compact form a classification because we can for the rest of the training instances away

00:32:16

and have a very small model essentially two out of the final compared to the whole shoulder training space

00:32:21

compared to something like a nearest neighbours where we had to some overall appoint

00:32:25

we're not reducing to summation every sort of small number of points

00:32:29

and it's very basin heavy so the classical mathematics methods optimisation the grunge and multiplies

00:32:34

i'm not gonna go too deep into it because you get to be boring

00:32:38

it's into itself so support vector machines the main advantages good generalisation

00:32:43

so the algorithms always set up too so the yield the maximum

00:32:47

possible distance between a hyper planes and between the classes itself

00:32:51

so reduce the chances of doing things like uh the feeding is really good when we have small amounts of data as well

00:32:56

computationally efficient way expressing our mind is a function of support vector is and these points men caused

00:33:02

by plane the real advantage of support vectors comes into the trick of the kernel operations

00:33:08

so now i will to uh we need decision rabbit in on

00:33:10

the new decisions essentially using saying algorithm using the same setup

00:33:14

just by doing what's there is the current trek recently mapping into high frequent high dimensional

00:33:19

spaces we'll talk about this more in a couple times we do these implicit mappings

00:33:23

informs the top products that actually seen the support vector operations allow

00:33:27

us to do is to be normally classification as well

00:33:31

so the call when we sort of setting up a support vector machine algorithm class flat it's really

00:33:35

assigned some to label so sign so we've got a mapping function the map something label space

00:33:41

and the feature space and we since we wanna learn if this time is positive or negative not assigned as in the cost one

00:33:46

no cost to doesn't really matter what we call class one and class to in these instances but it's always a binary classifier

00:33:52

support vector machines for multicast you're always performing one versus all classification or you performing a set

00:33:58

of actually binary classifies actually for multi task one but at the heart it's always binary

00:34:03

the same to we've got some sort of classification function what's going on here is that we have a label we have

00:34:09

some sort of way and we have some sort of colonel operator here that's essentially doing uh so the of

00:34:16

implicit mapping into high dimensional space or doing when the separation of which is to set

00:34:20

up with itself and what we wanna do is learn the set of weights

00:34:24

they said to whites are always non zero essentially for a support vectors the once it's the closest to the boundary itself

00:34:29

so have a classification function we set this up it's just you know it's quite a simple one we

00:34:34

have a decision boundary have some sort of by so the decision boundary and we wanna learn this

00:34:38

we can always say say that the weight vector sits perpendicular to the hype applying to the decision function itself

00:34:44

we can say the oh points essentially greater than zero sit in class one

00:34:50

and then we have points that actually sit on the decision boundary itself and we can find an optimal so the distance for this

00:34:56

changing the class to which is saying everything is less than zero gets assigned to class two to the minus one instances

00:35:01

and again we can set up the sort of all to these points that actually sit on these decision boundaries

00:35:06

given we have some sort of optimal margin we always want to maximise our decision so it

00:35:11

so these are the sort of two over the the way to norm of the um

00:35:16

weighting function which is the one and the minus here and i've done a bit of rearranging

00:35:20

in these equations but it does all check out and we wanna learn this function here

00:35:24

are able to do that very easy we set this up as a minimisation algorithm and

00:35:28

we put the constraint in the the minimisation algorithm itself essentially use calling a name

00:35:35

quite tragic optimisation to solve the algorithms itself we actually substrate we're granting operate isn't we make a sort

00:35:41

of you more constraints in the actual algorithm itself and essentially allows us to sit here and find

00:35:47

what the optimal set of weights uh after is come about this

00:35:50

sort of change in terminology from the move into granting multiplies

00:35:54

and then we can express the decision function here is some sort of way that

00:35:56

some alphas what we actually want to record wreck optimisation some sort labels

00:36:01

and the in the product between two vectors itself plus the bias so well

00:36:06

it turns out is that every single point that has an alpha

00:36:12

greater than zero i must've ally essentially on h. one or h. to all other points

00:36:17

oh twenty have alphas you going to the right so you get a very very efficient

00:36:22

summation here one actually making a decision function with just saying this point this

00:36:26

point this point this point all we need to know it to make the sorta

00:36:29

support vector machine decision tray there is information we have a very nice

00:36:33

complex model itself allows us to do separation quite well the

00:36:37

classification models expresses awaited some of the actual support vector

00:36:40

as being the points that's the closest to the boundary so that was a bit brief and global mobile phone

00:36:48

it's so hard to support vector machines are good too much what they tell it around so that your soft and god he's gone

00:36:54

deriving all these different equations in various different how great and there is different directions but again like

00:37:00

uh everything so this morning the big so the theory because of the notes look up some

00:37:03

papers if you're really interested in so how support vector machines work but the main thing to

00:37:07

know is is is really nice sort of classification model that we actually some up itself

00:37:13

we actually can introduce a non separable data into support vector machines as well

00:37:18

and this is by putting more constraints into one kind you

00:37:20

know put operated themselves and allowing it to essentially choose

00:37:25

and allow the decision function to essentially make mistakes and then waiting how badly it make these mistakes

00:37:30

this is not known what we control when we actually look in the complexity of the actual support vector machines

00:37:36

so we're looking at some sort of slack function and with the meeting of support vector machine to make some sort of mistake

00:37:43

because i can i don't mind that you making this tiny little mistake a 'cause you

00:37:46

giving me more general liable model we essentially chain this complexity to really control

00:37:51

the complexity actual model and control the trade off between the sort of on the feeling you know the feeling between the bias and variance there is

00:37:58

and that's what we're essentially chaining so when you cheating a support vector machine just wanting that so the middle point

00:38:02

that i talked about controlling bias there is this is very

00:38:05

serious and sells so say gets loud essentially saying

00:38:10

i don't want you to make errors i'm gonna punish you if you make higher and higher it's itself

00:38:14

so i decision boundary gets more and more wavy more more complex itself see

00:38:19

get small generally we're going okay we're making more more assumptions about the

00:38:22

distribution the data which may not necessarily be true so making the model

00:38:26

less complex in the head down to what's on the fitting itself

00:38:29

the rough both um we generally said say less than one to avoid a fitting a lot

00:38:35

of take the tasks again it's not always true it's just a raffle of them yeah

00:38:40

so the thing to remember with this sort of complex evaluate as it goes out we get harder and harder martin we got more to

00:38:46

what uh the feeling as we go down we get a softer and softer margin we get what defeating itself with this evaluate

00:38:52

as i also said the real advantage apart from this so the idea very compact model

00:38:58

and making is optimal decision function by some very very clear sort of optimisation roles

00:39:03

we can handle non we need data within support vector machines and this is done by the inclusion what's known as a kernel function

00:39:09

so we play connelly me a sort of support vector machine essentially find that we're doing

00:39:14

it top product between a different vectors within the space when making a decision itself

00:39:19

i'm saying to we can do this into sort of a higher and higher dimensionality is

00:39:23

buys into doing a nonlinear transformation taking the dot product to

00:39:27

these nonlinear transformations and just pushing this information into the

00:39:30

support vector machine when actually doing a final decision function here

00:39:34

and call the kinetic because we never actually mapping

00:39:38

within the algorithm itself into high dimensional spaces itself rarely do in an implicit form within the dot product operation

00:39:44

so essentially what's happening is that things if you look at it one from one particular dimension

00:39:50

you guys if you're looking at my two hands in you know from your point you can't really see separation from the the move more into

00:39:56

so the three dimensions you can my uses that lead to mention likely separation uh so the things go up and up into mentions

00:40:02

as essentially what we're doing with the converter we're going like i wanna go into

00:40:05

high dimensional space and check of the date is nominally separable only going to

00:40:09

you know higher in high dimensional spaces in this different mappings different rules are

00:40:13

sort of essentially how we do this mappings visible idea essentially that

00:40:18

something that's non separable in one dimension will eventually be separable

00:40:21

in higher high dimension hopefully at any rate itself

00:40:25

that's what's happening when we're doing the contract itself so nonlinear

00:40:28

mapping died dimensional spaces the lot to different roles and

00:40:31

properties that how we sort of to find these and what exactly is a kernel space but again for

00:40:37

by the t. l. skip over this itself and just say that the main sort of popular kernel functions

00:40:42

also come across obviously when you can no just doing the dot product between uh because itself

00:40:46

oh no mail colonel loading opponent your transformation really that i've

00:40:50

heard it said now's the order polynomial transformation and

00:40:53

we can do uh so the gaussian kernel we're looking at savings of the variance of the kernel itself

00:40:58

and controlling different aspects here to do with the power we raise the kernel to as well so there's different so the trade offs

00:41:05

every time you uh so the of do support vector machines wanna do you can move takes it takes time

00:41:11

every sort of parameter the complexity of the different parameters play off against each other so your is gonna take time

00:41:17

doing very very good grid searches really find the most optimal settings the support vector machines when you're doing it

00:41:23

to one and then doing this training test validation testing for any validation testing so they can be a bit of

00:41:28

time 'cause it's quite a lot of things to so that you know not to mars within support vector machines

00:41:32

so the other form of discriminate model that you talk about of the cost of this week anyway id neural

00:41:38

networks on your networks in general or discriminate model i'm very very briefly the neural networks are just

00:41:45

the neural networks in your network that has at least four layers and as many many parameters that have to be like in it

00:41:51

yeah a enable really through this idea of multiple layers multiple nonlinear transformations

00:41:57

essentially the ability to learn very very highly complex cause functions itself

00:42:01

but the capital is being one of the big issues we always have within speech pathology

00:42:06

is that we need substantial amount of training data the paper the network goes the more more training data we get the

00:42:11

more more things we have to learn the more more training data we need so they would very very well

00:42:16

provided you sort of set this up probably as a particular task and know what you're sort of um

00:42:22

feature space is capable of the amount of training data you have is actually capable of itself as well

00:42:30

so the last thing instead of so the different models uh all

00:42:33

talk about is generative models so this now is the idea

00:42:38

that we're going to go and start modelling to distribution so we're looking at

00:42:42

the joint distribution between the model between the features and the labels

00:42:46

and we wanna model is joint probably be function will assume some some functional form of the conditional probability

00:42:51

and actually have to make these parameters from the training data and then use baseball's to sort of calculate

00:42:56

what's the likelihood of a feature of our label given a particular set of features we have

00:43:01

and the real advantage of doing so the of generative models is this sort

00:43:06

of assumption that are actually learning this why the probability distribution should

00:43:10

essentially allow us to sort of prevent over feeding itself

00:43:15

and we can potentially a local in an update models we can revise a sort of on the fly

00:43:20

bit more something the massive outlier or is this some sort of mistakes remain we can reach routine

00:43:25

models in reaching the probably be distributions which is we get more and more data as well within the generative models themselves

00:43:31

generally models we always wanna model i make a sort of classic classification using baseball's itself

00:43:37

so it's just a little bit every arrangement we have a conditional probability predefined it into joint probably

00:43:42

be an redefined this interface will see it so this is saying that our class will be

00:43:46

given our features what was the most likely label that we actually got not what we don't

00:43:50

generally classify we didn't go we get what what's the likelihood we came from class i

00:43:55

what's likely would we came from cost they choose the maximum likelihood in return that we

00:43:59

get you guys probably probably do this during the hidden markov models anyway itself

00:44:04

to generate models are often based on the sort of concept the clustering so

00:44:07

sort of grouping feature vectors together into sort of common spaces common things

00:44:13

so it's a sort of partitioning the data looking at how well the model think

00:44:17

she's petitions letting which little adjustments of parameters that we actually need to make

00:44:21

and then making these adjustments over time and learning the sort of distributions into different

00:44:26

climbs into different classes itself and then trying to form these up itself

00:44:30

so the basic idea of any sort of clustering algorithm is going back

00:44:34

almost to the k. nearest neighbours into the k. means algorithms themselves

00:44:38

so looking at trying to find cost is where the two points belong to the same class that have a very

00:44:43

small distance measure between them two points belonging to different classes have a sense a very very big distance measure

00:44:49

between them so it's very so easy concept itself is into we can just phone classes check these and then

00:44:54

sort of resigned boundaries resigned boundaries that's what's happening a lot and its various different ways wanna do this

00:45:00

across the distance small and speeches similar similarities high when into cost the distances lot

00:45:05

features similarities low these belonging to different classes the general sort of way

00:45:10

looking at me vantage of doing any sort of clustering you move into the discriminant

00:45:15

models a generative models well is that we can often find very compact representations

00:45:19

so when we thought about what we're doing ten means into working out all these distances

00:45:23

to every single training point when i just got classes of dayton working at distances

00:45:27

or some the likelihood measure to the central database classes so some minimising the test points

00:45:32

they're actually going over itself the design of the test data into

00:45:36

a class uh by computing distances between so the centre in

00:45:39

the middle it's for each class that and find one us into gives us the best fit so the rio um

00:45:45

call methods that we talk about this with work any means naive bayes gaussian mixture models

00:45:51

and hidden markov models themselves take a means is probably the most simple uh um

00:45:57

how do we want an clustering algorithm as essentially the iterative assignment of training points

00:46:03

in one one into the different groups that we actually want to supervise learning we

00:46:07

sent to use the label so sorta died the selection of these classes itself

00:46:10

once costed we just use the same try to the data is and uses the label any new what data points

00:46:16

and that's why we use when we sort of doing unsupervised learning itself we can find if you've different distributions

00:46:21

and then we assume that the once enjoyed it some sort of new class label that would actually come up

00:46:26

with something that information with down when we're doing data mining instead of using something like a means

00:46:31

so can i mean some key parameter is essentially assigning a number of 'em classes that we actually want

00:46:37

depending on how we actually do it is is that we could use group labels to really sort of redefine these itself

00:46:42

so the training algorithm is we select some initial instances to be central wait

00:46:47

we design into the remaining instances into one of the k. classes

00:46:51

and then we voice assign it to one that's closest to we

00:46:54

we can points enjoyed based on the current pattern of assignments

00:46:58

resigned everything into the updated cost is and so on and so on and so on and so essentially we get no change

00:47:04

in the number uh in the so the change of the central to the cost the position all

00:47:09

we just go okay yeah you wanna do this for a set number of iterations itself

00:47:13

so the idea is to really do it until we get real we no change in any successive iterations between itself

00:47:19

we have some sort of random initialisation and then we just repeating this assignment repeating this cost or

00:47:24

it's in tried assignments time and time again until essentially convergence themselves so signing out points

00:47:31

you guys are iteration one iteration to run into the model so get better and better and better and better

00:47:36

the one disadvantage doing k. means clustering especially when we're doing sort of iteration points

00:47:41

is sometimes our final so the output of the model is very much dependent on actually selecting a good

00:47:47

sort of initial point if we select all our initial points somewhere you know

00:47:52

null through widely distributed of the data potential it's gonna take a longer time to

00:47:56

converge or what it does converge to is a bit sort of rubbish itself

00:48:00

take a means easy to implement work on large data sets but with the logo of a bit tuning to actually

00:48:07

find a set number of classes as i said this initial sort of seating can ever really strong impact

00:48:12

on the final cost uh and final points of time and we i can have some

00:48:16

sort of strong sensitive you to our allies annoys the during this class strings

00:48:20

this could really start to affect them sort of put different classes out of positions

00:48:23

themselves i'm just making some assumptions as well about the shape in about

00:48:29

the data which may or may not hold true window so looking at real data instances

00:48:34

so we have a really simple forms of doing generative model is called a naive bayes classifier

00:48:39

it's called naive to make some certain assumptions about the data in the distribution of the data itself

00:48:44

essentially we just use baseball directly responsible likelihood tables from posterior tables and

00:48:50

we just assigned probability distributions directly from the sort of data itself

00:48:53

but we work on the assumption either every features assume class conditionally independent so we can sort of one feature

00:49:00

the distribution fee just on the fact that class labels but in real life this assumption probably doesn't hold

00:49:05

idols uh it's a pretty naive way of doing it so as looking uh maximising this for the posterior probability

00:49:12

we just look up some the likelihood phone some likelihood table

00:49:15

used five probably the distributions and sometimes a particular

00:49:18

five probability is well generally we conditions the class independent actually removed

00:49:22

from the algorithm this is essentially what i've done here

00:49:25

so the classification face hey training is really converting data sets into

00:49:30

frequency tables so looking frequencies of distributions are the data set

00:49:34

creating some the likelihood finding sort of probabilities we might need to do small corrections we might have sort of

00:49:40

has a features in classes that don't really come up in the training set might have to make

00:49:44

some sort of assumptions about the distributions themselves if we wanna cover everything equally

00:49:49

and classifying is just using this sort of equations using the likelihood tables using the frequency table itself

00:49:55

the cacti the poster probably be for every class and class of the highest

00:49:58

posterior probabilities the outcome of the actual production that's always on it to

00:50:02

the very simple example that sort of comes up time and time again this idea or do we go out

00:50:07

why don't we go out right if you're able to go okay sammy someone out overcast yes you know

00:50:13

raining yes someone at the place down and so on so we look at the frequency table this is just saying

00:50:18

is that classes rainy sunny total is on number uh you know yes and then assign

00:50:23

different probably distributions in the wannabe assumptions here we have this sort of the right

00:50:27

okay yeah and then we can form a likelihood table just looking at the distribution of

00:50:31

different brands of radical to bring over cost on a yes and those sort of

00:50:36

like that as well and essentially when forming classifications we distilled

00:50:40

by baseball would get all this information from lookup

00:50:43

table trying from the data and we've been consoled start to infer things like if it was sunny

00:50:49

it's it's a correct statement players will downplay we can find that therefore likelihood of sunny and going on

00:50:54

play with sixty percent uh we can for this uh probably craig's and we can sign this

00:50:59

the sort of class label itself with these we could she knew exactly where we

00:51:02

wanna make this decision based on prior knowledge is well within the naive phase

00:51:07

so naive bayes is relatively easy fast to predict we have different test sets

00:51:12

as well it can perform multicast classification pretty can very very well

00:51:16

and big advantage of nice pave nice but i use over a lot

00:51:20

of different algorithms it can actually handle category inputs of variables itself

00:51:24

so we don't actually have to have some numeric distribution with enough features base you can start using like

00:51:30

words different other forms of actual features that we want no eyebrows what did we well for days

00:51:34

the main disadvantages idea of zero frequency so what happens if we get a test

00:51:39

spectra that has a category that was not observed during the training sample

00:51:42

so we had this gotta category here that sort of frequency table over cost

00:51:46

now if we had known occurrences of this within our actual training samples

00:51:50

the cans comes up with no not test samples we don't really know what

00:51:54

to do with the we always had this big assumption the independent predictors

00:51:59

don't know true and sort of real life data itself everything is sort of slightly correlated with everything

00:52:04

else and we've sort of making this very naive assumption hands what's called a nice classifier

00:52:09

so the last um okay than the uh the archive uh in any particular that is the

00:52:16

gaussian mixture models itself you probably did a little with this within hidden markov models

00:52:20

but it got to mister moses into the comics conversations of gas in probably distributions stuff functions

00:52:27

there we actually feed of the data is really complex way of expressing a

00:52:30

model is that we have to express a sort of feature distribution in

00:52:34

terms of some waiting classifier and just the means of the gaussians and the variance of the gaussian vectors itself so it's quite a calm

00:52:40

packed way of doing a generative model it's um and what's pretty

00:52:44

well a lot of the time so this compact representation

00:52:47

we have some sort of mean vector willing some mean distributions so the class of each

00:52:52

means sort of vector roughly characterise is the shape of the feature space itself

00:52:56

we have a covariance matrix it's it's yeah it's characterise is sort of variability within the feature space itself

00:53:02

we also have the waiting so this essentially is just saying how important we learn different clusters are different sizes

00:53:09

small ones on super important the amount of data essentially covered by its mixture

00:53:14

by this sort of clustering algorithms themselves so we have some sort of

00:53:17

some one gaussian two gaussians three gaussians every sort of come up

00:53:21

several model itself before model for a positive class and they get a class

00:53:26

on and so on and essentially just use the likelihood function use payrolls

00:53:30

to actually start doing it training is gently down with the expectation maximise

00:53:34

ration algorithm or doing um maximum prior prior updates as well

00:53:39

so this is really neat today one two is to maximise the gaussian the g. m. m. likelihood functions

00:53:43

on the actual data is set itself we generally don't maximise the

00:53:47

likelihood function we don't we maximise the log likelihood functions

00:53:51

and this is essentially a computational thing itself is going to be a white on the flow of a sort of

00:53:56

looking a very very small numbers itself on the voice very easy to

00:53:59

get and also makes out sort of estimations easier we're going

00:54:02

for multiplication system nations that just makes it a bit more computationally efficient gently while we use a lot like a load

00:54:08

so the of very prefer you what sort happening within the em

00:54:12

algorithm is is sent to the same as what's happening really

00:54:16

weaving sorry came into starting with some initial estimate of what the parameters

00:54:19

might be we could just estimate is directly randomly assigned them

00:54:23

see there in the okay with uh came into first discusses instrument choice now i'm

00:54:28

gonna try to fit more finely tuned gaussians at the top of these

00:54:31

we can to compute the likelihood that each parameters produce the particular data point this is nancy expectation

00:54:36

stepped reading compute whites essentially based on this likelihood that each parameter was produced

00:54:41

we can use these weights together with them together with the estimation we maximise

00:54:46

that we try to improve the likelihood estimations time and time again

00:54:49

and we're essentially again do the steps time and time over and over until we reach some sort of form of

00:54:54

convergence of their okay nice top you've done enough steps now it's up to looking classify what we actually do

00:55:00

so is this into initiative method we start with two rough estimates what we

00:55:03

think might be happening with the training algorithms and is into the

00:55:07

of the successive am algorithms we end up getting a final model sort

00:55:10

of roughly reflecting the we had two classes a day be here

00:55:14

um one have slightly wider distribution and we can form one with some

00:55:18

now are becoming distribution here so that's sort of final model itself

00:55:23

classification is always done by baseball's so we have some sort of stick on some

00:55:27

sort a lot of events in a training algorithm that training sets right

00:55:32

these are testing then we have a class models so we form the

00:55:35

different gaussian mixture model incentive for the class of interest itself

00:55:38

and then classification involves identifying the bottle the returns the

00:55:42

maximum conditional probability within the different classes so

00:55:45

essentially saying okay given a test data what's the maximum what's gaussian returns the maximum likelihood here

00:55:51

this is actually come potentially quite difficult so we flip it

00:55:55

around using bayes roles essentially instead of saying class

00:55:59

to gaussian we know that from gaussian to class we also have the prior distribution of the gaussian itself

00:56:04

and classification is essentially just finding which gaussian has returns so most likely would

00:56:10

itself and using the sort of extra prize term that we have here

00:56:15

it down to just complex representation in bacon model p.

00:56:19

d. f. so quite a good require level of

00:56:21

uh accuracy essentially quite easy to feed using the um or using the mac algorithm which

00:56:26

i didn't cover not algorithm is is more initiative update for sort of individual classes

00:56:31

user uses windows down for my vectors and things like that disadvantages

00:56:37

inflexible if we chose in really inappropriate model to begin with it's not

00:56:41

gonna give us the to best choice so we're quite some

00:56:44

sort of strong price some sort of estimates of what we think number of gaussians enough distributions that might be happening yeah

00:56:50

um you program model might be really of skew a quite complicated itself

00:56:54

and the distributions that don't really have a gaussian distribution it's not really

00:56:59

the best particular model that we can actually work with itself

00:57:03

so the the last one model to briefly mention is hidden markov models

00:57:08

so the this is essentially now just as almost extension onto the gaussian mixture

00:57:12

models saying we wanna map a sequence of observations on the sequence of

00:57:17

labels and recently saying this data hidden weakening look uh probabilistic variations it can be observed

00:57:23

we got a lot of different parameters we cannot die with the gas in mixed with a with a hidden markov models

00:57:27

so i was trying to introduce in hidden markov models says so id that was sat

00:57:33

here and lecture theta but what we wanna do is classify what the weather

00:57:36

is outside we can't see that whether directly so we've gotta take observations of what

00:57:41

might be happening the way you might do that is not to look good

00:57:45

how people come in and they just into a shirt so they just in shorts it's on the it's warm that justin raincoats we can

00:57:51

infer that might be raining in this isn't gently what's happening this is something way to sort of think about the hidden markov models

00:57:57

the real advantage is bottling sequential data real disadvantages probably the large number of parameters

00:58:03

the really come up with hidden markov models and assuming that we have enough data to actually

00:58:08

model and one is correct parameters with so that sort of

00:58:12

brings and to the lectures but it's not so the

00:58:16

class dismissed right yet because i've got a little quiz based on line that or whatever and to do it

00:58:21

so if you could all get an phones and also that a little cousin quiz and just sort of go over some

00:58:26

of the main points and so the re emphasise some of the main things to actually talk about this morning

00:58:33

Share this talk:

Conference Program

01:20:37

ML for speech classification, detection and regression (part 1)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 9:04 a.m.

478 views

58:41

ML for speech classification, detection and regression (part 2)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 10:59 a.m.

104 views

32:20

Quiz
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 11:59 a.m.

Recommended talks

41:35

Small Data, Number of features vs. number of observations
Olivier Crouzet
Sept. 5, 2019 · 9:47 a.m.

59:34

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

2370 views

ML for speech classification, detection and regression (part 2)
Nick Cummins, Universität Augsburg

Embed

Transcriptions

Conference Program

ML for speech classification, detection and regression (part 1)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 9:04 a.m.

ML for speech classification, detection and regression (part 2)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 10:59 a.m.

Quiz
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 11:59 a.m.

Recommended talks

Small Data, Number of features vs. number of observations
Olivier Crouzet
Sept. 5, 2019 · 9:47 a.m.

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

ML for speech classification, detection and regression (part 2) Nick Cummins, Universität Augsburg

Embed

Transcriptions

Conference Program

ML for speech classification, detection and regression (part 1) Nick Cummins, Universität Augsburg Feb. 13, 2019 · 9:04 a.m.

ML for speech classification, detection and regression (part 2) Nick Cummins, Universität Augsburg Feb. 13, 2019 · 10:59 a.m.

Quiz Nick Cummins, Universität Augsburg Feb. 13, 2019 · 11:59 a.m.

Recommended talks

Small Data, Number of features vs. number of observations Olivier Crouzet Sept. 5, 2019 · 9:47 a.m.

Deep Supervised Learning of Representations Yoshua Bengio, University of Montreal, Canada July 4, 2016 · 2:01 p.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

ML for speech classification, detection and regression (part 2)
Nick Cummins, Universität Augsburg

ML for speech classification, detection and regression (part 1)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 9:04 a.m.

ML for speech classification, detection and regression (part 2)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 10:59 a.m.

Quiz
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 11:59 a.m.

Small Data, Number of features vs. number of observations
Olivier Crouzet
Sept. 5, 2019 · 9:47 a.m.

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.