Player is loading...

Nick Cummins, Universität Augsburg

Wednesday, 13 February 2019 · 10:59 a.m. · 58m 41s

Embed code

Note: this content has been automatically generated.

00:00:04

okay i'll get started with the second part of this morning's lecture and this is

00:00:08

focusing on the machine learning aspects are putting the knowledge put in a a

00:00:14

yeah pretty intelligent since uh so the computational intelligence channel line

00:00:18

something from dayton line it was a particular task

00:00:21

and so how we set this up as a problem so the key aspects in around generalisation and then

00:00:27

talking about different so the models of machine learning advantages disadvantages and sort of brief interactions it

00:00:32

to roughly how they work and sort of why we needs page looking

00:00:36

to particular point of times so the what is machine learning

00:00:42

some lots of different ways we can really defined is any particular task

00:00:46

but into finding here is some sort of statistical learning technique

00:00:49

to automatically identify patterns in data we automatically identify these patterns in such a

00:00:54

way that we're getting better and better doing a sort of particular task

00:00:58

so we'll start machine learning by defining some sort of domain of interest and this is generally a

00:01:03

this is actually saying the feature space so we have some sort of feature space this feature space might have some sort of

00:01:09

marginal probability distributions at the distribution of x. in the sort of

00:01:12

wide distribution the actual features texas else uh acting out

00:01:16

collected data displaying the sort of feature space of the main data itself then we need to really to find some sort

00:01:22

of generic analysis task so essentially some sort of mapping some

00:01:26

sort of idea between going from the feature space

00:01:28

to go into what some sort of particular label space itself we wasted anyway we'll space

00:01:33

in these notes anyways being some sort of why this of them the labels for

00:01:36

us the terms of tap as well 'cause some sort of speech pathology label

00:01:40

itself so we always have because we always have corresponding labels and essentially what we wanna do it

00:01:45

one some sort of mapping something between this vector space and the so the label space

00:01:51

and we wanna learn it in such a way that when we get a new piece of data so we have a sort of data we used

00:01:55

to actually learn the mapping do we really wanna do this mapping in such a way that when we get some sort of new gator instance

00:02:01

where accurately able to sort of predict its label that goes with us

00:02:06

so what i mean by new data is data that not seen by machine

00:02:09

learning algorithm when we're sort of optimising the performance parameters one also doing

00:02:13

now listening from the actual data that we generally use and we can

00:02:17

set up machine learning algorithms in one of two particular way is

00:02:20

the most ways most commonly that we do a lot and speech pathology detection is of course supervised learning so this

00:02:27

is learning a model from labels to have features we have models so much on alert critical labels themselves

00:02:33

because i said stop was sort of an supervised learning this is where we might want to discover

00:02:37

labels from lots of models itself we could have loads and loads of different speech collected

00:02:42

we got no idea what's going on with that we might start i wanna cluster

00:02:45

together and to learn something from it itself so we generally five the features

00:02:50

we then do a unsupervised learning of some sort of models and from there we might be able

00:02:54

to infer labels all the the particular labels that might be of interest discover new relationships

00:02:59

there we might not have thought existed but today among the focus really on supervised

00:03:03

learning would touch a little bit of techniques at work we can unsupervised learning

00:03:07

but on the whole it just talk about the sort of civilised mapping toward learning

00:03:11

a particular function to go from pages to the actual label space itself

00:03:15

and this is the the voice got this processing chain in this is true of any supervise lighting that we do in sort of

00:03:21

any machine learning task in particular itself we always have some sort of input data and our case an audio file itself

00:03:27

yeah the feature extraction and that's what we talked about this morning we get up speaker feature representations

00:03:33

and essentially we find this algorithm find this mapping we want to have some sort of

00:03:37

apple and so this could be some sort of probability of certainty of else right

00:03:40

do they have a patio pathological condition or don't have a pathological condition and ideally how

00:03:46

shore out with that they might have this condition as well or could just be one to one mapping yes i have a on there they don't have it itself

00:03:52

and this is what the machine many algorithm itself does it once these

00:03:56

roles that wants to predict the labels from the actual features themselves

00:03:59

and add we do this we normally do this by setting some sorta cost function is could be some sort of

00:04:04

probably distributions could be formed a variety of different ways we sent to just wanna find an update discourse function

00:04:10

using sorta maximise asian minimisation so essentially just using some form of calculus

00:04:15

you know we always wanna find a local minima more local maxima reconsider derivative zero and simply solve the equation

00:04:21

that are very very broad level is roughly what's happening in machine many hours so we just finding a way

00:04:26

to find is optimal decision boundary is optimum decision boundaries

00:04:30

normally something that really just pushes a separation apart

00:04:33

as maximum is it can can give us the best chance also labelling

00:04:36

data instances as well because is very very simple to do and

00:04:40

we have sort of very clean separable data this is very difficult to do and we always have sort of trade offs to make

00:04:46

in the performance of the sort of algorithm itself when we actually don't have separable data when we've gotta make some assumptions

00:04:52

we actually trade off sort of performance parameters within the machine learning algorithms itself

00:04:57

machine learning algorithms the really the brains is system so the building of the mathematical model itself

00:05:03

and we roughly form these into too many different classes of the machine learning algorithms

00:05:08

discriminative models and this is essentially when we try to learn the mapping directly

00:05:12

so we're interested we have a training data essentially wanting the mapping to actually separate

00:05:16

at classes of interest of do i regression within the training data itself

00:05:20

we're not really caring about any sort of why the probability distributions that might be happening

00:05:24

here this is support vector machines this is day planning models is is this

00:05:28

is a lot different things go on here at the same time then we had generated models this is

00:05:33

essentially trying to add up device really the joint probability function between okay just on the labels

00:05:38

and then doing detection basically rearranging stuff using baseball so actually allows to form into a classification task

00:05:44

and you guys cover hidden markov models so we're in the wake of models and idea of a generative model and

00:05:49

sells gaussian mixture models k. means clustering we'll talk about some different forms of sort of generative models themselves

00:05:55

generally models we also use a lot within unsupervised learning itself because we're trying to map

00:06:00

the sort of wider probability distributions that might be happening in the data set itself

00:06:05

the parser really talk about specifics of algorithms are

00:06:08

advantages and the disadvantages own talk about generalisation

00:06:11

'cause generalisation is not really separates machine learning out from something

00:06:15

like an optimisation task it's very uh easy for stopped

00:06:17

eliza said have a cost function we wanna minimise that we wanna maximise what we wanna do something to it

00:06:22

that is exceptionally easy to go really gets machine learning it's brains it's power nuts

00:06:27

idea that would want something from the data is this idea of generalisation

00:06:31

does that will minimise maximise your cost function and this

00:06:34

is done using pretty much different optimisation techniques here

00:06:38

so we don't do this optimisation technique someone something as well so generally my said derivative are

00:06:43

great in the cost function this year i recently so for the model parameters themselves

00:06:47

this is very easy for me to say put this isn't always possible so it novel solutions will have some sort of closed form

00:06:54

days they do have a close format might be very computationally expensive so we have to do the fugitive methods

00:06:59

really get this sort of complex documentation done and one of the ways we could do this is something

00:07:04

like gradient descent and this is used a lot in machine learning a numbed your networks especially

00:07:09

where we just using a sort of iterative solution using negative gradient of

00:07:13

function and so so the working our way to the sort of

00:07:17

global minima that we might have within a particular cost function themselves trying to get this

00:07:21

sort of step sizes that down sit down sit down sit down sit down

00:07:24

but we wanna do this so it not that we were having this local minima within a particular training set but we wanna learn the

00:07:31

sort of wider local minima that might be happening we wanna make sure that we're not just going i'm got a hundred percent

00:07:38

decision on the data that i've collected michael party i can do this a hundred percent accurate

00:07:43

this is the best speech pathology detection system it ever i'm gonna give it to everybody

00:07:48

they're gonna put it in the latin guide this is the worst machine learning system have a it learns nothing because

00:07:54

if we have it to the training data we're gonna get a really really bad system talk

00:07:58

a bit more about this in the moment and different areas we've gotta look for itself

00:08:03

so machine learning essentially we don't wanna just minimise the training era so this is mistakes

00:08:08

made on the training data we also wanna minimise what's there's a generalisation era

00:08:12

so the it'd be winning phase when the channel learn something from the

00:08:15

data up to myself parameters of our algorithms from the data itself

00:08:19

and we're at the same time this is in training always minimising the training yeah this

00:08:23

is the difference between the actual in the predicted levels within the training set itself

00:08:27

we always wanna get this as small as we possibly can of course the same time this sort of introduces a lot of different

00:08:33

so the air isn't isn't mainly relating to sampling error itself so even

00:08:37

some sort of training data that we might have we only have

00:08:40

a very small selection really of the wider sort of distribution that

00:08:43

we have of the actual ta so we're interested in doing

00:08:47

we've got collected a couple hundred samples but everybody with you know this thousands millions of people with

00:08:52

a particular disorder of interest that we might have itself so there's no way we've covered this

00:08:56

so we stopped the mice was is very small point of time we're not gonna get a very good machine learning model and this is what we need to do it

00:09:02

and this is this idea of general last ability to the model

00:09:05

must accurately adequately label new test data samples to do i

00:09:10

training phase we minimise that training areas and we always test and the test place to minimise that test there is

00:09:15

so when they use all like database into doing training optimisation we try to

00:09:19

minimise this generalisation error on the training data this is i did

00:09:23

is why we're training this is why we did detect testing in machine learning so i will always have the sort of two step

00:09:29

so the training phase always this we have allied with data we do ah preprocessing removing

00:09:33

outliers we might do some sort of the noise in within the data itself

00:09:37

might do some other sort of functions here we do our feature extraction we

00:09:41

run opens while we're on the scene enter we do something yeah

00:09:45

we do our machine learning we actually train the model and we ought to my face model itself be set of parameters and said hi

00:09:51

parameters the external things that we can tune about the sort of operation the model we do this in this training phase itself

00:09:57

in in supervised learning we're always doing this with respect to the labels from the label training data

00:10:02

then we get great i've got this model excellent how well does it work this is what we

00:10:05

do in this is why we need the testing phase we wanna minimise this generalisation areas

00:10:10

so we get a new data sample this is anti not new we just

00:10:13

hold out something from the labels from day to actually collected itself

00:10:16

we do the same preprocessing if we've done in the so the day noisy here we try to get it so it's within this is the same conditions

00:10:23

we do if each the same feature extraction is into it just passes through

00:10:27

i'd i learning model and then we get that predicted label here and we get

00:10:30

a sort of set the product labels and measuring this on the test phase

00:10:34

that really gives us our ideal performance that's how we know when we got a good

00:10:37

bottle when we don't have a good model if we just sort of optimising here

00:10:41

what in this training phase we never really check within this test phase we've never got a good idea if we

00:10:46

have a good model and if we have a bad model and we have to spend a lot of time

00:10:51

testing isn't doing this in machine learning machine learning research can be

00:10:54

incredibly boring because we just doing this training testing training testing

00:10:58

change a couple parameters change a couple from less training testing but is very very important this is very important to do

00:11:04

to make sure you get the best and most general while model that you possibly can within these itself

00:11:10

and the reason we wanna do this is we're looking for particular subset of errors and we're trying to get rid of these

00:11:15

areas uh not essentially have them so we have the air and defeating is essentially when we can model that's too simple

00:11:21

so uh we just make some assumptions about the distribution of the two classes drawer line down the middle

00:11:28

it's very very simple it's not a complex model and the channel a chance to make a lot of mistakes

00:11:33

on it and it has what's known as high bias to generally makes a lot of similar mistakes all talk about by some sort of sensitivity

00:11:39

here in a little bit is an time what we want to also avoided serving this is going to make a model too complex

00:11:45

we try to really optimise its own absolutely makes no mistakes

00:11:49

on the actual training data we learn is convoluted decision

00:11:53

function he really know a way really reflects

00:11:57

anything about sort of real life all real well that we might be looking at the data and when we start to look at here

00:12:03

we're gonna go pretty much get the same to areas might just be making chance level or even beneath chance level

00:12:08

way so i have something like on the fitting or the feeling of our great feeling is is that

00:12:12

looking at this case where we can't really find a limit we separable model itself was just sort of trading off

00:12:17

on where we want particular areas to be made in just trying to make the most accurate system we can

00:12:22

we might try out the system so make small false negatives more false positives how does this

00:12:27

which are indifferent costing functions different type parameters so we can always find the most robust model

00:12:31

that we can in terms of working on a particular subset of hold out test data

00:12:36

so increasing the generalisation increasing the robustness of the model and this

00:12:39

is really really important to do with the machine learning itself

00:12:43

and it's based on the concept essentially a pass and there is there is so biases saying on

00:12:48

average how much you predicted values differ from their actual values so there's a some sort of

00:12:54

common way we're actually making areas of uh make a very very simple model we find the the model itself would just make the

00:13:00

same air time and time and time again the same time we might have area so we get very very different predictions

00:13:05

so my head to test instances very very close to each other but all essentially give us wildly

00:13:10

different estimations from outside of the machine learning model and we've got this idea of high variance

00:13:14

so really trying to look for something where we have both low various

00:13:17

um let bias within actual machine learning model but of course

00:13:21

this isn't really super possible to do is ways we can do this but generally we always have to trade off wise there is

00:13:28

buses variance there is itself window sort of minimising for generalisation areas

00:13:32

essentially what happens we start with a very very simple model it's you know it

00:13:37

decision tree attorney making two or three decisions it's gonna have very very high price there is we start to increase

00:13:43

the complexity of this model and increase and operate more parameters

00:13:46

enamel features in them all things drop mice itself

00:13:49

we move up into some really cool big learning system but we don't really have that much training data and we increase the

00:13:55

sort of variance there is a big night so the model becomes more and more complex the bias there is drop

00:14:01

we get rid of these but at the same time we increase the variance

00:14:03

there is themselves essentially what we're doing with continual training and testing

00:14:07

algorithms jiffy looking pretty straight off point plus the op global model complexity

00:14:13

for the any particular task any particular algorithm that i might have

00:14:17

there is non is the easy and so the so the no free lunch algorithms essentially saying no machine learning model isn't

00:14:23

it better than any of the machine learning model we just gotta find the most suitable for the data we have

00:14:27

we've got a very very small data set there's no point in doing trying to do really

00:14:31

cool when twin lining with it because we're gonna over fit will probably get out

00:14:35

no neither count for not variance in the data because we simply can't train it all all of it very very quickly with these

00:14:41

so it's not always saying what's the best algorithm it's a sentencing what's the most suitable algorithm for the actual job we have

00:14:47

the amount of training data we have for the feature representation we only use and so on and

00:14:51

so on there's lots of things we can choose in change and stuff within machine money

00:14:55

this is the real reason we do this we wanna trade off model

00:14:57

complexity versus the different areas the actual system can actually make here

00:15:02

so thin pair linguistics the so the default way we do this now is not actually

00:15:06

just doing training and testing but we often use a validation set as well

00:15:10

so we split the data essentially into three ways 'cause they were just

00:15:13

doing training and testing we can still learn essentially a model

00:15:17

that will not generalised well we could just over fits the actual training data and

00:15:21

to the test data at the same time if we don't have enough yeah

00:15:24

so this is some don't why we introduced this idea of a validation set so what we're doing here is we're training our model we take maybe sixty

00:15:30

percent about data seventy percent about data and we use is to train the

00:15:33

model we have a look at things like feature sets normalised relations

00:15:38

different type of parameters and then we try to optimise the model one the best representation we can from it

00:15:44

and then we use a twenty percent of the data maybe to doing validation we try to

00:15:48

minimise this generalisation error we do i training we do i testing on the validation data

00:15:53

and we go okay if this works this didn't work this work this didn't work so on and so on until eventually i okay

00:15:59

i've got this data here i've got my feature space here i've got my machine learning algorithm here i've got my course

00:16:04

settings for it i think this is the best one possible this is really cool this is gonna make a difference

00:16:09

then to really go okay yep we've got something general liable itself we use

00:16:13

the test itself is evaluates the data so this is just going okay

00:16:19

i this is brand new data nothing is mean seen here when we're doing

00:16:23

this optimisation we've essentially and then we can find what are actually true

00:16:28

algorithm performances using the sort of hold out test data we generally combine

00:16:33

the training set the validation set retrain the model using up a set of sort of features optimisation algorithms that we found

00:16:39

we tested again and find out if we've got a general washable model or

00:16:43

if we've done it in this way it doesn't quite efficiently you

00:16:46

can see you get very very bad accuracies with very very good models very quickly you doing training validation you pull him into test

00:16:52

and you get a big drop informants you know it okay this is not probably the best way

00:16:57

to do it if you performance hold steady you go like i probably got a good model

00:17:01

ultimately you performance increases 'cause you actually improve the model by combining the training

00:17:05

data training a little bit more performance and you get this guy

00:17:08

right yeah yeah i found the best thing possible that we can do but it's very very important

00:17:13

when you were viewing type isn't things this is one of the things to look for people really understand what's going

00:17:17

on especially in speech pathology especially in computational power linguistics so

00:17:21

id train test train validation test gently down in

00:17:25

preferred not to see him type is what i'm looking at them they would just a training about

00:17:29

asian it's really no way to know how general lies when model is so this is gently

00:17:34

the best optimal way within different feel to doing it now

00:17:37

hinting lining especially when we got a lot of data

00:17:41

we can sort of get away with doing training and validation only because innovation to got quite

00:17:45

a fair chunk of data still not really the optimal way to do it itself

00:17:49

often in speech pathology though we don't even have enough data really

00:17:52

divided into sixty twenty twenty we might start to have ways

00:17:57

it still might not be enough to really even train basic models themselves if this is the case we often use cos all validation

00:18:03

does into the saying we have some sort of sample space some sort of what it is you're

00:18:06

you shouldn't and i mean a huge maybe ninety percent of my data actually training algorithm with

00:18:12

oh used ten percent of my data for testing algorithm week and then

00:18:15

we just write take yesterday right at this or k. times riff

00:18:20

ten times if we're going ten focus validation and then we take the average of all

00:18:24

of these performances and that gives us a rough idea of what the sort

00:18:27

of generalised ability of the model might be how well it's actually performing sometimes we

00:18:31

can devise we can do cos all validation on some sort of train

00:18:35

validation set up we can sort of combine these methods to really try to find the most optimal parameter

00:18:39

possible but really it to hold out test set that we learned nothing from is still really

00:18:44

the best way to actually perform a model and to get the most suitable representation

00:18:48

of any particular model out and it's really important when you're doing a machine learning this idea of generalised ability

00:18:54

not of the feeding not on the feeling choosing the right model for the particular task and that's

00:19:00

what you always spend the time with with a machine learning research is waiting testing issue trains of

00:19:05

parameters and just waiting for a particular good result to sort of come out with itself

00:19:09

but that's all good and well but what we actually do how do we do this there is

00:19:14

hundreds of different ways we can the machine learning different mission learning algorithms itself

00:19:19

so just pulling out some of the more common ones that comes through in sort of speech pathology in within computational para linguistic

00:19:24

itself so first class anyway of algorithms all talk about a discriminative

00:19:29

models so essentially we're learning some so hard or soft

00:19:33

decision boundary between classes of interest we're assuming that there's some sort

00:19:37

of probably would be distribution joint probably would be distribution

00:19:40

but we're really just trying to estimate the parameters of this directly from the training data itself

00:19:44

we're not trying to learn this sort of wider probably distribution of this sort of

00:19:47

training set of the sort of features in of this uh label space itself

00:19:51

the advantages here generally we directly learning a decision function objective it's

00:19:56

it's uh it's pretty good when we just have sort of a very small number of data points itself to train with

00:20:02

we're not trying to estimate some sort of wider probably distribution and then sort of make a decision based on these were going okay what do

00:20:09

i need to know to make the decision at hand so it's probably wanna better ones of their and is used quite widely here

00:20:15

so it's um examples because of um discriminate yeah well to

00:20:19

talk through them random far s. k. nearest neighbour is

00:20:22

support vector machines you know networks and yeah you guys will have

00:20:26

so deeply much uh i think tomorrow or the next day so i'm not gonna go into much more detail of whatever job

00:20:32

but in terms of convolutional neural networks this morning in terms of declining 'cause you gotta go over that bit more

00:20:37

so that's one wanna talk about a random for us a random farce is essentially a

00:20:42

ensemble looked replace classifies so forming something like a decision tray so

00:20:47

remember my very first couple of slides this morning and

00:20:50

sort of put out this decision tree of have to choose if your piece is gonna come on time or not

00:20:54

this is essentially just taking a group of days just train on so the

00:20:57

slightly different views of the data are trained using slightly different parameter settings

00:21:01

so we just looking at a slightly different way of the data and then we sent to the summing

00:21:06

take an average of the sort of function of what we've learned from the various decision trees

00:21:10

actually make a final decision itself so the final decision is sort of a either value

00:21:16

or some sort of average if we're doing some sort of regression output so they strays essentially looking at different

00:21:21

views of the data looking at the data from different views itself depending on how we set up

00:21:26

so at the height of any random far is is essentially decision tree classifier so

00:21:31

to non parametric supervise machine learning algorithm and essentially a channel on the target

00:21:36

by just simply making a set of decisions and making the set of decisions recently just

00:21:40

subdivide in the feature space down and down and down and down and down into we actually able to make it

00:21:45

so it's a very it's the most probably interpret will of all the

00:21:48

classifies all really talk about today it's no is explainable one

00:21:52

i haven't put too many side on it's fine ability but i'll talk a little bit about that and talk about

00:21:57

some of the others but i sent to explain ability is one of the beatings we really need in speech pathology now to change into g. d. p. ah

00:22:04

rules the patients now if you wanna put machine learning in any critical cactus

00:22:08

have a right to understand how decision got reached about them most

00:22:13

do the most machine learning methods we use it's very hard to do that with that sort

00:22:17

of a big open field of research that we need to get more into itself

00:22:20

but decision trees are very easy to do explain really would just essentially sub

00:22:24

dividing the feature space making decisions to maximise information going at each point

00:22:29

so what decision will give us the most we can learn from the

00:22:32

data essentially h. point of time of breaking this down down down

00:22:36

two hours ever read notices essentially starting at the top and then you know we just go

00:22:41

down down down and every time on over city maybe decision about the data itself

00:22:47

voice willingly submit into further and further decisions into eventually reach a final decision within the tray itself

00:22:53

we talk a little bit more give a bit more what example naturalists afternoon but

00:22:57

this is a very in way we could do decision trees very simple example

00:23:01

we have a good good of animals that we wanna classify and we have certain aspects and features about animals

00:23:06

that we can use it is into we're just trying to get a

00:23:08

number of legs okay to fall so essentially the human versus a

00:23:13

four legged animal we can break them down and sort of maximise and it's just a rough crude idea of how decision trees work

00:23:20

so the key properties i cough labels are associated very bottom i've actually if no it

00:23:25

each were real relief represent some sort of decision role in classification balls into making decision

00:23:31

every single node we have a set of patterns associated with h. no it's

00:23:35

are relevant features on line so does feature selection automatically for us

00:23:39

within here and it's relatively simple very very easy to understand

00:23:43

essentially make it down in a sort of recursive algorithm style where we just this

00:23:47

into china um maximise the entropy of the information gain h. decision itself

00:23:52

sunset which decision returns as the maximum amount of information we need to make the next the next one

00:23:56

the next one we break this down into we've reached a class label that we particularly want

00:24:01

so steps we cacti the entropy of the parent node actor identifiable individual splits that we might do a maybe

00:24:07

about that goes with these and then essentially always choose display that gives us the best information gain the

00:24:13

it's giving of all the mass of a book out we found um when we're

00:24:16

looking at this sort of going on play if we do this information theories

00:24:20

and we find the best possible split then we can go on go on until we actually get possible labels down

00:24:25

so sort of working about these very briefly already advantages that quick and easy to interpret

00:24:31

computationally they're very very simple quite could defeat and they're very useful in data exploration

00:24:37

so it discovers the significant variables what we really need to

00:24:40

make a decision itself and identifies different relationships between them

00:24:45

they pretty robust this suitable on small data they're not easily

00:24:48

influenced by outliers 'cause royce maximising the information gain

00:24:51

and they can handle missing values relatively well as well they can handle nonlinear and sort of interactions relationship

00:24:57

but for all their sort of couldn't sinhalese massive advantages they're

00:25:01

very very simple algorithm separate device there is themselves

00:25:04

the the in stable so the very most small amount of data change can really change the tree structure

00:25:09

quite dramatically if we change our feature space even just a little bit itself they generally do

00:25:15

on more complex problems have for accuracy or got is what's known as a

00:25:19

weak classifier so we crossed wires something no matter how much we train

00:25:22

of a particular task it's generally just getting about chance level we can do

00:25:27

and what we can do there is actually use we classify as

00:25:30

and this is that the will to over fitting and do the tagging and ensemble methods which

00:25:35

is what random forces it sort of a bagging method of actually these decision trees

00:25:39

and it helps us reduce the variance helps increase the robustness of the actual model itself so it

00:25:46

so we combining multiple classifiers model don't different subsets of the data and

00:25:51

reducing the variance in the predictions by actually doing this itself

00:25:54

yeah the original data we might split into multiple data sets we try multiple different classifies

00:25:59

something like a in week uh i think the defaults maybe five hundred decision trees in random far so actually

00:26:05

doing quite a lot of different decisions itself and then we come up with several and final decision that we actually have

00:26:12

the again would bootstrapping combining results vary in classification and

00:26:17

essentially a bootstrapping is just saying we take a subset of the data with our placement and we do this multiple times itself

00:26:23

i'm backing in general for any form classify is always trying to reduce variance and also

00:26:28

at the same time have a limited effect on the bias the actual signal itself

00:26:31

so it's a good method if you have the sponsors of the week predictors to actually start

00:26:35

to the classification with and generally want to train the algorithm it's actually quite quick

00:26:40

so uh we got for so many trays we need independent bootstraps at

00:26:44

each node we may cost wait sometimes we just sort of um

00:26:47

deliberately altered around a bit so gently not too concerned about

00:26:51

the accuracy of the individual traces place the accuracy of

00:26:54

it the whole forest of the actual trace itself we do avoiding we get our sort of final output here

00:27:00

so random far as the big advantage is good accuracy fast run time

00:27:05

that work well we live status that second handle high dimensionality is

00:27:08

they inherently perform feature selection three that's a decision trees that are

00:27:12

done in on an improved disability the individual decision trees function

00:27:16

the same time the a little more susceptible to uh the feeling is particularly with the noisy environments themselves

00:27:23

and again we into this thing where mostly algorithms i'll talk about we had this idea of a black box approach now

00:27:29

we have no idea understanding of exactly how the decision got made yeah we understand how we set the algorithm out

00:27:35

the background of how but in terms of going which reaches data combined what is the indication that this person has

00:27:41

a particular speech pathology very hard to trace these things factors sort of black box operations the as i said

00:27:48

the field sort of needs to start shifting more towards explainable solutions and also

00:27:52

find ways to make solutions for these black box system themselves actually explainable

00:27:58

so the uh most simple approaches well in terms of

00:28:01

the um discriminate networks is the okay nearest neighbours

00:28:04

sort of classifier so this is pretty easy we're not actually now really doing classifying all learning

00:28:10

from the data itself were doing classifying just by so looking at properties

00:28:14

within the training set evaluate various this impossible decision process recently

00:28:19

pop in my test set in a going okay what's my nearest neighbour okay that's the class i'm gonna assigned to itself

00:28:25

and that's pretty much all we don't really have any testing here which

00:28:29

is going every time but it's a very slow method accent if

00:28:32

you have a very large data set you've got to do this work at least distant manages to hold in your status classifies itself

00:28:39

using something like euclid in the old to norm we might do the k. nearest on nearest neighbours algorithm itself

00:28:45

and then we we find that it's not actually that good so we just okay okay

00:28:50

what a different variants what a different distances itself is signing it therefore into the one itself

00:28:56

but again this is generally get over fit into training data so what we wanna do

00:29:01

is look at using nearest neighbours so instead of going what's my nearest neighbour what's

00:29:06

the cost of my to nearest neighbours what's class in my three nearest neighbours so on and so on is into this is the high programmer do we choose

00:29:13

when we doing nearest neighbours classifier cindy how many nearest neighbours do we have and again this sort of trade off

00:29:19

that we have between everything on defeating we've gotta find the right

00:29:22

sort of appropriate complexity as we take more more nearest neighbours

00:29:25

the decision function gets more and more complex chances waving good go up give it up and didn't go up itself

00:29:34

so uh yeah that essentially just something again using the semantics to find the k. nearest neighbours itself

00:29:40

and assigning that and we've always got to choose this idea okay

00:29:43

experiment lisa wanna find some sort of validation aaron minimise itself

00:29:47

we find is k. goes up to i was that to increase the variance increase the full validation areas that we might actually have

00:29:53

so really down to do the k. nearest neighbour is is it works

00:29:56

very well and very simple basic sort of recognition problems themselves

00:30:01

and is robust to noise we can actually toes sort of waiting into account if the bigger

00:30:05

distances i'll i'll lay things out we give them a sort of a less of away

00:30:09

that enables particularly far away we don't like that too much we can the algorithm until i was just choosing tried for a bit of noise

00:30:15

um it's a lazy when we're not really learning anything when we're doing k. nearest neighbours which is really

00:30:20

just making the decision itself and as a cause of this it's very very high computational cost

00:30:26

so if every single production we wanna make when we're doing chi nearest neighbours we sent to compute the

00:30:30

distance and sort every single instance within the training data itself we've gotta sum up every single distance

00:30:36

find the minimum set every single time so we don't actually have

00:30:39

a model which is doing this one operation one algorithm time

00:30:43

and time again so that's particular disadvantage of speed is something

00:30:46

that's particularly important when doing these style of decisions itself

00:30:51

the most frequently use i guess in speech pathology in computational power

00:30:55

linguistics in terms of classifiers is the support vector machine

00:30:59

and this is 'cause the just the good at making optimal decision

00:31:03

and the security easy to train and they work very well so low data set

00:31:08

how decision how support vector machine but says essentially were always making a binary classifier

00:31:13

is assigning something in to cost one class minus one set two classes itself

00:31:18

i mean essentially lining this decision boundary such that we can actually some decision boundary at the

00:31:24

set of nearest training samples to it need to know is the support vectors themselves

00:31:28

the geometrically these are the sole sort of training patterns that's it absolutely closest to

00:31:33

the boundary this awaited some of these training patents is sitting on this boundary

00:31:38

gives us a way to actually do a classifier is based

00:31:41

very heavily on sort of classic on optimisation so using

00:31:44

with courage and multiply as and doing contradict optimisation to actually learn the decision function will go briefly into itself

00:31:51

so we're learning some sort of class one class to separation we wanna find this pipeline hyper decision between it

00:31:57

as and you wanna maximise the distance between our two classes and then

00:32:00

we can use the support vectors this training instances that say

00:32:06

essentially closest to the decision boundary possible we use these to actually format classifier is very

00:32:11

compact form a classification because we can for the rest of the training instances away

00:32:16

and have a very small model essentially two out of the final compared to the whole shoulder training space

00:32:21

compared to something like a nearest neighbours where we had to some overall appoint

00:32:25

we're not reducing to summation every sort of small number of points

00:32:29

and it's very basin heavy so the classical mathematics methods optimisation the grunge and multiplies

00:32:34

i'm not gonna go too deep into it because you get to be boring

00:32:38

it's into itself so support vector machines the main advantages good generalisation

00:32:43

so the algorithms always set up too so the yield the maximum

00:32:47

possible distance between a hyper planes and between the classes itself

00:32:51

so reduce the chances of doing things like uh the feeding is really good when we have small amounts of data as well

00:32:56

computationally efficient way expressing our mind is a function of support vector is and these points men caused

00:33:02

by plane the real advantage of support vectors comes into the trick of the kernel operations

00:33:08

so now i will to uh we need decision rabbit in on

00:33:10

the new decisions essentially using saying algorithm using the same setup

00:33:14

just by doing what's there is the current trek recently mapping into high frequent high dimensional

00:33:19

spaces we'll talk about this more in a couple times we do these implicit mappings

00:33:23

informs the top products that actually seen the support vector operations allow

00:33:27

us to do is to be normally classification as well

00:33:31

so the call when we sort of setting up a support vector machine algorithm class flat it's really

00:33:35

assigned some to label so sign so we've got a mapping function the map something label space

00:33:41

and the feature space and we since we wanna learn if this time is positive or negative not assigned as in the cost one

00:33:46

no cost to doesn't really matter what we call class one and class to in these instances but it's always a binary classifier

00:33:52

support vector machines for multicast you're always performing one versus all classification or you performing a set

00:33:58

of actually binary classifies actually for multi task one but at the heart it's always binary

00:34:03

the same to we've got some sort of classification function what's going on here is that we have a label we have

00:34:09

some sort of way and we have some sort of colonel operator here that's essentially doing uh so the of

00:34:16

implicit mapping into high dimensional space or doing when the separation of which is to set

00:34:20

up with itself and what we wanna do is learn the set of weights

00:34:24

they said to whites are always non zero essentially for a support vectors the once it's the closest to the boundary itself

00:34:29

so have a classification function we set this up it's just you know it's quite a simple one we

00:34:34

have a decision boundary have some sort of by so the decision boundary and we wanna learn this

00:34:38

we can always say say that the weight vector sits perpendicular to the hype applying to the decision function itself

00:34:44

we can say the oh points essentially greater than zero sit in class one

00:34:50

and then we have points that actually sit on the decision boundary itself and we can find an optimal so the distance for this

00:34:56

changing the class to which is saying everything is less than zero gets assigned to class two to the minus one instances

00:35:01

and again we can set up the sort of all to these points that actually sit on these decision boundaries

00:35:06

given we have some sort of optimal margin we always want to maximise our decision so it

00:35:11

so these are the sort of two over the the way to norm of the um

00:35:16

weighting function which is the one and the minus here and i've done a bit of rearranging

00:35:20

in these equations but it does all check out and we wanna learn this function here

00:35:24

are able to do that very easy we set this up as a minimisation algorithm and

00:35:28

we put the constraint in the the minimisation algorithm itself essentially use calling a name

00:35:35

quite tragic optimisation to solve the algorithms itself we actually substrate we're granting operate isn't we make a sort

00:35:41

of you more constraints in the actual algorithm itself and essentially allows us to sit here and find

00:35:47

what the optimal set of weights uh after is come about this

00:35:50

sort of change in terminology from the move into granting multiplies

00:35:54

and then we can express the decision function here is some sort of way that

00:35:56

some alphas what we actually want to record wreck optimisation some sort labels

00:36:01

and the in the product between two vectors itself plus the bias so well

00:36:06

it turns out is that every single point that has an alpha

00:36:12

greater than zero i must've ally essentially on h. one or h. to all other points

00:36:17

oh twenty have alphas you going to the right so you get a very very efficient

00:36:22

summation here one actually making a decision function with just saying this point this

00:36:26

point this point this point all we need to know it to make the sorta

00:36:29

support vector machine decision tray there is information we have a very nice

00:36:33

complex model itself allows us to do separation quite well the

00:36:37

classification models expresses awaited some of the actual support vector

00:36:40

as being the points that's the closest to the boundary so that was a bit brief and global mobile phone

00:36:48

it's so hard to support vector machines are good too much what they tell it around so that your soft and god he's gone

00:36:54

deriving all these different equations in various different how great and there is different directions but again like

00:37:00

uh everything so this morning the big so the theory because of the notes look up some

00:37:03

papers if you're really interested in so how support vector machines work but the main thing to

00:37:07

know is is is really nice sort of classification model that we actually some up itself

00:37:13

we actually can introduce a non separable data into support vector machines as well

00:37:18

and this is by putting more constraints into one kind you

00:37:20

know put operated themselves and allowing it to essentially choose

00:37:25

and allow the decision function to essentially make mistakes and then waiting how badly it make these mistakes

00:37:30

this is not known what we control when we actually look in the complexity of the actual support vector machines

00:37:36

so we're looking at some sort of slack function and with the meeting of support vector machine to make some sort of mistake

00:37:43

because i can i don't mind that you making this tiny little mistake a 'cause you

00:37:46

giving me more general liable model we essentially chain this complexity to really control

00:37:51

the complexity actual model and control the trade off between the sort of on the feeling you know the feeling between the bias and variance there is

00:37:58

and that's what we're essentially chaining so when you cheating a support vector machine just wanting that so the middle point

00:38:02

that i talked about controlling bias there is this is very

00:38:05

serious and sells so say gets loud essentially saying

00:38:10

i don't want you to make errors i'm gonna punish you if you make higher and higher it's itself

00:38:14

so i decision boundary gets more and more wavy more more complex itself see

00:38:19

get small generally we're going okay we're making more more assumptions about the

00:38:22

distribution the data which may not necessarily be true so making the model

00:38:26

less complex in the head down to what's on the fitting itself

00:38:29

the rough both um we generally said say less than one to avoid a fitting a lot

00:38:35

of take the tasks again it's not always true it's just a raffle of them yeah

00:38:40

so the thing to remember with this sort of complex evaluate as it goes out we get harder and harder martin we got more to

00:38:46

what uh the feeling as we go down we get a softer and softer margin we get what defeating itself with this evaluate

00:38:52

as i also said the real advantage apart from this so the idea very compact model

00:38:58

and making is optimal decision function by some very very clear sort of optimisation roles

00:39:03

we can handle non we need data within support vector machines and this is done by the inclusion what's known as a kernel function

00:39:09

so we play connelly me a sort of support vector machine essentially find that we're doing

00:39:14

it top product between a different vectors within the space when making a decision itself

00:39:19

i'm saying to we can do this into sort of a higher and higher dimensionality is

00:39:23

buys into doing a nonlinear transformation taking the dot product to

00:39:27

these nonlinear transformations and just pushing this information into the

00:39:30

support vector machine when actually doing a final decision function here

00:39:34

and call the kinetic because we never actually mapping

00:39:38

within the algorithm itself into high dimensional spaces itself rarely do in an implicit form within the dot product operation

00:39:44

so essentially what's happening is that things if you look at it one from one particular dimension

00:39:50

you guys if you're looking at my two hands in you know from your point you can't really see separation from the the move more into

00:39:56

so the three dimensions you can my uses that lead to mention likely separation uh so the things go up and up into mentions

00:40:02

as essentially what we're doing with the converter we're going like i wanna go into

00:40:05

high dimensional space and check of the date is nominally separable only going to

00:40:09

you know higher in high dimensional spaces in this different mappings different rules are

00:40:13

sort of essentially how we do this mappings visible idea essentially that

00:40:18

something that's non separable in one dimension will eventually be separable

00:40:21

in higher high dimension hopefully at any rate itself

00:40:25

that's what's happening when we're doing the contract itself so nonlinear

00:40:28

mapping died dimensional spaces the lot to different roles and

00:40:31

properties that how we sort of to find these and what exactly is a kernel space but again for

00:40:37

by the t. l. skip over this itself and just say that the main sort of popular kernel functions

00:40:42

also come across obviously when you can no just doing the dot product between uh because itself

00:40:46

oh no mail colonel loading opponent your transformation really that i've

00:40:50

heard it said now's the order polynomial transformation and

00:40:53

we can do uh so the gaussian kernel we're looking at savings of the variance of the kernel itself

00:40:58

and controlling different aspects here to do with the power we raise the kernel to as well so there's different so the trade offs

00:41:05

every time you uh so the of do support vector machines wanna do you can move takes it takes time

00:41:11

every sort of parameter the complexity of the different parameters play off against each other so your is gonna take time

00:41:17

doing very very good grid searches really find the most optimal settings the support vector machines when you're doing it

00:41:23

to one and then doing this training test validation testing for any validation testing so they can be a bit of

00:41:28

time 'cause it's quite a lot of things to so that you know not to mars within support vector machines

00:41:32

so the other form of discriminate model that you talk about of the cost of this week anyway id neural

00:41:38

networks on your networks in general or discriminate model i'm very very briefly the neural networks are just

00:41:45

the neural networks in your network that has at least four layers and as many many parameters that have to be like in it

00:41:51

yeah a enable really through this idea of multiple layers multiple nonlinear transformations

00:41:57

essentially the ability to learn very very highly complex cause functions itself

00:42:01

but the capital is being one of the big issues we always have within speech pathology

00:42:06

is that we need substantial amount of training data the paper the network goes the more more training data we get the

00:42:11

more more things we have to learn the more more training data we need so they would very very well

00:42:16

provided you sort of set this up probably as a particular task and know what you're sort of um

00:42:22

feature space is capable of the amount of training data you have is actually capable of itself as well

00:42:30

so the last thing instead of so the different models uh all

00:42:33

talk about is generative models so this now is the idea

00:42:38

that we're going to go and start modelling to distribution so we're looking at

00:42:42

the joint distribution between the model between the features and the labels

00:42:46

and we wanna model is joint probably be function will assume some some functional form of the conditional probability

00:42:51

and actually have to make these parameters from the training data and then use baseball's to sort of calculate

00:42:56

what's the likelihood of a feature of our label given a particular set of features we have

00:43:01

and the real advantage of doing so the of generative models is this sort

00:43:06

of assumption that are actually learning this why the probability distribution should

00:43:10

essentially allow us to sort of prevent over feeding itself

00:43:15

and we can potentially a local in an update models we can revise a sort of on the fly

00:43:20

bit more something the massive outlier or is this some sort of mistakes remain we can reach routine

00:43:25

models in reaching the probably be distributions which is we get more and more data as well within the generative models themselves

00:43:31

generally models we always wanna model i make a sort of classic classification using baseball's itself

00:43:37

so it's just a little bit every arrangement we have a conditional probability predefined it into joint probably

00:43:42

be an redefined this interface will see it so this is saying that our class will be

00:43:46

given our features what was the most likely label that we actually got not what we don't

00:43:50

generally classify we didn't go we get what what's the likelihood we came from class i

00:43:55

what's likely would we came from cost they choose the maximum likelihood in return that we

00:43:59

get you guys probably probably do this during the hidden markov models anyway itself

00:44:04

to generate models are often based on the sort of concept the clustering so

00:44:07

sort of grouping feature vectors together into sort of common spaces common things

00:44:13

so it's a sort of partitioning the data looking at how well the model think

00:44:17

she's petitions letting which little adjustments of parameters that we actually need to make

00:44:21

and then making these adjustments over time and learning the sort of distributions into different

00:44:26

climbs into different classes itself and then trying to form these up itself

00:44:30

so the basic idea of any sort of clustering algorithm is going back

00:44:34

almost to the k. nearest neighbours into the k. means algorithms themselves

00:44:38

so looking at trying to find cost is where the two points belong to the same class that have a very

00:44:43

small distance measure between them two points belonging to different classes have a sense a very very big distance measure

00:44:49

between them so it's very so easy concept itself is into we can just phone classes check these and then

00:44:54

sort of resigned boundaries resigned boundaries that's what's happening a lot and its various different ways wanna do this

00:45:00

across the distance small and speeches similar similarities high when into cost the distances lot

00:45:05

features similarities low these belonging to different classes the general sort of way

00:45:10

looking at me vantage of doing any sort of clustering you move into the discriminant

00:45:15

models a generative models well is that we can often find very compact representations

00:45:19

so when we thought about what we're doing ten means into working out all these distances

00:45:23

to every single training point when i just got classes of dayton working at distances

00:45:27

or some the likelihood measure to the central database classes so some minimising the test points

00:45:32

they're actually going over itself the design of the test data into

00:45:36

a class uh by computing distances between so the centre in

00:45:39

the middle it's for each class that and find one us into gives us the best fit so the rio um

00:45:45

call methods that we talk about this with work any means naive bayes gaussian mixture models

00:45:51

and hidden markov models themselves take a means is probably the most simple uh um

00:45:57

how do we want an clustering algorithm as essentially the iterative assignment of training points

00:46:03

in one one into the different groups that we actually want to supervise learning we

00:46:07

sent to use the label so sorta died the selection of these classes itself

00:46:10

once costed we just use the same try to the data is and uses the label any new what data points

00:46:16

and that's why we use when we sort of doing unsupervised learning itself we can find if you've different distributions

00:46:21

and then we assume that the once enjoyed it some sort of new class label that would actually come up

00:46:26

with something that information with down when we're doing data mining instead of using something like a means

00:46:31

so can i mean some key parameter is essentially assigning a number of 'em classes that we actually want

00:46:37

depending on how we actually do it is is that we could use group labels to really sort of redefine these itself

00:46:42

so the training algorithm is we select some initial instances to be central wait

00:46:47

we design into the remaining instances into one of the k. classes

00:46:51

and then we voice assign it to one that's closest to we

00:46:54

we can points enjoyed based on the current pattern of assignments

00:46:58

resigned everything into the updated cost is and so on and so on and so on and so essentially we get no change

00:47:04

in the number uh in the so the change of the central to the cost the position all

00:47:09

we just go okay yeah you wanna do this for a set number of iterations itself

00:47:13

so the idea is to really do it until we get real we no change in any successive iterations between itself

00:47:19

we have some sort of random initialisation and then we just repeating this assignment repeating this cost or

00:47:24

it's in tried assignments time and time again until essentially convergence themselves so signing out points

00:47:31

you guys are iteration one iteration to run into the model so get better and better and better and better

00:47:36

the one disadvantage doing k. means clustering especially when we're doing sort of iteration points

00:47:41

is sometimes our final so the output of the model is very much dependent on actually selecting a good

00:47:47

sort of initial point if we select all our initial points somewhere you know

00:47:52

null through widely distributed of the data potential it's gonna take a longer time to

00:47:56

converge or what it does converge to is a bit sort of rubbish itself

00:48:00

take a means easy to implement work on large data sets but with the logo of a bit tuning to actually

00:48:07

find a set number of classes as i said this initial sort of seating can ever really strong impact

00:48:12

on the final cost uh and final points of time and we i can have some

00:48:16

sort of strong sensitive you to our allies annoys the during this class strings

00:48:20

this could really start to affect them sort of put different classes out of positions

00:48:23

themselves i'm just making some assumptions as well about the shape in about

00:48:29

the data which may or may not hold true window so looking at real data instances

00:48:34

so we have a really simple forms of doing generative model is called a naive bayes classifier

00:48:39

it's called naive to make some certain assumptions about the data in the distribution of the data itself

00:48:44

essentially we just use baseball directly responsible likelihood tables from posterior tables and

00:48:50

we just assigned probability distributions directly from the sort of data itself

00:48:53

but we work on the assumption either every features assume class conditionally independent so we can sort of one feature

00:49:00

the distribution fee just on the fact that class labels but in real life this assumption probably doesn't hold

00:49:05

idols uh it's a pretty naive way of doing it so as looking uh maximising this for the posterior probability

00:49:12

we just look up some the likelihood phone some likelihood table

00:49:15

used five probably the distributions and sometimes a particular

00:49:18

five probability is well generally we conditions the class independent actually removed

00:49:22

from the algorithm this is essentially what i've done here

00:49:25

so the classification face hey training is really converting data sets into

00:49:30

frequency tables so looking frequencies of distributions are the data set

00:49:34

creating some the likelihood finding sort of probabilities we might need to do small corrections we might have sort of

00:49:40

has a features in classes that don't really come up in the training set might have to make

00:49:44

some sort of assumptions about the distributions themselves if we wanna cover everything equally

00:49:49

and classifying is just using this sort of equations using the likelihood tables using the frequency table itself

00:49:55

the cacti the poster probably be for every class and class of the highest

00:49:58

posterior probabilities the outcome of the actual production that's always on it to

00:50:02

the very simple example that sort of comes up time and time again this idea or do we go out

00:50:07

why don't we go out right if you're able to go okay sammy someone out overcast yes you know

00:50:13

raining yes someone at the place down and so on so we look at the frequency table this is just saying

00:50:18

is that classes rainy sunny total is on number uh you know yes and then assign

00:50:23

different probably distributions in the wannabe assumptions here we have this sort of the right

00:50:27

okay yeah and then we can form a likelihood table just looking at the distribution of

00:50:31

different brands of radical to bring over cost on a yes and those sort of

00:50:36

like that as well and essentially when forming classifications we distilled

00:50:40

by baseball would get all this information from lookup

00:50:43

table trying from the data and we've been consoled start to infer things like if it was sunny

00:50:49

it's it's a correct statement players will downplay we can find that therefore likelihood of sunny and going on

00:50:54

play with sixty percent uh we can for this uh probably craig's and we can sign this

00:50:59

the sort of class label itself with these we could she knew exactly where we

00:51:02

wanna make this decision based on prior knowledge is well within the naive phase

00:51:07

so naive bayes is relatively easy fast to predict we have different test sets

00:51:12

as well it can perform multicast classification pretty can very very well

00:51:16

and big advantage of nice pave nice but i use over a lot

00:51:20

of different algorithms it can actually handle category inputs of variables itself

00:51:24

so we don't actually have to have some numeric distribution with enough features base you can start using like

00:51:30

words different other forms of actual features that we want no eyebrows what did we well for days

00:51:34

the main disadvantages idea of zero frequency so what happens if we get a test

00:51:39

spectra that has a category that was not observed during the training sample

00:51:42

so we had this gotta category here that sort of frequency table over cost

00:51:46

now if we had known occurrences of this within our actual training samples

00:51:50

the cans comes up with no not test samples we don't really know what

00:51:54

to do with the we always had this big assumption the independent predictors

00:51:59

don't know true and sort of real life data itself everything is sort of slightly correlated with everything

00:52:04

else and we've sort of making this very naive assumption hands what's called a nice classifier

00:52:09

so the last um okay than the uh the archive uh in any particular that is the

00:52:16

gaussian mixture models itself you probably did a little with this within hidden markov models

00:52:20

but it got to mister moses into the comics conversations of gas in probably distributions stuff functions

00:52:27

there we actually feed of the data is really complex way of expressing a

00:52:30

model is that we have to express a sort of feature distribution in

00:52:34

terms of some waiting classifier and just the means of the gaussians and the variance of the gaussian vectors itself so it's quite a calm

00:52:40

packed way of doing a generative model it's um and what's pretty

00:52:44

well a lot of the time so this compact representation

00:52:47

we have some sort of mean vector willing some mean distributions so the class of each

00:52:52

means sort of vector roughly characterise is the shape of the feature space itself

00:52:56

we have a covariance matrix it's it's yeah it's characterise is sort of variability within the feature space itself

00:53:02

we also have the waiting so this essentially is just saying how important we learn different clusters are different sizes

00:53:09

small ones on super important the amount of data essentially covered by its mixture

00:53:14

by this sort of clustering algorithms themselves so we have some sort of

00:53:17

some one gaussian two gaussians three gaussians every sort of come up

00:53:21

several model itself before model for a positive class and they get a class

00:53:26

on and so on and essentially just use the likelihood function use payrolls

00:53:30

to actually start doing it training is gently down with the expectation maximise

00:53:34

ration algorithm or doing um maximum prior prior updates as well

00:53:39

so this is really neat today one two is to maximise the gaussian the g. m. m. likelihood functions

00:53:43

on the actual data is set itself we generally don't maximise the

00:53:47

likelihood function we don't we maximise the log likelihood functions

00:53:51

and this is essentially a computational thing itself is going to be a white on the flow of a sort of

00:53:56

looking a very very small numbers itself on the voice very easy to

00:53:59

get and also makes out sort of estimations easier we're going

00:54:02

for multiplication system nations that just makes it a bit more computationally efficient gently while we use a lot like a load

00:54:08

so the of very prefer you what sort happening within the em

00:54:12

algorithm is is sent to the same as what's happening really

00:54:16

weaving sorry came into starting with some initial estimate of what the parameters

00:54:19

might be we could just estimate is directly randomly assigned them

00:54:23

see there in the okay with uh came into first discusses instrument choice now i'm

00:54:28

gonna try to fit more finely tuned gaussians at the top of these

00:54:31

we can to compute the likelihood that each parameters produce the particular data point this is nancy expectation

00:54:36

stepped reading compute whites essentially based on this likelihood that each parameter was produced

00:54:41

we can use these weights together with them together with the estimation we maximise

00:54:46

that we try to improve the likelihood estimations time and time again

00:54:49

and we're essentially again do the steps time and time over and over until we reach some sort of form of

00:54:54

convergence of their okay nice top you've done enough steps now it's up to looking classify what we actually do

00:55:00

so is this into initiative method we start with two rough estimates what we

00:55:03

think might be happening with the training algorithms and is into the

00:55:07

of the successive am algorithms we end up getting a final model sort

00:55:10

of roughly reflecting the we had two classes a day be here

00:55:14

um one have slightly wider distribution and we can form one with some

00:55:18

now are becoming distribution here so that's sort of final model itself

00:55:23

classification is always done by baseball's so we have some sort of stick on some

00:55:27

sort a lot of events in a training algorithm that training sets right

00:55:32

these are testing then we have a class models so we form the

00:55:35

different gaussian mixture model incentive for the class of interest itself

00:55:38

and then classification involves identifying the bottle the returns the

00:55:42

maximum conditional probability within the different classes so

00:55:45

essentially saying okay given a test data what's the maximum what's gaussian returns the maximum likelihood here

00:55:51

this is actually come potentially quite difficult so we flip it

00:55:55

around using bayes roles essentially instead of saying class

00:55:59

to gaussian we know that from gaussian to class we also have the prior distribution of the gaussian itself

00:56:04

and classification is essentially just finding which gaussian has returns so most likely would

00:56:10

itself and using the sort of extra prize term that we have here

00:56:15

it down to just complex representation in bacon model p.

00:56:19

d. f. so quite a good require level of

00:56:21

uh accuracy essentially quite easy to feed using the um or using the mac algorithm which

00:56:26

i didn't cover not algorithm is is more initiative update for sort of individual classes

00:56:31

user uses windows down for my vectors and things like that disadvantages

00:56:37

inflexible if we chose in really inappropriate model to begin with it's not

00:56:41

gonna give us the to best choice so we're quite some

00:56:44

sort of strong price some sort of estimates of what we think number of gaussians enough distributions that might be happening yeah

00:56:50

um you program model might be really of skew a quite complicated itself

00:56:54

and the distributions that don't really have a gaussian distribution it's not really

00:56:59

the best particular model that we can actually work with itself

00:57:03

so the the last one model to briefly mention is hidden markov models

00:57:08

so the this is essentially now just as almost extension onto the gaussian mixture

00:57:12

models saying we wanna map a sequence of observations on the sequence of

00:57:17

labels and recently saying this data hidden weakening look uh probabilistic variations it can be observed

00:57:23

we got a lot of different parameters we cannot die with the gas in mixed with a with a hidden markov models

00:57:27

so i was trying to introduce in hidden markov models says so id that was sat

00:57:33

here and lecture theta but what we wanna do is classify what the weather

00:57:36

is outside we can't see that whether directly so we've gotta take observations of what

00:57:41

might be happening the way you might do that is not to look good

00:57:45

how people come in and they just into a shirt so they just in shorts it's on the it's warm that justin raincoats we can

00:57:51

infer that might be raining in this isn't gently what's happening this is something way to sort of think about the hidden markov models

00:57:57

the real advantage is bottling sequential data real disadvantages probably the large number of parameters

00:58:03

the really come up with hidden markov models and assuming that we have enough data to actually

00:58:08

model and one is correct parameters with so that sort of

00:58:12

brings and to the lectures but it's not so the

00:58:16

class dismissed right yet because i've got a little quiz based on line that or whatever and to do it

00:58:21

so if you could all get an phones and also that a little cousin quiz and just sort of go over some

00:58:26

of the main points and so the re emphasise some of the main things to actually talk about this morning

00:58:33

oh

Nick Cummins, Universität Augsburg

13 Feb. 2019 · 9:04 a.m.

Nick Cummins, Universität Augsburg

13 Feb. 2019 · 10:59 a.m.