Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:04
okay i'll get started with the second part of this morning's lecture and this is
00:00:08
focusing on the machine learning aspects are putting the knowledge put in a a
00:00:14
yeah pretty intelligent since uh so the computational intelligence channel line
00:00:18
something from dayton line it was a particular task
00:00:21
and so how we set this up as a problem so the key aspects in around generalisation and then
00:00:27
talking about different so the models of machine learning advantages disadvantages and sort of brief interactions it
00:00:32
to roughly how they work and sort of why we needs page looking
00:00:36
to particular point of times so the what is machine learning
00:00:42
some lots of different ways we can really defined is any particular task
00:00:46
but into finding here is some sort of statistical learning technique
00:00:49
to automatically identify patterns in data we automatically identify these patterns in such a
00:00:54
way that we're getting better and better doing a sort of particular task
00:00:58
so we'll start machine learning by defining some sort of domain of interest and this is generally a
00:01:03
this is actually saying the feature space so we have some sort of feature space this feature space might have some sort of
00:01:09
marginal probability distributions at the distribution of x. in the sort of
00:01:12
wide distribution the actual features texas else uh acting out
00:01:16
collected data displaying the sort of feature space of the main data itself then we need to really to find some sort
00:01:22
of generic analysis task so essentially some sort of mapping some
00:01:26
sort of idea between going from the feature space
00:01:28
to go into what some sort of particular label space itself we wasted anyway we'll space
00:01:33
in these notes anyways being some sort of why this of them the labels for
00:01:36
us the terms of tap as well 'cause some sort of speech pathology label
00:01:40
itself so we always have because we always have corresponding labels and essentially what we wanna do it
00:01:45
one some sort of mapping something between this vector space and the so the label space
00:01:51
and we wanna learn it in such a way that when we get a new piece of data so we have a sort of data we used
00:01:55
to actually learn the mapping do we really wanna do this mapping in such a way that when we get some sort of new gator instance
00:02:01
where accurately able to sort of predict its label that goes with us
00:02:06
so what i mean by new data is data that not seen by machine
00:02:09
learning algorithm when we're sort of optimising the performance parameters one also doing
00:02:13
now listening from the actual data that we generally use and we can
00:02:17
set up machine learning algorithms in one of two particular way is
00:02:20
the most ways most commonly that we do a lot and speech pathology detection is of course supervised learning so this
00:02:27
is learning a model from labels to have features we have models so much on alert critical labels themselves
00:02:33
because i said stop was sort of an supervised learning this is where we might want to discover
00:02:37
labels from lots of models itself we could have loads and loads of different speech collected
00:02:42
we got no idea what's going on with that we might start i wanna cluster
00:02:45
together and to learn something from it itself so we generally five the features
00:02:50
we then do a unsupervised learning of some sort of models and from there we might be able
00:02:54
to infer labels all the the particular labels that might be of interest discover new relationships
00:02:59
there we might not have thought existed but today among the focus really on supervised
00:03:03
learning would touch a little bit of techniques at work we can unsupervised learning
00:03:07
but on the whole it just talk about the sort of civilised mapping toward learning
00:03:11
a particular function to go from pages to the actual label space itself
00:03:15
and this is the the voice got this processing chain in this is true of any supervise lighting that we do in sort of
00:03:21
any machine learning task in particular itself we always have some sort of input data and our case an audio file itself
00:03:27
yeah the feature extraction and that's what we talked about this morning we get up speaker feature representations
00:03:33
and essentially we find this algorithm find this mapping we want to have some sort of
00:03:37
apple and so this could be some sort of probability of certainty of else right
00:03:40
do they have a patio pathological condition or don't have a pathological condition and ideally how
00:03:46
shore out with that they might have this condition as well or could just be one to one mapping yes i have a on there they don't have it itself
00:03:52
and this is what the machine many algorithm itself does it once these
00:03:56
roles that wants to predict the labels from the actual features themselves
00:03:59
and add we do this we normally do this by setting some sorta cost function is could be some sort of
00:04:04
probably distributions could be formed a variety of different ways we sent to just wanna find an update discourse function
00:04:10
using sorta maximise asian minimisation so essentially just using some form of calculus
00:04:15
you know we always wanna find a local minima more local maxima reconsider derivative zero and simply solve the equation
00:04:21
that are very very broad level is roughly what's happening in machine many hours so we just finding a way
00:04:26
to find is optimal decision boundary is optimum decision boundaries
00:04:30
normally something that really just pushes a separation apart
00:04:33
as maximum is it can can give us the best chance also labelling
00:04:36
data instances as well because is very very simple to do and
00:04:40
we have sort of very clean separable data this is very difficult to do and we always have sort of trade offs to make
00:04:46
in the performance of the sort of algorithm itself when we actually don't have separable data when we've gotta make some assumptions
00:04:52
we actually trade off sort of performance parameters within the machine learning algorithms itself
00:04:57
machine learning algorithms the really the brains is system so the building of the mathematical model itself
00:05:03
and we roughly form these into too many different classes of the machine learning algorithms
00:05:08
discriminative models and this is essentially when we try to learn the mapping directly
00:05:12
so we're interested we have a training data essentially wanting the mapping to actually separate
00:05:16
at classes of interest of do i regression within the training data itself
00:05:20
we're not really caring about any sort of why the probability distributions that might be happening
00:05:24
here this is support vector machines this is day planning models is is this
00:05:28
is a lot different things go on here at the same time then we had generated models this is
00:05:33
essentially trying to add up device really the joint probability function between okay just on the labels
00:05:38
and then doing detection basically rearranging stuff using baseball so actually allows to form into a classification task
00:05:44
and you guys cover hidden markov models so we're in the wake of models and idea of a generative model and
00:05:49
sells gaussian mixture models k. means clustering we'll talk about some different forms of sort of generative models themselves
00:05:55
generally models we also use a lot within unsupervised learning itself because we're trying to map
00:06:00
the sort of wider probability distributions that might be happening in the data set itself
00:06:05
the parser really talk about specifics of algorithms are
00:06:08
advantages and the disadvantages own talk about generalisation
00:06:11
'cause generalisation is not really separates machine learning out from something
00:06:15
like an optimisation task it's very uh easy for stopped
00:06:17
eliza said have a cost function we wanna minimise that we wanna maximise what we wanna do something to it
00:06:22
that is exceptionally easy to go really gets machine learning it's brains it's power nuts
00:06:27
idea that would want something from the data is this idea of generalisation
00:06:31
does that will minimise maximise your cost function and this
00:06:34
is done using pretty much different optimisation techniques here
00:06:38
so we don't do this optimisation technique someone something as well so generally my said derivative are
00:06:43
great in the cost function this year i recently so for the model parameters themselves
00:06:47
this is very easy for me to say put this isn't always possible so it novel solutions will have some sort of closed form
00:06:54
days they do have a close format might be very computationally expensive so we have to do the fugitive methods
00:06:59
really get this sort of complex documentation done and one of the ways we could do this is something
00:07:04
like gradient descent and this is used a lot in machine learning a numbed your networks especially
00:07:09
where we just using a sort of iterative solution using negative gradient of
00:07:13
function and so so the working our way to the sort of
00:07:17
global minima that we might have within a particular cost function themselves trying to get this
00:07:21
sort of step sizes that down sit down sit down sit down sit down
00:07:24
but we wanna do this so it not that we were having this local minima within a particular training set but we wanna learn the
00:07:31
sort of wider local minima that might be happening we wanna make sure that we're not just going i'm got a hundred percent
00:07:38
decision on the data that i've collected michael party i can do this a hundred percent accurate
00:07:43
this is the best speech pathology detection system it ever i'm gonna give it to everybody
00:07:48
they're gonna put it in the latin guide this is the worst machine learning system have a it learns nothing because
00:07:54
if we have it to the training data we're gonna get a really really bad system talk
00:07:58
a bit more about this in the moment and different areas we've gotta look for itself
00:08:03
so machine learning essentially we don't wanna just minimise the training era so this is mistakes
00:08:08
made on the training data we also wanna minimise what's there's a generalisation era
00:08:12
so the it'd be winning phase when the channel learn something from the
00:08:15
data up to myself parameters of our algorithms from the data itself
00:08:19
and we're at the same time this is in training always minimising the training yeah this
00:08:23
is the difference between the actual in the predicted levels within the training set itself
00:08:27
we always wanna get this as small as we possibly can of course the same time this sort of introduces a lot of different
00:08:33
so the air isn't isn't mainly relating to sampling error itself so even
00:08:37
some sort of training data that we might have we only have
00:08:40
a very small selection really of the wider sort of distribution that
00:08:43
we have of the actual ta so we're interested in doing
00:08:47
we've got collected a couple hundred samples but everybody with you know this thousands millions of people with
00:08:52
a particular disorder of interest that we might have itself so there's no way we've covered this
00:08:56
so we stopped the mice was is very small point of time we're not gonna get a very good machine learning model and this is what we need to do it
00:09:02
and this is this idea of general last ability to the model
00:09:05
must accurately adequately label new test data samples to do i
00:09:10
training phase we minimise that training areas and we always test and the test place to minimise that test there is
00:09:15
so when they use all like database into doing training optimisation we try to
00:09:19
minimise this generalisation error on the training data this is i did
00:09:23
is why we're training this is why we did detect testing in machine learning so i will always have the sort of two step
00:09:29
so the training phase always this we have allied with data we do ah preprocessing removing
00:09:33
outliers we might do some sort of the noise in within the data itself
00:09:37
might do some other sort of functions here we do our feature extraction we
00:09:41
run opens while we're on the scene enter we do something yeah
00:09:45
we do our machine learning we actually train the model and we ought to my face model itself be set of parameters and said hi
00:09:51
parameters the external things that we can tune about the sort of operation the model we do this in this training phase itself
00:09:57
in in supervised learning we're always doing this with respect to the labels from the label training data
00:10:02
then we get great i've got this model excellent how well does it work this is what we
00:10:05
do in this is why we need the testing phase we wanna minimise this generalisation areas
00:10:10
so we get a new data sample this is anti not new we just
00:10:13
hold out something from the labels from day to actually collected itself
00:10:16
we do the same preprocessing if we've done in the so the day noisy here we try to get it so it's within this is the same conditions
00:10:23
we do if each the same feature extraction is into it just passes through
00:10:27
i'd i learning model and then we get that predicted label here and we get
00:10:30
a sort of set the product labels and measuring this on the test phase
00:10:34
that really gives us our ideal performance that's how we know when we got a good
00:10:37
bottle when we don't have a good model if we just sort of optimising here
00:10:41
what in this training phase we never really check within this test phase we've never got a good idea if we
00:10:46
have a good model and if we have a bad model and we have to spend a lot of time
00:10:51
testing isn't doing this in machine learning machine learning research can be
00:10:54
incredibly boring because we just doing this training testing training testing
00:10:58
change a couple parameters change a couple from less training testing but is very very important this is very important to do
00:11:04
to make sure you get the best and most general while model that you possibly can within these itself
00:11:10
and the reason we wanna do this is we're looking for particular subset of errors and we're trying to get rid of these
00:11:15
areas uh not essentially have them so we have the air and defeating is essentially when we can model that's too simple
00:11:21
so uh we just make some assumptions about the distribution of the two classes drawer line down the middle
00:11:28
it's very very simple it's not a complex model and the channel a chance to make a lot of mistakes
00:11:33
on it and it has what's known as high bias to generally makes a lot of similar mistakes all talk about by some sort of sensitivity
00:11:39
here in a little bit is an time what we want to also avoided serving this is going to make a model too complex
00:11:45
we try to really optimise its own absolutely makes no mistakes
00:11:49
on the actual training data we learn is convoluted decision
00:11:53
function he really know a way really reflects
00:11:57
anything about sort of real life all real well that we might be looking at the data and when we start to look at here
00:12:03
we're gonna go pretty much get the same to areas might just be making chance level or even beneath chance level
00:12:08
way so i have something like on the fitting or the feeling of our great feeling is is that
00:12:12
looking at this case where we can't really find a limit we separable model itself was just sort of trading off
00:12:17
on where we want particular areas to be made in just trying to make the most accurate system we can
00:12:22
we might try out the system so make small false negatives more false positives how does this
00:12:27
which are indifferent costing functions different type parameters so we can always find the most robust model
00:12:31
that we can in terms of working on a particular subset of hold out test data
00:12:36
so increasing the generalisation increasing the robustness of the model and this
00:12:39
is really really important to do with the machine learning itself
00:12:43
and it's based on the concept essentially a pass and there is there is so biases saying on
00:12:48
average how much you predicted values differ from their actual values so there's a some sort of
00:12:54
common way we're actually making areas of uh make a very very simple model we find the the model itself would just make the
00:13:00
same air time and time and time again the same time we might have area so we get very very different predictions
00:13:05
so my head to test instances very very close to each other but all essentially give us wildly
00:13:10
different estimations from outside of the machine learning model and we've got this idea of high variance
00:13:14
so really trying to look for something where we have both low various
00:13:17
um let bias within actual machine learning model but of course
00:13:21
this isn't really super possible to do is ways we can do this but generally we always have to trade off wise there is
00:13:28
buses variance there is itself window sort of minimising for generalisation areas
00:13:32
essentially what happens we start with a very very simple model it's you know it
00:13:37
decision tree attorney making two or three decisions it's gonna have very very high price there is we start to increase
00:13:43
the complexity of this model and increase and operate more parameters
00:13:46
enamel features in them all things drop mice itself
00:13:49
we move up into some really cool big learning system but we don't really have that much training data and we increase the
00:13:55
sort of variance there is a big night so the model becomes more and more complex the bias there is drop
00:14:01
we get rid of these but at the same time we increase the variance
00:14:03
there is themselves essentially what we're doing with continual training and testing
00:14:07
algorithms jiffy looking pretty straight off point plus the op global model complexity
00:14:13
for the any particular task any particular algorithm that i might have
00:14:17
there is non is the easy and so the so the no free lunch algorithms essentially saying no machine learning model isn't
00:14:23
it better than any of the machine learning model we just gotta find the most suitable for the data we have
00:14:27
we've got a very very small data set there's no point in doing trying to do really
00:14:31
cool when twin lining with it because we're gonna over fit will probably get out
00:14:35
no neither count for not variance in the data because we simply can't train it all all of it very very quickly with these
00:14:41
so it's not always saying what's the best algorithm it's a sentencing what's the most suitable algorithm for the actual job we have
00:14:47
the amount of training data we have for the feature representation we only use and so on and
00:14:51
so on there's lots of things we can choose in change and stuff within machine money
00:14:55
this is the real reason we do this we wanna trade off model
00:14:57
complexity versus the different areas the actual system can actually make here
00:15:02
so thin pair linguistics the so the default way we do this now is not actually
00:15:06
just doing training and testing but we often use a validation set as well
00:15:10
so we split the data essentially into three ways 'cause they were just
00:15:13
doing training and testing we can still learn essentially a model
00:15:17
that will not generalised well we could just over fits the actual training data and
00:15:21
to the test data at the same time if we don't have enough yeah
00:15:24
so this is some don't why we introduced this idea of a validation set so what we're doing here is we're training our model we take maybe sixty
00:15:30
percent about data seventy percent about data and we use is to train the
00:15:33
model we have a look at things like feature sets normalised relations
00:15:38
different type of parameters and then we try to optimise the model one the best representation we can from it
00:15:44
and then we use a twenty percent of the data maybe to doing validation we try to
00:15:48
minimise this generalisation error we do i training we do i testing on the validation data
00:15:53
and we go okay if this works this didn't work this work this didn't work so on and so on until eventually i okay
00:15:59
i've got this data here i've got my feature space here i've got my machine learning algorithm here i've got my course
00:16:04
settings for it i think this is the best one possible this is really cool this is gonna make a difference
00:16:09
then to really go okay yep we've got something general liable itself we use
00:16:13
the test itself is evaluates the data so this is just going okay
00:16:19
i this is brand new data nothing is mean seen here when we're doing
00:16:23
this optimisation we've essentially and then we can find what are actually true
00:16:28
algorithm performances using the sort of hold out test data we generally combine
00:16:33
the training set the validation set retrain the model using up a set of sort of features optimisation algorithms that we found
00:16:39
we tested again and find out if we've got a general washable model or
00:16:43
if we've done it in this way it doesn't quite efficiently you
00:16:46
can see you get very very bad accuracies with very very good models very quickly you doing training validation you pull him into test
00:16:52
and you get a big drop informants you know it okay this is not probably the best way
00:16:57
to do it if you performance hold steady you go like i probably got a good model
00:17:01
ultimately you performance increases 'cause you actually improve the model by combining the training
00:17:05
data training a little bit more performance and you get this guy
00:17:08
right yeah yeah i found the best thing possible that we can do but it's very very important
00:17:13
when you were viewing type isn't things this is one of the things to look for people really understand what's going
00:17:17
on especially in speech pathology especially in computational power linguistics so
00:17:21
id train test train validation test gently down in
00:17:25
preferred not to see him type is what i'm looking at them they would just a training about
00:17:29
asian it's really no way to know how general lies when model is so this is gently
00:17:34
the best optimal way within different feel to doing it now
00:17:37
hinting lining especially when we got a lot of data
00:17:41
we can sort of get away with doing training and validation only because innovation to got quite
00:17:45
a fair chunk of data still not really the optimal way to do it itself
00:17:49
often in speech pathology though we don't even have enough data really
00:17:52
divided into sixty twenty twenty we might start to have ways
00:17:57
it still might not be enough to really even train basic models themselves if this is the case we often use cos all validation
00:18:03
does into the saying we have some sort of sample space some sort of what it is you're
00:18:06
you shouldn't and i mean a huge maybe ninety percent of my data actually training algorithm with
00:18:12
oh used ten percent of my data for testing algorithm week and then
00:18:15
we just write take yesterday right at this or k. times riff
00:18:20
ten times if we're going ten focus validation and then we take the average of all
00:18:24
of these performances and that gives us a rough idea of what the sort
00:18:27
of generalised ability of the model might be how well it's actually performing sometimes we
00:18:31
can devise we can do cos all validation on some sort of train
00:18:35
validation set up we can sort of combine these methods to really try to find the most optimal parameter
00:18:39
possible but really it to hold out test set that we learned nothing from is still really
00:18:44
the best way to actually perform a model and to get the most suitable representation
00:18:48
of any particular model out and it's really important when you're doing a machine learning this idea of generalised ability
00:18:54
not of the feeding not on the feeling choosing the right model for the particular task and that's
00:19:00
what you always spend the time with with a machine learning research is waiting testing issue trains of
00:19:05
parameters and just waiting for a particular good result to sort of come out with itself
00:19:09
but that's all good and well but what we actually do how do we do this there is
00:19:14
hundreds of different ways we can the machine learning different mission learning algorithms itself
00:19:19
so just pulling out some of the more common ones that comes through in sort of speech pathology in within computational para linguistic
00:19:24
itself so first class anyway of algorithms all talk about a discriminative
00:19:29
models so essentially we're learning some so hard or soft
00:19:33
decision boundary between classes of interest we're assuming that there's some sort
00:19:37
of probably would be distribution joint probably would be distribution
00:19:40
but we're really just trying to estimate the parameters of this directly from the training data itself
00:19:44
we're not trying to learn this sort of wider probably distribution of this sort of
00:19:47
training set of the sort of features in of this uh label space itself
00:19:51
the advantages here generally we directly learning a decision function objective it's
00:19:56
it's uh it's pretty good when we just have sort of a very small number of data points itself to train with
00:20:02
we're not trying to estimate some sort of wider probably distribution and then sort of make a decision based on these were going okay what do
00:20:09
i need to know to make the decision at hand so it's probably wanna better ones of their and is used quite widely here
00:20:15
so it's um examples because of um discriminate yeah well to
00:20:19
talk through them random far s. k. nearest neighbour is
00:20:22
support vector machines you know networks and yeah you guys will have
00:20:26
so deeply much uh i think tomorrow or the next day so i'm not gonna go into much more detail of whatever job
00:20:32
but in terms of convolutional neural networks this morning in terms of declining 'cause you gotta go over that bit more
00:20:37
so that's one wanna talk about a random for us a random farce is essentially a
00:20:42
ensemble looked replace classifies so forming something like a decision tray so
00:20:47
remember my very first couple of slides this morning and
00:20:50
sort of put out this decision tree of have to choose if your piece is gonna come on time or not
00:20:54
this is essentially just taking a group of days just train on so the
00:20:57
slightly different views of the data are trained using slightly different parameter settings
00:21:01
so we just looking at a slightly different way of the data and then we sent to the summing
00:21:06
take an average of the sort of function of what we've learned from the various decision trees
00:21:10
actually make a final decision itself so the final decision is sort of a either value
00:21:16
or some sort of average if we're doing some sort of regression output so they strays essentially looking at different
00:21:21
views of the data looking at the data from different views itself depending on how we set up
00:21:26
so at the height of any random far is is essentially decision tree classifier so
00:21:31
to non parametric supervise machine learning algorithm and essentially a channel on the target
00:21:36
by just simply making a set of decisions and making the set of decisions recently just
00:21:40
subdivide in the feature space down and down and down and down and down into we actually able to make it
00:21:45
so it's a very it's the most probably interpret will of all the
00:21:48
classifies all really talk about today it's no is explainable one
00:21:52
i haven't put too many side on it's fine ability but i'll talk a little bit about that and talk about
00:21:57
some of the others but i sent to explain ability is one of the beatings we really need in speech pathology now to change into g. d. p. ah
00:22:04
rules the patients now if you wanna put machine learning in any critical cactus
00:22:08
have a right to understand how decision got reached about them most
00:22:13
do the most machine learning methods we use it's very hard to do that with that sort
00:22:17
of a big open field of research that we need to get more into itself
00:22:20
but decision trees are very easy to do explain really would just essentially sub
00:22:24
dividing the feature space making decisions to maximise information going at each point
00:22:29
so what decision will give us the most we can learn from the
00:22:32
data essentially h. point of time of breaking this down down down
00:22:36
two hours ever read notices essentially starting at the top and then you know we just go
00:22:41
down down down and every time on over city maybe decision about the data itself
00:22:47
voice willingly submit into further and further decisions into eventually reach a final decision within the tray itself
00:22:53
we talk a little bit more give a bit more what example naturalists afternoon but
00:22:57
this is a very in way we could do decision trees very simple example
00:23:01
we have a good good of animals that we wanna classify and we have certain aspects and features about animals
00:23:06
that we can use it is into we're just trying to get a
00:23:08
number of legs okay to fall so essentially the human versus a
00:23:13
four legged animal we can break them down and sort of maximise and it's just a rough crude idea of how decision trees work
00:23:20
so the key properties i cough labels are associated very bottom i've actually if no it
00:23:25
each were real relief represent some sort of decision role in classification balls into making decision
00:23:31
every single node we have a set of patterns associated with h. no it's
00:23:35
are relevant features on line so does feature selection automatically for us
00:23:39
within here and it's relatively simple very very easy to understand
00:23:43
essentially make it down in a sort of recursive algorithm style where we just this
00:23:47
into china um maximise the entropy of the information gain h. decision itself
00:23:52
sunset which decision returns as the maximum amount of information we need to make the next the next one
00:23:56
the next one we break this down into we've reached a class label that we particularly want
00:24:01
so steps we cacti the entropy of the parent node actor identifiable individual splits that we might do a maybe
00:24:07
about that goes with these and then essentially always choose display that gives us the best information gain the
00:24:13
it's giving of all the mass of a book out we found um when we're
00:24:16
looking at this sort of going on play if we do this information theories
00:24:20
and we find the best possible split then we can go on go on until we actually get possible labels down
00:24:25
so sort of working about these very briefly already advantages that quick and easy to interpret
00:24:31
computationally they're very very simple quite could defeat and they're very useful in data exploration
00:24:37
so it discovers the significant variables what we really need to
00:24:40
make a decision itself and identifies different relationships between them
00:24:45
they pretty robust this suitable on small data they're not easily
00:24:48
influenced by outliers 'cause royce maximising the information gain
00:24:51
and they can handle missing values relatively well as well they can handle nonlinear and sort of interactions relationship
00:24:57
but for all their sort of couldn't sinhalese massive advantages they're
00:25:01
very very simple algorithm separate device there is themselves
00:25:04
the the in stable so the very most small amount of data change can really change the tree structure
00:25:09
quite dramatically if we change our feature space even just a little bit itself they generally do
00:25:15
on more complex problems have for accuracy or got is what's known as a
00:25:19
weak classifier so we crossed wires something no matter how much we train
00:25:22
of a particular task it's generally just getting about chance level we can do
00:25:27
and what we can do there is actually use we classify as
00:25:30
and this is that the will to over fitting and do the tagging and ensemble methods which
00:25:35
is what random forces it sort of a bagging method of actually these decision trees
00:25:39
and it helps us reduce the variance helps increase the robustness of the actual model itself so it
00:25:46
so we combining multiple classifiers model don't different subsets of the data and
00:25:51
reducing the variance in the predictions by actually doing this itself
00:25:54
yeah the original data we might split into multiple data sets we try multiple different classifies
00:25:59
something like a in week uh i think the defaults maybe five hundred decision trees in random far so actually
00:26:05
doing quite a lot of different decisions itself and then we come up with several and final decision that we actually have
00:26:12
the again would bootstrapping combining results vary in classification and
00:26:17
essentially a bootstrapping is just saying we take a subset of the data with our placement and we do this multiple times itself
00:26:23
i'm backing in general for any form classify is always trying to reduce variance and also
00:26:28
at the same time have a limited effect on the bias the actual signal itself
00:26:31
so it's a good method if you have the sponsors of the week predictors to actually start
00:26:35
to the classification with and generally want to train the algorithm it's actually quite quick
00:26:40
so uh we got for so many trays we need independent bootstraps at
00:26:44
each node we may cost wait sometimes we just sort of um
00:26:47
deliberately altered around a bit so gently not too concerned about
00:26:51
the accuracy of the individual traces place the accuracy of
00:26:54
it the whole forest of the actual trace itself we do avoiding we get our sort of final output here
00:27:00
so random far as the big advantage is good accuracy fast run time
00:27:05
that work well we live status that second handle high dimensionality is
00:27:08
they inherently perform feature selection three that's a decision trees that are
00:27:12
done in on an improved disability the individual decision trees function
00:27:16
the same time the a little more susceptible to uh the feeling is particularly with the noisy environments themselves
00:27:23
and again we into this thing where mostly algorithms i'll talk about we had this idea of a black box approach now
00:27:29
we have no idea understanding of exactly how the decision got made yeah we understand how we set the algorithm out
00:27:35
the background of how but in terms of going which reaches data combined what is the indication that this person has
00:27:41
a particular speech pathology very hard to trace these things factors sort of black box operations the as i said
00:27:48
the field sort of needs to start shifting more towards explainable solutions and also
00:27:52
find ways to make solutions for these black box system themselves actually explainable
00:27:58
so the uh most simple approaches well in terms of
00:28:01
the um discriminate networks is the okay nearest neighbours
00:28:04
sort of classifier so this is pretty easy we're not actually now really doing classifying all learning
00:28:10
from the data itself were doing classifying just by so looking at properties
00:28:14
within the training set evaluate various this impossible decision process recently
00:28:19
pop in my test set in a going okay what's my nearest neighbour okay that's the class i'm gonna assigned to itself
00:28:25
and that's pretty much all we don't really have any testing here which
00:28:29
is going every time but it's a very slow method accent if
00:28:32
you have a very large data set you've got to do this work at least distant manages to hold in your status classifies itself
00:28:39
using something like euclid in the old to norm we might do the k. nearest on nearest neighbours algorithm itself
00:28:45
and then we we find that it's not actually that good so we just okay okay
00:28:50
what a different variants what a different distances itself is signing it therefore into the one itself
00:28:56
but again this is generally get over fit into training data so what we wanna do
00:29:01
is look at using nearest neighbours so instead of going what's my nearest neighbour what's
00:29:06
the cost of my to nearest neighbours what's class in my three nearest neighbours so on and so on is into this is the high programmer do we choose
00:29:13
when we doing nearest neighbours classifier cindy how many nearest neighbours do we have and again this sort of trade off
00:29:19
that we have between everything on defeating we've gotta find the right
00:29:22
sort of appropriate complexity as we take more more nearest neighbours
00:29:25
the decision function gets more and more complex chances waving good go up give it up and didn't go up itself
00:29:34
so uh yeah that essentially just something again using the semantics to find the k. nearest neighbours itself
00:29:40
and assigning that and we've always got to choose this idea okay
00:29:43
experiment lisa wanna find some sort of validation aaron minimise itself
00:29:47
we find is k. goes up to i was that to increase the variance increase the full validation areas that we might actually have
00:29:53
so really down to do the k. nearest neighbour is is it works
00:29:56
very well and very simple basic sort of recognition problems themselves
00:30:01
and is robust to noise we can actually toes sort of waiting into account if the bigger
00:30:05
distances i'll i'll lay things out we give them a sort of a less of away
00:30:09
that enables particularly far away we don't like that too much we can the algorithm until i was just choosing tried for a bit of noise
00:30:15
um it's a lazy when we're not really learning anything when we're doing k. nearest neighbours which is really
00:30:20
just making the decision itself and as a cause of this it's very very high computational cost
00:30:26
so if every single production we wanna make when we're doing chi nearest neighbours we sent to compute the
00:30:30
distance and sort every single instance within the training data itself we've gotta sum up every single distance
00:30:36
find the minimum set every single time so we don't actually have
00:30:39
a model which is doing this one operation one algorithm time
00:30:43
and time again so that's particular disadvantage of speed is something
00:30:46
that's particularly important when doing these style of decisions itself
00:30:51
the most frequently use i guess in speech pathology in computational power
00:30:55
linguistics in terms of classifiers is the support vector machine
00:30:59
and this is 'cause the just the good at making optimal decision
00:31:03
and the security easy to train and they work very well so low data set
00:31:08
how decision how support vector machine but says essentially were always making a binary classifier
00:31:13
is assigning something in to cost one class minus one set two classes itself
00:31:18
i mean essentially lining this decision boundary such that we can actually some decision boundary at the
00:31:24
set of nearest training samples to it need to know is the support vectors themselves
00:31:28
the geometrically these are the sole sort of training patterns that's it absolutely closest to
00:31:33
the boundary this awaited some of these training patents is sitting on this boundary
00:31:38
gives us a way to actually do a classifier is based
00:31:41
very heavily on sort of classic on optimisation so using
00:31:44
with courage and multiply as and doing contradict optimisation to actually learn the decision function will go briefly into itself
00:31:51
so we're learning some sort of class one class to separation we wanna find this pipeline hyper decision between it
00:31:57
as and you wanna maximise the distance between our two classes and then
00:32:00
we can use the support vectors this training instances that say
00:32:06
essentially closest to the decision boundary possible we use these to actually format classifier is very
00:32:11
compact form a classification because we can for the rest of the training instances away
00:32:16
and have a very small model essentially two out of the final compared to the whole shoulder training space
00:32:21
compared to something like a nearest neighbours where we had to some overall appoint
00:32:25
we're not reducing to summation every sort of small number of points
00:32:29
and it's very basin heavy so the classical mathematics methods optimisation the grunge and multiplies
00:32:34
i'm not gonna go too deep into it because you get to be boring
00:32:38
it's into itself so support vector machines the main advantages good generalisation
00:32:43
so the algorithms always set up too so the yield the maximum
00:32:47
possible distance between a hyper planes and between the classes itself
00:32:51
so reduce the chances of doing things like uh the feeding is really good when we have small amounts of data as well
00:32:56
computationally efficient way expressing our mind is a function of support vector is and these points men caused
00:33:02
by plane the real advantage of support vectors comes into the trick of the kernel operations
00:33:08
so now i will to uh we need decision rabbit in on
00:33:10
the new decisions essentially using saying algorithm using the same setup
00:33:14
just by doing what's there is the current trek recently mapping into high frequent high dimensional
00:33:19
spaces we'll talk about this more in a couple times we do these implicit mappings
00:33:23
informs the top products that actually seen the support vector operations allow
00:33:27
us to do is to be normally classification as well
00:33:31
so the call when we sort of setting up a support vector machine algorithm class flat it's really
00:33:35
assigned some to label so sign so we've got a mapping function the map something label space
00:33:41
and the feature space and we since we wanna learn if this time is positive or negative not assigned as in the cost one
00:33:46
no cost to doesn't really matter what we call class one and class to in these instances but it's always a binary classifier
00:33:52
support vector machines for multicast you're always performing one versus all classification or you performing a set
00:33:58
of actually binary classifies actually for multi task one but at the heart it's always binary
00:34:03
the same to we've got some sort of classification function what's going on here is that we have a label we have
00:34:09
some sort of way and we have some sort of colonel operator here that's essentially doing uh so the of
00:34:16
implicit mapping into high dimensional space or doing when the separation of which is to set
00:34:20
up with itself and what we wanna do is learn the set of weights
00:34:24
they said to whites are always non zero essentially for a support vectors the once it's the closest to the boundary itself
00:34:29
so have a classification function we set this up it's just you know it's quite a simple one we
00:34:34
have a decision boundary have some sort of by so the decision boundary and we wanna learn this
00:34:38
we can always say say that the weight vector sits perpendicular to the hype applying to the decision function itself
00:34:44
we can say the oh points essentially greater than zero sit in class one
00:34:50
and then we have points that actually sit on the decision boundary itself and we can find an optimal so the distance for this
00:34:56
changing the class to which is saying everything is less than zero gets assigned to class two to the minus one instances
00:35:01
and again we can set up the sort of all to these points that actually sit on these decision boundaries
00:35:06
given we have some sort of optimal margin we always want to maximise our decision so it
00:35:11
so these are the sort of two over the the way to norm of the um
00:35:16
weighting function which is the one and the minus here and i've done a bit of rearranging
00:35:20
in these equations but it does all check out and we wanna learn this function here
00:35:24
are able to do that very easy we set this up as a minimisation algorithm and
00:35:28
we put the constraint in the the minimisation algorithm itself essentially use calling a name
00:35:35
quite tragic optimisation to solve the algorithms itself we actually substrate we're granting operate isn't we make a sort
00:35:41
of you more constraints in the actual algorithm itself and essentially allows us to sit here and find
00:35:47
what the optimal set of weights uh after is come about this
00:35:50
sort of change in terminology from the move into granting multiplies
00:35:54
and then we can express the decision function here is some sort of way that
00:35:56
some alphas what we actually want to record wreck optimisation some sort labels
00:36:01
and the in the product between two vectors itself plus the bias so well
00:36:06
it turns out is that every single point that has an alpha
00:36:12
greater than zero i must've ally essentially on h. one or h. to all other points
00:36:17
oh twenty have alphas you going to the right so you get a very very efficient
00:36:22
summation here one actually making a decision function with just saying this point this
00:36:26
point this point this point all we need to know it to make the sorta
00:36:29
support vector machine decision tray there is information we have a very nice
00:36:33
complex model itself allows us to do separation quite well the
00:36:37
classification models expresses awaited some of the actual support vector
00:36:40
as being the points that's the closest to the boundary so that was a bit brief and global mobile phone
00:36:48
it's so hard to support vector machines are good too much what they tell it around so that your soft and god he's gone
00:36:54
deriving all these different equations in various different how great and there is different directions but again like
00:37:00
uh everything so this morning the big so the theory because of the notes look up some
00:37:03
papers if you're really interested in so how support vector machines work but the main thing to
00:37:07
know is is is really nice sort of classification model that we actually some up itself
00:37:13
we actually can introduce a non separable data into support vector machines as well
00:37:18
and this is by putting more constraints into one kind you
00:37:20
know put operated themselves and allowing it to essentially choose
00:37:25
and allow the decision function to essentially make mistakes and then waiting how badly it make these mistakes
00:37:30
this is not known what we control when we actually look in the complexity of the actual support vector machines
00:37:36
so we're looking at some sort of slack function and with the meeting of support vector machine to make some sort of mistake
00:37:43
because i can i don't mind that you making this tiny little mistake a 'cause you
00:37:46
giving me more general liable model we essentially chain this complexity to really control
00:37:51
the complexity actual model and control the trade off between the sort of on the feeling you know the feeling between the bias and variance there is
00:37:58
and that's what we're essentially chaining so when you cheating a support vector machine just wanting that so the middle point
00:38:02
that i talked about controlling bias there is this is very
00:38:05
serious and sells so say gets loud essentially saying
00:38:10
i don't want you to make errors i'm gonna punish you if you make higher and higher it's itself
00:38:14
so i decision boundary gets more and more wavy more more complex itself see
00:38:19
get small generally we're going okay we're making more more assumptions about the
00:38:22
distribution the data which may not necessarily be true so making the model
00:38:26
less complex in the head down to what's on the fitting itself
00:38:29
the rough both um we generally said say less than one to avoid a fitting a lot
00:38:35
of take the tasks again it's not always true it's just a raffle of them yeah
00:38:40
so the thing to remember with this sort of complex evaluate as it goes out we get harder and harder martin we got more to
00:38:46
what uh the feeling as we go down we get a softer and softer margin we get what defeating itself with this evaluate
00:38:52
as i also said the real advantage apart from this so the idea very compact model
00:38:58
and making is optimal decision function by some very very clear sort of optimisation roles
00:39:03
we can handle non we need data within support vector machines and this is done by the inclusion what's known as a kernel function
00:39:09
so we play connelly me a sort of support vector machine essentially find that we're doing
00:39:14
it top product between a different vectors within the space when making a decision itself
00:39:19
i'm saying to we can do this into sort of a higher and higher dimensionality is
00:39:23
buys into doing a nonlinear transformation taking the dot product to
00:39:27
these nonlinear transformations and just pushing this information into the
00:39:30
support vector machine when actually doing a final decision function here
00:39:34
and call the kinetic because we never actually mapping
00:39:38
within the algorithm itself into high dimensional spaces itself rarely do in an implicit form within the dot product operation
00:39:44
so essentially what's happening is that things if you look at it one from one particular dimension
00:39:50
you guys if you're looking at my two hands in you know from your point you can't really see separation from the the move more into
00:39:56
so the three dimensions you can my uses that lead to mention likely separation uh so the things go up and up into mentions
00:40:02
as essentially what we're doing with the converter we're going like i wanna go into
00:40:05
high dimensional space and check of the date is nominally separable only going to
00:40:09
you know higher in high dimensional spaces in this different mappings different rules are
00:40:13
sort of essentially how we do this mappings visible idea essentially that
00:40:18
something that's non separable in one dimension will eventually be separable
00:40:21
in higher high dimension hopefully at any rate itself
00:40:25
that's what's happening when we're doing the contract itself so nonlinear
00:40:28
mapping died dimensional spaces the lot to different roles and
00:40:31
properties that how we sort of to find these and what exactly is a kernel space but again for
00:40:37
by the t. l. skip over this itself and just say that the main sort of popular kernel functions
00:40:42
also come across obviously when you can no just doing the dot product between uh because itself
00:40:46
oh no mail colonel loading opponent your transformation really that i've
00:40:50
heard it said now's the order polynomial transformation and
00:40:53
we can do uh so the gaussian kernel we're looking at savings of the variance of the kernel itself
00:40:58
and controlling different aspects here to do with the power we raise the kernel to as well so there's different so the trade offs
00:41:05
every time you uh so the of do support vector machines wanna do you can move takes it takes time
00:41:11
every sort of parameter the complexity of the different parameters play off against each other so your is gonna take time
00:41:17
doing very very good grid searches really find the most optimal settings the support vector machines when you're doing it
00:41:23
to one and then doing this training test validation testing for any validation testing so they can be a bit of
00:41:28
time 'cause it's quite a lot of things to so that you know not to mars within support vector machines
00:41:32
so the other form of discriminate model that you talk about of the cost of this week anyway id neural
00:41:38
networks on your networks in general or discriminate model i'm very very briefly the neural networks are just
00:41:45
the neural networks in your network that has at least four layers and as many many parameters that have to be like in it
00:41:51
yeah a enable really through this idea of multiple layers multiple nonlinear transformations
00:41:57
essentially the ability to learn very very highly complex cause functions itself
00:42:01
but the capital is being one of the big issues we always have within speech pathology
00:42:06
is that we need substantial amount of training data the paper the network goes the more more training data we get the
00:42:11
more more things we have to learn the more more training data we need so they would very very well
00:42:16
provided you sort of set this up probably as a particular task and know what you're sort of um
00:42:22
feature space is capable of the amount of training data you have is actually capable of itself as well
00:42:30
so the last thing instead of so the different models uh all
00:42:33
talk about is generative models so this now is the idea
00:42:38
that we're going to go and start modelling to distribution so we're looking at
00:42:42
the joint distribution between the model between the features and the labels
00:42:46
and we wanna model is joint probably be function will assume some some functional form of the conditional probability
00:42:51
and actually have to make these parameters from the training data and then use baseball's to sort of calculate
00:42:56
what's the likelihood of a feature of our label given a particular set of features we have
00:43:01
and the real advantage of doing so the of generative models is this sort
00:43:06
of assumption that are actually learning this why the probability distribution should
00:43:10
essentially allow us to sort of prevent over feeding itself
00:43:15
and we can potentially a local in an update models we can revise a sort of on the fly
00:43:20
bit more something the massive outlier or is this some sort of mistakes remain we can reach routine
00:43:25
models in reaching the probably be distributions which is we get more and more data as well within the generative models themselves
00:43:31
generally models we always wanna model i make a sort of classic classification using baseball's itself
00:43:37
so it's just a little bit every arrangement we have a conditional probability predefined it into joint probably
00:43:42
be an redefined this interface will see it so this is saying that our class will be
00:43:46
given our features what was the most likely label that we actually got not what we don't
00:43:50
generally classify we didn't go we get what what's the likelihood we came from class i
00:43:55
what's likely would we came from cost they choose the maximum likelihood in return that we
00:43:59
get you guys probably probably do this during the hidden markov models anyway itself
00:44:04
to generate models are often based on the sort of concept the clustering so
00:44:07
sort of grouping feature vectors together into sort of common spaces common things
00:44:13
so it's a sort of partitioning the data looking at how well the model think
00:44:17
she's petitions letting which little adjustments of parameters that we actually need to make
00:44:21
and then making these adjustments over time and learning the sort of distributions into different
00:44:26
climbs into different classes itself and then trying to form these up itself
00:44:30
so the basic idea of any sort of clustering algorithm is going back
00:44:34
almost to the k. nearest neighbours into the k. means algorithms themselves
00:44:38
so looking at trying to find cost is where the two points belong to the same class that have a very
00:44:43
small distance measure between them two points belonging to different classes have a sense a very very big distance measure
00:44:49
between them so it's very so easy concept itself is into we can just phone classes check these and then
00:44:54
sort of resigned boundaries resigned boundaries that's what's happening a lot and its various different ways wanna do this
00:45:00
across the distance small and speeches similar similarities high when into cost the distances lot
00:45:05
features similarities low these belonging to different classes the general sort of way
00:45:10
looking at me vantage of doing any sort of clustering you move into the discriminant
00:45:15
models a generative models well is that we can often find very compact representations
00:45:19
so when we thought about what we're doing ten means into working out all these distances
00:45:23
to every single training point when i just got classes of dayton working at distances
00:45:27
or some the likelihood measure to the central database classes so some minimising the test points
00:45:32
they're actually going over itself the design of the test data into
00:45:36
a class uh by computing distances between so the centre in
00:45:39
the middle it's for each class that and find one us into gives us the best fit so the rio um
00:45:45
call methods that we talk about this with work any means naive bayes gaussian mixture models
00:45:51
and hidden markov models themselves take a means is probably the most simple uh um
00:45:57
how do we want an clustering algorithm as essentially the iterative assignment of training points
00:46:03
in one one into the different groups that we actually want to supervise learning we
00:46:07
sent to use the label so sorta died the selection of these classes itself
00:46:10
once costed we just use the same try to the data is and uses the label any new what data points
00:46:16
and that's why we use when we sort of doing unsupervised learning itself we can find if you've different distributions
00:46:21
and then we assume that the once enjoyed it some sort of new class label that would actually come up
00:46:26
with something that information with down when we're doing data mining instead of using something like a means
00:46:31
so can i mean some key parameter is essentially assigning a number of 'em classes that we actually want
00:46:37
depending on how we actually do it is is that we could use group labels to really sort of redefine these itself
00:46:42
so the training algorithm is we select some initial instances to be central wait
00:46:47
we design into the remaining instances into one of the k. classes
00:46:51
and then we voice assign it to one that's closest to we
00:46:54
we can points enjoyed based on the current pattern of assignments
00:46:58
resigned everything into the updated cost is and so on and so on and so on and so essentially we get no change
00:47:04
in the number uh in the so the change of the central to the cost the position all
00:47:09
we just go okay yeah you wanna do this for a set number of iterations itself
00:47:13
so the idea is to really do it until we get real we no change in any successive iterations between itself
00:47:19
we have some sort of random initialisation and then we just repeating this assignment repeating this cost or
00:47:24
it's in tried assignments time and time again until essentially convergence themselves so signing out points
00:47:31
you guys are iteration one iteration to run into the model so get better and better and better and better
00:47:36
the one disadvantage doing k. means clustering especially when we're doing sort of iteration points
00:47:41
is sometimes our final so the output of the model is very much dependent on actually selecting a good
00:47:47
sort of initial point if we select all our initial points somewhere you know
00:47:52
null through widely distributed of the data potential it's gonna take a longer time to
00:47:56
converge or what it does converge to is a bit sort of rubbish itself
00:48:00
take a means easy to implement work on large data sets but with the logo of a bit tuning to actually
00:48:07
find a set number of classes as i said this initial sort of seating can ever really strong impact
00:48:12
on the final cost uh and final points of time and we i can have some
00:48:16
sort of strong sensitive you to our allies annoys the during this class strings
00:48:20
this could really start to affect them sort of put different classes out of positions
00:48:23
themselves i'm just making some assumptions as well about the shape in about
00:48:29
the data which may or may not hold true window so looking at real data instances
00:48:34
so we have a really simple forms of doing generative model is called a naive bayes classifier
00:48:39
it's called naive to make some certain assumptions about the data in the distribution of the data itself
00:48:44
essentially we just use baseball directly responsible likelihood tables from posterior tables and
00:48:50
we just assigned probability distributions directly from the sort of data itself
00:48:53
but we work on the assumption either every features assume class conditionally independent so we can sort of one feature
00:49:00
the distribution fee just on the fact that class labels but in real life this assumption probably doesn't hold
00:49:05
idols uh it's a pretty naive way of doing it so as looking uh maximising this for the posterior probability
00:49:12
we just look up some the likelihood phone some likelihood table
00:49:15
used five probably the distributions and sometimes a particular
00:49:18
five probability is well generally we conditions the class independent actually removed
00:49:22
from the algorithm this is essentially what i've done here
00:49:25
so the classification face hey training is really converting data sets into
00:49:30
frequency tables so looking frequencies of distributions are the data set
00:49:34
creating some the likelihood finding sort of probabilities we might need to do small corrections we might have sort of
00:49:40
has a features in classes that don't really come up in the training set might have to make
00:49:44
some sort of assumptions about the distributions themselves if we wanna cover everything equally
00:49:49
and classifying is just using this sort of equations using the likelihood tables using the frequency table itself
00:49:55
the cacti the poster probably be for every class and class of the highest
00:49:58
posterior probabilities the outcome of the actual production that's always on it to
00:50:02
the very simple example that sort of comes up time and time again this idea or do we go out
00:50:07
why don't we go out right if you're able to go okay sammy someone out overcast yes you know
00:50:13
raining yes someone at the place down and so on so we look at the frequency table this is just saying
00:50:18
is that classes rainy sunny total is on number uh you know yes and then assign
00:50:23
different probably distributions in the wannabe assumptions here we have this sort of the right
00:50:27
okay yeah and then we can form a likelihood table just looking at the distribution of
00:50:31
different brands of radical to bring over cost on a yes and those sort of
00:50:36
like that as well and essentially when forming classifications we distilled
00:50:40
by baseball would get all this information from lookup
00:50:43
table trying from the data and we've been consoled start to infer things like if it was sunny
00:50:49
it's it's a correct statement players will downplay we can find that therefore likelihood of sunny and going on
00:50:54
play with sixty percent uh we can for this uh probably craig's and we can sign this
00:50:59
the sort of class label itself with these we could she knew exactly where we
00:51:02
wanna make this decision based on prior knowledge is well within the naive phase
00:51:07
so naive bayes is relatively easy fast to predict we have different test sets
00:51:12
as well it can perform multicast classification pretty can very very well
00:51:16
and big advantage of nice pave nice but i use over a lot
00:51:20
of different algorithms it can actually handle category inputs of variables itself
00:51:24
so we don't actually have to have some numeric distribution with enough features base you can start using like
00:51:30
words different other forms of actual features that we want no eyebrows what did we well for days
00:51:34
the main disadvantages idea of zero frequency so what happens if we get a test
00:51:39
spectra that has a category that was not observed during the training sample
00:51:42
so we had this gotta category here that sort of frequency table over cost
00:51:46
now if we had known occurrences of this within our actual training samples
00:51:50
the cans comes up with no not test samples we don't really know what
00:51:54
to do with the we always had this big assumption the independent predictors
00:51:59
don't know true and sort of real life data itself everything is sort of slightly correlated with everything
00:52:04
else and we've sort of making this very naive assumption hands what's called a nice classifier
00:52:09
so the last um okay than the uh the archive uh in any particular that is the
00:52:16
gaussian mixture models itself you probably did a little with this within hidden markov models
00:52:20
but it got to mister moses into the comics conversations of gas in probably distributions stuff functions
00:52:27
there we actually feed of the data is really complex way of expressing a
00:52:30
model is that we have to express a sort of feature distribution in
00:52:34
terms of some waiting classifier and just the means of the gaussians and the variance of the gaussian vectors itself so it's quite a calm
00:52:40
packed way of doing a generative model it's um and what's pretty
00:52:44
well a lot of the time so this compact representation
00:52:47
we have some sort of mean vector willing some mean distributions so the class of each
00:52:52
means sort of vector roughly characterise is the shape of the feature space itself
00:52:56
we have a covariance matrix it's it's yeah it's characterise is sort of variability within the feature space itself
00:53:02
we also have the waiting so this essentially is just saying how important we learn different clusters are different sizes
00:53:09
small ones on super important the amount of data essentially covered by its mixture
00:53:14
by this sort of clustering algorithms themselves so we have some sort of
00:53:17
some one gaussian two gaussians three gaussians every sort of come up
00:53:21
several model itself before model for a positive class and they get a class
00:53:26
on and so on and essentially just use the likelihood function use payrolls
00:53:30
to actually start doing it training is gently down with the expectation maximise
00:53:34
ration algorithm or doing um maximum prior prior updates as well
00:53:39
so this is really neat today one two is to maximise the gaussian the g. m. m. likelihood functions
00:53:43
on the actual data is set itself we generally don't maximise the
00:53:47
likelihood function we don't we maximise the log likelihood functions
00:53:51
and this is essentially a computational thing itself is going to be a white on the flow of a sort of
00:53:56
looking a very very small numbers itself on the voice very easy to
00:53:59
get and also makes out sort of estimations easier we're going
00:54:02
for multiplication system nations that just makes it a bit more computationally efficient gently while we use a lot like a load
00:54:08
so the of very prefer you what sort happening within the em
00:54:12
algorithm is is sent to the same as what's happening really
00:54:16
weaving sorry came into starting with some initial estimate of what the parameters
00:54:19
might be we could just estimate is directly randomly assigned them
00:54:23
see there in the okay with uh came into first discusses instrument choice now i'm
00:54:28
gonna try to fit more finely tuned gaussians at the top of these
00:54:31
we can to compute the likelihood that each parameters produce the particular data point this is nancy expectation
00:54:36
stepped reading compute whites essentially based on this likelihood that each parameter was produced
00:54:41
we can use these weights together with them together with the estimation we maximise
00:54:46
that we try to improve the likelihood estimations time and time again
00:54:49
and we're essentially again do the steps time and time over and over until we reach some sort of form of
00:54:54
convergence of their okay nice top you've done enough steps now it's up to looking classify what we actually do
00:55:00
so is this into initiative method we start with two rough estimates what we
00:55:03
think might be happening with the training algorithms and is into the
00:55:07
of the successive am algorithms we end up getting a final model sort
00:55:10
of roughly reflecting the we had two classes a day be here
00:55:14
um one have slightly wider distribution and we can form one with some
00:55:18
now are becoming distribution here so that's sort of final model itself
00:55:23
classification is always done by baseball's so we have some sort of stick on some
00:55:27
sort a lot of events in a training algorithm that training sets right
00:55:32
these are testing then we have a class models so we form the
00:55:35
different gaussian mixture model incentive for the class of interest itself
00:55:38
and then classification involves identifying the bottle the returns the
00:55:42
maximum conditional probability within the different classes so
00:55:45
essentially saying okay given a test data what's the maximum what's gaussian returns the maximum likelihood here
00:55:51
this is actually come potentially quite difficult so we flip it
00:55:55
around using bayes roles essentially instead of saying class
00:55:59
to gaussian we know that from gaussian to class we also have the prior distribution of the gaussian itself
00:56:04
and classification is essentially just finding which gaussian has returns so most likely would
00:56:10
itself and using the sort of extra prize term that we have here
00:56:15
it down to just complex representation in bacon model p.
00:56:19
d. f. so quite a good require level of
00:56:21
uh accuracy essentially quite easy to feed using the um or using the mac algorithm which
00:56:26
i didn't cover not algorithm is is more initiative update for sort of individual classes
00:56:31
user uses windows down for my vectors and things like that disadvantages
00:56:37
inflexible if we chose in really inappropriate model to begin with it's not
00:56:41
gonna give us the to best choice so we're quite some
00:56:44
sort of strong price some sort of estimates of what we think number of gaussians enough distributions that might be happening yeah
00:56:50
um you program model might be really of skew a quite complicated itself
00:56:54
and the distributions that don't really have a gaussian distribution it's not really
00:56:59
the best particular model that we can actually work with itself
00:57:03
so the the last one model to briefly mention is hidden markov models
00:57:08
so the this is essentially now just as almost extension onto the gaussian mixture
00:57:12
models saying we wanna map a sequence of observations on the sequence of
00:57:17
labels and recently saying this data hidden weakening look uh probabilistic variations it can be observed
00:57:23
we got a lot of different parameters we cannot die with the gas in mixed with a with a hidden markov models
00:57:27
so i was trying to introduce in hidden markov models says so id that was sat
00:57:33
here and lecture theta but what we wanna do is classify what the weather
00:57:36
is outside we can't see that whether directly so we've gotta take observations of what
00:57:41
might be happening the way you might do that is not to look good
00:57:45
how people come in and they just into a shirt so they just in shorts it's on the it's warm that justin raincoats we can
00:57:51
infer that might be raining in this isn't gently what's happening this is something way to sort of think about the hidden markov models
00:57:57
the real advantage is bottling sequential data real disadvantages probably the large number of parameters
00:58:03
the really come up with hidden markov models and assuming that we have enough data to actually
00:58:08
model and one is correct parameters with so that sort of
00:58:12
brings and to the lectures but it's not so the
00:58:16
class dismissed right yet because i've got a little quiz based on line that or whatever and to do it
00:58:21
so if you could all get an phones and also that a little cousin quiz and just sort of go over some
00:58:26
of the main points and so the re emphasise some of the main things to actually talk about this morning
00:58:33
oh

Share this talk: 


Conference Program

ML for speech classification, detection and regression (part 1)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 9:04 a.m.
478 views
ML for speech classification, detection and regression (part 2)
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 10:59 a.m.
104 views
Quiz
Nick Cummins, Universität Augsburg
Feb. 13, 2019 · 11:59 a.m.

Recommended talks

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.
2370 views