Player is loading...

Embed

Embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
Thank you once one so this presentation
00:00:05
will be a bit different from
00:00:06
yesterday's. Um it's more about things
00:00:11
that are happening at the research
00:00:13
level and not so much things that
00:00:16
people use a build products yet also
00:00:22
it'll be a little bit more technical.
00:00:24
So and I have a bit more time then
00:00:26
yesterday. So feel free to raise your
00:00:29
hand and ask questions in middle if
00:00:30
they're too many all you know filter
00:00:32
but let's let's try so we don't need to
00:00:36
wait for the end to ask questions okay
00:00:38
let's start with motivations really
00:00:45
it's about as proviso line why is that
00:00:49
important. Well you have to realise
00:00:52
that all the great things that people
00:00:54
earning has done in the last year years
00:00:56
is mostly due to supervised learning
00:01:00
meaning that we yeah we need large
00:01:04
datasets that are labelled where humans
00:01:07
have told a machine with the right
00:01:09
answer should be but that's not how
00:01:13
humans learned most of the time. And
00:01:16
think about how child like a two or
00:01:19
three year old figures out what we call
00:01:22
into the physics. She understands you
00:01:25
know gravity she understands solids and
00:01:28
and and liquids and all kinds of
00:01:31
mechanical notions of course without
00:01:35
ever taking a class you know on a
00:01:38
newtonian physics she got it by
00:01:43
observation her parents didn't tell her
00:01:45
how the world was you know going on in
00:01:49
terms of the physics. Um she just
00:01:52
interacts with the world observes and
00:01:54
figures out causal explanations there
00:01:57
are sufficiently good that you can
00:01:59
control her environment. And and do all
00:02:02
kinds of things that robots can't do
00:02:04
yet right. And so we'd like to have
00:02:09
that kind of ability for computers to
00:02:12
observe interact with the world in
00:02:15
order to get better information. And
00:02:17
and learn essentially without
00:02:20
supervision. Now of course when I talk
00:02:23
about this proviso and you have to
00:02:24
understand that in the big scheme of
00:02:26
things. We need all the three types of
00:02:30
learning to to reach aren't we need
00:02:31
supervised learning we need
00:02:33
unsupervised learning and we need
00:02:34
reinforcement learning they just you
00:02:36
know cater to different niches and and
00:02:39
humans you will use all three as well
00:02:41
so one what may wonder wonder things
00:02:48
all talk about is you know why is it
00:02:49
that unsupervised learning hasn't been
00:02:53
as successful and I I don't I don't
00:02:58
seem to have all the answers for that
00:02:59
but I'll I'll give you some some
00:03:01
suggestions I think that there are
00:03:04
computational and statistical
00:03:08
challenges that a rise out of the the
00:03:11
objective that we have in supplies
00:03:13
learning in really capturing the joint
00:03:15
distribution in some form maybe
00:03:18
implicitly of many variables where is
00:03:21
when we do supervised learning
00:03:22
typically we only care about pretty
00:03:24
thing you know one thing one number one
00:03:26
category there and we're not trying to
00:03:29
get a joint distribution in a high
00:03:30
dimensional space it and that's really
00:03:33
what else provides training is about it
00:03:35
may not be explicit but really like if
00:03:36
you train the encoding mythical I'm not
00:03:38
you know I'm just learning because many
00:03:40
minimising the construction or
00:03:41
something but but really what you're
00:03:44
trying to do is to extract information
00:03:46
about the structure of the data
00:03:48
distribution in a high dimensional
00:03:50
space and that's fundamentally
00:03:52
difficult and and I don't know maybe
00:03:54
it's gonna take is another fifty years
00:03:56
to crack this but I I really believe in
00:04:00
others yeah like me believe that we
00:04:05
need to work hard and this and and make
00:04:08
progress on this you know to to even
00:04:11
approach human level intelligence right
00:04:15
so no from a practical point if you why
00:04:17
would we want to do that well at a
00:04:19
really obvious answer is that there's a
00:04:22
lot of and labelled data out there that
00:04:25
would like our computers to learn from
00:04:28
and and use that information to build
00:04:30
better models of the world we can't go
00:04:34
on building specialised machines for
00:04:36
every new task where you you're gonna
00:04:37
need a lot of labelled data for each I
00:04:39
mean we can and this is what we're
00:04:40
doing but it's not gonna bring is human
00:04:43
level EI it's not gonna be enough
00:04:47
here's another reason when we do
00:04:51
unsupervised learning as I said
00:04:53
essentially in some sense we are
00:04:55
learning about the joint distribution
00:04:56
of things then we should be able to
00:04:59
answer any new question about the data.
00:05:03
So think about I zero random variables
00:05:05
XY and Z and I learned to join
00:05:07
distribution of all three now I should
00:05:09
be able to answer a question like oh
00:05:11
give an X what can I see about wine Z
00:05:13
or given why and see what can I say
00:05:16
about XY all of the questions about I
00:05:19
know I know some aspects of reality
00:05:21
what can I say about other aspects. So
00:05:24
this in provides learning there's no
00:05:27
preference to which question you gonna
00:05:28
be asking you can think of a supervised
00:05:30
learning is a special case of you know
00:05:34
restricting yourself to on your
00:05:35
particular for question which is pretty
00:05:38
why given X another reason why provides
00:05:44
learning to be practically useful even
00:05:46
before we completely crack at is that
00:05:48
it it turns out to be very useful as a
00:05:50
regular riser what what that means is
00:05:52
that it can as an adjunct to supervised
00:05:55
learning. So this is the semis provides
00:05:58
case we can use S provides learning as
00:06:00
a way to help generalisation and the
00:06:03
reason it helps is that it it it
00:06:05
incorporates a additional constraints
00:06:10
on the solution this all the
00:06:12
constraints or the a priori that we
00:06:13
putting in is that the solutions we're
00:06:16
looking for are not just good at
00:06:17
predicting why give an X somehow the
00:06:19
involve sabre presentations that are
00:06:22
also good at capturing something about
00:06:25
the X the input distribution right this
00:06:27
is a you don't have to have that
00:06:29
constraint when you dip your supervised
00:06:30
learning but when you add that
00:06:32
constrain you can get better
00:06:33
generalisation. Um and that can be
00:06:36
useful as a red visor by itself it
00:06:38
could be useful in the in the transfer
00:06:40
setting where you wanna go to a new
00:06:41
task where you have very few labelled
00:06:43
examples or domain adaptation which is
00:06:45
kind of similar it's not a new task
00:06:47
assume new you know type of data maybe
00:06:50
you you go to the you go from you know
00:06:54
Quebec french to swiss french. And you
00:06:57
have to adapt and you don't have a lot
00:06:59
of data alright so that's these are
00:07:04
good reasons another good reason that
00:07:06
came out right at the beginning of the
00:07:07
people learning revolution in two
00:07:09
thousand six is that it looks like we
00:07:11
can exploit as provides learning to
00:07:13
make the optimisation problem of of T
00:07:17
planning easier. Um and the reason is
00:07:20
that we can we can define sort of local
00:07:25
objective functions like each pair of
00:07:28
lay you should be a good all encoder
00:07:31
good should form a good pair of of what
00:07:35
one could repair. And that kind of
00:07:37
constraint is something you know that
00:07:39
induces the kind of training signal
00:07:41
locally you don't need to backdrop to
00:07:45
twenty layers to get that information.
00:07:48
So it can in in the and spliced
00:07:50
retraining things we did from two
00:07:53
thousand six to about two dozen twelve.
00:07:55
Um it was useful a useful way to get
00:07:59
the training of the ground for for deep
00:08:01
supervising that's later we find other
00:08:03
ways to go around this optimisation
00:08:05
difficulty with with the rectifier but
00:08:08
it remains that there's an interesting
00:08:11
effect here that could be taken
00:08:12
advantage of and then the last reason
00:08:15
why this is interesting is that even if
00:08:17
you're only doing. Q or supervised
00:08:19
learning it happens sometimes that the
00:08:21
thing you wanna predict is not a single
00:08:24
simple class or a simple real value.
00:08:27
It's it's it's it's a composed object
00:08:31
for example you're predicting a a set
00:08:35
you predicting a data structure
00:08:36
predicting the sentence you predicting
00:08:38
an image right so if you pretty good
00:08:39
image the output is a high dimensional
00:08:42
object it's pretty a sentence the
00:08:44
output is high dimensional object and
00:08:46
and these objects are composed of
00:08:48
simple things like pixels or words or
00:08:50
characters. And so they have a joint
00:08:52
distribution. Now of course it's a
00:08:54
conditional john descriptions of given
00:08:55
the input I want to predict the joint
00:08:57
distribution of a bunch of things like
00:08:58
words in the sentence or something like
00:09:00
that. Uh or structure of the molecule.
00:09:03
So all of these kinds of obvious we may
00:09:04
be interested in predicting or saying
00:09:07
something about given an put that's
00:09:10
illegal you know structured output
00:09:12
learning and and they're essentially
00:09:14
all the the techniques that we have in
00:09:17
developing forms provides learning
00:09:18
especially the probabilistic once they
00:09:20
become useful we just have the
00:09:23
unsupervised learning model as usual
00:09:25
except we condition it meaning we have
00:09:27
the input that changes something in the
00:09:30
form of the joint distribution already
00:09:31
outputs alright so these are very good
00:09:34
reason to study unsupervised learning
00:09:36
but the one that really you know makes
00:09:39
me weak up at night is that we really
00:09:42
want the machine to understand how the
00:09:45
will ticks how the world works. And
00:09:47
unfortunately a if you if you step back
00:09:54
you know behind all the hype and the
00:09:58
the the excitement around planning and
00:10:00
machine learning in general what
00:10:02
happens very often is that the the
00:10:04
models. And up learning simple tricks
00:10:08
they're like surface statistical
00:10:10
regularities in order to solve the task
00:10:13
and if you think about the self driving
00:10:17
cars you would like those self writing
00:10:19
you know cars to somehow not just
00:10:23
relying on surfaces tickle statistical
00:10:25
regularities but can make sense of the
00:10:28
causal relationships between the
00:10:30
objects and and what could happen if
00:10:34
scenarios even though they may not have
00:10:36
seen these scenarios during their
00:10:37
training face. So how can that happen
00:10:40
how do a human's manage to do that.
00:10:43
Well the deal I think this is a
00:10:46
hypothesis of course we don't really
00:10:48
know what's going on in our brains but
00:10:50
but there's a lot of evidence that we
00:10:52
we brain we we learn a models of the
00:10:54
world that are causal that that's what
00:10:58
I mean by causal here is that there are
00:10:59
explanations about of what's going on
00:11:03
so I think the main job of our brain is
00:11:07
to figure out an explanation for
00:11:08
everything that we're seeing that skins
00:11:10
provides learning job right. Um and and
00:11:12
having an explanation means that you
00:11:14
can kind of simulate you know what
00:11:17
would happen if I change some of these
00:11:19
explanatory factors even though this
00:11:22
may be a situation that I have never
00:11:24
seen during training me again example
00:11:27
fortunately I never had a car accident
00:11:31
that killed me a so how can I learn
00:11:34
about the avoid in the actions that
00:11:36
could you know had have me killed in a
00:11:39
car accident. Well a supervised
00:11:42
learning is obviously not gonna work
00:11:44
even even reinforcement learning is not
00:11:46
gonna work because you know how many
00:11:48
times I have to dial an arts them
00:11:49
before and I learned how to avoid that
00:11:51
right you you see that there's a
00:11:53
problem. So how do we get around that
00:11:56
well we build in mental model of of of
00:11:59
of cars of rows of people that allows
00:12:02
to predict that if you know I do this
00:12:04
and that a it you know there is it
00:12:07
something bad with that may happen and
00:12:08
this is how it may happen and and if it
00:12:11
said I I change a little bit my
00:12:12
behaviour I could you know and up alive
00:12:16
so we are able to do that because we
00:12:19
have these kinds of explanatory models
00:12:21
it's something that we don't know how
00:12:22
to do yet you machines but this is
00:12:25
something we really need to do
00:12:26
otherwise yeah it's it's not gonna be
00:12:30
you know it's gonna be a spongy
00:12:32
alright. So how do we possibly do that
00:12:37
well there are many answers but one of
00:12:40
them that you know the the the the
00:12:42
reason why we got started into this
00:12:44
adventure D planning is because we
00:12:47
thought that by learning these high
00:12:49
level presentations we might be able to
00:12:51
discover high level abstractions what
00:12:54
that means so these obstructions in
00:12:56
some sense or closer to the underlying
00:12:58
explanations the underlying spent three
00:13:00
factors. And what we would really like
00:13:03
is that these high level features that
00:13:05
we're learning the the really capture
00:13:09
the knowledge about what's going on.
00:13:11
And one way to think about this is that
00:13:13
the the pixels we're seeing the the
00:13:16
sound meaning a the words rereading
00:13:20
they were created by something by some
00:13:23
factors by by some agents. And maybe
00:13:27
the lighting and the the microphone
00:13:30
whatever factors came in together were
00:13:32
combined in order to produce what we
00:13:35
observe and so what we want a machine
00:13:38
to do is to reverse engineer this to
00:13:40
figure out what or these factors and
00:13:42
separate them right disentangle them.
00:13:45
So I'll come back to this notion of
00:13:46
design tangling later but this is a
00:13:48
really I I find a very inspiring notion
00:13:51
yeah I I I want first to separate the
00:13:57
notion of invariance from the notion of
00:13:58
descending killing the notion of
00:14:00
invariance is one that has been very
00:14:03
you know commonly studied and and
00:14:07
thought about in areas like speech
00:14:09
recognition or computer vision where we
00:14:11
wanna do supervised learning so we
00:14:13
wanna predict something definite like
00:14:15
you know the object category the
00:14:16
phoneme. And we're trying to hand craft
00:14:20
features or maybe learn features that
00:14:23
are invariant to all the other factors
00:14:28
that we don't care about if I'm doing
00:14:29
speech recognition. I don't wanna know
00:14:31
who the speaker is I want my features
00:14:33
to be very into the speaker I'd want my
00:14:36
features being very into the type of
00:14:37
microphone I'm using if I'm doing ups
00:14:40
recognition I I would like my high
00:14:43
level features to be a maybe invariance
00:14:45
to translation or something like that.
00:14:47
Um the problem with this is that well I
00:14:51
mean this is good for surprise ending
00:14:53
but when you're doing unsupervised
00:14:54
learning. Well you know which factors
00:14:57
are gonna be the one that matter I
00:14:58
wanna capture everything about the
00:14:59
distribution I wanna know that ah
00:15:02
actually D and the lying estimation of
00:15:04
the sound then hearing is both a
00:15:06
sequence of words and phonemes and the
00:15:09
identity the speaker where that person
00:15:11
is in whether he's sick or something
00:15:13
like all these are explanations for
00:15:15
what I'm hearing and I would like the
00:15:17
representation and getting to have all
00:15:19
of that but I would like those factors
00:15:22
to be separated out so that I can now
00:15:24
just plug a a linear classifier on top.
00:15:27
And I can pick out the phonemes if
00:15:28
that's what I want or I can pick out
00:15:30
the I speaker identity if that's what I
00:15:31
want right that's the difference
00:15:33
between invariance and doesn't think
00:15:35
invariance we're trying to eliminate
00:15:39
from the signal from the features those
00:15:42
factors that we don't care about in
00:15:44
doesn't think we don't want it
00:15:45
eliminate anything we just wanna
00:15:46
separate out the different pieces that
00:15:49
that already and lying explanations and
00:15:52
and if you're able to do that you're
00:15:54
essentially killing the curse of
00:15:55
dimensionality because now if if your
00:15:58
goal is to answer specific questions
00:16:00
question about one of the factors you
00:16:03
reduce the dimensionality from very
00:16:04
high to just those features that are
00:16:07
sensitive to that factor now the thing
00:16:12
that we don't completely understand is
00:16:13
that when we do some of these that
00:16:15
apply some of these unsupervised
00:16:16
learning a buttons it looks like the
00:16:19
features we getting are a bit more
00:16:23
disentangle then the original as we go
00:16:26
higher up. Um so something good is
00:16:29
happening. And and these these these
00:16:32
these are experiments that were done
00:16:34
you know to zen publishing two dozen
00:16:36
nine and two thousand eleven and I I I
00:16:39
suspect there are other papers more
00:16:41
recently where we what if we do a kind
00:16:47
of analysis of the the features that
00:16:49
have been learned his arms provides
00:16:51
learning algorithms like sparsely
00:16:53
quarters. Um in knowing some of the
00:16:57
factors right so you know I kind of
00:16:59
cheat and I know some of the going
00:17:01
factors now I can test whether some of
00:17:03
the features become specialised more
00:17:06
towards some factor and and and less
00:17:08
sensitive to other factors is something
00:17:10
we can measure and somehow it seems to
00:17:13
happen magically. So why would that
00:17:16
happen. So here's here's a a kind of
00:17:19
it's a sketch of a theory why
00:17:22
unsupervised learning can give rise to
00:17:27
the extraction of features that are
00:17:29
more disentangle then then the original
00:17:31
data and yeah before I show you the
00:17:38
easy question initially this picture
00:17:39
because for pictures are so much better
00:17:42
so imagine that this is the data you're
00:17:45
getting you have distribution which is
00:17:47
actually a mixture of three gaussians
00:17:50
you can't have simpler than that well
00:17:52
you have a single guy. Um but nobody
00:17:56
tells you that you know what what cost
00:18:01
in the the particular sample you're
00:18:03
getting comes from so you have a label
00:18:04
data you just have the X and the winds
00:18:07
would be the gaussian identity is it
00:18:08
the number one number two number three
00:18:10
but you only observe X right. So if you
00:18:14
only observe axe what would be a good
00:18:15
model of the data well the best
00:18:17
possible model of the data is the one
00:18:20
that actually spells out the density as
00:18:24
a mixture of three gaussians right this
00:18:25
is this is in terms of log likelihood
00:18:28
or or whatever you wanna use is very
00:18:30
likely that the best model the data is
00:18:32
the one that actually discovers that
00:18:34
there is a latent variable Y which can
00:18:36
take the three you know integer values
00:18:39
one two or three I mean you can in the
00:18:40
maybe see if you want but and and you
00:18:42
can read label them but the point is we
00:18:44
have these three categories that are
00:18:47
sort of a implicit and data when we
00:18:50
don't class train. We're exploiting the
00:18:53
fact that there are national clusters
00:18:57
and we use clustering algorithms to
00:18:59
discover these clusters and you can
00:19:01
think of these processes as causes that
00:19:04
nobody told us about but we can
00:19:06
discover with a simple statistical
00:19:08
analysis just you know K means will
00:19:10
figure it out right so you so so the
00:19:12
principle is that there are underlying
00:19:14
causes and the statistics of the data
00:19:17
can reveal them to us if we go a good
00:19:20
model of the data the better the model
00:19:21
we have the the better we are able to
00:19:24
figure out those underlying causes. Um
00:19:27
now why would that be useful for
00:19:29
supervised learning so that's where
00:19:30
this slide and that's the question
00:19:32
becomes interesting. So let's think of
00:19:36
why here is one of the factors that
00:19:40
explain axe all right. Um and so let's
00:19:46
say that at the end of the day we
00:19:47
actually want to classify and pretty
00:19:49
why given X this is gonna work yeah so
00:19:56
we could just train a normal neural net
00:19:58
it predicts white directly from X or we
00:20:01
could train eight generated model that
00:20:06
captures your axe right. Um and as I
00:20:13
try to argue previously the best
00:20:15
possible jotted model here is actually
00:20:17
one that's written as a sum over the
00:20:20
whys and possibly a over all the
00:20:22
variables that Coleman age a we're
00:20:25
given the the causal factors we can
00:20:28
pretty acts. And and the reason that
00:20:33
this is it better model than and then
00:20:35
this one is simply that this is how the
00:20:38
data was actually generated right so
00:20:40
the best model of the data is the one
00:20:41
that kind of the truth that's how it's
00:20:44
generated the one that gives the best
00:20:45
predictions is the white response to
00:20:47
truth. Um it so if we're if even if we
00:20:53
don't observe why okay if we just
00:20:58
observe ex we can we can extract latent
00:21:03
variables like P what we we we try to
00:21:06
monkey of X as a key of X given age
00:21:10
times P of age for example so we
00:21:13
introduce like Bibles age and in the
00:21:16
best possible model well within H
00:21:19
should be why because one is one of the
00:21:21
factors that explains X and so if we
00:21:24
find good representations for P attacks
00:21:27
we're likely that these representations
00:21:30
will be a useful to predict why okay
00:21:36
there is a a nice paper a at IC model
00:21:41
doesn't twelve by gen dancing and and
00:21:46
others from Bernard Shaw cost group at
00:21:50
max Planck institute where they show
00:21:53
that there's a huge difference between
00:21:56
the situation where X is the cause of
00:21:59
why and why is the cause of X in terms
00:22:01
of the ability of some is provides
00:22:03
lying to work in other words if if why
00:22:09
is the cause of X then we can do some
00:22:14
is provide learning and I liked
00:22:15
learning about P of X actually becomes
00:22:18
useful whereas if even though at the
00:22:23
end of the day we only care about you
00:22:25
white give a nice whereas if the causal
00:22:28
direction was reversed then all the
00:22:31
semis provides lighting would be
00:22:32
useless because in the case where it
00:22:35
was reversed basically the the joint
00:22:37
that they're the the joint
00:22:38
distribution. P avoiding and X would
00:22:40
just be given by TOY given X times P of
00:22:43
X and so you X would have nothing to do
00:22:45
with its structure with key of why give
00:22:47
an X whereas if it's the other way
00:22:49
around. Um if the right causal model is
00:22:53
go from Y to X then when we want to
00:22:55
learn P of why give a nice well there
00:22:58
is information about P of why given X
00:23:00
inside P of X because P of X is
00:23:02
decomposed lexus. So yeah they they
00:23:06
push this argument much further but the
00:23:09
this is a deep connection dinner date
00:23:13
is a deep connection between the
00:23:15
causality and the relation you know
00:23:17
which which is the cause of which and
00:23:19
the success of you know unsupervised
00:23:21
learning to help supervise not that's
00:23:23
the main message alright so I mentioned
00:23:28
that unsupervised learning is is
00:23:33
difficult and this shows up very
00:23:35
clearly when you tried to tackle
00:23:37
unsupervised learning using a arsenal
00:23:41
of mathematical and computational tools
00:23:44
from probability like graphical models
00:23:47
and and models with latent variables.
00:23:50
So in principle introducing the latent
00:23:53
Bibles sure that help us and it should
00:23:56
help us to even avoid the curse of
00:23:58
dimensionality. Um because because
00:24:02
we're modelling at the right level in
00:24:03
some sense. But the problem is that for
00:24:07
all of their approach is that that that
00:24:10
are really angry probability in
00:24:13
explicit probabilistic model what we
00:24:15
find is that some of the complications
00:24:18
during that are needed either for
00:24:19
learning or using the model are just
00:24:21
intractable be involve you know running
00:24:24
integrals or sums over an exponential
00:24:27
number of things and so for example in
00:24:32
in typical directed models exact
00:24:36
inference in other ways predicting the
00:24:37
latent variables given the input is is
00:24:41
intractable even though you're going
00:24:42
you're able to go in the other
00:24:43
direction predicting X given age
00:24:45
because that's how to model is
00:24:47
parameterised going backwards which is
00:24:50
something we actually need to do both
00:24:51
for learning a potentially for using
00:24:53
the model the is involves an
00:24:55
intractable some in other models the
00:24:58
and directed models yeah there's
00:25:00
another issue with it potentially in
00:25:02
addition to this one which is that
00:25:04
these models involved in normalisation
00:25:06
constants. Um which is intractable and
00:25:10
and it's gradient isn't right in other
00:25:12
words the probability is expressed as
00:25:14
some expression divided by
00:25:16
normalisation constants which we
00:25:17
usually right is that and that's that
00:25:19
is something we can compute easily. And
00:25:21
of course and we also need to give you
00:25:23
the gradient of that's said so it's
00:25:25
it's looks like it's hopeless. Um so
00:25:29
this has this has us you know motivated
00:25:33
a lot of new things some of which I
00:25:34
will tell you about but let me start
00:25:38
with the and sisters of the
00:25:43
degenerative models the energy based
00:25:46
models of both machines basically of
00:25:49
the category of undirected graphical
00:25:50
models so with and write a graphical
00:25:52
models basically you're expressing the
00:25:54
probability function. So X is the the
00:25:57
run the marble you're trying to model
00:25:59
in terms of and energy so this is just
00:26:02
a rewrite there's not much of a diff
00:26:04
constrained by doing this except that
00:26:05
we're saying that ah every
00:26:08
configuration gets a non zero
00:26:09
probability because energy you know
00:26:11
it's gonna be finite for any X and so
00:26:14
this means probably is just region zero
00:26:16
for everything but besides what it
00:26:19
really saying is that instead of
00:26:21
primate rising the probably directly
00:26:22
where primate rising this guy the
00:26:24
energy and we letting this Z the the
00:26:28
rye from it so that here is just to sum
00:26:30
over X or the integral of racks of the
00:26:32
the numerator okay so if you have a
00:26:35
model of that type it turns out that
00:26:38
the log flight you'd and tells you to
00:26:40
update your parameters according to the
00:26:42
following very simple idea and
00:26:45
especially if you think about
00:26:47
stochastic green descent so I'm giving
00:26:48
an example X let's call it X plus and
00:26:52
this landscape that I'm showing here is
00:26:54
the energy landscape so think of
00:26:55
remember this E to the minus energies
00:26:58
probability so when energy's localities
00:27:00
high. And there's an exponential
00:27:02
relationship. So yeah which is hard to
00:27:06
visualise here but ah when when this
00:27:09
goes up very much then the probably
00:27:10
goes exponentially faster zero alright
00:27:13
so we're given an example X plus and
00:27:15
you have occurred energy function so
00:27:17
this is the curve the Y axis is energy
00:27:19
and what we wanna do with max and like
00:27:22
it we wanna make the probability of the
00:27:24
observed data high that's what my from
00:27:26
like it means that means make the
00:27:28
energy of the observed configurations
00:27:31
low. So the ideal solution would be to
00:27:34
make every twenty example at peak I
00:27:36
mean another ticket trough like a
00:27:37
minimum of the energy that would be the
00:27:41
ideal solution from the twenty point of
00:27:43
view from civilisation might not be but
00:27:45
anyway what training consists in is
00:27:48
pushing down on the energy where the
00:27:50
examples are and pushing up everywhere
00:27:52
out because if I just push down on the
00:27:54
training example where the energy for
00:27:56
the training example that may not be
00:27:59
good what I really want is you know the
00:28:02
relative energy to be small for
00:28:03
trainings also here's an example where
00:28:05
the the data points are these a little
00:28:07
dots. And doing training we're pushing
00:28:11
up everywhere else. And we're gonna get
00:28:13
a model that puts a low energy where
00:28:16
the data is this is a good model right
00:28:18
and this is is not as good model. So
00:28:24
yeah you can get that just by doing
00:28:27
three lines of algebra but is this
00:28:29
something kind of intuitive about
00:28:32
what's going on here at the same time
00:28:33
as we're trying to push on at the
00:28:36
configuration given by the data push
00:28:38
down the energy we're trying to push up
00:28:40
everywhere everywhere else but not in
00:28:45
the same with the same strength
00:28:46
everywhere else we the equation we're
00:28:48
getting tells us we wanna push up
00:28:50
especially in places where the energy
00:28:52
is low right so all those places that
00:28:56
get a high probability basically should
00:28:57
be pushed up and we call these in a
00:29:02
negative examples and these possible
00:29:04
examples we're trying trying to make
00:29:05
positive examples more probable and
00:29:08
trying to make negative examples less
00:29:10
probable. And where do we get those
00:29:12
negative examples well ideally these
00:29:14
negative examples come from the model
00:29:16
distribution itself quite so once we
00:29:18
have an energy. We have a probability
00:29:20
fusion corresponds to it by this
00:29:22
equation. And if we could sample from
00:29:24
this distribution we would get like you
00:29:26
know many points here a few here if you
00:29:28
hear us so we wanna push where we get
00:29:31
those samples up. That's what the the
00:29:34
the math tells us we should be doing to
00:29:35
maximise like you this is what we see
00:29:39
in this equation so that the riveted of
00:29:40
the log probability with respect to
00:29:42
parameters which are hidden inside the
00:29:44
energy function has two turns one which
00:29:48
we call the positive face term and the
00:29:50
other called the negative face turn.
00:29:52
And this one is saying you know change
00:29:54
parameters so that the energy of the X
00:29:57
becomes lower because we wanna maximise
00:30:00
this we have a minus here so we
00:30:01
minimise the energy at this X and and
00:30:06
and now you also have this term or just
00:30:09
push up so there's no negative here.
00:30:11
You this is wants to go up so this was
00:30:13
the what everywhere so some all X tilde
00:30:18
but waited by P of X dollars so those
00:30:20
places where the model thinks that you
00:30:24
know they have a high probability we
00:30:26
want to reduce their probability we
00:30:28
want to increase their energy this is
00:30:30
the case here in the second line where
00:30:33
the model involves not just the expert
00:30:36
also some latent variable H so now the
00:30:38
energy function is defined in terms of
00:30:39
both X an age and you could marginalise
00:30:43
so some overall the values of age and
00:30:45
get another equation which looks like
00:30:47
the one we had before any call this
00:30:49
modified energy or marginalise energy
00:30:52
should be the right term but physicists
00:30:53
call it free energy and that is a
00:30:56
similar question except that we now
00:30:59
have to wait by the those probabilities
00:31:02
of the H given ex the two terms here.
00:31:06
And this week or posterior probability
00:31:08
so you see that when you have like
00:31:10
convertibles you know the to learn we
00:31:11
need to sample or or average over this
00:31:14
posterior probability of the latent
00:31:16
variables given me. And this can be
00:31:18
hard yes so yeah and then tell you much
00:31:26
about how we do this or the ways we
00:31:28
know right now how to do is the all
00:31:30
involve some kind of multi colour
00:31:32
markov chain so multicoloured markov
00:31:34
chains adjust methods to sample from a
00:31:37
distribution when you know we don't
00:31:39
have any better method so it's a kind
00:31:41
of general method for something from
00:31:42
this fusion and it's an intuitive
00:31:44
method you never actually get a a real
00:31:46
simple from the distribution you you
00:31:48
have to go you know many steps in and
00:31:50
the symbolically you hope that you get
00:31:52
the sample on the right distribution
00:31:53
you may have heard about restricted
00:31:57
both machines so these these are a
00:31:58
particular kind of and wrecked a
00:32:01
graphical model that has a a a a graph
00:32:07
structure like this where there is no
00:32:09
relation there's no relationship
00:32:11
between EX is when we know the H and
00:32:13
vice versa so the the X or
00:32:15
conditionally independent given the age
00:32:18
and vice versa. So this forms what's
00:32:21
called a by part time graph where we
00:32:24
have connections you know going from
00:32:25
top to bottom everywhere but no
00:32:27
connections no so called natural
00:32:29
connections here or here. And and with
00:32:32
those conditions it turns out that it's
00:32:34
actually easier to train these models
00:32:37
and I'm not gonna go into again it's
00:32:39
using some what you call a markov
00:32:41
chains but somehow we are we are able
00:32:43
to do a decent job of training these
00:32:46
these types of undirected graphical
00:32:47
models. And so the urban uses the
00:32:49
building blocks starting with the two
00:32:51
thousand six a breakthroughs for
00:32:54
supervised learning to train deeper
00:32:58
models but that threat of research just
00:33:03
kind of diet over the last few years.
00:33:05
And I have I I I I have some thoughts
00:33:11
about why. Why didn't work as well as
00:33:14
we would have hoped and part of it. I
00:33:17
believe has to do with the the fact
00:33:19
there we we rely on these multicoloured
00:33:21
markov chains in order to get those
00:33:23
samples let me try to explain what I
00:33:26
think is going on. So in order to get a
00:33:31
gradient on the parameters in order to
00:33:33
train the model we need to get samples
00:33:36
from the model in other ways we we have
00:33:38
to ask the model you know give me
00:33:39
examples of the things you believe in
00:33:41
like which images you would generate
00:33:43
and we do this by running a markov
00:33:47
chain which starts at some
00:33:49
configuration again and goes you know
00:33:51
left and like randomly makes a local
00:33:54
small moves out what's particular about
00:33:56
in the same C is that those moves are
00:33:59
typically both local and they want to
00:34:01
go to a place of high probability. So
00:34:04
at the end of the day you end up
00:34:05
walking near the modes of the diffusion
00:34:07
and spending more time where where
00:34:10
probabilities higher that's that that
00:34:11
the deal what the deal is that as we
00:34:13
run these markov chains we end up
00:34:14
spending more time where that probably
00:34:17
is higher in fact proportionally
00:34:18
exactly to the to the probability. But
00:34:22
there's that there's a problem. When
00:34:24
the model is kind of agnostic initially
00:34:27
your model puts sort of uniform
00:34:29
probability everywhere. And then it
00:34:31
gets to put more more probability mass
00:34:34
around where the data is and initially
00:34:37
you know these these molds are kind of
00:34:40
us move and and you can still travel
00:34:44
between those modes without having to
00:34:46
go through a zero probability region
00:34:48
but as the model gets sharper another
00:34:50
words it it now really it gets more
00:34:52
confident about which configurations or
00:34:54
probable. And which are not like the
00:34:57
things in between the modes for example
00:35:00
maybe this is one category in this is
00:35:01
not a category in there shouldn't be
00:35:03
anything in between then what happens
00:35:05
is that those markov chains get trapped
00:35:08
in round around one mode and they can't
00:35:10
easily jump from one mode to another
00:35:11
mode and what it means is that if we
00:35:15
start somewhere we gonna stay around
00:35:16
that region and we can't visit the rest
00:35:19
and we don't get really representative
00:35:20
samples we don't get representative
00:35:22
samples then our training suffers it so
00:35:25
those models are able to learn
00:35:27
distributions are sort of some level of
00:35:30
complexity if we try to learn more
00:35:32
complex distributions it just stalls to
00:35:34
you we haven't been able to yet maybe
00:35:37
we'll find solutions to that but for
00:35:38
now it remains an open problem as far
00:35:41
as I'm concerned one glimmer of hope
00:35:45
comes from experiments that we run a
00:35:48
few years ago where we found that
00:35:50
although sampling in the in but space
00:35:54
with these and CMC is is is hard if
00:35:59
instead of running this markov chain in
00:36:01
the the raw input space like pixels we
00:36:05
first not the data to high level
00:36:07
representation because let's see we've
00:36:09
train a bunch of little encoders are
00:36:10
bunch of IB m.s. So we we now have a
00:36:13
with them at the input data through a
00:36:16
better presentation the kind of
00:36:18
representation we learned we when you
00:36:19
that's typically and now if we run the
00:36:21
markov chain in that space it turns out
00:36:24
that it makes as much better between
00:36:26
the modes. So we we've trying to
00:36:28
understand that and I have a picture
00:36:29
here hopefully which helps to
00:36:31
understand what is going on so in need
00:36:35
pixel space really input space that
00:36:38
they that concentrates on some manifold
00:36:40
like here at this is a cartoon
00:36:42
obviously. I've see this is the the
00:36:44
manifold of three of nines and this is
00:36:46
the man of the freeze and these
00:36:47
metaphors of very very thin the occupy
00:36:49
very small volume and they're well
00:36:51
separated from each other and so it's
00:36:53
hard to mix between two categories for
00:36:55
example but what happens is as you
00:36:58
mount the data to these higher
00:36:59
dimensional not high dimensional high
00:37:01
dimensional spaces that are you know
00:37:04
learns somehow to capture the
00:37:07
description like or quarters I the
00:37:12
relative volume occupied by the data in
00:37:16
that space is larger than in the
00:37:18
original space. And the the different
00:37:20
manifolds get close at each other. So
00:37:22
now it becomes easier to jump from one
00:37:24
to the other. And there's something
00:37:26
else happens which is that where is the
00:37:29
manifolds in the original space are
00:37:33
highly curved and complicated when you
00:37:36
go to these learn spaces of using
00:37:39
provides learning those manifold become
00:37:42
flat. So to try to understand what I
00:37:44
mean by flat manifold think about a
00:37:47
curved manifold so let's say the data
00:37:48
concentrating input space on this
00:37:51
thread manifold no I think two examples
00:37:55
like if the image of a nine here any
00:37:57
image of a three here. And I linearly
00:38:00
interpolate between them and I look at
00:38:01
points in between and and try to
00:38:03
visualise what they look like so this
00:38:04
is what we did you have a nine here you
00:38:06
have three here. You do linear
00:38:08
interpolation pixel space and of course
00:38:10
what you get you get the addition of
00:38:12
you know if we in line which doesn't
00:38:13
look like either three or nine might
00:38:16
take take two random images natural
00:38:18
images add them up and you get
00:38:19
something that doesn't look like an
00:38:20
actual image. So what it means is that
00:38:24
if I take two images and I interpolate
00:38:26
the stuff in between that is and is not
00:38:29
on the manifold because the manifold is
00:38:32
not flat if the metaphor was flat when
00:38:35
I do linear interpolation the things in
00:38:37
between look like natural images and
00:38:40
this is actually one of the tests that
00:38:42
we use with a new unsupervised feature
00:38:46
learning a buttons to see whether it
00:38:50
has done a good job or not up and
00:38:52
folding about what we take to images.
00:38:55
We map them to the representation space
00:38:57
we do a linear interpolation and we
00:38:58
look at the things in between in in the
00:39:01
pixel space right so we can go back and
00:39:03
forth between input space in your
00:39:05
presentation space. And so we can
00:39:07
interpolate in the in the
00:39:10
representation space in the H space and
00:39:12
then you know use decoder to map back
00:39:17
to pixel space and visualise so here ah
00:39:21
we see what happens when we do it by
00:39:24
looking at the first layer of of a
00:39:26
stack of a little encoders of the
00:39:28
nosing recorders here and here the
00:39:30
second layer and what we find is that
00:39:32
the higher we go the better you know
00:39:35
flattening is is happening so what was
00:39:38
going on now is that after just the
00:39:40
second layer we can we can interpolate
00:39:42
between this nine and a three and
00:39:43
everything in between make sense and
00:39:45
looks like like a digit and there's a
00:39:47
point here where it suddenly jumps very
00:39:49
fast from the nine to three okay this
00:39:52
guy's can just not just in the border
00:39:53
of being you know between three tonight
00:39:55
in just a few pixels above go from nine
00:39:57
to three right. So it is really found a
00:39:59
way to make these manifolds very close
00:40:01
to each other and the path now on the
00:40:04
straight line straight line a goes
00:40:06
exactly to the right place and never
00:40:08
goes through something that doesn't
00:40:11
look like an actual image you guys have
00:40:14
question about this yes well so okay so
00:40:23
I have an image. I have an encoder
00:40:26
which maps it to a vector I got another
00:40:30
image in a three I get another vector
00:40:33
two vectors okay now I can take a
00:40:36
linear interpolation so all five times
00:40:38
the first one plus one minus alpha
00:40:39
times second one where Ralph lies
00:40:41
between zero and one is it so for
00:40:43
example you have this one plus half of
00:40:45
this one that would be right in the
00:40:47
middle okay so not give me another
00:40:49
vector. And then I'm map it back to
00:40:52
pixel space because I have you know two
00:40:53
we mapping here from input to a
00:40:55
presentation in back yeah because it it
00:41:07
tells us whether the manifold as being
00:41:10
flattened or not. But why would that be
00:41:12
a good thing well I think it's very
00:41:15
clear if you if you have a flat
00:41:17
manifold you can you can now basically
00:41:21
combine existing examples you know to
00:41:24
to to predict you know what would
00:41:25
happen so I'll show you an example
00:41:26
later what you can do here's what you
00:41:36
can do you take a man with glasses you
00:41:40
subtract a vector format without
00:41:42
glasses you add the vector for women
00:41:44
with without glasses and you get women
00:41:45
with glasses yeah I mean whatever
00:42:01
processing you wanna do you really want
00:42:03
to do it in this linear space where you
00:42:04
can just simple linear operation in
00:42:06
order to change things. Right like oh I
00:42:10
I decide that I don't wanna have
00:42:11
classes I just you know we move some
00:42:14
direction yeah yes you're right but
00:42:23
here is the simplest thing we can think
00:42:25
of. And so here's another way to think
00:42:28
about why lean years good. I mean why
00:42:30
why flat is good because let's say I
00:42:34
wanted to capture the dissolution it's
00:42:35
I wanted a at some like you right. So I
00:42:37
wanna capture the distribution and if
00:42:41
the this missions like this is gonna be
00:42:43
difficult to model it I I I'm gonna
00:42:44
need like with if I do it with gaussian
00:42:47
extremely like many components to to
00:42:50
you know go through this if it's one
00:42:51
flat thing one gaussian bad. I got the
00:42:54
density once so I think about it. Once
00:43:05
you know the density of the data you
00:43:07
can answer any question about it in the
00:43:09
language of the variables that your
00:43:10
design yes I'm not saying that the
00:43:23
world is a big gaussian. But if we can
00:43:26
you know map a lot of it too simple the
00:43:30
solutions then we can answer questions
00:43:32
right so yeah you're right maybe for
00:43:34
example let me let me give you an
00:43:36
example in in in the direction you're
00:43:38
talking about if you actually have
00:43:41
multiple categories presumably you're
00:43:43
not gonna get it I guess single gas in
00:43:44
that captures all the categories you
00:43:46
probably wanna have like a different
00:43:48
you know a gaussian for each category
00:43:50
and so so the the right model you
00:43:52
wouldn't be a single gaussian because
00:43:53
we want to somehow capture the fact
00:43:56
that we have these clusters. So yeah
00:43:58
but the point is is gonna be much
00:44:00
easier to model the data answer
00:44:01
questions reason if we can flatten the
00:44:05
manifolds but it is something that can
00:44:09
be argued no but I it's not about the
00:44:29
structure of the space it's about the
00:44:30
structure of the distribution if the
00:44:32
data has to be you know along that
00:44:35
subspace if it if it makes you has a
00:44:39
complete shape it's hard to capture
00:44:42
what's going on is hard to make
00:44:43
predictions it's hard to reason about
00:44:45
it. If you everything becomes when
00:44:47
you're it's much easier to reason about
00:44:49
that's that's all let me move on
00:44:51
because I only have fifteen minutes
00:44:52
left and lots of things that would like
00:44:56
to talk about but I'll do quickly so I
00:44:58
mention all encoders already right so
00:45:00
you just picked basically to a mapping
00:45:02
from input space representations pacing
00:45:04
back. And we can learn them in various
00:45:06
ways. And we can have probably think
00:45:09
version where the and colour is
00:45:11
actually a conditional distribution. So
00:45:14
it's not just a a function we actually
00:45:17
inject some kind of noise here and we
00:45:19
get a sample of H given any particular
00:45:21
X and simply the decoder can itself be
00:45:25
conditional descriptions of given some
00:45:26
H from some distribution which we call
00:45:28
prior to solution then we are getting a
00:45:31
X is from pure explanation so this
00:45:35
these two guys actually represent a
00:45:38
joint distribution over X an age and
00:45:41
these two guys with a different matter
00:45:43
also correspond to a joint distribution
00:45:45
so I mentioned it literal your that the
00:45:50
these and seem seem met is and
00:45:52
classical ways of primate rising
00:45:55
problem probably distributions kind of
00:45:57
hit a wall so we explored other ways of
00:46:00
of doing this and the general theme of
00:46:03
of this is let's bypass all of these
00:46:06
normalisation constants and and so on.
00:46:09
And and learn generative black boxes so
00:46:14
if if we if we specify the problem of
00:46:17
unsupervised learning is build a
00:46:18
machine that can generate see images
00:46:20
which is something we can discuss. But
00:46:22
it see we we we define it like this
00:46:24
then let's just trying and you on that
00:46:26
that you know takes in random numbers
00:46:28
and outputs images right. We can of
00:46:30
course trying a different kind of neon
00:46:32
that which may have like different
00:46:34
inputs and then we can have you know
00:46:35
given for example some sentence I would
00:46:37
like to generate an image of responses
00:46:39
that's just a variation right once you
00:46:40
get once you're able to agenda trained
00:46:43
you on that that does this kind of
00:46:44
thing then you can do all kinds of
00:46:45
other fun things. So that's one
00:46:47
variance and I'll tell you about this
00:46:49
call the generated but they're still
00:46:51
mats and they are very hot these days
00:46:54
another variant which for now has been
00:46:57
less explored is that alright so we're
00:47:00
not gonna generate in one go we're
00:47:02
gonna generate through a sequence of
00:47:03
steps is gonna be like like a recurrent
00:47:05
net. So it's gonna have a state. We
00:47:08
throw in some random numbers and then
00:47:09
each point in the sequence we generate
00:47:12
a sample. And as we do more of these
00:47:14
steps the samples look nicer so this
00:47:16
kind of imitates the markov chain but
00:47:18
but now we're gonna learn the so called
00:47:21
transitional parade of the marketing
00:47:23
the black box that goes from one stated
00:47:25
next eight generate some samples and in
00:47:29
you know in some random numbers so that
00:47:31
we get a different thing each time. So
00:47:33
this is just a kind of stochastic
00:47:34
dynamical system that generates the
00:47:36
things we want and I called of these
00:47:39
things generative stochastic networks
00:47:41
alright and then we can do all kinds of
00:47:44
math about these things and actually
00:47:46
show that you can train them so this is
00:47:49
totally different from the the
00:47:50
classical approach of undirected
00:47:52
graphical models any skip something's
00:47:56
let me tell you about the denoting
00:47:57
recorder which is related to to this
00:48:02
and of course to encoders in general so
00:48:04
it's a particular kind of all encoder
00:48:06
where I think I had at yeah here in the
00:48:11
D noise in the recorder a what we do is
00:48:15
we minimise the reconstruction they're
00:48:17
but instead of giving the raw input
00:48:19
here we give a corrupted input for
00:48:22
example we hide some of the inputs
00:48:23
these we seven to zero or we add some
00:48:26
gaussian noise or whatever we want we
00:48:27
can also inject noise here but the
00:48:28
traditional thing is we inject noise
00:48:30
here. And the error we're minimising
00:48:33
here is like some kind of log
00:48:34
likelihood we construction so
00:48:35
probability of the clean input given
00:48:39
the code. And and that's that's that's
00:48:42
a delusional encoder and it it's
00:48:45
probably the one that's been best
00:48:46
studied mathematically and you
00:48:47
understand better it's probabilistic
00:48:49
interpretation so here's a picture of
00:48:54
what's going on let's see the data is
00:48:55
concentrated on this manifold. So the
00:48:58
exes here training points. So what we
00:49:00
do is we take a train we take a
00:49:01
training point we corrupted and we get
00:49:03
you know something like this like the
00:49:04
right thing and then we ask the new on
00:49:06
that to go back and and we construct
00:49:09
the original now of course it may not
00:49:11
be able to do it perfectly because
00:49:12
maybe the original could've been here
00:49:13
here here and so in general is gonna
00:49:16
point right at the manifold if it
00:49:18
learns well and so it learns these kind
00:49:20
of vector field which points towards
00:49:22
the data. And you can actually do
00:49:24
experiment ah onto the data and so
00:49:27
let's say that there are these yellow
00:49:29
circles and that you know someone
00:49:30
coders learn these arrows these arrows
00:49:32
correspond to a you know where it was
00:49:35
to go if you start hearing must go in
00:49:37
this direction so the reconstruction
00:49:39
you know is pointing in this direction
00:49:40
so this is the personal to
00:49:42
reconstruction minus input. And in fact
00:49:46
we can we can prove that if you train
00:49:49
this well selection directly where this
00:49:51
converges is that the reconstruction
00:49:53
minus the input so the same thing next
00:49:57
year in here. It actually estimates
00:49:59
what's called a score. D log PDX not
00:50:02
towards the direction the gradient of
00:50:05
the den see the direction in which a
00:50:06
density increases the most right so if
00:50:08
you're if you're sitting here where you
00:50:11
wanna go to to increase probably the
00:50:13
most is towards the matter for this is
00:50:15
the gradient of the like you right.
00:50:16
"'cause" there's a peak of probability
00:50:18
that should be here and then probably
00:50:20
should go down as fast as you move away
00:50:22
infection B zero. But if you smooth a
00:50:24
bit you gonna get this. So there's a
00:50:27
lot of papers the try to understand
00:50:30
this and and also show that's these
00:50:34
them with a grin recorders you can
00:50:35
sample from them you can define a
00:50:38
markov chain that corresponds to
00:50:41
something from the model has been line
00:50:43
so you can actually once you've trained
00:50:45
on the noise going coat or you can just
00:50:47
apply the corruption apply the
00:50:50
stochastic reconstructions in other
00:50:51
words you gonna sample from the output
00:50:54
distribution rather than have a
00:50:56
deterministic function. And then you do
00:50:58
it again and again and this markov
00:51:00
chain will converge to what the model
00:51:02
where the model T C.s where it would
00:51:04
probably mass so in terms of this
00:51:06
picture what it means is that you know
00:51:08
if you fall this arrow and you had a
00:51:10
bit of noise in the fall the arrow and
00:51:11
a battery that annoys you will you will
00:51:13
kind of move more or less in that
00:51:14
direction and then you a start moving
00:51:16
around this thing "'cause" they're no
00:51:19
arrows going away from this there no
00:51:22
arrows going this way you are but if
00:51:25
you use stipulated away brings you back
00:51:28
right so there is a bit of noise it
00:51:29
makes you move around like a random
00:51:30
walk and you gonna stay on that you
00:51:32
know run walk let me skip a few more
00:51:37
things so there is another kind of or
00:51:42
recorder with the probabilistic
00:51:43
interpretation that has really made it
00:51:45
big in the last few years. And risk
00:51:47
score the variational all one colour
00:51:50
and it's a very very beautiful theory
00:51:56
that's behind this a very simple
00:51:57
actually where we think about two
00:52:04
distributions eighty eight directed
00:52:08
model which is supposedly the one that
00:52:09
we wanna train which we can decompose
00:52:12
into the prior on the top level and
00:52:14
then conditionals where a potentially
00:52:16
usually there's only one stage actually
00:52:18
you have X given age and on the other
00:52:22
hand the dislike the decoder path and
00:52:24
we're gonna have an encoder path and
00:52:26
the end encoder goes exactly in the
00:52:27
other direction but it but it
00:52:29
stochastic and it it it has this cute
00:52:31
distribution QH given X so the X comes
00:52:34
from the data description which by
00:52:36
convention I would like to write Q axe
00:52:39
and this way this defines a joint QXNH
00:52:43
and this defines a joint P of X an age
00:52:45
and essentially the training objective
00:52:48
is to make these two distributions
00:52:50
match in K in the KL cool but clearer
00:52:53
sense. And it turns out that it this is
00:52:57
pretty much tractable is not that it
00:52:59
doesn't involve not like running a
00:53:01
markov chain you can change the
00:53:03
parameters of both the encoder and the
00:53:05
decoder. So that the the you know the
00:53:09
they are the job descriptions the the
00:53:11
capture our schools to each other as
00:53:12
possible in particular if the joined of
00:53:15
this and the joint of this match well
00:53:17
then in particular the marginal so the
00:53:19
Q Alexia which is the data description
00:53:21
matches PLX which is the marginal here.
00:53:23
But you never need to express Q have X
00:53:25
directly so this relies on what's
00:53:29
called operational bound in which the
00:53:32
the ah the log PLX which is a already
00:53:36
that's intractable is bounded by a a
00:53:39
tractable quantity which involves
00:53:40
sampling from Q and measuring the the P
00:53:46
of X even age something that we can
00:53:48
compute I I'm not gonna go into the
00:53:51
details because and you have a few
00:53:52
minutes left and skip a few things here
00:53:56
so there there's some recurrent variant
00:53:58
of this that have been proposed called
00:53:59
raw which will do fun things like
00:54:04
generate not in one go about generate
00:54:07
through a sequence of steps for example
00:54:10
draw it three here by moving the little
00:54:12
cursor and changing its position and
00:54:15
size and drawing ink in the middle of
00:54:18
it. And you know it's busy gonna draw
00:54:21
the thing you want. So it it works
00:54:23
really well for and this digits you can
00:54:27
do it also on the SSVH and that's a
00:54:30
street you half numbers the these are
00:54:34
actually training examples from this
00:54:35
data set. And these are the kinds of
00:54:37
samples you're getting so these are
00:54:38
really good for you know before draw we
00:54:42
we we had no out with them that could
00:54:46
draw things like this that look so
00:54:48
realistic. Now that's digits the next
00:54:53
that was images natural images like
00:54:56
image nets. So for this the the out
00:54:59
without really made it big is the
00:55:02
gender divide the serial network that I
00:55:03
mentioned earlier. And it's it's based
00:55:07
on a very simple intuition you're gonna
00:55:08
train. Um to you on that one which is
00:55:13
gonna be the one we want to use at the
00:55:14
end the generator and so as I said
00:55:16
before it's a black box that takes a
00:55:18
random vector and outputs if fake image
00:55:21
generated image. But we also gonna
00:55:23
train a discriminator network a
00:55:26
classifier. And you you can we gonna
00:55:28
think about this discriminator as a
00:55:30
trained lost function so normally the
00:55:33
last functions something fixed. But
00:55:35
here like in some enforcement lighting
00:55:37
setups we are gonna learn a lost
00:55:39
function and the loss function is
00:55:41
basically one that's is trying to
00:55:44
discriminate between the fake images
00:55:46
generated by our model. And the real
00:55:48
images coming from the training set. So
00:55:51
you know this guy's just not doing
00:55:52
normal classification and the way we
00:55:54
train this guy is that the generators
00:55:56
trying to fool the discriminator now
00:55:59
the words is trying to produce and I'll
00:56:02
put that maximises the probability that
00:56:05
what it sees a is classified as a real
00:56:08
image and so we take the output
00:56:10
probability here and we just backdrop
00:56:12
into the generator. So that's the basic
00:56:17
idea. Um so you know during training
00:56:21
when we train the discriminator we show
00:56:23
training except at the real training
00:56:24
image and we get a you know we we tell
00:56:27
the disk noted that you should out the
00:56:29
one and sometimes we said give it the
00:56:34
output of the generator and we tell the
00:56:36
discrete you should output is zero. But
00:56:38
then the way we train the generator is
00:56:40
that we take the probability of being
00:56:43
the one that this one is producing when
00:56:45
the input comes from the generator and
00:56:48
we try to maximise it. So we making
00:56:50
this guy produce the wrong answer
00:56:53
trying to fool the discriminator and
00:56:56
there's been a number of papers
00:56:58
including a one famous one with some if
00:57:02
where these kinds of models have been
00:57:05
used to generate images that we're more
00:57:07
realistic then you know any of the
00:57:09
methods that were previously a unknown
00:57:12
to generate images so so these are the
00:57:17
kinds of images that were generated and
00:57:22
you could also in this case look at how
00:57:25
the image was generated in in that
00:57:27
going from low resolution to high
00:57:29
resolution and sort of see how it's
00:57:31
filling in details and then there was
00:57:35
another paper last year and not long
00:57:38
ago laid back six months ago if you if
00:57:41
you don't know yet you know archive
00:57:43
this is the year and the month. And
00:57:45
then the numbers you know increase as
00:57:48
you put in more papers. So so this is
00:57:52
just a variant of the Ghana which uses
00:57:54
conclusions in a smart way and it's
00:57:58
these guys are pretty difficult to
00:58:00
train. But when you succeed to train
00:58:02
them they can you know provide very
00:58:03
realistic images of these these are the
00:58:05
kinds of images that were generated by
00:58:07
the model okay so this is you know this
00:58:10
blue everybody's mind. Um and and you
00:58:13
could play games like what I told you
00:58:15
before you can work in the a
00:58:18
representation space and do arithmetic
00:58:20
with those vectors. And and do things
00:58:23
like like racial before right so the
00:58:25
kinds of things people been doing with
00:58:26
the words you can do with images there
00:58:29
is a new people coming from a my my
00:58:34
group I'm not one of the others a where
00:58:38
we combine some of the ideas from the
00:58:40
racial rank ordering again we have two
00:58:43
models one that goes from input to a
00:58:46
latent space when it goes from latent
00:58:48
space to input so this is like you know
00:58:50
the encoder in the decoder. And we have
00:58:53
a down discriminator that looks at both
00:58:55
the input and the latent and try to
00:58:57
figure out if it comes from a if it
00:59:00
comes from this guy or from this guy
00:59:03
right. And and these are the kinds of
00:59:06
images regenerating from this ah it's
00:59:08
hard to you know quantify unfortunately
00:59:12
that's one of the problems for these
00:59:14
things are okay so I think I'm gonna
00:59:16
stop here I had an A Whole bunch of
00:59:19
other slides in my presentation that
00:59:21
I'll I'll make available where I talked
00:59:23
about a mural autoregressive models a
00:59:26
special case of which is the recurrent
00:59:29
nets which can be used to generate. So
00:59:32
you know we carried nets actually our
00:59:34
data models you can use them to
00:59:36
generate a sequence of of things. And
00:59:39
more recently this was used to generate
00:59:42
images as well so this is the pixel art
00:59:44
and paper which is was just presented
00:59:46
at the last I CMLA Couple of weeks ago
00:59:49
and got a best paper award. And they
00:59:51
also are able to generate pretty nice
00:59:53
images and people are getting excited
00:59:54
about it. But basically you're just
00:59:56
generating one pixel at a time
00:59:58
condition on the other pixels I don't
01:00:00
really like the philosophy of this
01:00:01
because we've gotten rid of the latent
01:00:02
variables. But well it works so you
01:00:05
know we're scientist and we have to
01:00:06
face reality. And try to adjust and you
01:00:09
know what is it that we were missing
01:00:11
from the other approaches that makes
01:00:12
this works quite well so that's where
01:00:15
we are and thank you for your attention
01:00:17
more questions please no five minutes
01:00:39
it's just me talk about a I think the
01:00:48
menu for yes and I say it's somehow
01:00:52
related to this intended in the I was
01:00:54
related to what sorry this and finding
01:00:57
the factors that absolutely yes. We can
01:01:00
definitely see it I saw about take yeah
01:01:05
yes right. It's like if there is in
01:01:10
this that and manifold now you can
01:01:12
think of it like there is a direction
01:01:15
corresponding to glasses there's a
01:01:17
direction responding to male female.
01:01:19
And then you can do arithmetic you know
01:01:21
kind of independently you know add more
01:01:23
or less of these things where is in the
01:01:25
pixel space there's not like a
01:01:27
direction pixel space that you know we
01:01:28
moved classes or changes you know
01:01:31
gender this is just not possible I mean
01:01:33
it would work for particular image but
01:01:35
not in general where is this would work
01:01:37
in general. So I it has taken the image
01:01:40
manifold which is really twisted
01:01:42
inculcate into something flat where
01:01:43
directions have meeting yeah that's
01:01:48
what we were aiming for I so I I have a
01:01:54
question about the adversary and image
01:01:57
generation yes. So in the case where
01:02:00
you generate fake and and I think
01:02:03
images that are indistinguishable from
01:02:07
the real images for that that's never
01:02:09
happens because we are not that good
01:02:10
yet yeah so that's my question still do
01:02:13
you have a like oh I was asking the
01:02:16
same question does seat yesterday. So
01:02:18
do we have like we're not able to make
01:02:20
the discriminator reach fifty percent
01:02:22
air it stays always a bit better than
01:02:25
fifty the you know sixty or something
01:02:27
so the question is speculative do we
01:02:30
have that guarantee that in the in the
01:02:32
in the we will not be able to generate
01:02:35
something that looks indistinguishable
01:02:38
from the images well if we do we're
01:02:40
done. I mean if the discriminator is
01:02:44
completely full then we put as much
01:02:45
capacity as we can in it that means we
01:02:48
we finish we we have a machine that
01:02:49
generates real images yeah but this is
01:02:53
for this for the the the the real with
01:02:55
respect to the discriminator network
01:02:57
and because if you bring sure so the
01:02:59
you know whole the whole of sadistic
01:03:01
summation running is based on the this
01:03:04
idea that a nonparametric approaches
01:03:09
where you say let's imagine that the
01:03:12
amount of data grows and my capacity
01:03:15
grows accordingly. Um what would happen
01:03:18
in the limit. And here we can show what
01:03:20
happens in the limit it's gonna wonder
01:03:22
distribution not whether it's gonna be
01:03:23
feasible from an optimisation point of
01:03:25
view also there's something really
01:03:26
funny going on here is that in in
01:03:29
normal machine learning we have a
01:03:30
single objective function here we have
01:03:32
a funny you know game you know each of
01:03:36
these two guys optimise a different
01:03:38
objective function. So in theory there
01:03:42
is a solution to the game but it's not
01:03:45
simply minimising and objective
01:03:46
function. But but there is in the in
01:03:49
the paper you'll see a lot of theory
01:03:51
about what happens asymptotically and
01:03:53
in in principle it should learn the
01:03:55
distribution okay thanks thank you very
01:04:05
much for the great look I have a
01:04:07
question about the many for similar
01:04:09
positions space yeah so use like
01:04:11
whiskey visuals to an honour to
01:04:13
understand the to see if the linear
01:04:14
interpolation looks like the images and
01:04:16
the many phones yes is there another
01:04:18
way to characterise as like many four
01:04:20
like the shape or the volume to use
01:04:22
another approach to I'm sure there are
01:04:24
many ways that we could use to figure
01:04:26
out what is going on I think we're just
01:04:28
starting to play with those stories and
01:04:31
having a lot of fun. But that there's
01:04:34
so much we don't understand. And
01:04:36
visualisation has been useful from the
01:04:37
beginning here and I think you know you
01:04:40
could have even more will in the
01:04:41
future. So we're we're doing things
01:04:44
like you know generating a plane you
01:04:48
know interpolating in the light in
01:04:50
space and see what happens input space
01:04:52
but we could do probably you know more
01:04:55
to try to figure out what is going on
01:04:57
yeah I'm wondering what you do these
01:05:02
interpolation systematically relative
01:05:04
dimensionality of those representations
01:05:06
as a work better if you have a
01:05:07
compressed representation or expose so
01:05:12
it depends on the kinds of arguments
01:05:14
you're using the the racial encoder
01:05:17
they tend to compress in some sense
01:05:20
like throwaway dimensions too much
01:05:23
actually and it's it's a bug that we
01:05:24
understand I mean it's something we
01:05:25
don't like it's doing it too much. Um
01:05:29
things like you know isn't coders are
01:05:30
you can be you can have many more
01:05:32
dimensions that's okay actually doesn't
01:05:34
hurt. Um for the against usually we we
01:05:40
you know we keep the representation
01:05:41
space pretty high dimensional but not
01:05:44
as high dimensional as the input
01:05:45
because they're typically images and
01:05:47
yeah there's probably a lot of
01:05:48
redundancy yeah I I I you don't need
01:05:52
those space you don't want those pieces
01:05:54
to have two small dimension if you if
01:05:55
you go for like two or three dimensions
01:05:57
it that just doesn't work that well you
01:06:00
can get something like a nameless with
01:06:01
three dimensions you can you can see
01:06:03
things that are reasonable. But it's
01:06:05
not nearly I mean you can't do natural
01:06:07
images and even for amnesty wouldn't be
01:06:09
as nice as if you have and the
01:06:11
dimensions. Maybe you should take one
01:06:14
more question than for the coffee to do
01:06:18
the questions about it. So behind you
01:06:24
yeah my question is all we said that
01:06:30
the the generator network is not just
01:06:34
throwing arts images from the the two
01:06:37
image right right it's it's absolutely
01:06:39
a valid concern and we can we can do
01:06:44
some things to try to make sure it's
01:06:46
not so for example a typical thing that
01:06:48
we do is so we take we find the nearest
01:06:52
neighbour in know euclidean distance in
01:06:54
the training set to generated image so
01:06:56
we generally that image. And then we
01:06:58
wanna check so is this just a copy of
01:07:00
the party a particular training
01:07:01
example. So if there was a very no
01:07:05
similar nearest neighbour in the
01:07:06
training set to this generated image
01:07:08
then we would know that the network has
01:07:10
just memorise this so that's one trick
01:07:13
but is it not necessarily a
01:07:16
satisfactory because maybe it's you
01:07:18
know it's still learning. And something
01:07:21
like nearest neighbours but but maybe
01:07:22
you know higher dimensional you know in
01:07:24
higher space but yeah it's it's
01:07:27
something that we could be concerned
01:07:29
about is this overfitting in some sense
01:07:31
and maybe that's why we have these nice
01:07:33
images and I don't think we have a
01:07:35
fully satisfying answer in the case of
01:07:37
the variational recorder we can
01:07:38
actually measure of down on the log
01:07:41
likelihood so there we can actually be
01:07:42
sure that it's not overfitting because
01:07:44
we we we have a quarter to measure of
01:07:47
the quality of the model through a
01:07:49
approximation of the log back in okay
01:07:54
so we can things you should in for the

Share this talk: 


Conference program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
4 July 2016 · 2:01 p.m.
Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
4 July 2016 · 3:20 p.m.
Day 1 - Questions and Answers
Panel
4 July 2016 · 4:16 p.m.
Torch 1
Soumith Chintala, Facebook
5 July 2016 · 10:02 a.m.
Torch 2
Soumith Chintala, Facebook
5 July 2016 · 11:21 a.m.
Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
5 July 2016 · 1:59 p.m.
Torch 3
Soumith Chintala, Facebook
5 July 2016 · 3:28 p.m.
Day 2 - Questions and Answers
Panel
5 July 2016 · 4:21 p.m.
TensorFlow 1
Mihaela Rosca, Google
6 July 2016 · 10 a.m.
TensorFlow 2
Mihaela Rosca, Google
6 July 2016 · 11:19 a.m.
TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
6 July 2016 · 3:21 p.m.

Recommended talks

Structured Sparse Coding for Microphone Array Location Calibration
Afsaneh Asaei, Idiap/CMU
8 Sept. 2012 · 11:55 a.m.