Embed code
Thank you once one so this presentation
will be a bit different from
yesterday's. Um it's more about things
that are happening at the research
level and not so much things that
people use a build products yet also
it'll be a little bit more technical.
So and I have a bit more time then
yesterday. So feel free to raise your
hand and ask questions in middle if
they're too many all you know filter
but let's let's try so we don't need to
wait for the end to ask questions okay
let's start with motivations really
it's about as proviso line why is that
important. Well you have to realise
that all the great things that people
earning has done in the last year years
is mostly due to supervised learning
meaning that we yeah we need large
datasets that are labelled where humans
have told a machine with the right
answer should be but that's not how
humans learned most of the time. And
think about how child like a two or
three year old figures out what we call
into the physics. She understands you
know gravity she understands solids and
and and liquids and all kinds of
mechanical notions of course without
ever taking a class you know on a
newtonian physics she got it by
observation her parents didn't tell her
how the world was you know going on in
terms of the physics. Um she just
interacts with the world observes and
figures out causal explanations there
are sufficiently good that you can
control her environment. And and do all
kinds of things that robots can't do
yet right. And so we'd like to have
that kind of ability for computers to
observe interact with the world in
order to get better information. And
and learn essentially without
supervision. Now of course when I talk
about this proviso and you have to
understand that in the big scheme of
things. We need all the three types of
learning to to reach aren't we need
supervised learning we need
unsupervised learning and we need
reinforcement learning they just you
know cater to different niches and and
humans you will use all three as well
so one what may wonder wonder things
all talk about is you know why is it
that unsupervised learning hasn't been
as successful and I I don't I don't
seem to have all the answers for that
but I'll I'll give you some some
suggestions I think that there are
computational and statistical
challenges that a rise out of the the
objective that we have in supplies
learning in really capturing the joint
distribution in some form maybe
implicitly of many variables where is
when we do supervised learning
typically we only care about pretty
thing you know one thing one number one
category there and we're not trying to
get a joint distribution in a high
dimensional space it and that's really
what else provides training is about it
may not be explicit but really like if
you train the encoding mythical I'm not
you know I'm just learning because many
minimising the construction or
something but but really what you're
trying to do is to extract information
about the structure of the data
distribution in a high dimensional
space and that's fundamentally
difficult and and I don't know maybe
it's gonna take is another fifty years
to crack this but I I really believe in
others yeah like me believe that we
need to work hard and this and and make
progress on this you know to to even
approach human level intelligence right
so no from a practical point if you why
would we want to do that well at a
really obvious answer is that there's a
lot of and labelled data out there that
would like our computers to learn from
and and use that information to build
better models of the world we can't go
on building specialised machines for
every new task where you you're gonna
need a lot of labelled data for each I
mean we can and this is what we're
doing but it's not gonna bring is human
level EI it's not gonna be enough
here's another reason when we do
unsupervised learning as I said
essentially in some sense we are
learning about the joint distribution
of things then we should be able to
answer any new question about the data.
So think about I zero random variables
XY and Z and I learned to join
distribution of all three now I should
be able to answer a question like oh
give an X what can I see about wine Z
or given why and see what can I say
about XY all of the questions about I
know I know some aspects of reality
what can I say about other aspects. So
this in provides learning there's no
preference to which question you gonna
be asking you can think of a supervised
learning is a special case of you know
restricting yourself to on your
particular for question which is pretty
why given X another reason why provides
learning to be practically useful even
before we completely crack at is that
it it turns out to be very useful as a
regular riser what what that means is
that it can as an adjunct to supervised
learning. So this is the semis provides
case we can use S provides learning as
a way to help generalisation and the
reason it helps is that it it it
incorporates a additional constraints
on the solution this all the
constraints or the a priori that we
putting in is that the solutions we're
looking for are not just good at
predicting why give an X somehow the
involve sabre presentations that are
also good at capturing something about
the X the input distribution right this
is a you don't have to have that
constraint when you dip your supervised
learning but when you add that
constrain you can get better
generalisation. Um and that can be
useful as a red visor by itself it
could be useful in the in the transfer
setting where you wanna go to a new
task where you have very few labelled
examples or domain adaptation which is
kind of similar it's not a new task
assume new you know type of data maybe
you you go to the you go from you know
Quebec french to swiss french. And you
have to adapt and you don't have a lot
of data alright so that's these are
good reasons another good reason that
came out right at the beginning of the
people learning revolution in two
thousand six is that it looks like we
can exploit as provides learning to
make the optimisation problem of of T
planning easier. Um and the reason is
that we can we can define sort of local
objective functions like each pair of
lay you should be a good all encoder
good should form a good pair of of what
one could repair. And that kind of
constraint is something you know that
induces the kind of training signal
locally you don't need to backdrop to
twenty layers to get that information.
So it can in in the and spliced
retraining things we did from two
thousand six to about two dozen twelve.
Um it was useful a useful way to get
the training of the ground for for deep
supervising that's later we find other
ways to go around this optimisation
difficulty with with the rectifier but
it remains that there's an interesting
effect here that could be taken
advantage of and then the last reason
why this is interesting is that even if
you're only doing. Q or supervised
learning it happens sometimes that the
thing you wanna predict is not a single
simple class or a simple real value.
It's it's it's it's a composed object
for example you're predicting a a set
you predicting a data structure
predicting the sentence you predicting
an image right so if you pretty good
image the output is a high dimensional
object it's pretty a sentence the
output is high dimensional object and
and these objects are composed of
simple things like pixels or words or
characters. And so they have a joint
distribution. Now of course it's a
conditional john descriptions of given
the input I want to predict the joint
distribution of a bunch of things like
words in the sentence or something like
that. Uh or structure of the molecule.
So all of these kinds of obvious we may
be interested in predicting or saying
something about given an put that's
illegal you know structured output
learning and and they're essentially
all the the techniques that we have in
developing forms provides learning
especially the probabilistic once they
become useful we just have the
unsupervised learning model as usual
except we condition it meaning we have
the input that changes something in the
form of the joint distribution already
outputs alright so these are very good
reason to study unsupervised learning
but the one that really you know makes
me weak up at night is that we really
want the machine to understand how the
will ticks how the world works. And
unfortunately a if you if you step back
you know behind all the hype and the
the the excitement around planning and
machine learning in general what
happens very often is that the the
models. And up learning simple tricks
they're like surface statistical
regularities in order to solve the task
and if you think about the self driving
cars you would like those self writing
you know cars to somehow not just
relying on surfaces tickle statistical
regularities but can make sense of the
causal relationships between the
objects and and what could happen if
scenarios even though they may not have
seen these scenarios during their
training face. So how can that happen
how do a human's manage to do that.
Well the deal I think this is a
hypothesis of course we don't really
know what's going on in our brains but
but there's a lot of evidence that we
we brain we we learn a models of the
world that are causal that that's what
I mean by causal here is that there are
explanations about of what's going on
so I think the main job of our brain is
to figure out an explanation for
everything that we're seeing that skins
provides learning job right. Um and and
having an explanation means that you
can kind of simulate you know what
would happen if I change some of these
explanatory factors even though this
may be a situation that I have never
seen during training me again example
fortunately I never had a car accident
that killed me a so how can I learn
about the avoid in the actions that
could you know had have me killed in a
car accident. Well a supervised
learning is obviously not gonna work
even even reinforcement learning is not
gonna work because you know how many
times I have to dial an arts them
before and I learned how to avoid that
right you you see that there's a
problem. So how do we get around that
well we build in mental model of of of
of cars of rows of people that allows
to predict that if you know I do this
and that a it you know there is it
something bad with that may happen and
this is how it may happen and and if it
said I I change a little bit my
behaviour I could you know and up alive
so we are able to do that because we
have these kinds of explanatory models
it's something that we don't know how
to do yet you machines but this is
something we really need to do
otherwise yeah it's it's not gonna be
you know it's gonna be a spongy
alright. So how do we possibly do that
well there are many answers but one of
them that you know the the the the
reason why we got started into this
adventure D planning is because we
thought that by learning these high
level presentations we might be able to
discover high level abstractions what
that means so these obstructions in
some sense or closer to the underlying
explanations the underlying spent three
factors. And what we would really like
is that these high level features that
we're learning the the really capture
the knowledge about what's going on.
And one way to think about this is that
the the pixels we're seeing the the
sound meaning a the words rereading
they were created by something by some
factors by by some agents. And maybe
the lighting and the the microphone
whatever factors came in together were
combined in order to produce what we
observe and so what we want a machine
to do is to reverse engineer this to
figure out what or these factors and
separate them right disentangle them.
So I'll come back to this notion of
design tangling later but this is a
really I I find a very inspiring notion
yeah I I I want first to separate the
notion of invariance from the notion of
descending killing the notion of
invariance is one that has been very
you know commonly studied and and
thought about in areas like speech
recognition or computer vision where we
wanna do supervised learning so we
wanna predict something definite like
you know the object category the
phoneme. And we're trying to hand craft
features or maybe learn features that
are invariant to all the other factors
that we don't care about if I'm doing
speech recognition. I don't wanna know
who the speaker is I want my features
to be very into the speaker I'd want my
features being very into the type of
microphone I'm using if I'm doing ups
recognition I I would like my high
level features to be a maybe invariance
to translation or something like that.
Um the problem with this is that well I
mean this is good for surprise ending
but when you're doing unsupervised
learning. Well you know which factors
are gonna be the one that matter I
wanna capture everything about the
distribution I wanna know that ah
actually D and the lying estimation of
the sound then hearing is both a
sequence of words and phonemes and the
identity the speaker where that person
is in whether he's sick or something
like all these are explanations for
what I'm hearing and I would like the
representation and getting to have all
of that but I would like those factors
to be separated out so that I can now
just plug a a linear classifier on top.
And I can pick out the phonemes if
that's what I want or I can pick out
the I speaker identity if that's what I
want right that's the difference
between invariance and doesn't think
invariance we're trying to eliminate
from the signal from the features those
factors that we don't care about in
doesn't think we don't want it
eliminate anything we just wanna
separate out the different pieces that
that already and lying explanations and
and if you're able to do that you're
essentially killing the curse of
dimensionality because now if if your
goal is to answer specific questions
question about one of the factors you
reduce the dimensionality from very
high to just those features that are
sensitive to that factor now the thing
that we don't completely understand is
that when we do some of these that
apply some of these unsupervised
learning a buttons it looks like the
features we getting are a bit more
disentangle then the original as we go
higher up. Um so something good is
happening. And and these these these
these are experiments that were done
you know to zen publishing two dozen
nine and two thousand eleven and I I I
suspect there are other papers more
recently where we what if we do a kind
of analysis of the the features that
have been learned his arms provides
learning algorithms like sparsely
quarters. Um in knowing some of the
factors right so you know I kind of
cheat and I know some of the going
factors now I can test whether some of
the features become specialised more
towards some factor and and and less
sensitive to other factors is something
we can measure and somehow it seems to
happen magically. So why would that
happen. So here's here's a a kind of
it's a sketch of a theory why
unsupervised learning can give rise to
the extraction of features that are
more disentangle then then the original
data and yeah before I show you the
easy question initially this picture
because for pictures are so much better
so imagine that this is the data you're
getting you have distribution which is
actually a mixture of three gaussians
you can't have simpler than that well
you have a single guy. Um but nobody
tells you that you know what what cost
in the the particular sample you're
getting comes from so you have a label
data you just have the X and the winds
would be the gaussian identity is it
the number one number two number three
but you only observe X right. So if you
only observe axe what would be a good
model of the data well the best
possible model of the data is the one
that actually spells out the density as
a mixture of three gaussians right this
is this is in terms of log likelihood
or or whatever you wanna use is very
likely that the best model the data is
the one that actually discovers that
there is a latent variable Y which can
take the three you know integer values
one two or three I mean you can in the
maybe see if you want but and and you
can read label them but the point is we
have these three categories that are
sort of a implicit and data when we
don't class train. We're exploiting the
fact that there are national clusters
and we use clustering algorithms to
discover these clusters and you can
think of these processes as causes that
nobody told us about but we can
discover with a simple statistical
analysis just you know K means will
figure it out right so you so so the
principle is that there are underlying
causes and the statistics of the data
can reveal them to us if we go a good
model of the data the better the model
we have the the better we are able to
figure out those underlying causes. Um
now why would that be useful for
supervised learning so that's where
this slide and that's the question
becomes interesting. So let's think of
why here is one of the factors that
explain axe all right. Um and so let's
say that at the end of the day we
actually want to classify and pretty
why given X this is gonna work yeah so
we could just train a normal neural net
it predicts white directly from X or we
could train eight generated model that
captures your axe right. Um and as I
try to argue previously the best
possible jotted model here is actually
one that's written as a sum over the
whys and possibly a over all the
variables that Coleman age a we're
given the the causal factors we can
pretty acts. And and the reason that
this is it better model than and then
this one is simply that this is how the
data was actually generated right so
the best model of the data is the one
that kind of the truth that's how it's
generated the one that gives the best
predictions is the white response to
truth. Um it so if we're if even if we
don't observe why okay if we just
observe ex we can we can extract latent
variables like P what we we we try to
monkey of X as a key of X given age
times P of age for example so we
introduce like Bibles age and in the
best possible model well within H
should be why because one is one of the
factors that explains X and so if we
find good representations for P attacks
we're likely that these representations
will be a useful to predict why okay
there is a a nice paper a at IC model
doesn't twelve by gen dancing and and
others from Bernard Shaw cost group at
max Planck institute where they show
that there's a huge difference between
the situation where X is the cause of
why and why is the cause of X in terms
of the ability of some is provides
lying to work in other words if if why
is the cause of X then we can do some
is provide learning and I liked
learning about P of X actually becomes
useful whereas if even though at the
end of the day we only care about you
white give a nice whereas if the causal
direction was reversed then all the
semis provides lighting would be
useless because in the case where it
was reversed basically the the joint
that they're the the joint
distribution. P avoiding and X would
just be given by TOY given X times P of
X and so you X would have nothing to do
with its structure with key of why give
an X whereas if it's the other way
around. Um if the right causal model is
go from Y to X then when we want to
learn P of why give a nice well there
is information about P of why given X
inside P of X because P of X is
decomposed lexus. So yeah they they
push this argument much further but the
this is a deep connection dinner date
is a deep connection between the
causality and the relation you know
which which is the cause of which and
the success of you know unsupervised
learning to help supervise not that's
the main message alright so I mentioned
that unsupervised learning is is
difficult and this shows up very
clearly when you tried to tackle
unsupervised learning using a arsenal
of mathematical and computational tools
from probability like graphical models
and and models with latent variables.
So in principle introducing the latent
Bibles sure that help us and it should
help us to even avoid the curse of
dimensionality. Um because because
we're modelling at the right level in
some sense. But the problem is that for
all of their approach is that that that
are really angry probability in
explicit probabilistic model what we
find is that some of the complications
during that are needed either for
learning or using the model are just
intractable be involve you know running
integrals or sums over an exponential
number of things and so for example in
in typical directed models exact
inference in other ways predicting the
latent variables given the input is is
intractable even though you're going
you're able to go in the other
direction predicting X given age
because that's how to model is
parameterised going backwards which is
something we actually need to do both
for learning a potentially for using
the model the is involves an
intractable some in other models the
and directed models yeah there's
another issue with it potentially in
addition to this one which is that
these models involved in normalisation
constants. Um which is intractable and
and it's gradient isn't right in other
words the probability is expressed as
some expression divided by
normalisation constants which we
usually right is that and that's that
is something we can compute easily. And
of course and we also need to give you
the gradient of that's said so it's
it's looks like it's hopeless. Um so
this has this has us you know motivated
a lot of new things some of which I
will tell you about but let me start
with the and sisters of the
degenerative models the energy based
models of both machines basically of
the category of undirected graphical
models so with and write a graphical
models basically you're expressing the
probability function. So X is the the
run the marble you're trying to model
in terms of and energy so this is just
a rewrite there's not much of a diff
constrained by doing this except that
we're saying that ah every
configuration gets a non zero
probability because energy you know
it's gonna be finite for any X and so
this means probably is just region zero
for everything but besides what it
really saying is that instead of
primate rising the probably directly
where primate rising this guy the
energy and we letting this Z the the
rye from it so that here is just to sum
over X or the integral of racks of the
the numerator okay so if you have a
model of that type it turns out that
the log flight you'd and tells you to
update your parameters according to the
following very simple idea and
especially if you think about
stochastic green descent so I'm giving
an example X let's call it X plus and
this landscape that I'm showing here is
the energy landscape so think of
remember this E to the minus energies
probability so when energy's localities
high. And there's an exponential
relationship. So yeah which is hard to
visualise here but ah when when this
goes up very much then the probably
goes exponentially faster zero alright
so we're given an example X plus and
you have occurred energy function so
this is the curve the Y axis is energy
and what we wanna do with max and like
it we wanna make the probability of the
observed data high that's what my from
like it means that means make the
energy of the observed configurations
low. So the ideal solution would be to
make every twenty example at peak I
mean another ticket trough like a
minimum of the energy that would be the
ideal solution from the twenty point of
view from civilisation might not be but
anyway what training consists in is
pushing down on the energy where the
examples are and pushing up everywhere
out because if I just push down on the
training example where the energy for
the training example that may not be
good what I really want is you know the
relative energy to be small for
trainings also here's an example where
the the data points are these a little
dots. And doing training we're pushing
up everywhere else. And we're gonna get
a model that puts a low energy where
the data is this is a good model right
and this is is not as good model. So
yeah you can get that just by doing
three lines of algebra but is this
something kind of intuitive about
what's going on here at the same time
as we're trying to push on at the
configuration given by the data push
down the energy we're trying to push up
everywhere everywhere else but not in
the same with the same strength
everywhere else we the equation we're
getting tells us we wanna push up
especially in places where the energy
is low right so all those places that
get a high probability basically should
be pushed up and we call these in a
negative examples and these possible
examples we're trying trying to make
positive examples more probable and
trying to make negative examples less
probable. And where do we get those
negative examples well ideally these
negative examples come from the model
distribution itself quite so once we
have an energy. We have a probability
fusion corresponds to it by this
equation. And if we could sample from
this distribution we would get like you
know many points here a few here if you
hear us so we wanna push where we get
those samples up. That's what the the
the math tells us we should be doing to
maximise like you this is what we see
in this equation so that the riveted of
the log probability with respect to
parameters which are hidden inside the
energy function has two turns one which
we call the positive face term and the
other called the negative face turn.
And this one is saying you know change
parameters so that the energy of the X
becomes lower because we wanna maximise
this we have a minus here so we
minimise the energy at this X and and
and now you also have this term or just
push up so there's no negative here.
You this is wants to go up so this was
the what everywhere so some all X tilde
but waited by P of X dollars so those
places where the model thinks that you
know they have a high probability we
want to reduce their probability we
want to increase their energy this is
the case here in the second line where
the model involves not just the expert
also some latent variable H so now the
energy function is defined in terms of
both X an age and you could marginalise
so some overall the values of age and
get another equation which looks like
the one we had before any call this
modified energy or marginalise energy
should be the right term but physicists
call it free energy and that is a
similar question except that we now
have to wait by the those probabilities
of the H given ex the two terms here.
And this week or posterior probability
so you see that when you have like
convertibles you know the to learn we
need to sample or or average over this
posterior probability of the latent
variables given me. And this can be
hard yes so yeah and then tell you much
about how we do this or the ways we
know right now how to do is the all
involve some kind of multi colour
markov chain so multicoloured markov
chains adjust methods to sample from a
distribution when you know we don't
have any better method so it's a kind
of general method for something from
this fusion and it's an intuitive
method you never actually get a a real
simple from the distribution you you
have to go you know many steps in and
the symbolically you hope that you get
the sample on the right distribution
you may have heard about restricted
both machines so these these are a
particular kind of and wrecked a
graphical model that has a a a a graph
structure like this where there is no
relation there's no relationship
between EX is when we know the H and
vice versa so the the X or
conditionally independent given the age
and vice versa. So this forms what's
called a by part time graph where we
have connections you know going from
top to bottom everywhere but no
connections no so called natural
connections here or here. And and with
those conditions it turns out that it's
actually easier to train these models
and I'm not gonna go into again it's
using some what you call a markov
chains but somehow we are we are able
to do a decent job of training these
these types of undirected graphical
models. And so the urban uses the
building blocks starting with the two
thousand six a breakthroughs for
supervised learning to train deeper
models but that threat of research just
kind of diet over the last few years.
And I have I I I I have some thoughts
about why. Why didn't work as well as
we would have hoped and part of it. I
believe has to do with the the fact
there we we rely on these multicoloured
markov chains in order to get those
samples let me try to explain what I
think is going on. So in order to get a
gradient on the parameters in order to
train the model we need to get samples
from the model in other ways we we have
to ask the model you know give me
examples of the things you believe in
like which images you would generate
and we do this by running a markov
chain which starts at some
configuration again and goes you know
left and like randomly makes a local
small moves out what's particular about
in the same C is that those moves are
typically both local and they want to
go to a place of high probability. So
at the end of the day you end up
walking near the modes of the diffusion
and spending more time where where
probabilities higher that's that that
the deal what the deal is that as we
run these markov chains we end up
spending more time where that probably
is higher in fact proportionally
exactly to the to the probability. But
there's that there's a problem. When
the model is kind of agnostic initially
your model puts sort of uniform
probability everywhere. And then it
gets to put more more probability mass
around where the data is and initially
you know these these molds are kind of
us move and and you can still travel
between those modes without having to
go through a zero probability region
but as the model gets sharper another
words it it now really it gets more
confident about which configurations or
probable. And which are not like the
things in between the modes for example
maybe this is one category in this is
not a category in there shouldn't be
anything in between then what happens
is that those markov chains get trapped
in round around one mode and they can't
easily jump from one mode to another
mode and what it means is that if we
start somewhere we gonna stay around
that region and we can't visit the rest
and we don't get really representative
samples we don't get representative
samples then our training suffers it so
those models are able to learn
distributions are sort of some level of
complexity if we try to learn more
complex distributions it just stalls to
you we haven't been able to yet maybe
we'll find solutions to that but for
now it remains an open problem as far
as I'm concerned one glimmer of hope
comes from experiments that we run a
few years ago where we found that
although sampling in the in but space
with these and CMC is is is hard if
instead of running this markov chain in
the the raw input space like pixels we
first not the data to high level
representation because let's see we've
train a bunch of little encoders are
bunch of IB m.s. So we we now have a
with them at the input data through a
better presentation the kind of
representation we learned we when you
that's typically and now if we run the
markov chain in that space it turns out
that it makes as much better between
the modes. So we we've trying to
understand that and I have a picture
here hopefully which helps to
understand what is going on so in need
pixel space really input space that
they that concentrates on some manifold
like here at this is a cartoon
obviously. I've see this is the the
manifold of three of nines and this is
the man of the freeze and these
metaphors of very very thin the occupy
very small volume and they're well
separated from each other and so it's
hard to mix between two categories for
example but what happens is as you
mount the data to these higher
dimensional not high dimensional high
dimensional spaces that are you know
learns somehow to capture the
description like or quarters I the
relative volume occupied by the data in
that space is larger than in the
original space. And the the different
manifolds get close at each other. So
now it becomes easier to jump from one
to the other. And there's something
else happens which is that where is the
manifolds in the original space are
highly curved and complicated when you
go to these learn spaces of using
provides learning those manifold become
flat. So to try to understand what I
mean by flat manifold think about a
curved manifold so let's say the data
concentrating input space on this
thread manifold no I think two examples
like if the image of a nine here any
image of a three here. And I linearly
interpolate between them and I look at
points in between and and try to
visualise what they look like so this
is what we did you have a nine here you
have three here. You do linear
interpolation pixel space and of course
what you get you get the addition of
you know if we in line which doesn't
look like either three or nine might
take take two random images natural
images add them up and you get
something that doesn't look like an
actual image. So what it means is that
if I take two images and I interpolate
the stuff in between that is and is not
on the manifold because the manifold is
not flat if the metaphor was flat when
I do linear interpolation the things in
between look like natural images and
this is actually one of the tests that
we use with a new unsupervised feature
learning a buttons to see whether it
has done a good job or not up and
folding about what we take to images.
We map them to the representation space
we do a linear interpolation and we
look at the things in between in in the
pixel space right so we can go back and
forth between input space in your
presentation space. And so we can
interpolate in the in the
representation space in the H space and
then you know use decoder to map back
to pixel space and visualise so here ah
we see what happens when we do it by
looking at the first layer of of a
stack of a little encoders of the
nosing recorders here and here the
second layer and what we find is that
the higher we go the better you know
flattening is is happening so what was
going on now is that after just the
second layer we can we can interpolate
between this nine and a three and
everything in between make sense and
looks like like a digit and there's a
point here where it suddenly jumps very
fast from the nine to three okay this
guy's can just not just in the border
of being you know between three tonight
in just a few pixels above go from nine
to three right. So it is really found a
way to make these manifolds very close
to each other and the path now on the
straight line straight line a goes
exactly to the right place and never
goes through something that doesn't
look like an actual image you guys have
question about this yes well so okay so
I have an image. I have an encoder
which maps it to a vector I got another
image in a three I get another vector
two vectors okay now I can take a
linear interpolation so all five times
the first one plus one minus alpha
times second one where Ralph lies
between zero and one is it so for
example you have this one plus half of
this one that would be right in the
middle okay so not give me another
vector. And then I'm map it back to
pixel space because I have you know two
we mapping here from input to a
presentation in back yeah because it it
tells us whether the manifold as being
flattened or not. But why would that be
a good thing well I think it's very
clear if you if you have a flat
manifold you can you can now basically
combine existing examples you know to
to to predict you know what would
happen so I'll show you an example
later what you can do here's what you
can do you take a man with glasses you
subtract a vector format without
glasses you add the vector for women
with without glasses and you get women
with glasses yeah I mean whatever
processing you wanna do you really want
to do it in this linear space where you
can just simple linear operation in
order to change things. Right like oh I
I decide that I don't wanna have
classes I just you know we move some
direction yeah yes you're right but
here is the simplest thing we can think
of. And so here's another way to think
about why lean years good. I mean why
why flat is good because let's say I
wanted to capture the dissolution it's
I wanted a at some like you right. So I
wanna capture the distribution and if
the this missions like this is gonna be
difficult to model it I I I'm gonna
need like with if I do it with gaussian
extremely like many components to to
you know go through this if it's one
flat thing one gaussian bad. I got the
density once so I think about it. Once
you know the density of the data you
can answer any question about it in the
language of the variables that your
design yes I'm not saying that the
world is a big gaussian. But if we can
you know map a lot of it too simple the
solutions then we can answer questions
right so yeah you're right maybe for
example let me let me give you an
example in in in the direction you're
talking about if you actually have
multiple categories presumably you're
not gonna get it I guess single gas in
that captures all the categories you
probably wanna have like a different
you know a gaussian for each category
and so so the the right model you
wouldn't be a single gaussian because
we want to somehow capture the fact
that we have these clusters. So yeah
but the point is is gonna be much
easier to model the data answer
questions reason if we can flatten the
manifolds but it is something that can
be argued no but I it's not about the
structure of the space it's about the
structure of the distribution if the
data has to be you know along that
subspace if it if it makes you has a
complete shape it's hard to capture
what's going on is hard to make
predictions it's hard to reason about
it. If you everything becomes when
you're it's much easier to reason about
that's that's all let me move on
because I only have fifteen minutes
left and lots of things that would like
to talk about but I'll do quickly so I
mention all encoders already right so
you just picked basically to a mapping
from input space representations pacing
back. And we can learn them in various
ways. And we can have probably think
version where the and colour is
actually a conditional distribution. So
it's not just a a function we actually
inject some kind of noise here and we
get a sample of H given any particular
X and simply the decoder can itself be
conditional descriptions of given some
H from some distribution which we call
prior to solution then we are getting a
X is from pure explanation so this
these two guys actually represent a
joint distribution over X an age and
these two guys with a different matter
also correspond to a joint distribution
so I mentioned it literal your that the
these and seem seem met is and
classical ways of primate rising
problem probably distributions kind of
hit a wall so we explored other ways of
of doing this and the general theme of
of this is let's bypass all of these
normalisation constants and and so on.
And and learn generative black boxes so
if if we if we specify the problem of
unsupervised learning is build a
machine that can generate see images
which is something we can discuss. But
it see we we we define it like this
then let's just trying and you on that
that you know takes in random numbers
and outputs images right. We can of
course trying a different kind of neon
that which may have like different
inputs and then we can have you know
given for example some sentence I would
like to generate an image of responses
that's just a variation right once you
get once you're able to agenda trained
you on that that does this kind of
thing then you can do all kinds of
other fun things. So that's one
variance and I'll tell you about this
call the generated but they're still
mats and they are very hot these days
another variant which for now has been
less explored is that alright so we're
not gonna generate in one go we're
gonna generate through a sequence of
steps is gonna be like like a recurrent
net. So it's gonna have a state. We
throw in some random numbers and then
each point in the sequence we generate
a sample. And as we do more of these
steps the samples look nicer so this
kind of imitates the markov chain but
but now we're gonna learn the so called
transitional parade of the marketing
the black box that goes from one stated
next eight generate some samples and in
you know in some random numbers so that
we get a different thing each time. So
this is just a kind of stochastic
dynamical system that generates the
things we want and I called of these
things generative stochastic networks
alright and then we can do all kinds of
math about these things and actually
show that you can train them so this is
totally different from the the
classical approach of undirected
graphical models any skip something's
let me tell you about the denoting
recorder which is related to to this
and of course to encoders in general so
it's a particular kind of all encoder
where I think I had at yeah here in the
D noise in the recorder a what we do is
we minimise the reconstruction they're
but instead of giving the raw input
here we give a corrupted input for
example we hide some of the inputs
these we seven to zero or we add some
gaussian noise or whatever we want we
can also inject noise here but the
traditional thing is we inject noise
here. And the error we're minimising
here is like some kind of log
likelihood we construction so
probability of the clean input given
the code. And and that's that's that's
a delusional encoder and it it's
probably the one that's been best
studied mathematically and you
understand better it's probabilistic
interpretation so here's a picture of
what's going on let's see the data is
concentrated on this manifold. So the
exes here training points. So what we
do is we take a train we take a
training point we corrupted and we get
you know something like this like the
right thing and then we ask the new on
that to go back and and we construct
the original now of course it may not
be able to do it perfectly because
maybe the original could've been here
here here and so in general is gonna
point right at the manifold if it
learns well and so it learns these kind
of vector field which points towards
the data. And you can actually do
experiment ah onto the data and so
let's say that there are these yellow
circles and that you know someone
coders learn these arrows these arrows
correspond to a you know where it was
to go if you start hearing must go in
this direction so the reconstruction
you know is pointing in this direction
so this is the personal to
reconstruction minus input. And in fact
we can we can prove that if you train
this well selection directly where this
converges is that the reconstruction
minus the input so the same thing next
year in here. It actually estimates
what's called a score. D log PDX not
towards the direction the gradient of
the den see the direction in which a
density increases the most right so if
you're if you're sitting here where you
wanna go to to increase probably the
most is towards the matter for this is
the gradient of the like you right.
"'cause" there's a peak of probability
that should be here and then probably
should go down as fast as you move away
infection B zero. But if you smooth a
bit you gonna get this. So there's a
lot of papers the try to understand
this and and also show that's these
them with a grin recorders you can
sample from them you can define a
markov chain that corresponds to
something from the model has been line
so you can actually once you've trained
on the noise going coat or you can just
apply the corruption apply the
stochastic reconstructions in other
words you gonna sample from the output
distribution rather than have a
deterministic function. And then you do
it again and again and this markov
chain will converge to what the model
where the model T C.s where it would
probably mass so in terms of this
picture what it means is that you know
if you fall this arrow and you had a
bit of noise in the fall the arrow and
a battery that annoys you will you will
kind of move more or less in that
direction and then you a start moving
around this thing "'cause" they're no
arrows going away from this there no
arrows going this way you are but if
you use stipulated away brings you back
right so there is a bit of noise it
makes you move around like a random
walk and you gonna stay on that you
know run walk let me skip a few more
things so there is another kind of or
recorder with the probabilistic
interpretation that has really made it
big in the last few years. And risk
score the variational all one colour
and it's a very very beautiful theory
that's behind this a very simple
actually where we think about two
distributions eighty eight directed
model which is supposedly the one that
we wanna train which we can decompose
into the prior on the top level and
then conditionals where a potentially
usually there's only one stage actually
you have X given age and on the other
hand the dislike the decoder path and
we're gonna have an encoder path and
the end encoder goes exactly in the
other direction but it but it
stochastic and it it it has this cute
distribution QH given X so the X comes
from the data description which by
convention I would like to write Q axe
and this way this defines a joint QXNH
and this defines a joint P of X an age
and essentially the training objective
is to make these two distributions
match in K in the KL cool but clearer
sense. And it turns out that it this is
pretty much tractable is not that it
doesn't involve not like running a
markov chain you can change the
parameters of both the encoder and the
decoder. So that the the you know the
they are the job descriptions the the
capture our schools to each other as
possible in particular if the joined of
this and the joint of this match well
then in particular the marginal so the
Q Alexia which is the data description
matches PLX which is the marginal here.
But you never need to express Q have X
directly so this relies on what's
called operational bound in which the
the ah the log PLX which is a already
that's intractable is bounded by a a
tractable quantity which involves
sampling from Q and measuring the the P
of X even age something that we can
compute I I'm not gonna go into the
details because and you have a few
minutes left and skip a few things here
so there there's some recurrent variant
of this that have been proposed called
raw which will do fun things like
generate not in one go about generate
through a sequence of steps for example
draw it three here by moving the little
cursor and changing its position and
size and drawing ink in the middle of
it. And you know it's busy gonna draw
the thing you want. So it it works
really well for and this digits you can
do it also on the SSVH and that's a
street you half numbers the these are
actually training examples from this
data set. And these are the kinds of
samples you're getting so these are
really good for you know before draw we
we we had no out with them that could
draw things like this that look so
realistic. Now that's digits the next
that was images natural images like
image nets. So for this the the out
without really made it big is the
gender divide the serial network that I
mentioned earlier. And it's it's based
on a very simple intuition you're gonna
train. Um to you on that one which is
gonna be the one we want to use at the
end the generator and so as I said
before it's a black box that takes a
random vector and outputs if fake image
generated image. But we also gonna
train a discriminator network a
classifier. And you you can we gonna
think about this discriminator as a
trained lost function so normally the
last functions something fixed. But
here like in some enforcement lighting
setups we are gonna learn a lost
function and the loss function is
basically one that's is trying to
discriminate between the fake images
generated by our model. And the real
images coming from the training set. So
you know this guy's just not doing
normal classification and the way we
train this guy is that the generators
trying to fool the discriminator now
the words is trying to produce and I'll
put that maximises the probability that
what it sees a is classified as a real
image and so we take the output
probability here and we just backdrop
into the generator. So that's the basic
idea. Um so you know during training
when we train the discriminator we show
training except at the real training
image and we get a you know we we tell
the disk noted that you should out the
one and sometimes we said give it the
output of the generator and we tell the
discrete you should output is zero. But
then the way we train the generator is
that we take the probability of being
the one that this one is producing when
the input comes from the generator and
we try to maximise it. So we making
this guy produce the wrong answer
trying to fool the discriminator and
there's been a number of papers
including a one famous one with some if
where these kinds of models have been
used to generate images that we're more
realistic then you know any of the
methods that were previously a unknown
to generate images so so these are the
kinds of images that were generated and
you could also in this case look at how
the image was generated in in that
going from low resolution to high
resolution and sort of see how it's
filling in details and then there was
another paper last year and not long
ago laid back six months ago if you if
you don't know yet you know archive
this is the year and the month. And
then the numbers you know increase as
you put in more papers. So so this is
just a variant of the Ghana which uses
conclusions in a smart way and it's
these guys are pretty difficult to
train. But when you succeed to train
them they can you know provide very
realistic images of these these are the
kinds of images that were generated by
the model okay so this is you know this
blue everybody's mind. Um and and you
could play games like what I told you
before you can work in the a
representation space and do arithmetic
with those vectors. And and do things
like like racial before right so the
kinds of things people been doing with
the words you can do with images there
is a new people coming from a my my
group I'm not one of the others a where
we combine some of the ideas from the
racial rank ordering again we have two
models one that goes from input to a
latent space when it goes from latent
space to input so this is like you know
the encoder in the decoder. And we have
a down discriminator that looks at both
the input and the latent and try to
figure out if it comes from a if it
comes from this guy or from this guy
right. And and these are the kinds of
images regenerating from this ah it's
hard to you know quantify unfortunately
that's one of the problems for these
things are okay so I think I'm gonna
stop here I had an A Whole bunch of
other slides in my presentation that
I'll I'll make available where I talked
about a mural autoregressive models a
special case of which is the recurrent
nets which can be used to generate. So
you know we carried nets actually our
data models you can use them to
generate a sequence of of things. And
more recently this was used to generate
images as well so this is the pixel art
and paper which is was just presented
at the last I CMLA Couple of weeks ago
and got a best paper award. And they
also are able to generate pretty nice
images and people are getting excited
about it. But basically you're just
generating one pixel at a time
condition on the other pixels I don't
really like the philosophy of this
because we've gotten rid of the latent
variables. But well it works so you
know we're scientist and we have to
face reality. And try to adjust and you
know what is it that we were missing
from the other approaches that makes
this works quite well so that's where
we are and thank you for your attention
more questions please no five minutes
it's just me talk about a I think the
menu for yes and I say it's somehow
related to this intended in the I was
related to what sorry this and finding
the factors that absolutely yes. We can
definitely see it I saw about take yeah
yes right. It's like if there is in
this that and manifold now you can
think of it like there is a direction
corresponding to glasses there's a
direction responding to male female.
And then you can do arithmetic you know
kind of independently you know add more
or less of these things where is in the
pixel space there's not like a
direction pixel space that you know we
moved classes or changes you know
gender this is just not possible I mean
it would work for particular image but
not in general where is this would work
in general. So I it has taken the image
manifold which is really twisted
inculcate into something flat where
directions have meeting yeah that's
what we were aiming for I so I I have a
question about the adversary and image
generation yes. So in the case where
you generate fake and and I think
images that are indistinguishable from
the real images for that that's never
happens because we are not that good
yet yeah so that's my question still do
you have a like oh I was asking the
same question does seat yesterday. So
do we have like we're not able to make
the discriminator reach fifty percent
air it stays always a bit better than
fifty the you know sixty or something
so the question is speculative do we
have that guarantee that in the in the
in the we will not be able to generate
something that looks indistinguishable
from the images well if we do we're
done. I mean if the discriminator is
completely full then we put as much
capacity as we can in it that means we
we finish we we have a machine that
generates real images yeah but this is
for this for the the the the real with
respect to the discriminator network
and because if you bring sure so the
you know whole the whole of sadistic
summation running is based on the this
idea that a nonparametric approaches
where you say let's imagine that the
amount of data grows and my capacity
grows accordingly. Um what would happen
in the limit. And here we can show what
happens in the limit it's gonna wonder
distribution not whether it's gonna be
feasible from an optimisation point of
view also there's something really
funny going on here is that in in
normal machine learning we have a
single objective function here we have
a funny you know game you know each of
these two guys optimise a different
objective function. So in theory there
is a solution to the game but it's not
simply minimising and objective
function. But but there is in the in
the paper you'll see a lot of theory
about what happens asymptotically and
in in principle it should learn the
distribution okay thanks thank you very
much for the great look I have a
question about the many for similar
positions space yeah so use like
whiskey visuals to an honour to
understand the to see if the linear
interpolation looks like the images and
the many phones yes is there another
way to characterise as like many four
like the shape or the volume to use
another approach to I'm sure there are
many ways that we could use to figure
out what is going on I think we're just
starting to play with those stories and
having a lot of fun. But that there's
so much we don't understand. And
visualisation has been useful from the
beginning here and I think you know you
could have even more will in the
future. So we're we're doing things
like you know generating a plane you
know interpolating in the light in
space and see what happens input space
but we could do probably you know more
to try to figure out what is going on
yeah I'm wondering what you do these
interpolation systematically relative
dimensionality of those representations
as a work better if you have a
compressed representation or expose so
it depends on the kinds of arguments
you're using the the racial encoder
they tend to compress in some sense
like throwaway dimensions too much
actually and it's it's a bug that we
understand I mean it's something we
don't like it's doing it too much. Um
things like you know isn't coders are
you can be you can have many more
dimensions that's okay actually doesn't
hurt. Um for the against usually we we
you know we keep the representation
space pretty high dimensional but not
as high dimensional as the input
because they're typically images and
yeah there's probably a lot of
redundancy yeah I I I you don't need
those space you don't want those pieces
to have two small dimension if you if
you go for like two or three dimensions
it that just doesn't work that well you
can get something like a nameless with
three dimensions you can you can see
things that are reasonable. But it's
not nearly I mean you can't do natural
images and even for amnesty wouldn't be
as nice as if you have and the
dimensions. Maybe you should take one
more question than for the coffee to do
the questions about it. So behind you
yeah my question is all we said that
the the generator network is not just
throwing arts images from the the two
image right right it's it's absolutely
a valid concern and we can we can do
some things to try to make sure it's
not so for example a typical thing that
we do is so we take we find the nearest
neighbour in know euclidean distance in
the training set to generated image so
we generally that image. And then we
wanna check so is this just a copy of
the party a particular training
example. So if there was a very no
similar nearest neighbour in the
training set to this generated image
then we would know that the network has
just memorise this so that's one trick
but is it not necessarily a
satisfactory because maybe it's you
know it's still learning. And something
like nearest neighbours but but maybe
you know higher dimensional you know in
higher space but yeah it's it's
something that we could be concerned
about is this overfitting in some sense
and maybe that's why we have these nice
images and I don't think we have a
fully satisfying answer in the case of
the variational recorder we can
actually measure of down on the log
likelihood so there we can actually be
sure that it's not overfitting because
we we we have a quarter to measure of
the quality of the model through a
approximation of the log back in okay
so we can things you should in for the

Share this talk: 


Conference program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
4 juil. 2016 · 2:01 après-midi
Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
4 juil. 2016 · 3:20 après-midi
Day 1 - Questions and Answers
Panel
4 juil. 2016 · 4:16 après-midi
Torch 1
Soumith Chintala, Facebook
5 juil. 2016 · 10:02 matin
Torch 2
Soumith Chintala, Facebook
5 juil. 2016 · 11:21 matin
Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
5 juil. 2016 · 1:59 après-midi
Torch 3
Soumith Chintala, Facebook
5 juil. 2016 · 3:28 après-midi
Day 2 - Questions and Answers
Panel
5 juil. 2016 · 4:21 après-midi
TensorFlow 1
Mihaela Rosca, Google
6 juil. 2016 · 10 matin
TensorFlow 2
Mihaela Rosca, Google
6 juil. 2016 · 11:19 matin
TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
6 juil. 2016 · 3:21 après-midi

Recommended talks

Q&As: Structured Discriminative Models for Speech Recognition
Mark Gales, Cambridge University, Engineering Department, UK
21 juin 2012 · 4:28 après-midi