Yoshua Bengio, University of Montreal, Canada

Tuesday, 5 July 2016 · 1:59 p.m. · 01h 08m 04s · 1,384 views

Embed code

Thank you once one so this presentation

will be a bit different from

yesterday's. Um it's more about things

that are happening at the research

level and not so much things that

people use a build products yet also

it'll be a little bit more technical.

So and I have a bit more time then

yesterday. So feel free to raise your

hand and ask questions in middle if

they're too many all you know filter

but let's let's try so we don't need to

wait for the end to ask questions okay

let's start with motivations really

it's about as proviso line why is that

important. Well you have to realise

that all the great things that people

earning has done in the last year years

is mostly due to supervised learning

meaning that we yeah we need large

datasets that are labelled where humans

have told a machine with the right

answer should be but that's not how

humans learned most of the time. And

think about how child like a two or

three year old figures out what we call

into the physics. She understands you

know gravity she understands solids and

and and liquids and all kinds of

mechanical notions of course without

ever taking a class you know on a

newtonian physics she got it by

observation her parents didn't tell her

how the world was you know going on in

terms of the physics. Um she just

interacts with the world observes and

figures out causal explanations there

are sufficiently good that you can

control her environment. And and do all

kinds of things that robots can't do

yet right. And so we'd like to have

that kind of ability for computers to

observe interact with the world in

order to get better information. And

and learn essentially without

supervision. Now of course when I talk

about this proviso and you have to

understand that in the big scheme of

things. We need all the three types of

learning to to reach aren't we need

supervised learning we need

unsupervised learning and we need

reinforcement learning they just you

know cater to different niches and and

humans you will use all three as well

so one what may wonder wonder things

all talk about is you know why is it

that unsupervised learning hasn't been

as successful and I I don't I don't

seem to have all the answers for that

but I'll I'll give you some some

suggestions I think that there are

computational and statistical

challenges that a rise out of the the

objective that we have in supplies

learning in really capturing the joint

distribution in some form maybe

implicitly of many variables where is

when we do supervised learning

typically we only care about pretty

thing you know one thing one number one

category there and we're not trying to

get a joint distribution in a high

dimensional space it and that's really

what else provides training is about it

may not be explicit but really like if

you train the encoding mythical I'm not

you know I'm just learning because many

minimising the construction or

something but but really what you're

trying to do is to extract information

about the structure of the data

distribution in a high dimensional

space and that's fundamentally

difficult and and I don't know maybe

it's gonna take is another fifty years

to crack this but I I really believe in

others yeah like me believe that we

need to work hard and this and and make

progress on this you know to to even

approach human level intelligence right

so no from a practical point if you why

would we want to do that well at a

really obvious answer is that there's a

lot of and labelled data out there that

would like our computers to learn from

and and use that information to build

better models of the world we can't go

on building specialised machines for

every new task where you you're gonna

need a lot of labelled data for each I

mean we can and this is what we're

doing but it's not gonna bring is human

level EI it's not gonna be enough

here's another reason when we do

unsupervised learning as I said

essentially in some sense we are

learning about the joint distribution

of things then we should be able to

answer any new question about the data.

So think about I zero random variables

XY and Z and I learned to join

distribution of all three now I should

be able to answer a question like oh

give an X what can I see about wine Z

or given why and see what can I say

about XY all of the questions about I

know I know some aspects of reality

what can I say about other aspects. So

this in provides learning there's no

preference to which question you gonna

be asking you can think of a supervised

learning is a special case of you know

restricting yourself to on your

particular for question which is pretty

why given X another reason why provides

learning to be practically useful even

before we completely crack at is that

it it turns out to be very useful as a

regular riser what what that means is

that it can as an adjunct to supervised

learning. So this is the semis provides

case we can use S provides learning as

a way to help generalisation and the

reason it helps is that it it it

incorporates a additional constraints

on the solution this all the

constraints or the a priori that we

putting in is that the solutions we're

looking for are not just good at

predicting why give an X somehow the

involve sabre presentations that are

also good at capturing something about

the X the input distribution right this

is a you don't have to have that

constraint when you dip your supervised

learning but when you add that

constrain you can get better

generalisation. Um and that can be

useful as a red visor by itself it

could be useful in the in the transfer

setting where you wanna go to a new

task where you have very few labelled

examples or domain adaptation which is

kind of similar it's not a new task

assume new you know type of data maybe

you you go to the you go from you know

Quebec french to swiss french. And you

have to adapt and you don't have a lot

of data alright so that's these are

good reasons another good reason that

came out right at the beginning of the

people learning revolution in two

thousand six is that it looks like we

can exploit as provides learning to

make the optimisation problem of of T

planning easier. Um and the reason is

that we can we can define sort of local

objective functions like each pair of

lay you should be a good all encoder

good should form a good pair of of what

one could repair. And that kind of

constraint is something you know that

induces the kind of training signal

locally you don't need to backdrop to

twenty layers to get that information.

So it can in in the and spliced

retraining things we did from two

thousand six to about two dozen twelve.

Um it was useful a useful way to get

the training of the ground for for deep

supervising that's later we find other

ways to go around this optimisation

difficulty with with the rectifier but

it remains that there's an interesting

effect here that could be taken

advantage of and then the last reason

why this is interesting is that even if

you're only doing. Q or supervised

learning it happens sometimes that the

thing you wanna predict is not a single

simple class or a simple real value.

It's it's it's it's a composed object

for example you're predicting a a set

you predicting a data structure

predicting the sentence you predicting

an image right so if you pretty good

image the output is a high dimensional

object it's pretty a sentence the

output is high dimensional object and

and these objects are composed of

simple things like pixels or words or

characters. And so they have a joint

distribution. Now of course it's a

conditional john descriptions of given

the input I want to predict the joint

distribution of a bunch of things like

words in the sentence or something like

that. Uh or structure of the molecule.

So all of these kinds of obvious we may

be interested in predicting or saying

something about given an put that's

illegal you know structured output

learning and and they're essentially

all the the techniques that we have in

developing forms provides learning

especially the probabilistic once they

become useful we just have the

unsupervised learning model as usual

except we condition it meaning we have

the input that changes something in the

form of the joint distribution already

outputs alright so these are very good

reason to study unsupervised learning

but the one that really you know makes

me weak up at night is that we really

want the machine to understand how the

will ticks how the world works. And

unfortunately a if you if you step back

you know behind all the hype and the

the the excitement around planning and

machine learning in general what

happens very often is that the the

models. And up learning simple tricks

they're like surface statistical

regularities in order to solve the task

and if you think about the self driving

cars you would like those self writing

you know cars to somehow not just

relying on surfaces tickle statistical

regularities but can make sense of the

causal relationships between the

objects and and what could happen if

scenarios even though they may not have

seen these scenarios during their

training face. So how can that happen

how do a human's manage to do that.

Well the deal I think this is a

hypothesis of course we don't really

know what's going on in our brains but

but there's a lot of evidence that we

we brain we we learn a models of the

world that are causal that that's what

I mean by causal here is that there are

explanations about of what's going on

so I think the main job of our brain is

to figure out an explanation for

everything that we're seeing that skins

provides learning job right. Um and and

having an explanation means that you

can kind of simulate you know what

would happen if I change some of these

explanatory factors even though this

may be a situation that I have never

seen during training me again example

fortunately I never had a car accident

that killed me a so how can I learn

about the avoid in the actions that

could you know had have me killed in a

car accident. Well a supervised

learning is obviously not gonna work

even even reinforcement learning is not

gonna work because you know how many

times I have to dial an arts them

before and I learned how to avoid that

right you you see that there's a

problem. So how do we get around that

well we build in mental model of of of

of cars of rows of people that allows

to predict that if you know I do this

and that a it you know there is it

something bad with that may happen and

this is how it may happen and and if it

said I I change a little bit my

behaviour I could you know and up alive

so we are able to do that because we

have these kinds of explanatory models

it's something that we don't know how

to do yet you machines but this is

something we really need to do

otherwise yeah it's it's not gonna be

you know it's gonna be a spongy

alright. So how do we possibly do that

well there are many answers but one of

them that you know the the the the

reason why we got started into this

adventure D planning is because we

thought that by learning these high

level presentations we might be able to

discover high level abstractions what

that means so these obstructions in

some sense or closer to the underlying

explanations the underlying spent three

factors. And what we would really like

is that these high level features that

we're learning the the really capture

the knowledge about what's going on.

And one way to think about this is that

the the pixels we're seeing the the

sound meaning a the words rereading

they were created by something by some

factors by by some agents. And maybe

the lighting and the the microphone

whatever factors came in together were

combined in order to produce what we

observe and so what we want a machine

to do is to reverse engineer this to

figure out what or these factors and

separate them right disentangle them.

So I'll come back to this notion of

design tangling later but this is a

really I I find a very inspiring notion

yeah I I I want first to separate the

notion of invariance from the notion of

descending killing the notion of

invariance is one that has been very

you know commonly studied and and

thought about in areas like speech

recognition or computer vision where we

wanna do supervised learning so we

wanna predict something definite like

you know the object category the

phoneme. And we're trying to hand craft

features or maybe learn features that

are invariant to all the other factors

that we don't care about if I'm doing

speech recognition. I don't wanna know

who the speaker is I want my features

to be very into the speaker I'd want my

features being very into the type of

microphone I'm using if I'm doing ups

recognition I I would like my high

level features to be a maybe invariance

to translation or something like that.

Um the problem with this is that well I

mean this is good for surprise ending

but when you're doing unsupervised

learning. Well you know which factors

are gonna be the one that matter I

wanna capture everything about the

distribution I wanna know that ah

actually D and the lying estimation of

the sound then hearing is both a

sequence of words and phonemes and the

identity the speaker where that person

is in whether he's sick or something

like all these are explanations for

what I'm hearing and I would like the

representation and getting to have all

of that but I would like those factors

to be separated out so that I can now

just plug a a linear classifier on top.

And I can pick out the phonemes if

that's what I want or I can pick out

the I speaker identity if that's what I

want right that's the difference

between invariance and doesn't think

invariance we're trying to eliminate

from the signal from the features those

factors that we don't care about in

doesn't think we don't want it

eliminate anything we just wanna

separate out the different pieces that

that already and lying explanations and

and if you're able to do that you're

essentially killing the curse of

dimensionality because now if if your

goal is to answer specific questions

question about one of the factors you

reduce the dimensionality from very

high to just those features that are

sensitive to that factor now the thing

that we don't completely understand is

that when we do some of these that

apply some of these unsupervised

learning a buttons it looks like the

features we getting are a bit more

disentangle then the original as we go

higher up. Um so something good is

happening. And and these these these

these are experiments that were done

you know to zen publishing two dozen

nine and two thousand eleven and I I I

suspect there are other papers more

recently where we what if we do a kind

of analysis of the the features that

have been learned his arms provides

learning algorithms like sparsely

quarters. Um in knowing some of the

factors right so you know I kind of

cheat and I know some of the going

factors now I can test whether some of

the features become specialised more

towards some factor and and and less

sensitive to other factors is something

we can measure and somehow it seems to

happen magically. So why would that

happen. So here's here's a a kind of

it's a sketch of a theory why

unsupervised learning can give rise to

the extraction of features that are

more disentangle then then the original

data and yeah before I show you the

easy question initially this picture

because for pictures are so much better

so imagine that this is the data you're

getting you have distribution which is

actually a mixture of three gaussians

you can't have simpler than that well

you have a single guy. Um but nobody

tells you that you know what what cost

in the the particular sample you're

getting comes from so you have a label

data you just have the X and the winds

would be the gaussian identity is it

the number one number two number three

but you only observe X right. So if you

only observe axe what would be a good

model of the data well the best

possible model of the data is the one

that actually spells out the density as

a mixture of three gaussians right this

is this is in terms of log likelihood

or or whatever you wanna use is very

likely that the best model the data is

the one that actually discovers that

there is a latent variable Y which can

take the three you know integer values

one two or three I mean you can in the

maybe see if you want but and and you

can read label them but the point is we

have these three categories that are

sort of a implicit and data when we

don't class train. We're exploiting the

fact that there are national clusters

and we use clustering algorithms to

discover these clusters and you can

think of these processes as causes that

nobody told us about but we can

discover with a simple statistical

analysis just you know K means will

figure it out right so you so so the

principle is that there are underlying

causes and the statistics of the data

can reveal them to us if we go a good

model of the data the better the model

we have the the better we are able to

figure out those underlying causes. Um

now why would that be useful for

supervised learning so that's where

this slide and that's the question

becomes interesting. So let's think of

why here is one of the factors that

explain axe all right. Um and so let's

say that at the end of the day we

actually want to classify and pretty

why given X this is gonna work yeah so

we could just train a normal neural net

it predicts white directly from X or we

could train eight generated model that

captures your axe right. Um and as I

try to argue previously the best

possible jotted model here is actually

one that's written as a sum over the

whys and possibly a over all the

variables that Coleman age a we're

given the the causal factors we can

pretty acts. And and the reason that

this is it better model than and then

this one is simply that this is how the

data was actually generated right so

the best model of the data is the one

that kind of the truth that's how it's

generated the one that gives the best

predictions is the white response to

truth. Um it so if we're if even if we

don't observe why okay if we just

observe ex we can we can extract latent

variables like P what we we we try to

monkey of X as a key of X given age

times P of age for example so we

introduce like Bibles age and in the

best possible model well within H

should be why because one is one of the

factors that explains X and so if we

find good representations for P attacks

we're likely that these representations

will be a useful to predict why okay

there is a a nice paper a at IC model

doesn't twelve by gen dancing and and

others from Bernard Shaw cost group at

max Planck institute where they show

that there's a huge difference between

the situation where X is the cause of

why and why is the cause of X in terms

of the ability of some is provides

lying to work in other words if if why

is the cause of X then we can do some

is provide learning and I liked

learning about P of X actually becomes

useful whereas if even though at the

end of the day we only care about you

white give a nice whereas if the causal

direction was reversed then all the

semis provides lighting would be

useless because in the case where it

was reversed basically the the joint

that they're the the joint

distribution. P avoiding and X would

just be given by TOY given X times P of

X and so you X would have nothing to do

with its structure with key of why give

an X whereas if it's the other way

around. Um if the right causal model is

go from Y to X then when we want to

learn P of why give a nice well there

is information about P of why given X

inside P of X because P of X is

decomposed lexus. So yeah they they

push this argument much further but the

this is a deep connection dinner date

is a deep connection between the

causality and the relation you know

which which is the cause of which and

the success of you know unsupervised

learning to help supervise not that's

the main message alright so I mentioned

that unsupervised learning is is

difficult and this shows up very

clearly when you tried to tackle

unsupervised learning using a arsenal

of mathematical and computational tools

from probability like graphical models

and and models with latent variables.

So in principle introducing the latent

Bibles sure that help us and it should

help us to even avoid the curse of

dimensionality. Um because because

we're modelling at the right level in

some sense. But the problem is that for

all of their approach is that that that

are really angry probability in

explicit probabilistic model what we

find is that some of the complications

during that are needed either for

learning or using the model are just

intractable be involve you know running

integrals or sums over an exponential

number of things and so for example in

in typical directed models exact

inference in other ways predicting the

latent variables given the input is is

intractable even though you're going

you're able to go in the other

direction predicting X given age

because that's how to model is

parameterised going backwards which is

something we actually need to do both

for learning a potentially for using

the model the is involves an

intractable some in other models the

and directed models yeah there's

another issue with it potentially in

addition to this one which is that

these models involved in normalisation

constants. Um which is intractable and

and it's gradient isn't right in other

words the probability is expressed as

some expression divided by

normalisation constants which we

usually right is that and that's that

is something we can compute easily. And

of course and we also need to give you

the gradient of that's said so it's

it's looks like it's hopeless. Um so

this has this has us you know motivated

a lot of new things some of which I

will tell you about but let me start

with the and sisters of the

degenerative models the energy based

models of both machines basically of

the category of undirected graphical

models so with and write a graphical

models basically you're expressing the

probability function. So X is the the

run the marble you're trying to model

in terms of and energy so this is just

a rewrite there's not much of a diff

constrained by doing this except that

we're saying that ah every

configuration gets a non zero

probability because energy you know

it's gonna be finite for any X and so

this means probably is just region zero

for everything but besides what it

really saying is that instead of

primate rising the probably directly

where primate rising this guy the

energy and we letting this Z the the

rye from it so that here is just to sum

over X or the integral of racks of the

the numerator okay so if you have a

model of that type it turns out that

the log flight you'd and tells you to

update your parameters according to the

following very simple idea and

especially if you think about

stochastic green descent so I'm giving

an example X let's call it X plus and

this landscape that I'm showing here is

the energy landscape so think of

remember this E to the minus energies

probability so when energy's localities

high. And there's an exponential

relationship. So yeah which is hard to

visualise here but ah when when this

goes up very much then the probably

goes exponentially faster zero alright

so we're given an example X plus and

you have occurred energy function so

this is the curve the Y axis is energy

and what we wanna do with max and like

it we wanna make the probability of the

observed data high that's what my from

like it means that means make the

energy of the observed configurations

low. So the ideal solution would be to

make every twenty example at peak I

mean another ticket trough like a

minimum of the energy that would be the

ideal solution from the twenty point of

view from civilisation might not be but

anyway what training consists in is

pushing down on the energy where the

examples are and pushing up everywhere

out because if I just push down on the

training example where the energy for

the training example that may not be

good what I really want is you know the

relative energy to be small for

trainings also here's an example where

the the data points are these a little

dots. And doing training we're pushing

up everywhere else. And we're gonna get

a model that puts a low energy where

the data is this is a good model right

and this is is not as good model. So

yeah you can get that just by doing

three lines of algebra but is this

something kind of intuitive about

what's going on here at the same time

as we're trying to push on at the

configuration given by the data push

down the energy we're trying to push up

everywhere everywhere else but not in

the same with the same strength

everywhere else we the equation we're

getting tells us we wanna push up

especially in places where the energy

is low right so all those places that

get a high probability basically should

be pushed up and we call these in a

negative examples and these possible

examples we're trying trying to make

positive examples more probable and

trying to make negative examples less

probable. And where do we get those

negative examples well ideally these

negative examples come from the model

distribution itself quite so once we

have an energy. We have a probability

fusion corresponds to it by this

equation. And if we could sample from

this distribution we would get like you

know many points here a few here if you

hear us so we wanna push where we get

those samples up. That's what the the

the math tells us we should be doing to

maximise like you this is what we see

in this equation so that the riveted of

the log probability with respect to

parameters which are hidden inside the

energy function has two turns one which

we call the positive face term and the

other called the negative face turn.

And this one is saying you know change

parameters so that the energy of the X

becomes lower because we wanna maximise

this we have a minus here so we

minimise the energy at this X and and

and now you also have this term or just

push up so there's no negative here.

You this is wants to go up so this was

the what everywhere so some all X tilde

but waited by P of X dollars so those

places where the model thinks that you

know they have a high probability we

want to reduce their probability we

want to increase their energy this is

the case here in the second line where

the model involves not just the expert

also some latent variable H so now the

energy function is defined in terms of

both X an age and you could marginalise

so some overall the values of age and

get another equation which looks like

the one we had before any call this

modified energy or marginalise energy

should be the right term but physicists

call it free energy and that is a

similar question except that we now

have to wait by the those probabilities

of the H given ex the two terms here.

And this week or posterior probability

so you see that when you have like

convertibles you know the to learn we

need to sample or or average over this

posterior probability of the latent

variables given me. And this can be

hard yes so yeah and then tell you much

about how we do this or the ways we

know right now how to do is the all

involve some kind of multi colour

markov chain so multicoloured markov

chains adjust methods to sample from a

distribution when you know we don't

have any better method so it's a kind

of general method for something from

this fusion and it's an intuitive

method you never actually get a a real

simple from the distribution you you

have to go you know many steps in and

the symbolically you hope that you get

the sample on the right distribution

you may have heard about restricted

both machines so these these are a

particular kind of and wrecked a

graphical model that has a a a a graph

structure like this where there is no

relation there's no relationship

between EX is when we know the H and

vice versa so the the X or

conditionally independent given the age

and vice versa. So this forms what's

called a by part time graph where we

have connections you know going from

top to bottom everywhere but no

connections no so called natural

connections here or here. And and with

those conditions it turns out that it's

actually easier to train these models

and I'm not gonna go into again it's

using some what you call a markov

chains but somehow we are we are able

to do a decent job of training these

these types of undirected graphical

models. And so the urban uses the

building blocks starting with the two

thousand six a breakthroughs for

supervised learning to train deeper

models but that threat of research just

kind of diet over the last few years.

And I have I I I I have some thoughts

about why. Why didn't work as well as

we would have hoped and part of it. I

believe has to do with the the fact

there we we rely on these multicoloured

markov chains in order to get those

samples let me try to explain what I

think is going on. So in order to get a

gradient on the parameters in order to

train the model we need to get samples

from the model in other ways we we have

to ask the model you know give me

examples of the things you believe in

like which images you would generate

and we do this by running a markov

chain which starts at some

configuration again and goes you know

left and like randomly makes a local

small moves out what's particular about

in the same C is that those moves are

typically both local and they want to

go to a place of high probability. So

at the end of the day you end up

walking near the modes of the diffusion

and spending more time where where

probabilities higher that's that that

the deal what the deal is that as we

run these markov chains we end up

spending more time where that probably

is higher in fact proportionally

exactly to the to the probability. But

there's that there's a problem. When

the model is kind of agnostic initially

your model puts sort of uniform

probability everywhere. And then it

gets to put more more probability mass

around where the data is and initially

you know these these molds are kind of

us move and and you can still travel

between those modes without having to

go through a zero probability region

but as the model gets sharper another

words it it now really it gets more

confident about which configurations or

probable. And which are not like the

things in between the modes for example

maybe this is one category in this is

not a category in there shouldn't be

anything in between then what happens

is that those markov chains get trapped

in round around one mode and they can't

easily jump from one mode to another

mode and what it means is that if we

start somewhere we gonna stay around

that region and we can't visit the rest

and we don't get really representative

samples we don't get representative

samples then our training suffers it so

those models are able to learn

distributions are sort of some level of

complexity if we try to learn more

complex distributions it just stalls to

you we haven't been able to yet maybe

we'll find solutions to that but for

now it remains an open problem as far

as I'm concerned one glimmer of hope

comes from experiments that we run a

few years ago where we found that

although sampling in the in but space

with these and CMC is is is hard if

instead of running this markov chain in

the the raw input space like pixels we

first not the data to high level

representation because let's see we've

train a bunch of little encoders are

bunch of IB m.s. So we we now have a

with them at the input data through a

better presentation the kind of

representation we learned we when you

that's typically and now if we run the

markov chain in that space it turns out

that it makes as much better between

the modes. So we we've trying to

understand that and I have a picture

here hopefully which helps to

understand what is going on so in need

pixel space really input space that

they that concentrates on some manifold

like here at this is a cartoon

obviously. I've see this is the the

manifold of three of nines and this is

the man of the freeze and these

metaphors of very very thin the occupy

very small volume and they're well

separated from each other and so it's

hard to mix between two categories for

example but what happens is as you

mount the data to these higher

dimensional not high dimensional high

dimensional spaces that are you know

learns somehow to capture the

description like or quarters I the

relative volume occupied by the data in

that space is larger than in the

original space. And the the different

manifolds get close at each other. So

now it becomes easier to jump from one

to the other. And there's something

else happens which is that where is the

manifolds in the original space are

highly curved and complicated when you

go to these learn spaces of using

provides learning those manifold become

flat. So to try to understand what I

mean by flat manifold think about a

curved manifold so let's say the data

concentrating input space on this

thread manifold no I think two examples

like if the image of a nine here any

image of a three here. And I linearly

interpolate between them and I look at

points in between and and try to

visualise what they look like so this

is what we did you have a nine here you

have three here. You do linear

interpolation pixel space and of course

what you get you get the addition of

you know if we in line which doesn't

look like either three or nine might

take take two random images natural

images add them up and you get

something that doesn't look like an

actual image. So what it means is that

if I take two images and I interpolate

the stuff in between that is and is not

on the manifold because the manifold is

not flat if the metaphor was flat when

I do linear interpolation the things in

between look like natural images and

this is actually one of the tests that

we use with a new unsupervised feature

learning a buttons to see whether it

has done a good job or not up and

folding about what we take to images.

We map them to the representation space

we do a linear interpolation and we

look at the things in between in in the

pixel space right so we can go back and

forth between input space in your

presentation space. And so we can

interpolate in the in the

representation space in the H space and

then you know use decoder to map back

to pixel space and visualise so here ah

we see what happens when we do it by

looking at the first layer of of a

stack of a little encoders of the

nosing recorders here and here the

second layer and what we find is that

the higher we go the better you know

flattening is is happening so what was

going on now is that after just the

second layer we can we can interpolate

between this nine and a three and

everything in between make sense and

looks like like a digit and there's a

point here where it suddenly jumps very

fast from the nine to three okay this

guy's can just not just in the border

of being you know between three tonight

in just a few pixels above go from nine

to three right. So it is really found a

way to make these manifolds very close

to each other and the path now on the

straight line straight line a goes

exactly to the right place and never

goes through something that doesn't

look like an actual image you guys have

question about this yes well so okay so

I have an image. I have an encoder

which maps it to a vector I got another

image in a three I get another vector

two vectors okay now I can take a

linear interpolation so all five times

the first one plus one minus alpha

times second one where Ralph lies

between zero and one is it so for

example you have this one plus half of

this one that would be right in the

middle okay so not give me another

vector. And then I'm map it back to

pixel space because I have you know two

we mapping here from input to a

presentation in back yeah because it it

tells us whether the manifold as being

flattened or not. But why would that be

a good thing well I think it's very

clear if you if you have a flat

manifold you can you can now basically

combine existing examples you know to

to to predict you know what would

happen so I'll show you an example

later what you can do here's what you

can do you take a man with glasses you

subtract a vector format without

glasses you add the vector for women

with without glasses and you get women

with glasses yeah I mean whatever

processing you wanna do you really want

to do it in this linear space where you

can just simple linear operation in

order to change things. Right like oh I

I decide that I don't wanna have

classes I just you know we move some

direction yeah yes you're right but

here is the simplest thing we can think

of. And so here's another way to think

about why lean years good. I mean why

why flat is good because let's say I

wanted to capture the dissolution it's

I wanted a at some like you right. So I

wanna capture the distribution and if

the this missions like this is gonna be

difficult to model it I I I'm gonna

need like with if I do it with gaussian

extremely like many components to to

you know go through this if it's one

flat thing one gaussian bad. I got the

density once so I think about it. Once

you know the density of the data you

can answer any question about it in the

language of the variables that your

design yes I'm not saying that the

world is a big gaussian. But if we can

you know map a lot of it too simple the

solutions then we can answer questions

right so yeah you're right maybe for

example let me let me give you an

example in in in the direction you're

talking about if you actually have

multiple categories presumably you're

not gonna get it I guess single gas in

that captures all the categories you

probably wanna have like a different

you know a gaussian for each category

and so so the the right model you

wouldn't be a single gaussian because

we want to somehow capture the fact

that we have these clusters. So yeah

but the point is is gonna be much

easier to model the data answer

questions reason if we can flatten the

manifolds but it is something that can

be argued no but I it's not about the

structure of the space it's about the

structure of the distribution if the

data has to be you know along that

subspace if it if it makes you has a

complete shape it's hard to capture

what's going on is hard to make

predictions it's hard to reason about

it. If you everything becomes when

you're it's much easier to reason about

that's that's all let me move on

because I only have fifteen minutes

left and lots of things that would like

to talk about but I'll do quickly so I

mention all encoders already right so

you just picked basically to a mapping

from input space representations pacing

back. And we can learn them in various

ways. And we can have probably think

version where the and colour is

actually a conditional distribution. So

it's not just a a function we actually

inject some kind of noise here and we

get a sample of H given any particular

X and simply the decoder can itself be

conditional descriptions of given some

H from some distribution which we call

prior to solution then we are getting a

X is from pure explanation so this

these two guys actually represent a

joint distribution over X an age and

these two guys with a different matter

also correspond to a joint distribution

so I mentioned it literal your that the

these and seem seem met is and

classical ways of primate rising

problem probably distributions kind of

hit a wall so we explored other ways of

of doing this and the general theme of

of this is let's bypass all of these

normalisation constants and and so on.

And and learn generative black boxes so

if if we if we specify the problem of

unsupervised learning is build a

machine that can generate see images

which is something we can discuss. But

it see we we we define it like this

then let's just trying and you on that

that you know takes in random numbers

and outputs images right. We can of

course trying a different kind of neon

that which may have like different

inputs and then we can have you know

given for example some sentence I would

like to generate an image of responses

that's just a variation right once you

get once you're able to agenda trained

you on that that does this kind of

thing then you can do all kinds of

other fun things. So that's one

variance and I'll tell you about this

call the generated but they're still

mats and they are very hot these days

another variant which for now has been

less explored is that alright so we're

not gonna generate in one go we're

gonna generate through a sequence of

steps is gonna be like like a recurrent

net. So it's gonna have a state. We

throw in some random numbers and then

each point in the sequence we generate

a sample. And as we do more of these

steps the samples look nicer so this

kind of imitates the markov chain but

but now we're gonna learn the so called

transitional parade of the marketing

the black box that goes from one stated

next eight generate some samples and in

you know in some random numbers so that

we get a different thing each time. So

this is just a kind of stochastic

dynamical system that generates the

things we want and I called of these

things generative stochastic networks

alright and then we can do all kinds of

math about these things and actually

show that you can train them so this is

totally different from the the

classical approach of undirected

graphical models any skip something's

let me tell you about the denoting

recorder which is related to to this

and of course to encoders in general so

it's a particular kind of all encoder

where I think I had at yeah here in the

D noise in the recorder a what we do is

we minimise the reconstruction they're

but instead of giving the raw input

here we give a corrupted input for

example we hide some of the inputs

these we seven to zero or we add some

gaussian noise or whatever we want we

can also inject noise here but the

traditional thing is we inject noise

here. And the error we're minimising

here is like some kind of log

likelihood we construction so

probability of the clean input given

the code. And and that's that's that's

a delusional encoder and it it's

probably the one that's been best

studied mathematically and you

understand better it's probabilistic

interpretation so here's a picture of

what's going on let's see the data is

concentrated on this manifold. So the

exes here training points. So what we

do is we take a train we take a

training point we corrupted and we get

you know something like this like the

right thing and then we ask the new on

that to go back and and we construct

the original now of course it may not

be able to do it perfectly because

maybe the original could've been here

here here and so in general is gonna

point right at the manifold if it

learns well and so it learns these kind

of vector field which points towards

the data. And you can actually do

experiment ah onto the data and so

let's say that there are these yellow

circles and that you know someone

coders learn these arrows these arrows

correspond to a you know where it was

to go if you start hearing must go in

this direction so the reconstruction

you know is pointing in this direction

so this is the personal to

reconstruction minus input. And in fact

we can we can prove that if you train

this well selection directly where this

converges is that the reconstruction

minus the input so the same thing next

year in here. It actually estimates

what's called a score. D log PDX not

towards the direction the gradient of

the den see the direction in which a

density increases the most right so if

you're if you're sitting here where you

wanna go to to increase probably the

most is towards the matter for this is

the gradient of the like you right.

"'cause" there's a peak of probability

that should be here and then probably

should go down as fast as you move away

infection B zero. But if you smooth a

bit you gonna get this. So there's a

lot of papers the try to understand

this and and also show that's these

them with a grin recorders you can

sample from them you can define a

markov chain that corresponds to

something from the model has been line

so you can actually once you've trained

on the noise going coat or you can just

apply the corruption apply the

stochastic reconstructions in other

words you gonna sample from the output

distribution rather than have a

deterministic function. And then you do

it again and again and this markov

chain will converge to what the model

where the model T C.s where it would

probably mass so in terms of this

picture what it means is that you know

if you fall this arrow and you had a

bit of noise in the fall the arrow and

a battery that annoys you will you will

kind of move more or less in that

direction and then you a start moving

around this thing "'cause" they're no

arrows going away from this there no

arrows going this way you are but if

you use stipulated away brings you back

right so there is a bit of noise it

makes you move around like a random

walk and you gonna stay on that you

know run walk let me skip a few more

things so there is another kind of or

recorder with the probabilistic

interpretation that has really made it

big in the last few years. And risk

score the variational all one colour

and it's a very very beautiful theory

that's behind this a very simple

actually where we think about two

distributions eighty eight directed

model which is supposedly the one that

we wanna train which we can decompose

into the prior on the top level and

then conditionals where a potentially

usually there's only one stage actually

you have X given age and on the other

hand the dislike the decoder path and

we're gonna have an encoder path and

the end encoder goes exactly in the

other direction but it but it

stochastic and it it it has this cute

distribution QH given X so the X comes

from the data description which by

convention I would like to write Q axe

and this way this defines a joint QXNH

and this defines a joint P of X an age

and essentially the training objective

is to make these two distributions

match in K in the KL cool but clearer

sense. And it turns out that it this is

pretty much tractable is not that it

doesn't involve not like running a

markov chain you can change the

parameters of both the encoder and the

decoder. So that the the you know the

they are the job descriptions the the

capture our schools to each other as

possible in particular if the joined of

this and the joint of this match well

then in particular the marginal so the

Q Alexia which is the data description

matches PLX which is the marginal here.

But you never need to express Q have X

directly so this relies on what's

called operational bound in which the

the ah the log PLX which is a already

that's intractable is bounded by a a

tractable quantity which involves

sampling from Q and measuring the the P

of X even age something that we can

compute I I'm not gonna go into the

details because and you have a few

minutes left and skip a few things here

so there there's some recurrent variant

of this that have been proposed called

raw which will do fun things like

generate not in one go about generate

through a sequence of steps for example

draw it three here by moving the little

cursor and changing its position and

size and drawing ink in the middle of

it. And you know it's busy gonna draw

the thing you want. So it it works

really well for and this digits you can

do it also on the SSVH and that's a

street you half numbers the these are

actually training examples from this

data set. And these are the kinds of

samples you're getting so these are

really good for you know before draw we

we we had no out with them that could

draw things like this that look so

realistic. Now that's digits the next

that was images natural images like

image nets. So for this the the out

without really made it big is the

gender divide the serial network that I

mentioned earlier. And it's it's based

on a very simple intuition you're gonna

train. Um to you on that one which is

gonna be the one we want to use at the

end the generator and so as I said

before it's a black box that takes a

random vector and outputs if fake image

generated image. But we also gonna

train a discriminator network a

classifier. And you you can we gonna

think about this discriminator as a

trained lost function so normally the

last functions something fixed. But

here like in some enforcement lighting

setups we are gonna learn a lost

function and the loss function is

basically one that's is trying to

discriminate between the fake images

generated by our model. And the real

images coming from the training set. So

you know this guy's just not doing

normal classification and the way we

train this guy is that the generators

trying to fool the discriminator now

the words is trying to produce and I'll

put that maximises the probability that

what it sees a is classified as a real

image and so we take the output

probability here and we just backdrop

into the generator. So that's the basic

idea. Um so you know during training

when we train the discriminator we show

training except at the real training

image and we get a you know we we tell

the disk noted that you should out the

one and sometimes we said give it the

output of the generator and we tell the

discrete you should output is zero. But

then the way we train the generator is

that we take the probability of being

the one that this one is producing when

the input comes from the generator and

we try to maximise it. So we making

this guy produce the wrong answer

trying to fool the discriminator and

there's been a number of papers

including a one famous one with some if

where these kinds of models have been

used to generate images that we're more

realistic then you know any of the

methods that were previously a unknown

to generate images so so these are the

kinds of images that were generated and

you could also in this case look at how

the image was generated in in that

going from low resolution to high

resolution and sort of see how it's

filling in details and then there was

another paper last year and not long

ago laid back six months ago if you if

you don't know yet you know archive

this is the year and the month. And

then the numbers you know increase as

you put in more papers. So so this is

just a variant of the Ghana which uses

conclusions in a smart way and it's

these guys are pretty difficult to

train. But when you succeed to train

them they can you know provide very

realistic images of these these are the

kinds of images that were generated by

the model okay so this is you know this

blue everybody's mind. Um and and you

could play games like what I told you

before you can work in the a

representation space and do arithmetic

with those vectors. And and do things

like like racial before right so the

kinds of things people been doing with

the words you can do with images there

is a new people coming from a my my

group I'm not one of the others a where

we combine some of the ideas from the

racial rank ordering again we have two

models one that goes from input to a

latent space when it goes from latent

space to input so this is like you know

the encoder in the decoder. And we have

a down discriminator that looks at both

the input and the latent and try to

figure out if it comes from a if it

comes from this guy or from this guy

right. And and these are the kinds of

images regenerating from this ah it's

hard to you know quantify unfortunately

that's one of the problems for these

things are okay so I think I'm gonna

stop here I had an A Whole bunch of

other slides in my presentation that

I'll I'll make available where I talked

about a mural autoregressive models a

special case of which is the recurrent

nets which can be used to generate. So

you know we carried nets actually our

data models you can use them to

generate a sequence of of things. And

more recently this was used to generate

images as well so this is the pixel art

and paper which is was just presented

at the last I CMLA Couple of weeks ago

and got a best paper award. And they

also are able to generate pretty nice

images and people are getting excited

about it. But basically you're just

generating one pixel at a time

condition on the other pixels I don't

really like the philosophy of this

because we've gotten rid of the latent

variables. But well it works so you

know we're scientist and we have to

face reality. And try to adjust and you

know what is it that we were missing

from the other approaches that makes

this works quite well so that's where

we are and thank you for your attention

more questions please no five minutes

it's just me talk about a I think the

menu for yes and I say it's somehow

related to this intended in the I was

related to what sorry this and finding

the factors that absolutely yes. We can

definitely see it I saw about take yeah

yes right. It's like if there is in

this that and manifold now you can

think of it like there is a direction

corresponding to glasses there's a

direction responding to male female.

And then you can do arithmetic you know

kind of independently you know add more

or less of these things where is in the

pixel space there's not like a

direction pixel space that you know we

moved classes or changes you know

gender this is just not possible I mean

it would work for particular image but

not in general where is this would work

in general. So I it has taken the image

manifold which is really twisted

inculcate into something flat where

directions have meeting yeah that's

what we were aiming for I so I I have a

question about the adversary and image

generation yes. So in the case where

you generate fake and and I think

images that are indistinguishable from

the real images for that that's never

happens because we are not that good

yet yeah so that's my question still do

you have a like oh I was asking the

same question does seat yesterday. So

do we have like we're not able to make

the discriminator reach fifty percent

air it stays always a bit better than

fifty the you know sixty or something

so the question is speculative do we

have that guarantee that in the in the

in the we will not be able to generate

something that looks indistinguishable

from the images well if we do we're

done. I mean if the discriminator is

completely full then we put as much

capacity as we can in it that means we

we finish we we have a machine that

generates real images yeah but this is

for this for the the the the real with

respect to the discriminator network

and because if you bring sure so the

you know whole the whole of sadistic

summation running is based on the this

idea that a nonparametric approaches

where you say let's imagine that the

amount of data grows and my capacity

grows accordingly. Um what would happen

in the limit. And here we can show what

happens in the limit it's gonna wonder

distribution not whether it's gonna be

feasible from an optimisation point of

view also there's something really

funny going on here is that in in

normal machine learning we have a

single objective function here we have

a funny you know game you know each of

these two guys optimise a different

objective function. So in theory there

is a solution to the game but it's not

simply minimising and objective

function. But but there is in the in

the paper you'll see a lot of theory

about what happens asymptotically and

in in principle it should learn the

distribution okay thanks thank you very

much for the great look I have a

question about the many for similar

positions space yeah so use like

whiskey visuals to an honour to

understand the to see if the linear

interpolation looks like the images and

the many phones yes is there another

way to characterise as like many four

like the shape or the volume to use

another approach to I'm sure there are

many ways that we could use to figure

out what is going on I think we're just

starting to play with those stories and

having a lot of fun. But that there's

so much we don't understand. And

visualisation has been useful from the

beginning here and I think you know you

could have even more will in the

future. So we're we're doing things

like you know generating a plane you

know interpolating in the light in

space and see what happens input space

but we could do probably you know more

to try to figure out what is going on

yeah I'm wondering what you do these

interpolation systematically relative

dimensionality of those representations

as a work better if you have a

compressed representation or expose so

it depends on the kinds of arguments

you're using the the racial encoder

they tend to compress in some sense

like throwaway dimensions too much

actually and it's it's a bug that we

understand I mean it's something we

don't like it's doing it too much. Um

things like you know isn't coders are

you can be you can have many more

dimensions that's okay actually doesn't

hurt. Um for the against usually we we

you know we keep the representation

space pretty high dimensional but not

as high dimensional as the input

because they're typically images and

yeah there's probably a lot of

redundancy yeah I I I you don't need

those space you don't want those pieces

to have two small dimension if you if

you go for like two or three dimensions

it that just doesn't work that well you

can get something like a nameless with

three dimensions you can you can see

things that are reasonable. But it's

not nearly I mean you can't do natural

images and even for amnesty wouldn't be

as nice as if you have and the

dimensions. Maybe you should take one

more question than for the coffee to do

the questions about it. So behind you

yeah my question is all we said that

the the generator network is not just

throwing arts images from the the two

image right right it's it's absolutely

a valid concern and we can we can do

some things to try to make sure it's

not so for example a typical thing that

we do is so we take we find the nearest

neighbour in know euclidean distance in

the training set to generated image so

we generally that image. And then we

wanna check so is this just a copy of

the party a particular training

example. So if there was a very no

similar nearest neighbour in the

training set to this generated image

then we would know that the network has

just memorise this so that's one trick

but is it not necessarily a

satisfactory because maybe it's you

know it's still learning. And something

like nearest neighbours but but maybe

you know higher dimensional you know in

higher space but yeah it's it's

something that we could be concerned

about is this overfitting in some sense

and maybe that's why we have these nice

images and I don't think we have a

fully satisfying answer in the case of

the variational recorder we can

actually measure of down on the log

likelihood so there we can actually be

sure that it's not overfitting because

we we we have a quarter to measure of

the quality of the model through a

approximation of the log back in okay

so we can things you should in for the

Yoshua Bengio, University of Montreal, Canada

4 July 2016 · 2:01 p.m.

Alison B Lowndes, NVIDIA

4 July 2016 · 3:20 p.m.

Panel

4 July 2016 · 4:16 p.m.

Yoshua Bengio, University of Montreal, Canada

5 July 2016 · 1:59 p.m.

Panel

5 July 2016 · 4:21 p.m.

Mihaela Rosca, Google

6 July 2016 · 10 a.m.

Mihaela Rosca, Google

6 July 2016 · 11:19 a.m.

Mauricio Breternitz, AMD

6 July 2016 · 1:59 p.m.

Mihaela Rosca, Google

6 July 2016 · 3:21 p.m.

Zoltán Tüske, RWTH Aachen University

7 Sept. 2012 · 2:29 p.m.