Deep Generative Models

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:00

Thank you once one so this presentation

00:00:05

will be a bit different from

00:00:06

yesterday's. Um it's more about things

00:00:11

that are happening at the research

00:00:13

level and not so much things that

00:00:16

people use a build products yet also

00:00:22

it'll be a little bit more technical.

00:00:24

So and I have a bit more time then

00:00:26

yesterday. So feel free to raise your

00:00:29

hand and ask questions in middle if

00:00:30

they're too many all you know filter

00:00:32

but let's let's try so we don't need to

00:00:36

wait for the end to ask questions okay

00:00:38

let's start with motivations really

00:00:45

it's about as proviso line why is that

00:00:49

important. Well you have to realise

00:00:52

that all the great things that people

00:00:54

earning has done in the last year years

00:00:56

is mostly due to supervised learning

00:01:00

meaning that we yeah we need large

00:01:04

datasets that are labelled where humans

00:01:07

have told a machine with the right

00:01:09

answer should be but that's not how

00:01:13

humans learned most of the time. And

00:01:16

think about how child like a two or

00:01:19

three year old figures out what we call

00:01:22

into the physics. She understands you

00:01:25

know gravity she understands solids and

00:01:28

and and liquids and all kinds of

00:01:31

mechanical notions of course without

00:01:35

ever taking a class you know on a

00:01:38

newtonian physics she got it by

00:01:43

observation her parents didn't tell her

00:01:45

how the world was you know going on in

00:01:49

terms of the physics. Um she just

00:01:52

interacts with the world observes and

00:01:54

figures out causal explanations there

00:01:57

are sufficiently good that you can

00:01:59

control her environment. And and do all

00:02:02

kinds of things that robots can't do

00:02:04

yet right. And so we'd like to have

00:02:09

that kind of ability for computers to

00:02:12

observe interact with the world in

00:02:15

order to get better information. And

00:02:17

and learn essentially without

00:02:20

supervision. Now of course when I talk

00:02:23

about this proviso and you have to

00:02:24

understand that in the big scheme of

00:02:26

things. We need all the three types of

00:02:30

learning to to reach aren't we need

00:02:31

supervised learning we need

00:02:33

unsupervised learning and we need

00:02:34

reinforcement learning they just you

00:02:36

know cater to different niches and and

00:02:39

humans you will use all three as well

00:02:41

so one what may wonder wonder things

00:02:48

all talk about is you know why is it

00:02:49

that unsupervised learning hasn't been

00:02:53

as successful and I I don't I don't

00:02:58

seem to have all the answers for that

00:02:59

but I'll I'll give you some some

00:03:01

suggestions I think that there are

00:03:04

computational and statistical

00:03:08

challenges that a rise out of the the

00:03:11

objective that we have in supplies

00:03:13

learning in really capturing the joint

00:03:15

distribution in some form maybe

00:03:18

implicitly of many variables where is

00:03:21

when we do supervised learning

00:03:22

typically we only care about pretty

00:03:24

thing you know one thing one number one

00:03:26

category there and we're not trying to

00:03:29

get a joint distribution in a high

00:03:30

dimensional space it and that's really

00:03:33

what else provides training is about it

00:03:35

may not be explicit but really like if

00:03:36

you train the encoding mythical I'm not

00:03:38

you know I'm just learning because many

00:03:40

minimising the construction or

00:03:41

something but but really what you're

00:03:44

trying to do is to extract information

00:03:46

about the structure of the data

00:03:48

distribution in a high dimensional

00:03:50

space and that's fundamentally

00:03:52

difficult and and I don't know maybe

00:03:54

it's gonna take is another fifty years

00:03:56

to crack this but I I really believe in

00:04:00

others yeah like me believe that we

00:04:05

need to work hard and this and and make

00:04:08

progress on this you know to to even

00:04:11

approach human level intelligence right

00:04:15

so no from a practical point if you why

00:04:17

would we want to do that well at a

00:04:19

really obvious answer is that there's a

00:04:22

lot of and labelled data out there that

00:04:25

would like our computers to learn from

00:04:28

and and use that information to build

00:04:30

better models of the world we can't go

00:04:34

on building specialised machines for

00:04:36

every new task where you you're gonna

00:04:37

need a lot of labelled data for each I

00:04:39

mean we can and this is what we're

00:04:40

doing but it's not gonna bring is human

00:04:43

level EI it's not gonna be enough

00:04:47

here's another reason when we do

00:04:51

unsupervised learning as I said

00:04:53

essentially in some sense we are

00:04:55

learning about the joint distribution

00:04:56

of things then we should be able to

00:04:59

answer any new question about the data.

00:05:03

So think about I zero random variables

00:05:05

XY and Z and I learned to join

00:05:07

distribution of all three now I should

00:05:09

be able to answer a question like oh

00:05:11

give an X what can I see about wine Z

00:05:13

or given why and see what can I say

00:05:16

about XY all of the questions about I

00:05:19

know I know some aspects of reality

00:05:21

what can I say about other aspects. So

00:05:24

this in provides learning there's no

00:05:27

preference to which question you gonna

00:05:28

be asking you can think of a supervised

00:05:30

learning is a special case of you know

00:05:34

restricting yourself to on your

00:05:35

particular for question which is pretty

00:05:38

why given X another reason why provides

00:05:44

learning to be practically useful even

00:05:46

before we completely crack at is that

00:05:48

it it turns out to be very useful as a

00:05:50

regular riser what what that means is

00:05:52

that it can as an adjunct to supervised

00:05:55

learning. So this is the semis provides

00:05:58

case we can use S provides learning as

00:06:00

a way to help generalisation and the

00:06:03

reason it helps is that it it it

00:06:05

incorporates a additional constraints

00:06:10

on the solution this all the

00:06:12

constraints or the a priori that we

00:06:13

putting in is that the solutions we're

00:06:16

looking for are not just good at

00:06:17

predicting why give an X somehow the

00:06:19

involve sabre presentations that are

00:06:22

also good at capturing something about

00:06:25

the X the input distribution right this

00:06:27

is a you don't have to have that

00:06:29

constraint when you dip your supervised

00:06:30

learning but when you add that

00:06:32

constrain you can get better

00:06:33

generalisation. Um and that can be

00:06:36

useful as a red visor by itself it

00:06:38

could be useful in the in the transfer

00:06:40

setting where you wanna go to a new

00:06:41

task where you have very few labelled

00:06:43

examples or domain adaptation which is

00:06:45

kind of similar it's not a new task

00:06:47

assume new you know type of data maybe

00:06:50

you you go to the you go from you know

00:06:54

Quebec french to swiss french. And you

00:06:57

have to adapt and you don't have a lot

00:06:59

of data alright so that's these are

00:07:04

good reasons another good reason that

00:07:06

came out right at the beginning of the

00:07:07

people learning revolution in two

00:07:09

thousand six is that it looks like we

00:07:11

can exploit as provides learning to

00:07:13

make the optimisation problem of of T

00:07:17

planning easier. Um and the reason is

00:07:20

that we can we can define sort of local

00:07:25

objective functions like each pair of

00:07:28

lay you should be a good all encoder

00:07:31

good should form a good pair of of what

00:07:35

one could repair. And that kind of

00:07:37

constraint is something you know that

00:07:39

induces the kind of training signal

00:07:41

locally you don't need to backdrop to

00:07:45

twenty layers to get that information.

00:07:48

So it can in in the and spliced

00:07:50

retraining things we did from two

00:07:53

thousand six to about two dozen twelve.

00:07:55

Um it was useful a useful way to get

00:07:59

the training of the ground for for deep

00:08:01

supervising that's later we find other

00:08:03

ways to go around this optimisation

00:08:05

difficulty with with the rectifier but

00:08:08

it remains that there's an interesting

00:08:11

effect here that could be taken

00:08:12

advantage of and then the last reason

00:08:15

why this is interesting is that even if

00:08:17

you're only doing. Q or supervised

00:08:19

learning it happens sometimes that the

00:08:21

thing you wanna predict is not a single

00:08:24

simple class or a simple real value.

00:08:27

It's it's it's it's a composed object

00:08:31

for example you're predicting a a set

00:08:35

you predicting a data structure

00:08:36

predicting the sentence you predicting

00:08:38

an image right so if you pretty good

00:08:39

image the output is a high dimensional

00:08:42

object it's pretty a sentence the

00:08:44

output is high dimensional object and

00:08:46

and these objects are composed of

00:08:48

simple things like pixels or words or

00:08:50

characters. And so they have a joint

00:08:52

distribution. Now of course it's a

00:08:54

conditional john descriptions of given

00:08:55

the input I want to predict the joint

00:08:57

distribution of a bunch of things like

00:08:58

words in the sentence or something like

00:09:00

that. Uh or structure of the molecule.

00:09:03

So all of these kinds of obvious we may

00:09:04

be interested in predicting or saying

00:09:07

something about given an put that's

00:09:10

illegal you know structured output

00:09:12

learning and and they're essentially

00:09:14

all the the techniques that we have in

00:09:17

developing forms provides learning

00:09:18

especially the probabilistic once they

00:09:20

become useful we just have the

00:09:23

unsupervised learning model as usual

00:09:25

except we condition it meaning we have

00:09:27

the input that changes something in the

00:09:30

form of the joint distribution already

00:09:31

outputs alright so these are very good

00:09:34

reason to study unsupervised learning

00:09:36

but the one that really you know makes

00:09:39

me weak up at night is that we really

00:09:42

want the machine to understand how the

00:09:45

will ticks how the world works. And

00:09:47

unfortunately a if you if you step back

00:09:54

you know behind all the hype and the

00:09:58

the the excitement around planning and

00:10:00

machine learning in general what

00:10:02

happens very often is that the the

00:10:04

models. And up learning simple tricks

00:10:08

they're like surface statistical

00:10:10

regularities in order to solve the task

00:10:13

and if you think about the self driving

00:10:17

cars you would like those self writing

00:10:19

you know cars to somehow not just

00:10:23

relying on surfaces tickle statistical

00:10:25

regularities but can make sense of the

00:10:28

causal relationships between the

00:10:30

objects and and what could happen if

00:10:34

scenarios even though they may not have

00:10:36

seen these scenarios during their

00:10:37

training face. So how can that happen

00:10:40

how do a human's manage to do that.

00:10:43

Well the deal I think this is a

00:10:46

hypothesis of course we don't really

00:10:48

know what's going on in our brains but

00:10:50

but there's a lot of evidence that we

00:10:52

we brain we we learn a models of the

00:10:54

world that are causal that that's what

00:10:58

I mean by causal here is that there are

00:10:59

explanations about of what's going on

00:11:03

so I think the main job of our brain is

00:11:07

to figure out an explanation for

00:11:08

everything that we're seeing that skins

00:11:10

provides learning job right. Um and and

00:11:12

having an explanation means that you

00:11:14

can kind of simulate you know what

00:11:17

would happen if I change some of these

00:11:19

explanatory factors even though this

00:11:22

may be a situation that I have never

00:11:24

seen during training me again example

00:11:27

fortunately I never had a car accident

00:11:31

that killed me a so how can I learn

00:11:34

about the avoid in the actions that

00:11:36

could you know had have me killed in a

00:11:39

car accident. Well a supervised

00:11:42

learning is obviously not gonna work

00:11:44

even even reinforcement learning is not

00:11:46

gonna work because you know how many

00:11:48

times I have to dial an arts them

00:11:49

before and I learned how to avoid that

00:11:51

right you you see that there's a

00:11:53

problem. So how do we get around that

00:11:56

well we build in mental model of of of

00:11:59

of cars of rows of people that allows

00:12:02

to predict that if you know I do this

00:12:04

and that a it you know there is it

00:12:07

something bad with that may happen and

00:12:08

this is how it may happen and and if it

00:12:11

said I I change a little bit my

00:12:12

behaviour I could you know and up alive

00:12:16

so we are able to do that because we

00:12:19

have these kinds of explanatory models

00:12:21

it's something that we don't know how

00:12:22

to do yet you machines but this is

00:12:25

something we really need to do

00:12:26

otherwise yeah it's it's not gonna be

00:12:30

you know it's gonna be a spongy

00:12:32

alright. So how do we possibly do that

00:12:37

well there are many answers but one of

00:12:40

them that you know the the the the

00:12:42

reason why we got started into this

00:12:44

adventure D planning is because we

00:12:47

thought that by learning these high

00:12:49

level presentations we might be able to

00:12:51

discover high level abstractions what

00:12:54

that means so these obstructions in

00:12:56

some sense or closer to the underlying

00:12:58

explanations the underlying spent three

00:13:00

factors. And what we would really like

00:13:03

is that these high level features that

00:13:05

we're learning the the really capture

00:13:09

the knowledge about what's going on.

00:13:11

And one way to think about this is that

00:13:13

the the pixels we're seeing the the

00:13:16

sound meaning a the words rereading

00:13:20

they were created by something by some

00:13:23

factors by by some agents. And maybe

00:13:27

the lighting and the the microphone

00:13:30

whatever factors came in together were

00:13:32

combined in order to produce what we

00:13:35

observe and so what we want a machine

00:13:38

to do is to reverse engineer this to

00:13:40

figure out what or these factors and

00:13:42

separate them right disentangle them.

00:13:45

So I'll come back to this notion of

00:13:46

design tangling later but this is a

00:13:48

really I I find a very inspiring notion

00:13:51

yeah I I I want first to separate the

00:13:57

notion of invariance from the notion of

00:13:58

descending killing the notion of

00:14:00

invariance is one that has been very

00:14:03

you know commonly studied and and

00:14:07

thought about in areas like speech

00:14:09

recognition or computer vision where we

00:14:11

wanna do supervised learning so we

00:14:13

wanna predict something definite like

00:14:15

you know the object category the

00:14:16

phoneme. And we're trying to hand craft

00:14:20

features or maybe learn features that

00:14:23

are invariant to all the other factors

00:14:28

that we don't care about if I'm doing

00:14:29

speech recognition. I don't wanna know

00:14:31

who the speaker is I want my features

00:14:33

to be very into the speaker I'd want my

00:14:36

features being very into the type of

00:14:37

microphone I'm using if I'm doing ups

00:14:40

recognition I I would like my high

00:14:43

level features to be a maybe invariance

00:14:45

to translation or something like that.

00:14:47

Um the problem with this is that well I

00:14:51

mean this is good for surprise ending

00:14:53

but when you're doing unsupervised

00:14:54

learning. Well you know which factors

00:14:57

are gonna be the one that matter I

00:14:58

wanna capture everything about the

00:14:59

distribution I wanna know that ah

00:15:02

actually D and the lying estimation of

00:15:04

the sound then hearing is both a

00:15:06

sequence of words and phonemes and the

00:15:09

identity the speaker where that person

00:15:11

is in whether he's sick or something

00:15:13

like all these are explanations for

00:15:15

what I'm hearing and I would like the

00:15:17

representation and getting to have all

00:15:19

of that but I would like those factors

00:15:22

to be separated out so that I can now

00:15:24

just plug a a linear classifier on top.

00:15:27

And I can pick out the phonemes if

00:15:28

that's what I want or I can pick out

00:15:30

the I speaker identity if that's what I

00:15:31

want right that's the difference

00:15:33

between invariance and doesn't think

00:15:35

invariance we're trying to eliminate

00:15:39

from the signal from the features those

00:15:42

factors that we don't care about in

00:15:44

doesn't think we don't want it

00:15:45

eliminate anything we just wanna

00:15:46

separate out the different pieces that

00:15:49

that already and lying explanations and

00:15:52

and if you're able to do that you're

00:15:54

essentially killing the curse of

00:15:55

dimensionality because now if if your

00:15:58

goal is to answer specific questions

00:16:00

question about one of the factors you

00:16:03

reduce the dimensionality from very

00:16:04

high to just those features that are

00:16:07

sensitive to that factor now the thing

00:16:12

that we don't completely understand is

00:16:13

that when we do some of these that

00:16:15

apply some of these unsupervised

00:16:16

learning a buttons it looks like the

00:16:19

features we getting are a bit more

00:16:23

disentangle then the original as we go

00:16:26

higher up. Um so something good is

00:16:29

happening. And and these these these

00:16:32

these are experiments that were done

00:16:34

you know to zen publishing two dozen

00:16:36

nine and two thousand eleven and I I I

00:16:39

suspect there are other papers more

00:16:41

recently where we what if we do a kind

00:16:47

of analysis of the the features that

00:16:49

have been learned his arms provides

00:16:51

learning algorithms like sparsely

00:16:53

quarters. Um in knowing some of the

00:16:57

factors right so you know I kind of

00:16:59

cheat and I know some of the going

00:17:01

factors now I can test whether some of

00:17:03

the features become specialised more

00:17:06

towards some factor and and and less

00:17:08

sensitive to other factors is something

00:17:10

we can measure and somehow it seems to

00:17:13

happen magically. So why would that

00:17:16

happen. So here's here's a a kind of

00:17:19

it's a sketch of a theory why

00:17:22

unsupervised learning can give rise to

00:17:27

the extraction of features that are

00:17:29

more disentangle then then the original

00:17:31

data and yeah before I show you the

00:17:38

easy question initially this picture

00:17:39

because for pictures are so much better

00:17:42

so imagine that this is the data you're

00:17:45

getting you have distribution which is

00:17:47

actually a mixture of three gaussians

00:17:50

you can't have simpler than that well

00:17:52

you have a single guy. Um but nobody

00:17:56

tells you that you know what what cost

00:18:01

in the the particular sample you're

00:18:03

getting comes from so you have a label

00:18:04

data you just have the X and the winds

00:18:07

would be the gaussian identity is it

00:18:08

the number one number two number three

00:18:10

but you only observe X right. So if you

00:18:14

only observe axe what would be a good

00:18:15

model of the data well the best

00:18:17

possible model of the data is the one

00:18:20

that actually spells out the density as

00:18:24

a mixture of three gaussians right this

00:18:25

is this is in terms of log likelihood

00:18:28

or or whatever you wanna use is very

00:18:30

likely that the best model the data is

00:18:32

the one that actually discovers that

00:18:34

there is a latent variable Y which can

00:18:36

take the three you know integer values

00:18:39

one two or three I mean you can in the

00:18:40

maybe see if you want but and and you

00:18:42

can read label them but the point is we

00:18:44

have these three categories that are

00:18:47

sort of a implicit and data when we

00:18:50

don't class train. We're exploiting the

00:18:53

fact that there are national clusters

00:18:57

and we use clustering algorithms to

00:18:59

discover these clusters and you can

00:19:01

think of these processes as causes that

00:19:04

nobody told us about but we can

00:19:06

discover with a simple statistical

00:19:08

analysis just you know K means will

00:19:10

figure it out right so you so so the

00:19:12

principle is that there are underlying

00:19:14

causes and the statistics of the data

00:19:17

can reveal them to us if we go a good

00:19:20

model of the data the better the model

00:19:21

we have the the better we are able to

00:19:24

figure out those underlying causes. Um

00:19:27

now why would that be useful for

00:19:29

supervised learning so that's where

00:19:30

this slide and that's the question

00:19:32

becomes interesting. So let's think of

00:19:36

why here is one of the factors that

00:19:40

explain axe all right. Um and so let's

00:19:46

say that at the end of the day we

00:19:47

actually want to classify and pretty

00:19:49

why given X this is gonna work yeah so

00:19:56

we could just train a normal neural net

00:19:58

it predicts white directly from X or we

00:20:01

could train eight generated model that

00:20:06

captures your axe right. Um and as I

00:20:13

try to argue previously the best

00:20:15

possible jotted model here is actually

00:20:17

one that's written as a sum over the

00:20:20

whys and possibly a over all the

00:20:22

variables that Coleman age a we're

00:20:25

given the the causal factors we can

00:20:28

pretty acts. And and the reason that

00:20:33

this is it better model than and then

00:20:35

this one is simply that this is how the

00:20:38

data was actually generated right so

00:20:40

the best model of the data is the one

00:20:41

that kind of the truth that's how it's

00:20:44

generated the one that gives the best

00:20:45

predictions is the white response to

00:20:47

truth. Um it so if we're if even if we

00:20:53

don't observe why okay if we just

00:20:58

observe ex we can we can extract latent

00:21:03

variables like P what we we we try to

00:21:06

monkey of X as a key of X given age

00:21:10

times P of age for example so we

00:21:13

introduce like Bibles age and in the

00:21:16

best possible model well within H

00:21:19

should be why because one is one of the

00:21:21

factors that explains X and so if we

00:21:24

find good representations for P attacks

00:21:27

we're likely that these representations

00:21:30

will be a useful to predict why okay

00:21:36

there is a a nice paper a at IC model

00:21:41

doesn't twelve by gen dancing and and

00:21:46

others from Bernard Shaw cost group at

00:21:50

max Planck institute where they show

00:21:53

that there's a huge difference between

00:21:56

the situation where X is the cause of

00:21:59

why and why is the cause of X in terms

00:22:01

of the ability of some is provides

00:22:03

lying to work in other words if if why

00:22:09

is the cause of X then we can do some

00:22:14

is provide learning and I liked

00:22:15

learning about P of X actually becomes

00:22:18

useful whereas if even though at the

00:22:23

end of the day we only care about you

00:22:25

white give a nice whereas if the causal

00:22:28

direction was reversed then all the

00:22:31

semis provides lighting would be

00:22:32

useless because in the case where it

00:22:35

was reversed basically the the joint

00:22:37

that they're the the joint

00:22:38

distribution. P avoiding and X would

00:22:40

just be given by TOY given X times P of

00:22:43

X and so you X would have nothing to do

00:22:45

with its structure with key of why give

00:22:47

an X whereas if it's the other way

00:22:49

around. Um if the right causal model is

00:22:53

go from Y to X then when we want to

00:22:55

learn P of why give a nice well there

00:22:58

is information about P of why given X

00:23:00

inside P of X because P of X is

00:23:02

decomposed lexus. So yeah they they

00:23:06

push this argument much further but the

00:23:09

this is a deep connection dinner date

00:23:13

is a deep connection between the

00:23:15

causality and the relation you know

00:23:17

which which is the cause of which and

00:23:19

the success of you know unsupervised

00:23:21

learning to help supervise not that's

00:23:23

the main message alright so I mentioned

00:23:28

that unsupervised learning is is

00:23:33

difficult and this shows up very

00:23:35

clearly when you tried to tackle

00:23:37

unsupervised learning using a arsenal

00:23:41

of mathematical and computational tools

00:23:44

from probability like graphical models

00:23:47

and and models with latent variables.

00:23:50

So in principle introducing the latent

00:23:53

Bibles sure that help us and it should

00:23:56

help us to even avoid the curse of

00:23:58

dimensionality. Um because because

00:24:02

we're modelling at the right level in

00:24:03

some sense. But the problem is that for

00:24:07

all of their approach is that that that

00:24:10

are really angry probability in

00:24:13

explicit probabilistic model what we

00:24:15

find is that some of the complications

00:24:18

during that are needed either for

00:24:19

learning or using the model are just

00:24:21

intractable be involve you know running

00:24:24

integrals or sums over an exponential

00:24:27

number of things and so for example in

00:24:32

in typical directed models exact

00:24:36

inference in other ways predicting the

00:24:37

latent variables given the input is is

00:24:41

intractable even though you're going

00:24:42

you're able to go in the other

00:24:43

direction predicting X given age

00:24:45

because that's how to model is

00:24:47

parameterised going backwards which is

00:24:50

something we actually need to do both

00:24:51

for learning a potentially for using

00:24:53

the model the is involves an

00:24:55

intractable some in other models the

00:24:58

and directed models yeah there's

00:25:00

another issue with it potentially in

00:25:02

addition to this one which is that

00:25:04

these models involved in normalisation

00:25:06

constants. Um which is intractable and

00:25:10

and it's gradient isn't right in other

00:25:12

words the probability is expressed as

00:25:14

some expression divided by

00:25:16

normalisation constants which we

00:25:17

usually right is that and that's that

00:25:19

is something we can compute easily. And

00:25:21

of course and we also need to give you

00:25:23

the gradient of that's said so it's

00:25:25

it's looks like it's hopeless. Um so

00:25:29

this has this has us you know motivated

00:25:33

a lot of new things some of which I

00:25:34

will tell you about but let me start

00:25:38

with the and sisters of the

00:25:43

degenerative models the energy based

00:25:46

models of both machines basically of

00:25:49

the category of undirected graphical

00:25:50

models so with and write a graphical

00:25:52

models basically you're expressing the

00:25:54

probability function. So X is the the

00:25:57

run the marble you're trying to model

00:25:59

in terms of and energy so this is just

00:26:02

a rewrite there's not much of a diff

00:26:04

constrained by doing this except that

00:26:05

we're saying that ah every

00:26:08

configuration gets a non zero

00:26:09

probability because energy you know

00:26:11

it's gonna be finite for any X and so

00:26:14

this means probably is just region zero

00:26:16

for everything but besides what it

00:26:19

really saying is that instead of

00:26:21

primate rising the probably directly

00:26:22

where primate rising this guy the

00:26:24

energy and we letting this Z the the

00:26:28

rye from it so that here is just to sum

00:26:30

over X or the integral of racks of the

00:26:32

the numerator okay so if you have a

00:26:35

model of that type it turns out that

00:26:38

the log flight you'd and tells you to

00:26:40

update your parameters according to the

00:26:42

following very simple idea and

00:26:45

especially if you think about

00:26:47

stochastic green descent so I'm giving

00:26:48

an example X let's call it X plus and

00:26:52

this landscape that I'm showing here is

00:26:54

the energy landscape so think of

00:26:55

remember this E to the minus energies

00:26:58

probability so when energy's localities

00:27:00

high. And there's an exponential

00:27:02

relationship. So yeah which is hard to

00:27:06

visualise here but ah when when this

00:27:09

goes up very much then the probably

00:27:10

goes exponentially faster zero alright

00:27:13

so we're given an example X plus and

00:27:15

you have occurred energy function so

00:27:17

this is the curve the Y axis is energy

00:27:19

and what we wanna do with max and like

00:27:22

it we wanna make the probability of the

00:27:24

observed data high that's what my from

00:27:26

like it means that means make the

00:27:28

energy of the observed configurations

00:27:31

low. So the ideal solution would be to

00:27:34

make every twenty example at peak I

00:27:36

mean another ticket trough like a

00:27:37

minimum of the energy that would be the

00:27:41

ideal solution from the twenty point of

00:27:43

view from civilisation might not be but

00:27:45

anyway what training consists in is

00:27:48

pushing down on the energy where the

00:27:50

examples are and pushing up everywhere

00:27:52

out because if I just push down on the

00:27:54

training example where the energy for

00:27:56

the training example that may not be

00:27:59

good what I really want is you know the

00:28:02

relative energy to be small for

00:28:03

trainings also here's an example where

00:28:05

the the data points are these a little

00:28:07

dots. And doing training we're pushing

00:28:11

up everywhere else. And we're gonna get

00:28:13

a model that puts a low energy where

00:28:16

the data is this is a good model right

00:28:18

and this is is not as good model. So

00:28:24

yeah you can get that just by doing

00:28:27

three lines of algebra but is this

00:28:29

something kind of intuitive about

00:28:32

what's going on here at the same time

00:28:33

as we're trying to push on at the

00:28:36

configuration given by the data push

00:28:38

down the energy we're trying to push up

00:28:40

everywhere everywhere else but not in

00:28:45

the same with the same strength

00:28:46

everywhere else we the equation we're

00:28:48

getting tells us we wanna push up

00:28:50

especially in places where the energy

00:28:52

is low right so all those places that

00:28:56

get a high probability basically should

00:28:57

be pushed up and we call these in a

00:29:02

negative examples and these possible

00:29:04

examples we're trying trying to make

00:29:05

positive examples more probable and

00:29:08

trying to make negative examples less

00:29:10

probable. And where do we get those

00:29:12

negative examples well ideally these

00:29:14

negative examples come from the model

00:29:16

distribution itself quite so once we

00:29:18

have an energy. We have a probability

00:29:20

fusion corresponds to it by this

00:29:22

equation. And if we could sample from

00:29:24

this distribution we would get like you

00:29:26

know many points here a few here if you

00:29:28

hear us so we wanna push where we get

00:29:31

those samples up. That's what the the

00:29:34

the math tells us we should be doing to

00:29:35

maximise like you this is what we see

00:29:39

in this equation so that the riveted of

00:29:40

the log probability with respect to

00:29:42

parameters which are hidden inside the

00:29:44

energy function has two turns one which

00:29:48

we call the positive face term and the

00:29:50

other called the negative face turn.

00:29:52

And this one is saying you know change

00:29:54

parameters so that the energy of the X

00:29:57

becomes lower because we wanna maximise

00:30:00

this we have a minus here so we

00:30:01

minimise the energy at this X and and

00:30:06

and now you also have this term or just

00:30:09

push up so there's no negative here.

00:30:11

You this is wants to go up so this was

00:30:13

the what everywhere so some all X tilde

00:30:18

but waited by P of X dollars so those

00:30:20

places where the model thinks that you

00:30:24

know they have a high probability we

00:30:26

want to reduce their probability we

00:30:28

want to increase their energy this is

00:30:30

the case here in the second line where

00:30:33

the model involves not just the expert

00:30:36

also some latent variable H so now the

00:30:38

energy function is defined in terms of

00:30:39

both X an age and you could marginalise

00:30:43

so some overall the values of age and

00:30:45

get another equation which looks like

00:30:47

the one we had before any call this

00:30:49

modified energy or marginalise energy

00:30:52

should be the right term but physicists

00:30:53

call it free energy and that is a

00:30:56

Conference Program

59:34

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

2368 views

55:38

Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.

427 views

01:01:02

Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.

330 views

55:14

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

815 views

55:57

Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.

342 views

01:08:04

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.

2156 views

49:29

Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.

275 views

52:43

Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.

151 views

45:40

TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.

2659 views

52:33

TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.

1704 views

01:05:51

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD
July 6, 2016 · 1:59 p.m.

1406 views

01:04:41

TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.

2250 views

Recommended talks

22:38

An Intrinsic Geometry of Manifold Learning Theory & Related Algorithms
Dr. Ke Sun, Uni Geneva
Oct. 17, 2013 · 11:21 a.m.

103 views

20:13

Abstract Reasoning
Marco Valentino, Idiap Research Institute
March 10, 2023 · 10:38 a.m.

120 views

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada

Embed

Transcriptions

Conference Program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.

Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.

Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.

Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.

TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.

TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD
July 6, 2016 · 1:59 p.m.

TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.

Recommended talks

An Intrinsic Geometry of Manifold Learning Theory & Related Algorithms
Dr. Ke Sun, Uni Geneva
Oct. 17, 2013 · 11:21 a.m.

Abstract Reasoning
Marco Valentino, Idiap Research Institute
March 10, 2023 · 10:38 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Deep Generative Models Yoshua Bengio, University of Montreal, Canada

Embed

Transcriptions

Conference Program

Deep Supervised Learning of Representations Yoshua Bengio, University of Montreal, Canada July 4, 2016 · 2:01 p.m.

Hardware & software update from NVIDIA, Enabling Deep Learning Alison B Lowndes, NVIDIA July 4, 2016 · 3:20 p.m.

Day 1 - Questions and Answers Panel July 4, 2016 · 4:16 p.m.

Torch 1 Soumith Chintala, Facebook July 5, 2016 · 10:02 a.m.

Torch 2 Soumith Chintala, Facebook July 5, 2016 · 11:21 a.m.

Deep Generative Models Yoshua Bengio, University of Montreal, Canada July 5, 2016 · 1:59 p.m.

Torch 3 Soumith Chintala, Facebook July 5, 2016 · 3:28 p.m.

Day 2 - Questions and Answers Panel July 5, 2016 · 4:21 p.m.

TensorFlow 1 Mihaela Rosca, Google July 6, 2016 · 10 a.m.

TensorFlow 2 Mihaela Rosca, Google July 6, 2016 · 11:19 a.m.

AMD's Open Compute and Open Source cross platform solutions for Machine Learning Mauricio Breternitz, AMD July 6, 2016 · 1:59 p.m.

TensorFlow 3 and Day 3 Questions and Answers session Mihaela Rosca, Google July 6, 2016 · 3:21 p.m.

Recommended talks

An Intrinsic Geometry of Manifold Learning Theory & Related Algorithms Dr. Ke Sun, Uni Geneva Oct. 17, 2013 · 11:21 a.m.

Abstract Reasoning Marco Valentino, Idiap Research Institute March 10, 2023 · 10:38 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.

Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.

Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.

Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.

TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.

TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD
July 6, 2016 · 1:59 p.m.

TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.

An Intrinsic Geometry of Manifold Learning Theory & Related Algorithms
Dr. Ke Sun, Uni Geneva
Oct. 17, 2013 · 11:21 a.m.

Abstract Reasoning
Marco Valentino, Idiap Research Institute
March 10, 2023 · 10:38 a.m.