Deep Supervised Learning of Representations

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:00

We'll start with a short work by LA

00:00:02

with subject off yeah on then we would

00:00:05

have the first two by you sure so first

00:00:09

okay well you have to be to be very

00:00:13

very short I would be very shall I

00:00:15

promise for ones like you saying that

00:00:17

but well welcome to sun is excellent

00:00:20

and sunny valley in particular the best

00:00:24

place into both less polluted morse and

00:00:28

so I hope that besides this great

00:00:30

workshop we will have science I have

00:00:32

time to enjoy the the region is really

00:00:34

worth is spending times and these

00:00:37

hiking around or whatever you like

00:00:41

anyway I I would like to thank for for

00:00:43

for us the first for organising this

00:00:45

this great workshop and all the people

00:00:47

who kindly accepted to come is that and

00:00:50

these are is is a lecture us for us

00:00:54

this workshop there was two reasons to

00:00:55

have this workshop this year the first

00:00:58

one is that is one of the twenty fifth

00:01:01

anniversary events that lead up is

00:01:02

organising this year one among a few

00:01:06

orders many others that will have

00:01:08

basically every month and the second

00:01:11

one is that basically there is as we

00:01:13

all know and that's why we are all here

00:01:15

there is a big a revival whatever we

00:01:18

want to call it. Well or progress in

00:01:23

the mission on the ink and deep

00:01:26

learning and no network as opposed to

00:01:29

just superficial learning that people

00:01:31

were doing in the past. And obviously

00:01:34

it yeah but is built into a pretty

00:01:37

large institute of hundred twenty

00:01:39

people fifty or in the twenty startups

00:01:42

and many more to come in to the feud to

00:01:45

your work years around the around to

00:01:48

advance signal processing and machine

00:01:50

learning I dress to menu problems is

00:01:54

Dave a the speech processing computer

00:01:56

vision biometrics by you imaging

00:02:00

computer vision human behaviour

00:02:03

understanding and so on. So everything

00:02:06

we are doing so we sounds very fancy

00:02:08

and very complicated but what we like

00:02:12

about what we are doing is that we are

00:02:13

all sharing the same tools which are

00:02:16

basically tools coming from signal

00:02:18

processing and and machine learning. So

00:02:20

any progress in this area regarding

00:02:24

software or regarding hardware that

00:02:26

social something that is unique to this

00:02:28

workshop I believe. I think it's one of

00:02:31

the first time that I know of where we

00:02:33

have people coming from the hardware

00:02:35

side which not always agree with each

00:02:38

other either because we are talking

00:02:40

about CPUGPU in many others a dude he

00:02:44

how about and the softer side and so

00:02:47

this would be a great places so to

00:02:48

exchange ideas about the future and all

00:02:50

we can have the community at large and

00:02:53

and so areas like the one I just

00:02:56

mentioned now. So again thank you a lot

00:02:59

to cross off also is the wall organise

00:03:02

this and you will your boss you will be

00:03:04

possible to next to the treaties yeah

00:03:07

thank you for so okay so as obvious say

00:03:18

hold as you probably know because you

00:03:19

are here there is a strong provide or

00:03:23

when you have an transformation running

00:03:25

on I think we all agree that it's you

00:03:28

to mix of policies balls onto the

00:03:32

juries are side doctor sort of

00:03:34

engineering things going on especially

00:03:35

software frameworks on hardware from

00:03:38

testicle. So I I think it's nice to

00:03:40

have those from plastic speakers

00:03:41

because they they spend ton skate

00:03:44

that's of interest was the for the

00:03:45

running on the old also I think so it's

00:03:49

a nice even to elected yeah because

00:03:50

it's pretty words button for what we we

00:03:53

do here which is to get interface

00:03:55

between mission on either some machine

00:03:58

running on our engineering on concept

00:04:01

to industry. So it's we the Greek

00:04:04

leisure that so for a speaker we be you

00:04:07

shopping you so you sure useful obvious

00:04:10

autonomous yeah the is at of the

00:04:13

machine on a mission running about

00:04:15

Julian could act of the C file hold on

00:04:17

that we you know on the user makes a

00:04:20

puny on keep running as you might have

00:04:22

to you these are the you wrote two

00:04:24

books on the subject on there as being

00:04:27

impressed with my my to press on the

00:04:28

the paper Thanks you guys hear me well

00:04:40

yes good. So today I'll talk more about

00:04:44

supervised learning and I'm just gonna

00:04:47

scratch the surface there's one hours

00:04:49

not enough really to give justice to

00:04:51

this field but and tomorrow I'll talk

00:04:54

more about as provides learning. So as

00:04:58

you know colours or something to drive

00:05:00

themselves and we're starting to talk

00:05:04

to our phones in their starting to say

00:05:07

something back and computers are now

00:05:11

able to beat the world champion in the

00:05:13

game go which is which was not to be

00:05:15

something computers would be able to do

00:05:18

for decades. And all this is

00:05:21

essentially because the progress in

00:05:22

machine learning in an area called deep

00:05:24

learning which is essentially a renewal

00:05:26

of you and nets I think it's a lot more

00:05:28

than these little things I mentioned

00:05:32

it's a whole a new iconic revolution

00:05:36

that is coming with progress and yeah I

00:05:39

currently spearheaded by these

00:05:41

techniques. And that's why so many

00:05:43

companies are jumping into this is only

00:05:45

this is the display like two years ago

00:05:49

and it's it's much bigger and much more

00:05:51

crowded now. So let me tell you about

00:05:54

about deep learning what it is the the

00:05:59

general idea is the we want to learn

00:06:05

from data so he's a machine learning

00:06:06

algorithms. And with particular is that

00:06:09

we're gonna learn of representations of

00:06:11

the data. And we're gonna learn

00:06:13

multiple levels of representation of

00:06:15

data okay that's that's really what's

00:06:17

the planning is about in why would that

00:06:20

be an interesting because these

00:06:22

multiple levels of representation

00:06:23

really hers or a supposed to and and

00:06:27

effectively seem to capture different

00:06:30

degrees extraction so as you go deeper

00:06:32

you tend to be able to capture more

00:06:35

abstract concepts. And and this has

00:06:38

worked out really well it started with

00:06:40

a speech recognition object recognition

00:06:42

hundred detection and more recently

00:06:45

there's been a lot of progress and a

00:06:48

natural language understanding

00:06:50

processing machine translation and

00:06:52

things like so it this this ability of

00:07:00

train you on that's that are deep to

00:07:02

have more then a couple of hidden

00:07:04

layers something that really happened

00:07:06

around two thousand six things to buy a

00:07:10

funding from see far which is a

00:07:12

canadian organisation that makes pretty

00:07:14

long bets on on and shoes research

00:07:17

projects and included different things

00:07:21

that in trying to my live in Montreal

00:07:22

and yellow kinds of in new York so that

00:07:25

was the that the the first breakthrough

00:07:27

essentially allowed us to use

00:07:29

unsupervised learning organs which

00:07:31

existed to bootstrap the training of a

00:07:35

deep supervise you on that then another

00:07:39

I think really important bands that to

00:07:41

people are not aware of a something

00:07:43

that happened in two thousand eleven

00:07:45

when we found that actually we didn't

00:07:47

need this supervise and provide spree

00:07:49

training trick that if we just replace

00:07:52

the nonlinearity that people tended to

00:07:55

use the hyperbolic tangent or the

00:07:56

sigmoid with the rectifier you were

00:08:00

able to train very deep supervise

00:08:02

networks. And just the year after that

00:08:04

our colleagues from Toronto use this

00:08:08

trick along with other you tricks like

00:08:10

drop out. And in order to have a really

00:08:14

big breakthrough in terms of

00:08:16

performance on object recognition I'll

00:08:19

tell you more about that. So first of

00:08:22

all let me step back a little bit about

00:08:24

what a ideas about a I is about

00:08:27

building machines that can take

00:08:29

decisions the decisions. And for a

00:08:34

computer or even a human or animal to

00:08:37

take the decisions that entity needs

00:08:40

knowledge right you get intelligence by

00:08:42

having knowledge however so this is of

00:08:45

course well known and for decades what

00:08:49

what's happened in the I research is

00:08:51

we've tried to give that knowledge to

00:08:53

computers explicitly by you know giving

00:08:56

the computer rules and facts. And the

00:08:59

and P.'s the program but it failed and

00:09:03

it failed because a lot of the very

00:09:05

important knowledge that we happen and

00:09:07

be using you to understand the world

00:09:08

around this isn't something we can

00:09:10

communicate in in the language or in

00:09:12

programs it's things we know but we

00:09:15

can't explain this is essentially

00:09:18

intuition. And that's where machine

00:09:20

running comes because we need to get

00:09:23

that knowledge to computers we can't

00:09:25

tell them exactly what that knowledge

00:09:27

is like how to recognise the face or

00:09:30

chair but we can show examples of

00:09:33

computer and that's how computers have

00:09:35

been able to learn that kind of

00:09:37

intuitive knowledge you know a a good

00:09:40

example of this is the the gonna go I

00:09:42

was telling you before you can ask

00:09:43

expertly why did you make that play and

00:09:45

it will invent a story but really the

00:09:47

story is very incomplete and and

00:09:51

students wouldn't be able to just use

00:09:53

that story in order to plea as well as

00:09:55

the master by far. So the the the the

00:09:59

the expert player has this intuitive

00:10:01

knowledge about with the right thing to

00:10:02

do but we can't really explain it. Um

00:10:05

what are we can take games played by

00:10:09

these high level experts and bootstrap

00:10:11

a you on that that learns to capture

00:10:14

these intuitions implicitly I love this

00:10:18

machine learning power to the data yes

00:10:21

right. So another really important

00:10:24

thing to understand about you know why

00:10:26

is machine learning working in the

00:10:28

first place these days so well is that

00:10:31

it relies on optimisation relies on

00:10:33

defining what it is that you want the

00:10:35

machine to learn is a a function like

00:10:39

an error function that we can just

00:10:41

optimised and and the way we optimise

00:10:43

it is actually incredibly simple

00:10:45

compared to the very sophisticated you

00:10:49

know things have been done in the

00:10:50

position we just do a very small

00:10:52

changes the time we show one example at

00:10:54

a time or a small batch of example

00:10:57

recall many batch but that's technical

00:10:59

detail we should one example at a time

00:11:00

and then we we we look at the error

00:11:03

that the computer is making about

00:11:05

example like you're supposed to save

00:11:06

the car. Um and and we gonna make a

00:11:10

very small change of the parameters

00:11:12

inside the box that define what what is

00:11:14

the mapping from input output. So that

00:11:17

the that nothing produce something

00:11:18

slightly better I'll tell you a lot

00:11:20

more about backdrop later but that's

00:11:22

the idea of a good compute what is the

00:11:24

small change or we can do to the neural

00:11:27

net parameters. So that there will be

00:11:29

slightly smaller next time and we

00:11:31

repeat that hundreds of millions of

00:11:33

times and the thing recognises cards

00:11:35

spaces and desks and and so on. And one

00:11:39

of the first areas where the this

00:11:41

breakthrough of using D that's really

00:11:44

made a difference is in the area of

00:11:46

speech recognition it start around two

00:11:48

dozen ten and we we see in the graph is

00:11:51

what happened with the years on on a

00:11:54

particular benchmark and this is a this

00:11:56

is a cartoon the real picture as you

00:11:58

know lots of ups and downs really and

00:12:02

what we see is that in the nineties

00:12:03

things were progressing quite well

00:12:05

using HM M.'s which with the standard

00:12:08

of the day and it two thousand somehow

00:12:11

even though we had more data and and

00:12:14

yeah faster computers performance in

00:12:17

improved that much until these deep

00:12:19

neural net started being used and that

00:12:21

was a big drop in in error rates and

00:12:23

over matter of a few years the whole

00:12:25

area of speech recognition turn to

00:12:27

using these things. And all the

00:12:29

industrial systems now are based on on

00:12:31

these TV on that's and then you know

00:12:35

lagging by a couple of years something

00:12:37

similar happened in computer vision and

00:12:38

it started with a object recognition

00:12:41

other was given an image you know is

00:12:44

there you know which which object to

00:12:45

present is there a dog is there a chair

00:12:48

is their person. And what the tasks

00:12:52

that that really started this going is

00:12:54

the image neckties where you have a

00:12:56

thousand hundred categories you're

00:12:57

given an image and you're you're

00:12:58

supposed to say which of the categories

00:13:00

is present. Um and and in the last the

00:13:05

few years from two thousand twelve

00:13:07

doesn't fifteen I'm not only the the

00:13:09

the performance you know improve very

00:13:12

fast thanks to these be here

00:13:14

convolutional nets but we essentially

00:13:16

which human level performance on these

00:13:18

tasks there's to the the this is true

00:13:21

of you know sort of nice images and so

00:13:23

humans are still better when when it's

00:13:25

harder to do recognition but the

00:13:27

progress has been really amazing and

00:13:30

essentially it's now more in almost you

00:13:34

know an industrial concern to to get

00:13:37

these into products okay I'm gonna

00:13:40

actually no run you video from my

00:13:43

former colleagues are leading to you of

00:13:45

started the company of years ago and

00:13:47

and being recruited by and media. And

00:13:52

the the use these conclusion that's to

00:13:55

train. And it Uh_huh right I I oh I oh

00:14:11

oh oh oh oh I so you need some more

00:14:24

data on the of of of of of of of of of

00:14:53

right so this and other things

00:15:16

depending is bringing are gonna really

00:15:17

change or world oh okay but now and I

00:15:22

guess for the rest of my presentation

00:15:25

I'm gonna build a bit more into the

00:15:28

technical part of this. And I'm gonna

00:15:31

start by telling you about the the work

00:15:34

force of the progress we've had in the

00:15:37

last years which is just the good go

00:15:39

back from a from the eighties or late

00:15:43

seventies depending on how you wanna

00:15:44

look at it that's based on very very

00:15:47

simple ideas that are really important

00:15:49

to understand in order to even debug

00:15:52

the things that you're playing with

00:15:54

when you when you will in the next few

00:15:56

days. Um so remember I said that we

00:15:59

want to compute the smallest changes

00:16:02

that need to be done to those neural

00:16:03

that parameters so that it performs

00:16:05

slightly better next time this just

00:16:07

happens to be a gradient so it's a

00:16:08

partial derivative of the arrow the

00:16:10

loss function much optimise with

00:16:12

respect to the parameters so how are we

00:16:14

gonna compute these partial derivatives

00:16:15

from this very complicated machine

00:16:17

which is this deep you on that so we're

00:16:21

gonna use a little chain rule which

00:16:23

says that which tells how to compute

00:16:25

derivatives through composition. I so

00:16:27

so if X influences why fruit function G

00:16:31

and why influences that true function F

00:16:33

right so that is F compose with G of X

00:16:38

we can get the derivatives that respect

00:16:41

X to the final answer with respect to

00:16:43

the input by just multiplying these two

00:16:45

partial derivatives along the way and

00:16:49

so in in and and in our case the X that

00:16:51

we care about this is gonna be some

00:16:53

parameter and this is that we care

00:16:55

about is gonna be the error that we're

00:16:57

making on an example the loss that we

00:16:58

wanna minimise. Now to think that is

00:17:00

great about backdrop is that if the

00:17:03

amount of computation you need to

00:17:05

compute the losses function of example

00:17:07

and the parameters is on the order of

00:17:10

and see and depends on the number of

00:17:12

parameters or something about then

00:17:15

computing the gradient so the

00:17:17

derivative the loss of respect or

00:17:18

parameters is also or yeah right so you

00:17:21

don't you can compute something

00:17:23

efficiently you can also computed

00:17:24

greeting efficiently. So we start with

00:17:27

this. But X which is a variable like

00:17:30

the parameter and we do transformation

00:17:32

so this is just a graphical view what I

00:17:34

told you about in we have these partial

00:17:35

derivatives along the way. And the

00:17:37

general tells us that if I make a small

00:17:40

change delta X it's gonna become a

00:17:44

small change don't to live by taking

00:17:46

the build X and multiplying by the

00:17:47

partial driven same thing but delta

00:17:49

otherwise gonna transform into does

00:17:51

that by multiplying delta Y by these

00:17:53

that you want so then if I have a done

00:17:55

the XM plug this into that one I get

00:17:58

that the small changed all the X become

00:18:00

delta that by multiplying by DZEY times

00:18:02

DYDX this is basically what happens

00:18:04

right this is the chain well this is

00:18:05

how it comes up. Um and that's the that

00:18:10

was the simple scenario where X goes

00:18:12

directly to that but maybe they're

00:18:14

different hats right so X influences

00:18:16

one one in influences why to for

00:18:18

example these maybe you know two

00:18:20

neurons in email you know like here and

00:18:22

that is your loss annexes some

00:18:23

parameter. So now it turns out that the

00:18:26

chain will just changes a little bit

00:18:28

and we gonna add the products along the

00:18:31

path for each of the path and so you

00:18:36

know we have these partial derivatives

00:18:38

along the path we do this guy times

00:18:40

this guy was this guy times this guy

00:18:42

and of course we can generalise this

00:18:44

too and pads and we get this equation

00:18:48

which says essentially that for no tax

00:18:51

we're gonna look at the partial

00:18:54

derivatives of the lost respect with

00:18:55

successors here why wonder why and why

00:18:58

high. And multiplied by the partial

00:19:00

derivative along the path DYIDX so for

00:19:05

each these guys okay so that's a very

00:19:07

simple formula and that's what you have

00:19:08

the heart of things like like porch or

00:19:11

or you know or tens of and of course

00:19:15

you can generalise this to an arbitrary

00:19:17

graph of computation so what what these

00:19:19

packages do is they create eight E it

00:19:22

data structure which represents a graph

00:19:24

of computation or flow graph where each

00:19:27

node corresponds to result and usually

00:19:30

those notes wants peace killers

00:19:31

actually will be ten servers like

00:19:33

matrices vectors of higher or a high

00:19:35

order objects a but the principle is

00:19:37

gonna be the same that once we we have

00:19:40

loved that graph we can either computed

00:19:43

forward or we can compute derivatives

00:19:45

in a recursive way by by saying okay so

00:19:49

that the read up the final lost here

00:19:51

with respect to some no tax in the

00:19:53

middle can be obtained recursively by

00:19:56

looking at the already computed partial

00:19:58

derivatives these DYI for each of the

00:20:00

successors of X in the graph times the

00:20:03

partial derivative along the arcs

00:20:05

TYIDXO how this guy influence as the

00:20:07

next guy for each of the next guys and

00:20:10

how each of those guys these influences

00:20:12

this that that we've already computed

00:20:13

that recursively because this is of the

00:20:15

same form is this these that be

00:20:16

something where something is any of the

00:20:19

no of course to make that we're gonna

00:20:21

have to do it do it in the proper order

00:20:23

we first need to compute the wise you

00:20:25

before we compute I mean the derivative

00:20:27

with respect to twice before we

00:20:28

computed derivatives with respect to

00:20:29

the X okay so that's that's essentially

00:20:33

back prop you can apply it to the

00:20:35

multiplier network for example here

00:20:37

would be used simple architecture where

00:20:41

we might up what a a vector abilities

00:20:44

over categories it's typical thing we

00:20:46

do and or lost might be the so called

00:20:49

negative log likelihood which is minus

00:20:51

the log of the probability given to the

00:20:54

crack class so one of these guys the

00:20:56

correct class and we just want this guy

00:20:58

to be as high as possible that we take

00:21:01

minus the log of it and we minimise.

00:21:03

And of course that loss depends both on

00:21:05

the outputs and on the correct answer

00:21:07

why because we use this this correct

00:21:09

answer which is an integer here to let

00:21:12

us know which of the output

00:21:13

probabilities we wanna maximise once we

00:21:17

conclude that last week and go

00:21:18

backwards using the same principles

00:21:19

what a short don't here. So now we can

00:21:21

compute the relatively respect to the

00:21:25

output units and then using this week

00:21:27

in applied recursively to together

00:21:29

derivatives respect the previously as

00:21:31

well as to with respect to the weights

00:21:33

that go into that later and similarly

00:21:36

we can go out again and back problem

00:21:38

one more step and get riveted respected

00:21:40

those weights right so if I go again

00:21:45

it's going oh yeah so we is good that's

00:21:53

just explained it alright oh I'll make

00:22:05

my slides available so you can look at

00:22:07

this more carefully and once you

00:22:09

understand that you can apply this to

00:22:11

any a graph of any structure you can of

00:22:14

course generalises to graphs that are

00:22:17

dynamically a constructed liking the

00:22:19

recurrent network what you have is

00:22:21

instead of having a if excise graph the

00:22:24

graph actually has the form of a chain

00:22:26

like this. Uh and depending on the

00:22:29

number of inputs the X as well that the

00:22:31

graph will be longer to accommodate

00:22:33

Reading all of these inputs and

00:22:34

computing some internal state which

00:22:37

correspond to your own that summarise

00:22:40

everything that's been done before in

00:22:42

in a way that captures what's needed

00:22:43

for whatever computation follows you

00:22:47

you could also generalise to grab that

00:22:48

are trees but again that that which

00:22:50

particular with this we Karen and

00:22:51

recursive architectures is that instead

00:22:55

of having a a fixed us that a graph

00:22:58

depending on the particular data you

00:22:59

have like the length of the sequence or

00:23:01

the the tree that's built on top of

00:23:03

sentence the the graph is is

00:23:06

dynamically constructed for this to

00:23:07

make sense what you need is that the

00:23:09

the same parameters are we use I so we

00:23:13

don't have a separate set of weights

00:23:14

for each time step you have the same

00:23:16

weights use a different times that and

00:23:19

so if I have a longer sequence I can

00:23:20

just extend the graph is gonna be the

00:23:22

same parameters or and and so we can

00:23:25

generalise to you know different links

00:23:27

or different trees in the keys

00:23:28

recursive networks alright so this was

00:23:32

a very very brief intro to back prop

00:23:35

you can read a lot more in in my book

00:23:38

which is available online. And and I'm

00:23:42

gonna move to a little bit of the why

00:23:44

doesn't work and that being is gonna be

00:23:46

very a brief and high level and you

00:23:51

know you probably want to delve deeper

00:23:53

by yourself to make more sense of it

00:23:56

but I'm gonna try to give you the the

00:23:57

basic concepts. So the previous thing

00:24:00

was haul it in a very you know focused

00:24:03

way now let's try to see why wasn't

00:24:05

working. What's what's new with these

00:24:07

deep networks. Um so you know in the

00:24:11

early days of a I where what people are

00:24:13

trying to do is build a system that

00:24:16

goes directly from input to output

00:24:17

through hand design programs all the

00:24:20

knowledge as I said at the beginning

00:24:21

was put in the machine directly from

00:24:24

the brains of the the the experts or

00:24:26

from the brings of the programmer into

00:24:29

a program maybe the program had an

00:24:31

explicit set of facts and rules but

00:24:34

that's how it was done that was no

00:24:35

learning then a lot of the work in

00:24:38

classical machine learning in the

00:24:39

eighties and nineties was based on

00:24:43

introducing some learning in particular

00:24:46

starting from a lot of hand design

00:24:47

features that were crafted based on a

00:24:50

knowledge of with the input is supposed

00:24:52

to be and what kind of in various we're

00:24:54

looking for and then transforming those

00:24:57

features are cruel line mapping often

00:24:59

just a linear mapping or colonel

00:25:01

machine that would go to produce the

00:25:05

the output we want. Um what happens

00:25:08

with neural nets is that ah we look

00:25:11

inside this box and we think of it as a

00:25:13

composition of multiple transformations

00:25:16

and once you start thinking about this

00:25:18

you have something in the middle here

00:25:19

between two sets of transformations you

00:25:22

can call this first transformation

00:25:23

extracting features but now they're

00:25:24

gonna be line. And the second thing

00:25:26

here might be be linear classifier

00:25:28

again. But the thing in the middle of

00:25:29

something new it's a representation

00:25:32

that the computer has a line. So you

00:25:34

know the really important concept in

00:25:36

deep learning is that we're learning

00:25:39

representations. And not any kind of

00:25:41

representations is I'll try to explain

00:25:43

a bit later. But that's really crucial

00:25:45

thing and then you know deepening is

00:25:47

just taking this idea running a

00:25:48

presentation to say okay so we're gonna

00:25:50

learn multiple levels other

00:25:51

presentations who's your I just have to

00:25:53

levels of impatient this guy and this

00:25:55

guy the output is is in in a it's also

00:25:58

a representation but it's it's meaning

00:26:00

is fixed it's depends on that you know

00:26:02

what the semantics of what we're trying

00:26:04

to protect and of course you have a

00:26:05

presentation input but also the

00:26:07

meanings speaks where is the meeting of

00:26:08

these representations or something

00:26:10

discovered by the computer it's

00:26:11

something that the computer makes up

00:26:14

you know to to do the job. So for

00:26:16

example around two thousand nine our

00:26:18

friends at Stanford looked into the the

00:26:21

than you'll and trying to figure out

00:26:22

what kinds of representations that is

00:26:24

the you are that learning. And so this

00:26:26

new one that was trained on a images of

00:26:29

faces and they found that the the

00:26:32

lowest level the first layer units

00:26:35

extracted a detectors because this is

00:26:37

not a new observation this is something

00:26:39

that's been shown with a lot of machine

00:26:41

learning models including you and

00:26:42

that's that an actual set of features

00:26:45

for images like these edge detectors

00:26:47

they like oriented contrast of a

00:26:49

particular size and position and so on.

00:26:52

Um but it it kind of discovered always

00:26:55

by itself then it more interesting is

00:26:57

that if you look at the second level

00:26:58

and you look at what the units like so

00:27:00

this is what these pictures each little

00:27:02

where he represents the kind of input

00:27:04

that a particular unit in that you on

00:27:07

that is is preferring. And you see that

00:27:10

it has things like parts of the of the

00:27:13

face like you know eyes and and noses

00:27:16

or other forty things that you can

00:27:18

think of as the composition of some of

00:27:20

these guys here. So that the units one

00:27:23

level compose together. And not nearly

00:27:27

compute function of the lower level

00:27:31

output so that it takes the

00:27:33

representations at one level in

00:27:35

computing you kind of representation

00:27:37

and here we can see that it seems to be

00:27:39

discovering detectors for parts they

00:27:41

can be combined to form a high level

00:27:43

part or full objects like these these

00:27:45

faces it so why why is this idea

00:27:50

working there's actually no free lunch

00:27:53

anywhere in mission dining and and not

00:27:54

in planning the reason why planning is

00:27:57

working is because somehow this idea of

00:27:59

composing pieces together composing

00:28:04

functions at multiple levels is natural

00:28:07

is something that is out there that

00:28:09

fits well how the world is organised.

00:28:11

So if you're into trying to do a image

00:28:14

recognition well there's a there's a

00:28:15

kind of you know some natural hierarchy

00:28:17

of concepts starting from pixels to

00:28:20

edges to little texture motives. And

00:28:23

and parts and objects if you're doing

00:28:25

you know modelling text then you know

00:28:26

characters combining two words words

00:28:29

combine into weird groups or or freezes

00:28:31

that go into clauses and sentences and

00:28:33

stories and we don't really know what

00:28:35

the rights you know concept should be

00:28:37

higher up but but we can imagine that

00:28:39

there are some high level abstractions

00:28:41

that that makes sense for the

00:28:43

particular domain in speech of course

00:28:45

we go from the acoustics samples to

00:28:46

some spectral features to to sounds and

00:28:50

phones in phonemes and words in in in

00:28:52

more right language models. So

00:28:55

essentially all of these the networks

00:28:57

are obtained by taking the raw data and

00:29:00

transform it it into so the feature

00:29:03

extraction at different levels one more

00:29:05

abstract so the lower levels again the

00:29:07

so now this is similar is why should

00:29:09

you before but we see that this is for

00:29:12

colour images and so you see not just

00:29:15

these edges but also some sort of lower

00:29:19

frequency edges with colours that

00:29:22

become important. And and and higher up

00:29:25

now you see these detectors that

00:29:27

capture these funny shapes that are

00:29:29

made by composing these guys and and

00:29:31

even higher up you see these funny

00:29:33

shapes would actually now start looking

00:29:34

like parts of objects like maybe this

00:29:36

is a the or something. And and of

00:29:38

course we'll and and so on and maybe

00:29:41

this is like a face of the bird or

00:29:44

something now when you really a I think

00:29:50

a lot about machine learning based on

00:29:52

the presentations you can start playing

00:29:54

all kinds of really interesting games

00:29:57

and I only give you glimpse here with

00:29:59

this very old works from my brother

00:30:02

sammy. And his collaborators at the at

00:30:05

Google then Jason west and because I

00:30:08

think it was not a double but the work

00:30:10

together on the idea of learning

00:30:12

representations for both a text more

00:30:16

precisely a short text queries like

00:30:18

Eiffel tower. And for images in making

00:30:22

those are presentations in the same

00:30:23

space and so what's going on is that

00:30:27

there's gonna be a large

00:30:29

transformation. So you can think of a

00:30:31

you on that that takes the image. And

00:30:34

what's a hundred numbers right so

00:30:36

here's for visitors to D but really

00:30:38

initially these were a hundred

00:30:40

dimensional vector spaces today

00:30:41

actually they're like two thousand but

00:30:43

it's the same idea. Um and and so you

00:30:48

have one function that maps images to

00:30:50

that space and you have another

00:30:52

function that you also learn that maps

00:30:54

these queries which you can think of a

00:30:56

symbols. And into the same space

00:30:59

actually those the mapping between

00:31:01

those symbols and and and a vector is

00:31:03

just like table lookup right so you

00:31:04

have a table for each of I don't know a

00:31:06

million queries that are most frequent

00:31:09

the table would tell you what is the

00:31:11

Hunter dimensional vector corresponding

00:31:12

to dolphin all by Eiffel tolerance one

00:31:15

and nowadays they have more complicated

00:31:17

mappings which maybe include recurrent

00:31:18

nets and so on and can deal with a much

00:31:20

larger vocabulary and and things like

00:31:22

that but the idea is we learned these

00:31:25

two mapping so that the representation

00:31:30

for say the word dolphin is gonna be

00:31:32

close to their presentation for an

00:31:34

image of a dolphin. And wise that

00:31:36

useful well you can imagine if your

00:31:38

search engine. And you want to

00:31:40

associate things that people type

00:31:43

queries and images one way or the other

00:31:46

this is very useful right. So you want

00:31:48

images here to map for presentation is

00:31:50

kind of semantic that has to do with

00:31:52

what this means we don't care hear so

00:31:54

much about the details of the you know

00:31:57

the light on the water but we care

00:31:59

about what's in there it's you know

00:32:00

adult and that's in warrants one looks

00:32:03

like a pool. So these kinds of

00:32:05

information that you would like them to

00:32:06

be encoded in this more abstract

00:32:09

representation much more abstract than

00:32:10

the pixels in such a way that you can

00:32:12

also we cover the concepts that go with

00:32:15

them because you end opposing usually

00:32:17

in space right. So let me tell you a

00:32:25

little bit about I'd a big at a high

00:32:27

level what I think are key ingredients

00:32:31

for machine learning to approach a I

00:32:33

and number one is pretty obvious but I

00:32:38

think for many decades we can have

00:32:40

ignored it too much which is we need

00:32:42

lots and lots of data but let's let's

00:32:45

try to step back you why do we need

00:32:47

that much data we need that much

00:32:49

daytime people complain all new on that

00:32:51

anyone of data. Well if we wanna

00:32:55

machine to be intelligent to understand

00:32:57

the world around us. It's gonna need a

00:33:00

lot of the yeah to to get that

00:33:02

knowledge because it's mostly learning

00:33:04

all that information about the world

00:33:05

around this by by learning by by by

00:33:08

observing. And and the world around us

00:33:11

is complicated. So you know to learn

00:33:14

some complex presentation of what's

00:33:16

going on at there we need lots and lots

00:33:18

of data and I think that right now the

00:33:20

amount of data we're using is still way

00:33:22

too small compared to what will be

00:33:24

needed to reach you level yeah so

00:33:26

that's that's in really number one but

00:33:28

of course is not enough to have a light

00:33:29

data if we can't build models that can

00:33:32

capture and so for that we need very

00:33:34

flexible models we need we can to have

00:33:38

a you know models like linear models

00:33:40

that don't have enough capacity we need

00:33:42

the number of parameters to grow a you

00:33:45

know in proportion to the amount of

00:33:46

data so those those models have to be

00:33:48

big you have to be big because they

00:33:50

basically store the data in a different

00:33:53

form which allows the computers to take

00:33:55

good decisions alright. Now if we have

00:33:58

these big models with a lot of

00:33:59

parameters of course we have to train

00:34:02

them and we have to run them and for

00:34:04

that we need enough computing power.

00:34:06

And you know one of the reasons why you

00:34:09

on that's weren't so successful before

00:34:11

is because we didn't train them or not

00:34:12

data we didn't have big enough models

00:34:14

and we didn't have enough computing

00:34:15

power to train and use them now I just

00:34:19

having the first two ingredients is not

00:34:20

enough either because you could have

00:34:24

something like maybe and efficiently

00:34:26

implemented colonel machine and you

00:34:28

would people in principle to deal with

00:34:29

all three of these ingredients. However

00:34:32

there's something else that is

00:34:33

important. Uh I don't know if you heard

00:34:36

about the curse of dimensionality

00:34:37

you're trying to we're trying to learn

00:34:39

from very high dimensional data a and

00:34:42

in principle from you know if we don't

00:34:44

make any assumptions about the data.

00:34:46

It's essentially impossible to line

00:34:48

from that they they're just too many

00:34:49

configurations that are possible. And

00:34:52

the only way around that is to make

00:34:55

sufficiently powerful assumptions about

00:34:57

the data that there's something called

00:34:58

the no free lunch there and it says you

00:35:00

you do need to make these kinds of

00:35:02

assumptions otherwise you won't be able

00:35:04

to to learn something really complex

00:35:08

okay so many tell you about those

00:35:10

assumptions that are being made

00:35:11

specifically in deep learning. And they

00:35:14

have to do with the curse of

00:35:15

dimensionality this exponential number

00:35:17

of configurations that we have to deal

00:35:19

with we can't learn a separate

00:35:21

parameter for each configuration the

00:35:22

input because the number of such

00:35:24

configurations is like is huge is is

00:35:26

much more than the might data we can

00:35:27

ever like you know it's like the more

00:35:29

than the number of atoms in the

00:35:30

universe. So we we can't just learned

00:35:33

by heart everything gives or we need

00:35:34

some form of station. And besides the

00:35:37

smooth this is something which has been

00:35:39

very a successful and powerful machine

00:35:41

running we need other assumptions in in

00:35:44

in these big units we're putting in two

00:35:48

crucial additional assumptions which

00:35:50

have to do with composition T in

00:35:53

different forms of it right so we

00:35:55

already know compositional use very

00:35:57

powerful and humans use it all the time

00:35:59

that's how we we you know drop the

00:36:02

world around us we compose concepts a

00:36:05

human language is basically an exercise

00:36:07

in in composing ideas and meetings

00:36:09

right. So in those neural nets we have

00:36:12

two forms of composition algae one

00:36:15

which happens even with a single they

00:36:18

yearly on that so every time you have a

00:36:20

it what we call it distributed

00:36:21

representations or thing about one

00:36:23

layer venue on that each you're on

00:36:25

their each artificial on is detecting a

00:36:28

a feature a concept and and these these

00:36:31

detectors are not mutually exclusive to

00:36:33

the number of configurations that we

00:36:34

can capture grows exponentially with

00:36:37

the number of units so single layer

00:36:38

neon that is in some sense

00:36:40

exponentially powerful in what can

00:36:42

represent. Um so this idea of learning

00:36:45

features is is our first form of

00:36:47

competition all get an object is

00:36:49

described by the composition of you

00:36:51

know which features are activated yeah

00:36:53

which attributes are rather than to

00:36:55

this particular image. And because we

00:36:57

have many attributes and they can be on

00:36:59

or off or maybe some some grey level we

00:37:02

can have a very rich description that's

00:37:04

exponentially which in some sense and

00:37:06

yeah and there's actually more than the

00:37:10

words that I'm saying that there's a

00:37:11

lot of math behind this showing how are

00:37:14

these the fact that you have these this

00:37:16

representations really buys you

00:37:18

something exponential in statistical

00:37:21

sense. So that's the first form of of

00:37:24

composition holiday. There's a second

00:37:27

form which is the one you get when you

00:37:29

how many layers one on top of the other

00:37:31

where each layer compute something is a

00:37:34

function of the output of the previous

00:37:35

later and and here you also get an

00:37:38

exponential game. So again we have

00:37:40

theory showing that by having more

00:37:43

composition of you know of of the

00:37:46

layers on top of layers and here's you

00:37:48

can represent functions that are

00:37:49

exponentially richer in some sense. Um

00:37:53

so that's that's that's the other key

00:37:55

ingredient. And and of course these are

00:37:58

assumptions about the well there's no

00:37:59

free lunch as I said these things work

00:38:02

because they fit well with the world in

00:38:04

which we live if it wasn't the case

00:38:06

that the world was conveniently

00:38:08

describing the composition way these

00:38:10

neural nets would not be working as

00:38:11

well as they are to give you an example

00:38:14

of this. Um think about a and you on

00:38:19

that we're at some level representation

00:38:22

you have different units that detect

00:38:23

different kinds of attribute so let's

00:38:25

see the the image the input is an image

00:38:26

of a person. So you can imagine that

00:38:29

you could have a unit that recognises

00:38:30

that the person wears glasses you could

00:38:32

have a unit that recognises that the

00:38:34

the the person is a female you can have

00:38:36

another unit that recognises the

00:38:37

person's a child and so on you can

00:38:39

imagine like you know a thousand such

00:38:40

units and and now you know why is that

00:38:44

interesting because imagine you work to

00:38:48

try to learn these detectors right. So

00:38:51

you have these thousand detectors okay

00:38:53

you have and and so if you have these

00:38:59

and thousand factors and thousand

00:39:01

features. And if you want to learn them

00:39:04

separately you would need something

00:39:06

like say K parameters alright "'cause"

00:39:10

you know what what what are the

00:39:11

characteristics of the person that

00:39:12

wears glasses and you need me you can

00:39:14

imagine training a separate neon that

00:39:15

even for each of them. But we will do

00:39:17

better than that by sharing the layers

00:39:19

but in the worst case you could imagine

00:39:20

that if I had and features and each of

00:39:22

them requires sort of order of K

00:39:23

parameters that in total I need or

00:39:25

river and time ski examples to learn

00:39:29

about all these features. Now let's

00:39:32

consider the alternative where we don't

00:39:34

use a this presentation we use with

00:39:36

school and nonparametric approach a

00:39:40

which says well we gonna consider all

00:39:43

of the possible configurations of image

00:39:44

they images of of persons. So so now

00:39:48

what I'm gonna do save from using SBM

00:39:52

or something about or nearest neighbour

00:39:54

approach I'm gonna have to use an

00:39:57

example for each of the configurations

00:39:59

that I want to learn about so how many

00:40:00

configurations of the input like get

00:40:02

well potentially and the D right so do

00:40:06

use the input dimension the number of

00:40:09

ways to to split the data into all the

00:40:12

configurations of you know you has

00:40:13

glasses it doesn't have you know you

00:40:15

she's a female she's not it's a channel

00:40:18

that's not a child. So all of these is

00:40:20

basically an exponentially large set.

00:40:23

And the good news is if those features

00:40:26

if we were present the data with this

00:40:28

this representation we can do it in in

00:40:31

something that grows a nicely with the

00:40:33

amount of the complexity of the task

00:40:36

rather than exponentially. So yeah we

00:40:40

actually we can formal papers about

00:40:44

this characterising the number of of in

00:40:52

your pieces that any on that can

00:40:53

capture that has a single him layer

00:40:56

with and heating units and and and

00:40:58

essentially can represent things that

00:41:00

are that look complicated it may you

00:41:02

know you think you need order prevented

00:41:04

it D parameters in order to learn but

00:41:07

because of the the the the

00:41:10

compositional that we're assuming about

00:41:12

the well that ah you can essentially

00:41:15

the reason this works is that we can

00:41:17

assume that we can learn independently

00:41:19

about wearing glasses or about being

00:41:21

female versus male or being a child or

00:41:24

not right that that's why this this

00:41:26

compilation works. Uh if the detector

00:41:29

for glasses needed to know whether you

00:41:32

were female or child and so on in order

00:41:34

to do the detection of glasses then it

00:41:36

wouldn't work you would need as many

00:41:37

parameters as if you were doing an SVM

00:41:39

or and yours neighbour method the fact

00:41:43

that you can learn about these

00:41:44

attributes kind of separately from each

00:41:46

other without having to know all of the

00:41:47

configurations of the other attributes

00:41:49

is the reason why this is working okay

00:41:54

and yeah and this something similar has

00:41:58

been shown for that where you find that

00:42:01

you you know some functions could be

00:42:04

represented very efficiently with a

00:42:06

deep net but if you wanted to represent

00:42:08

those functions with a shallow network

00:42:10

you might need a huge number of it's

00:42:12

not the words there are functions out

00:42:14

there which really are not fully

00:42:15

expresses a composition of many levels

00:42:19

of nonlinear transformations and if you

00:42:20

try to capture those functions using a

00:42:24

not sufficiently deep network so it's a

00:42:26

shell network with a single layer two

00:42:28

layers. And that's not enough then you

00:42:30

you wouldn't you wouldn't need an

00:42:32

exponential number of units so you

00:42:33

would need an exponential number of

00:42:34

parameters and you need an expression

00:42:37

number examples to learn those

00:42:38

functions properly okay so the this

00:42:41

this is a bit of the theory. Um there's

00:42:44

another vertical thing that happened

00:42:46

recently which is maybe as important

00:42:50

for many years researchers in machine

00:42:54

earning top that neural nets couldn't

00:42:56

be really practical and useful because

00:42:59

training them in for a non convex

00:43:02

optimisation problem which could have

00:43:04

many local minima so what I mean by

00:43:06

this is if you if especially if you are

00:43:08

low dimension you think about functions

00:43:11

you trying to optimise a function like

00:43:13

the the the total error with a spectral

00:43:16

parameters you know it might be very

00:43:17

bad if you just to stochastic doing

00:43:19

descent or any kind of local descent

00:43:21

algorithm global optimisation you might

00:43:22

get stuck in these local minima and

00:43:25

that's not the case for things like

00:43:27

colonel machines. So the question is

00:43:31

you know is that since this a real

00:43:32

problem well it turns out that it's not

00:43:34

a real problem. Uh at least there's a

00:43:36

lot of evidence that this myth that

00:43:38

training neural nets is is riddled by

00:43:42

bad local minima is really a mess in

00:43:45

the and what we found actually is that

00:43:48

it's especially true as we go from tiny

00:43:51

networks to large networks so the

00:43:53

really interesting thing is that the

00:43:54

larger the network that easier it is to

00:43:56

optimise. So you know you can you can

00:44:01

feel really bad cases of optimisation

00:44:03

on very small nets with worse for union

00:44:05

it's when you go to a millions of

00:44:08

parameters or Hunters of millions of

00:44:09

parameters there is as it does with

00:44:11

statistical effect that's happening

00:44:13

that really makes documentation much

00:44:15

easier. And we we started this using

00:44:20

and analysis of critical points which

00:44:24

are the places where the the network

00:44:26

has your derivatives. And it turns out

00:44:29

that for the most part the kind of

00:44:33

critical points that you could

00:44:34

encounter during training and you on

00:44:35

that are subtle points meaning that you

00:44:38

you're not stuck their directions where

00:44:40

it looks like a local minimum but in

00:44:42

other directions actually going down.

00:44:44

And so you know great descent will just

00:44:46

sit and you know go in and not get

00:44:49

stuck in those those sell points. Um

00:44:53

yeah so this this is a lot to say about

00:44:56

this but I see time line let me tell

00:44:59

you a little bit about that of where

00:45:01

we're going now right the the

00:45:05

beginnings of neural nets were really

00:45:07

about having recognition about the

00:45:09

kinds of things have told you when you

00:45:11

recognise objects of images. And and of

00:45:15

course young is is much more than

00:45:17

having recognition. But but what's

00:45:19

interesting is that in the last few

00:45:21

years there's been really a lot of

00:45:23

progress in moving neural that stored

00:45:25

something that's more like a high level

00:45:27

cognition. Um there's been a lot of

00:45:31

work about attention in particular in

00:45:33

my lab and and in many other places

00:45:35

now. I'll tell you about about that so

00:45:38

attention is essentially it's in spite

00:45:41

of course about what we know about

00:45:42

humans a of considering the whole of an

00:45:46

input or a big set of numbers as one

00:45:50

homogeneous block. Um so for example if

00:45:53

you think about a layer that is looking

00:45:56

at the lower layer instead of looking

00:45:57

at everything the network learns to

00:46:00

focus on parts of the import or a layer

00:46:05

learns to focus on part of it's it's

00:46:08

it's important. Um another direction

00:46:14

that's really that's very very

00:46:15

promising is to look at reasoning

00:46:18

problems where instead of going from

00:46:21

input to output in once that you

00:46:23

actually have a sequence of steps and

00:46:25

the number of steps could very we at

00:46:27

each that we combine pieces of evidence

00:46:30

you know to to come up with a

00:46:31

conclusion this is really what

00:46:33

reasoning is about you combine say

00:46:35

different observations with different

00:46:37

things you know about the will and you

00:46:40

you know combine them to find an

00:46:42

answer. So I'll I'll tell you a little

00:46:45

bit about this but about the your and a

00:46:49

half ago this started with the simple

00:46:51

memory networks annual train machines.

00:46:53

And then another direction which is

00:46:55

related to this is everything that has

00:46:57

to do with planning. And reinforcement

00:46:59

learning and this is been exemplified

00:47:01

by the work of deep mind which has been

00:47:04

acquired by Google couple of years ago.

00:47:06

And their work on playing atari games

00:47:09

and more recently on the alpha go the

00:47:11

system that I mentioned at the

00:47:12

beginning that which a beep the will

00:47:14

champion but it's it's much more than

00:47:16

playing games it's about learning to

00:47:19

take decisions. And being able to learn

00:47:22

in a context where you wanna have ms

00:47:23

Elise provide the ability to have a

00:47:26

labels or supervised learning every

00:47:28

step. And and then more recently this

00:47:30

this kind of research we're combining

00:47:32

deep learning with with reinforcement

00:47:35

learning has gone into robotics. So the

00:47:37

whole field of robotics back in

00:47:38

particular by Berkeley is is moving

00:47:40

towards the use of departing and you

00:47:44

say a few words about attention. So

00:47:46

imagine a a a sequence of feature

00:47:52

vectors so you think of each of these

00:47:53

points as a vector we've been using

00:47:56

this for machine translation so each of

00:47:58

those would be a feature vector

00:48:01

extracted corresponding to a particular

00:48:05

place in an input sentence it so it it

00:48:09

may contain a semantic attributes

00:48:11

corresponding to the word at that

00:48:13

position as well as word in the

00:48:14

neighbourhood right so this is a

00:48:17

sequence of feature vectors but you

00:48:19

know it could be any kind of space. And

00:48:22

we're gonna produce another sequence of

00:48:23

feature vectors. But instead of using

00:48:26

this kind of usual fully connected the

00:48:29

approach which is kind of a static grab

00:48:32

we gonna make the the relationship

00:48:33

between the first sequence the sick and

00:48:35

the second sequence as something more

00:48:37

dynamic using and attention they can

00:48:39

and so what's the idea of you can't

00:48:40

make an is the idea is that when we

00:48:43

needed produce this feature vector

00:48:46

instead of looking at all of these guys

00:48:48

we're gonna choose a few of them maybe

00:48:50

missing the one and we're gonna use

00:48:53

that teacher and maybe maybe a few

00:48:55

others to compute the the feature the

00:48:58

next up right so so we're gonna focus

00:49:00

on a few elements in the input sequence

00:49:03

this is the crucial thing and you can

00:49:05

do it using what's called soft

00:49:07

attention or stochastic or attention we

00:49:10

work mostly with soft attention but we

00:49:12

have a paper will also use the plastic

00:49:13

car tension so the idea of soft

00:49:15

attention is that instead of taking a

00:49:19

yes no decision about which features we

00:49:21

gonna be looking at which element in

00:49:22

the set here we're gonna be looking

00:49:23

yeah we can shoot some soft weights

00:49:27

that sum to one over all the elements

00:49:29

here in order to decide you know how

00:49:31

much attention we gonna give to each of

00:49:32

them. And those software it's gonna be

00:49:34

computed by a little attention neural

00:49:37

that a little MLP here that takes the

00:49:41

the contexts at the upper level here

00:49:44

and the features at the lower level and

00:49:46

basically decide it's a good match you

00:49:48

know should we use this guy as input

00:49:50

for the next one here. And I'll put a

00:49:52

score for each of the possible

00:49:54

positions using you know the input

00:49:55

features corresponding input feature it

00:49:57

so because these weights or or or just

00:50:01

a part of a soft complication the

00:50:04

french will computation you can you can

00:50:06

learn to put attention in the right

00:50:08

place and it does learn to do that. And

00:50:10

in fact it's thanks to these attention

00:50:12

mechanism that we reached a state of

00:50:13

the art in machine translation in in in

00:50:16

the last year in two thousand fifteen

00:50:17

so yeah we we basically use the

00:50:23

architecture I should be for to process

00:50:28

input sentences extract those features

00:50:32

from them using a a form of recurrent

00:50:35

net re though so we'll bidirectional

00:50:37

recurrent net. And and then we let me

00:50:40

show you that picture again you can

00:50:42

think of it like we've extracted

00:50:44

semantic features from the whole

00:50:46

sentence or you think about even

00:50:48

Reading a whole book right and each of

00:50:50

these is features extracted at each

00:50:51

position each word in in in in the book

00:50:53

here. And now we can produce a word at

00:50:56

a time in the translated book. And so

00:50:59

each time we produce the next word in

00:51:01

the translated book we decide which

00:51:03

word in which few words in the

00:51:06

sourcebook we should be looking looking

00:51:08

at and and this works quite well my

00:51:11

position to a technique that had been

00:51:12

you have tried before where a along

00:51:15

with our colleagues a Google where you

00:51:17

you you read the whole book you come up

00:51:19

with that kind of semantic

00:51:22

representation of the whole book and

00:51:23

then you feed that into another

00:51:25

recurrent net which produces the the

00:51:27

words in the translated book and that

00:51:29

doesn't work because it's hard to

00:51:30

compress that much information into a

00:51:32

fixed size vector. But by allowing then

00:51:35

the network to to decide at which each

00:51:38

point in producing the output sequence

00:51:40

where to look a it works perfectly well

00:51:42

and so we we won a couple of the WMTI

00:51:48

challenges so this is you only

00:51:49

competition for machine translation

00:51:52

using these neural machine translation

00:51:53

systems. And more recently at our

00:51:56

colleagues at at Stanford I've been

00:51:59

using this and other datasets all the

00:52:01

benchmarks and obtain even stronger

00:52:03

improvements and now there's it's it's

00:52:05

a whole cottage industry to improve

00:52:07

these neural emotion translation

00:52:08

systems ah yeah and they're essentially

00:52:11

leading in in dumb machine translation

00:52:14

will right now one thing you can do

00:52:17

with attention that's quite cool as

00:52:19

well is combining the things we've done

00:52:23

in computer vision with things we've

00:52:26

learned with modelling language. So in

00:52:31

this way what we've done is we we tried

00:52:34

any on that the conclusion on that it

00:52:36

extracts features from the image and

00:52:38

then we use inattention a an of them to

00:52:41

decide to produce one word at a time in

00:52:43

the sentence that's supposed to be a

00:52:44

description of the image so the

00:52:46

computer reads the image and produces a

00:52:47

sentence stochastic Lisa to outputs a

00:52:50

probability for the next word and then

00:52:52

we sample that word and produce

00:52:54

probably for the next word and so on.

00:52:56

And so it sees this image and says a

00:52:57

woman is throwing a frisbee in the park

00:53:00

but it doesn't using attention so each

00:53:01

time it produces a word in the output

00:53:03

sequence it chooses to look in the

00:53:05

input so here when it says frisbee it's

00:53:07

looking in this region where there's a

00:53:09

frisbee. Um so just a few years ago

00:53:11

somebody would have told me well we're

00:53:12

gonna train any on that that you know

00:53:14

looks at it imagine produces a natural

00:53:15

language sentence that describes it I

00:53:18

would have said Nat it's gonna take you

00:53:19

know at least ten years. Uh but it's

00:53:22

there and and you know this is being

00:53:23

this is more than your old result and

00:53:26

and now it's it's you know people are

00:53:30

doing even better than that yeah so let

00:53:33

me show you more of these examples of

00:53:35

the computer looks at this and it says

00:53:37

a dog is standing on the hardwood floor

00:53:39

and when it says dog is looking at the

00:53:41

face of the dog it look at this image

00:53:43

and it says a stop sign is on the road

00:53:45

with the mountain in the background.

00:53:46

And what it sees a sub sub sign it's

00:53:49

it's know what it's a stop sign it's

00:53:52

looking at the stuff now let me show

00:53:55

you something that our colleagues at

00:53:57

face crooked done using something

00:53:59

similar but now instead of producing a

00:54:03

sentence you answer the question is

00:54:07

there a baby yeah what is the man doing

00:54:12

I think is the baby sitting on his lap

00:54:17

yeah are they smiling yeah is there a

00:54:27

baby in the folder yeah where is the

00:54:32

baby standing I am what is the baby

00:54:37

doing actually teeth what game is being

00:54:44

played soccer someone kicking the ball

00:54:49

yeah what colour is the ball yellow

00:54:56

what is the dog playing E what colour

00:55:05

is the dog black is the dog wearing a

00:55:11

colour yeah what is the cat sniffing

00:55:17

yeah where is the cat I'm bad what

00:55:25

colour is the cat black and white what

00:55:30

colour the bananas we okay now you have

00:55:39

to be where this is a demo made by face

00:55:42

book. So oh I mean I mean I think this

00:55:47

is real but they probably selected

00:55:49

cases where it works better

00:55:52

nonetheless. This is really impressive

00:55:55

let me tell you a little bit about

00:55:59

what's behind the scenes in addition to

00:56:01

the make answers I've been telling you

00:56:02

and essentially it's using this

00:56:07

attention make an is an idea not just

00:56:11

to focus on a particular part of the

00:56:14

input but to focus on a particular part

00:56:19

of memory. So the idea here is to

00:56:25

separate the main computation which

00:56:27

would be done by recurrent network

00:56:28

typically from a memory which you can

00:56:33

think of like a computer memory where

00:56:35

you would have a vector at each address

00:56:38

and and these these factors could be

00:56:40

long like think of these as the word

00:56:41

invading so they might be like two

00:56:43

hundred dimensional something like

00:56:44

that. And and now the recurrent net

00:56:49

kind of course read from the external

00:56:51

world and produces output. And servers.

00:56:54

But it can also do internal actions. So

00:56:58

the internal actions here would be

00:57:00

things like Reading at a particular

00:57:02

place or writing at a particular place.

00:57:06

Now instead of taking a hard decision

00:57:08

about where to read and where to write.

00:57:11

And what to write it takes soft

00:57:14

versions of these decisions. So it

00:57:16

computes a a a score for each address

00:57:21

and those scores with the soft next

00:57:23

would sum to one that really you know

00:57:25

where it wants to me and what it's

00:57:28

gonna do is gonna take like we did for

00:57:30

detention a cannon is gonna take those

00:57:32

weights. And make a linear combination

00:57:36

of what's reading. So we take the

00:57:38

contents everywhere weighted by those

00:57:41

those scores of some to one you know to

00:57:44

actually get the information from the

00:57:46

memory into the recurrent net so is

00:57:47

Reading with a focus of attention in a

00:57:50

few places and you can do the same

00:57:52

thing for the writing yes so you can

00:57:58

use these kinds of systems to do things

00:58:00

like read a little story like some

00:58:02

walks into the kitchen sam picks up an

00:58:04

apples and walked into the bedroom and

00:58:06

drop the apple and then question where

00:58:08

is the apple so the computer reads all

00:58:10

of these things including the question.

00:58:12

And knows now this is the question

00:58:14

maybe because there is a special marker

00:58:16

in it's supposed to answer or something

00:58:17

like that or just like we had it in the

00:58:20

demo except that in a demo. Instead of

00:58:23

the text here we had an image but it's

00:58:25

exactly the same account isn't alright.

00:58:31

So I'm gonna close here this is a

00:58:33

picture of my group in Montreal

00:58:37

montreal's representing all rhythms.

00:58:39

And and we always recruiting. Thank

00:58:44

you. So I guess it's time for the break

00:58:59

I'll be here force for the panel later

00:59:02

so if you have questions we can we can

00:59:04

answer the panel and also tomorrow I'll

00:59:07

be giving another lecture and I'll

00:59:11

leave more time for questions during

00:59:13

the lecture so you you know you can

00:59:14

keep your questions a little bit for

00:59:17

later today or tomorrow we can have the

00:59:19

the questions then so we can have time

Share this talk:

Conference Program

59:34

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

2368 views

55:38

Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.

427 views

01:01:02

Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.

330 views

55:14

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

815 views

55:57

Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.

342 views

01:08:04

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.

2156 views

49:29

Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.

275 views

52:43

Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.

151 views

45:40

TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.

2659 views

52:33

TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.

1704 views

01:05:51

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD
July 6, 2016 · 1:59 p.m.

1406 views

01:04:41

TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.

2251 views

Recommended talks

52:50

Pose estimation and gesture recognition using structured deep learning
Christian Wolf, LIRIS team, INSA Lyon, France
Oct. 17, 2014 · 11:06 a.m.

388 views

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada

Embed

Transcriptions

Conference Program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.

Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.

Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.

Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.

TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.

TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD
July 6, 2016 · 1:59 p.m.

TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.

Recommended talks

Pose estimation and gesture recognition using structured deep learning
Christian Wolf, LIRIS team, INSA Lyon, France
Oct. 17, 2014 · 11:06 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Deep Supervised Learning of Representations Yoshua Bengio, University of Montreal, Canada

Embed

Transcriptions

Conference Program

Deep Supervised Learning of Representations Yoshua Bengio, University of Montreal, Canada July 4, 2016 · 2:01 p.m.

Hardware & software update from NVIDIA, Enabling Deep Learning Alison B Lowndes, NVIDIA July 4, 2016 · 3:20 p.m.

Day 1 - Questions and Answers Panel July 4, 2016 · 4:16 p.m.

Torch 1 Soumith Chintala, Facebook July 5, 2016 · 10:02 a.m.

Torch 2 Soumith Chintala, Facebook July 5, 2016 · 11:21 a.m.

Deep Generative Models Yoshua Bengio, University of Montreal, Canada July 5, 2016 · 1:59 p.m.

Torch 3 Soumith Chintala, Facebook July 5, 2016 · 3:28 p.m.

Day 2 - Questions and Answers Panel July 5, 2016 · 4:21 p.m.

TensorFlow 1 Mihaela Rosca, Google July 6, 2016 · 10 a.m.

TensorFlow 2 Mihaela Rosca, Google July 6, 2016 · 11:19 a.m.

AMD's Open Compute and Open Source cross platform solutions for Machine Learning Mauricio Breternitz, AMD July 6, 2016 · 1:59 p.m.

TensorFlow 3 and Day 3 Questions and Answers session Mihaela Rosca, Google July 6, 2016 · 3:21 p.m.

Recommended talks

Pose estimation and gesture recognition using structured deep learning Christian Wolf, LIRIS team, INSA Lyon, France Oct. 17, 2014 · 11:06 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.

Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.

Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.

Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.

TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.

TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD
July 6, 2016 · 1:59 p.m.

TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.

Pose estimation and gesture recognition using structured deep learning
Christian Wolf, LIRIS team, INSA Lyon, France
Oct. 17, 2014 · 11:06 a.m.