Player is loading...

Embed

Embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
We'll start with a short work by LA
00:00:02
with subject off yeah on then we would
00:00:05
have the first two by you sure so first
00:00:09
okay well you have to be to be very
00:00:13
very short I would be very shall I
00:00:15
promise for ones like you saying that
00:00:17
but well welcome to sun is excellent
00:00:20
and sunny valley in particular the best
00:00:24
place into both less polluted morse and
00:00:28
so I hope that besides this great
00:00:30
workshop we will have science I have
00:00:32
time to enjoy the the region is really
00:00:34
worth is spending times and these
00:00:37
hiking around or whatever you like
00:00:41
anyway I I would like to thank for for
00:00:43
for us the first for organising this
00:00:45
this great workshop and all the people
00:00:47
who kindly accepted to come is that and
00:00:50
these are is is a lecture us for us
00:00:54
this workshop there was two reasons to
00:00:55
have this workshop this year the first
00:00:58
one is that is one of the twenty fifth
00:01:01
anniversary events that lead up is
00:01:02
organising this year one among a few
00:01:06
orders many others that will have
00:01:08
basically every month and the second
00:01:11
one is that basically there is as we
00:01:13
all know and that's why we are all here
00:01:15
there is a big a revival whatever we
00:01:18
want to call it. Well or progress in
00:01:23
the mission on the ink and deep
00:01:26
learning and no network as opposed to
00:01:29
just superficial learning that people
00:01:31
were doing in the past. And obviously
00:01:34
it yeah but is built into a pretty
00:01:37
large institute of hundred twenty
00:01:39
people fifty or in the twenty startups
00:01:42
and many more to come in to the feud to
00:01:45
your work years around the around to
00:01:48
advance signal processing and machine
00:01:50
learning I dress to menu problems is
00:01:54
Dave a the speech processing computer
00:01:56
vision biometrics by you imaging
00:02:00
computer vision human behaviour
00:02:03
understanding and so on. So everything
00:02:06
we are doing so we sounds very fancy
00:02:08
and very complicated but what we like
00:02:12
about what we are doing is that we are
00:02:13
all sharing the same tools which are
00:02:16
basically tools coming from signal
00:02:18
processing and and machine learning. So
00:02:20
any progress in this area regarding
00:02:24
software or regarding hardware that
00:02:26
social something that is unique to this
00:02:28
workshop I believe. I think it's one of
00:02:31
the first time that I know of where we
00:02:33
have people coming from the hardware
00:02:35
side which not always agree with each
00:02:38
other either because we are talking
00:02:40
about CPUGPU in many others a dude he
00:02:44
how about and the softer side and so
00:02:47
this would be a great places so to
00:02:48
exchange ideas about the future and all
00:02:50
we can have the community at large and
00:02:53
and so areas like the one I just
00:02:56
mentioned now. So again thank you a lot
00:02:59
to cross off also is the wall organise
00:03:02
this and you will your boss you will be
00:03:04
possible to next to the treaties yeah
00:03:07
thank you for so okay so as obvious say
00:03:18
hold as you probably know because you
00:03:19
are here there is a strong provide or
00:03:23
when you have an transformation running
00:03:25
on I think we all agree that it's you
00:03:28
to mix of policies balls onto the
00:03:32
juries are side doctor sort of
00:03:34
engineering things going on especially
00:03:35
software frameworks on hardware from
00:03:38
testicle. So I I think it's nice to
00:03:40
have those from plastic speakers
00:03:41
because they they spend ton skate
00:03:44
that's of interest was the for the
00:03:45
running on the old also I think so it's
00:03:49
a nice even to elected yeah because
00:03:50
it's pretty words button for what we we
00:03:53
do here which is to get interface
00:03:55
between mission on either some machine
00:03:58
running on our engineering on concept
00:04:01
to industry. So it's we the Greek
00:04:04
leisure that so for a speaker we be you
00:04:07
shopping you so you sure useful obvious
00:04:10
autonomous yeah the is at of the
00:04:13
machine on a mission running about
00:04:15
Julian could act of the C file hold on
00:04:17
that we you know on the user makes a
00:04:20
puny on keep running as you might have
00:04:22
to you these are the you wrote two
00:04:24
books on the subject on there as being
00:04:27
impressed with my my to press on the
00:04:28
the paper Thanks you guys hear me well
00:04:40
yes good. So today I'll talk more about
00:04:44
supervised learning and I'm just gonna
00:04:47
scratch the surface there's one hours
00:04:49
not enough really to give justice to
00:04:51
this field but and tomorrow I'll talk
00:04:54
more about as provides learning. So as
00:04:58
you know colours or something to drive
00:05:00
themselves and we're starting to talk
00:05:04
to our phones in their starting to say
00:05:07
something back and computers are now
00:05:11
able to beat the world champion in the
00:05:13
game go which is which was not to be
00:05:15
something computers would be able to do
00:05:18
for decades. And all this is
00:05:21
essentially because the progress in
00:05:22
machine learning in an area called deep
00:05:24
learning which is essentially a renewal
00:05:26
of you and nets I think it's a lot more
00:05:28
than these little things I mentioned
00:05:32
it's a whole a new iconic revolution
00:05:36
that is coming with progress and yeah I
00:05:39
currently spearheaded by these
00:05:41
techniques. And that's why so many
00:05:43
companies are jumping into this is only
00:05:45
this is the display like two years ago
00:05:49
and it's it's much bigger and much more
00:05:51
crowded now. So let me tell you about
00:05:54
about deep learning what it is the the
00:05:59
general idea is the we want to learn
00:06:05
from data so he's a machine learning
00:06:06
algorithms. And with particular is that
00:06:09
we're gonna learn of representations of
00:06:11
the data. And we're gonna learn
00:06:13
multiple levels of representation of
00:06:15
data okay that's that's really what's
00:06:17
the planning is about in why would that
00:06:20
be an interesting because these
00:06:22
multiple levels of representation
00:06:23
really hers or a supposed to and and
00:06:27
effectively seem to capture different
00:06:30
degrees extraction so as you go deeper
00:06:32
you tend to be able to capture more
00:06:35
abstract concepts. And and this has
00:06:38
worked out really well it started with
00:06:40
a speech recognition object recognition
00:06:42
hundred detection and more recently
00:06:45
there's been a lot of progress and a
00:06:48
natural language understanding
00:06:50
processing machine translation and
00:06:52
things like so it this this ability of
00:07:00
train you on that's that are deep to
00:07:02
have more then a couple of hidden
00:07:04
layers something that really happened
00:07:06
around two thousand six things to buy a
00:07:10
funding from see far which is a
00:07:12
canadian organisation that makes pretty
00:07:14
long bets on on and shoes research
00:07:17
projects and included different things
00:07:21
that in trying to my live in Montreal
00:07:22
and yellow kinds of in new York so that
00:07:25
was the that the the first breakthrough
00:07:27
essentially allowed us to use
00:07:29
unsupervised learning organs which
00:07:31
existed to bootstrap the training of a
00:07:35
deep supervise you on that then another
00:07:39
I think really important bands that to
00:07:41
people are not aware of a something
00:07:43
that happened in two thousand eleven
00:07:45
when we found that actually we didn't
00:07:47
need this supervise and provide spree
00:07:49
training trick that if we just replace
00:07:52
the nonlinearity that people tended to
00:07:55
use the hyperbolic tangent or the
00:07:56
sigmoid with the rectifier you were
00:08:00
able to train very deep supervise
00:08:02
networks. And just the year after that
00:08:04
our colleagues from Toronto use this
00:08:08
trick along with other you tricks like
00:08:10
drop out. And in order to have a really
00:08:14
big breakthrough in terms of
00:08:16
performance on object recognition I'll
00:08:19
tell you more about that. So first of
00:08:22
all let me step back a little bit about
00:08:24
what a ideas about a I is about
00:08:27
building machines that can take
00:08:29
decisions the decisions. And for a
00:08:34
computer or even a human or animal to
00:08:37
take the decisions that entity needs
00:08:40
knowledge right you get intelligence by
00:08:42
having knowledge however so this is of
00:08:45
course well known and for decades what
00:08:49
what's happened in the I research is
00:08:51
we've tried to give that knowledge to
00:08:53
computers explicitly by you know giving
00:08:56
the computer rules and facts. And the
00:08:59
and P.'s the program but it failed and
00:09:03
it failed because a lot of the very
00:09:05
important knowledge that we happen and
00:09:07
be using you to understand the world
00:09:08
around this isn't something we can
00:09:10
communicate in in the language or in
00:09:12
programs it's things we know but we
00:09:15
can't explain this is essentially
00:09:18
intuition. And that's where machine
00:09:20
running comes because we need to get
00:09:23
that knowledge to computers we can't
00:09:25
tell them exactly what that knowledge
00:09:27
is like how to recognise the face or
00:09:30
chair but we can show examples of
00:09:33
computer and that's how computers have
00:09:35
been able to learn that kind of
00:09:37
intuitive knowledge you know a a good
00:09:40
example of this is the the gonna go I
00:09:42
was telling you before you can ask
00:09:43
expertly why did you make that play and
00:09:45
it will invent a story but really the
00:09:47
story is very incomplete and and
00:09:51
students wouldn't be able to just use
00:09:53
that story in order to plea as well as
00:09:55
the master by far. So the the the the
00:09:59
the expert player has this intuitive
00:10:01
knowledge about with the right thing to
00:10:02
do but we can't really explain it. Um
00:10:05
what are we can take games played by
00:10:09
these high level experts and bootstrap
00:10:11
a you on that that learns to capture
00:10:14
these intuitions implicitly I love this
00:10:18
machine learning power to the data yes
00:10:21
right. So another really important
00:10:24
thing to understand about you know why
00:10:26
is machine learning working in the
00:10:28
first place these days so well is that
00:10:31
it relies on optimisation relies on
00:10:33
defining what it is that you want the
00:10:35
machine to learn is a a function like
00:10:39
an error function that we can just
00:10:41
optimised and and the way we optimise
00:10:43
it is actually incredibly simple
00:10:45
compared to the very sophisticated you
00:10:49
know things have been done in the
00:10:50
position we just do a very small
00:10:52
changes the time we show one example at
00:10:54
a time or a small batch of example
00:10:57
recall many batch but that's technical
00:10:59
detail we should one example at a time
00:11:00
and then we we we look at the error
00:11:03
that the computer is making about
00:11:05
example like you're supposed to save
00:11:06
the car. Um and and we gonna make a
00:11:10
very small change of the parameters
00:11:12
inside the box that define what what is
00:11:14
the mapping from input output. So that
00:11:17
the that nothing produce something
00:11:18
slightly better I'll tell you a lot
00:11:20
more about backdrop later but that's
00:11:22
the idea of a good compute what is the
00:11:24
small change or we can do to the neural
00:11:27
net parameters. So that there will be
00:11:29
slightly smaller next time and we
00:11:31
repeat that hundreds of millions of
00:11:33
times and the thing recognises cards
00:11:35
spaces and desks and and so on. And one
00:11:39
of the first areas where the this
00:11:41
breakthrough of using D that's really
00:11:44
made a difference is in the area of
00:11:46
speech recognition it start around two
00:11:48
dozen ten and we we see in the graph is
00:11:51
what happened with the years on on a
00:11:54
particular benchmark and this is a this
00:11:56
is a cartoon the real picture as you
00:11:58
know lots of ups and downs really and
00:12:02
what we see is that in the nineties
00:12:03
things were progressing quite well
00:12:05
using HM M.'s which with the standard
00:12:08
of the day and it two thousand somehow
00:12:11
even though we had more data and and
00:12:14
yeah faster computers performance in
00:12:17
improved that much until these deep
00:12:19
neural net started being used and that
00:12:21
was a big drop in in error rates and
00:12:23
over matter of a few years the whole
00:12:25
area of speech recognition turn to
00:12:27
using these things. And all the
00:12:29
industrial systems now are based on on
00:12:31
these TV on that's and then you know
00:12:35
lagging by a couple of years something
00:12:37
similar happened in computer vision and
00:12:38
it started with a object recognition
00:12:41
other was given an image you know is
00:12:44
there you know which which object to
00:12:45
present is there a dog is there a chair
00:12:48
is their person. And what the tasks
00:12:52
that that really started this going is
00:12:54
the image neckties where you have a
00:12:56
thousand hundred categories you're
00:12:57
given an image and you're you're
00:12:58
supposed to say which of the categories
00:13:00
is present. Um and and in the last the
00:13:05
few years from two thousand twelve
00:13:07
doesn't fifteen I'm not only the the
00:13:09
the performance you know improve very
00:13:12
fast thanks to these be here
00:13:14
convolutional nets but we essentially
00:13:16
which human level performance on these
00:13:18
tasks there's to the the this is true
00:13:21
of you know sort of nice images and so
00:13:23
humans are still better when when it's
00:13:25
harder to do recognition but the
00:13:27
progress has been really amazing and
00:13:30
essentially it's now more in almost you
00:13:34
know an industrial concern to to get
00:13:37
these into products okay I'm gonna
00:13:40
actually no run you video from my
00:13:43
former colleagues are leading to you of
00:13:45
started the company of years ago and
00:13:47
and being recruited by and media. And
00:13:52
the the use these conclusion that's to
00:13:55
train. And it Uh_huh right I I oh I oh
00:14:11
oh oh oh oh I so you need some more
00:14:24
data on the of of of of of of of of of
00:14:53
right so this and other things
00:15:16
depending is bringing are gonna really
00:15:17
change or world oh okay but now and I
00:15:22
guess for the rest of my presentation
00:15:25
I'm gonna build a bit more into the
00:15:28
technical part of this. And I'm gonna
00:15:31
start by telling you about the the work
00:15:34
force of the progress we've had in the
00:15:37
last years which is just the good go
00:15:39
back from a from the eighties or late
00:15:43
seventies depending on how you wanna
00:15:44
look at it that's based on very very
00:15:47
simple ideas that are really important
00:15:49
to understand in order to even debug
00:15:52
the things that you're playing with
00:15:54
when you when you will in the next few
00:15:56
days. Um so remember I said that we
00:15:59
want to compute the smallest changes
00:16:02
that need to be done to those neural
00:16:03
that parameters so that it performs
00:16:05
slightly better next time this just
00:16:07
happens to be a gradient so it's a
00:16:08
partial derivative of the arrow the
00:16:10
loss function much optimise with
00:16:12
respect to the parameters so how are we
00:16:14
gonna compute these partial derivatives
00:16:15
from this very complicated machine
00:16:17
which is this deep you on that so we're
00:16:21
gonna use a little chain rule which
00:16:23
says that which tells how to compute
00:16:25
derivatives through composition. I so
00:16:27
so if X influences why fruit function G
00:16:31
and why influences that true function F
00:16:33
right so that is F compose with G of X
00:16:38
we can get the derivatives that respect
00:16:41
X to the final answer with respect to
00:16:43
the input by just multiplying these two
00:16:45
partial derivatives along the way and
00:16:49
so in in and and in our case the X that
00:16:51
we care about this is gonna be some
00:16:53
parameter and this is that we care
00:16:55
about is gonna be the error that we're
00:16:57
making on an example the loss that we
00:16:58
wanna minimise. Now to think that is
00:17:00
great about backdrop is that if the
00:17:03
amount of computation you need to
00:17:05
compute the losses function of example
00:17:07
and the parameters is on the order of
00:17:10
and see and depends on the number of
00:17:12
parameters or something about then
00:17:15
computing the gradient so the
00:17:17
derivative the loss of respect or
00:17:18
parameters is also or yeah right so you
00:17:21
don't you can compute something
00:17:23
efficiently you can also computed
00:17:24
greeting efficiently. So we start with
00:17:27
this. But X which is a variable like
00:17:30
the parameter and we do transformation
00:17:32
so this is just a graphical view what I
00:17:34
told you about in we have these partial
00:17:35
derivatives along the way. And the
00:17:37
general tells us that if I make a small
00:17:40
change delta X it's gonna become a
00:17:44
small change don't to live by taking
00:17:46
the build X and multiplying by the
00:17:47
partial driven same thing but delta
00:17:49
otherwise gonna transform into does
00:17:51
that by multiplying delta Y by these
00:17:53
that you want so then if I have a done
00:17:55
the XM plug this into that one I get
00:17:58
that the small changed all the X become
00:18:00
delta that by multiplying by DZEY times
00:18:02
DYDX this is basically what happens
00:18:04
right this is the chain well this is
00:18:05
how it comes up. Um and that's the that
00:18:10
was the simple scenario where X goes
00:18:12
directly to that but maybe they're
00:18:14
different hats right so X influences
00:18:16
one one in influences why to for
00:18:18
example these maybe you know two
00:18:20
neurons in email you know like here and
00:18:22
that is your loss annexes some
00:18:23
parameter. So now it turns out that the
00:18:26
chain will just changes a little bit
00:18:28
and we gonna add the products along the
00:18:31
path for each of the path and so you
00:18:36
know we have these partial derivatives
00:18:38
along the path we do this guy times
00:18:40
this guy was this guy times this guy
00:18:42
and of course we can generalise this
00:18:44
too and pads and we get this equation
00:18:48
which says essentially that for no tax
00:18:51
we're gonna look at the partial
00:18:54
derivatives of the lost respect with
00:18:55
successors here why wonder why and why
00:18:58
high. And multiplied by the partial
00:19:00
derivative along the path DYIDX so for
00:19:05
each these guys okay so that's a very
00:19:07
simple formula and that's what you have
00:19:08
the heart of things like like porch or
00:19:11
or you know or tens of and of course
00:19:15
you can generalise this to an arbitrary
00:19:17
graph of computation so what what these
00:19:19
packages do is they create eight E it
00:19:22
data structure which represents a graph
00:19:24
of computation or flow graph where each
00:19:27
node corresponds to result and usually
00:19:30
those notes wants peace killers
00:19:31
actually will be ten servers like
00:19:33
matrices vectors of higher or a high
00:19:35
order objects a but the principle is
00:19:37
gonna be the same that once we we have
00:19:40
loved that graph we can either computed
00:19:43
forward or we can compute derivatives
00:19:45
in a recursive way by by saying okay so
00:19:49
that the read up the final lost here
00:19:51
with respect to some no tax in the
00:19:53
middle can be obtained recursively by
00:19:56
looking at the already computed partial
00:19:58
derivatives these DYI for each of the
00:20:00
successors of X in the graph times the
00:20:03
partial derivative along the arcs
00:20:05
TYIDXO how this guy influence as the
00:20:07
next guy for each of the next guys and
00:20:10
how each of those guys these influences
00:20:12
this that that we've already computed
00:20:13
that recursively because this is of the
00:20:15
same form is this these that be
00:20:16
something where something is any of the
00:20:19
no of course to make that we're gonna
00:20:21
have to do it do it in the proper order
00:20:23
we first need to compute the wise you
00:20:25
before we compute I mean the derivative
00:20:27
with respect to twice before we
00:20:28
computed derivatives with respect to
00:20:29
the X okay so that's that's essentially
00:20:33
back prop you can apply it to the
00:20:35
multiplier network for example here
00:20:37
would be used simple architecture where
00:20:41
we might up what a a vector abilities
00:20:44
over categories it's typical thing we
00:20:46
do and or lost might be the so called
00:20:49
negative log likelihood which is minus
00:20:51
the log of the probability given to the
00:20:54
crack class so one of these guys the
00:20:56
correct class and we just want this guy
00:20:58
to be as high as possible that we take
00:21:01
minus the log of it and we minimise.
00:21:03
And of course that loss depends both on
00:21:05
the outputs and on the correct answer
00:21:07
why because we use this this correct
00:21:09
answer which is an integer here to let
00:21:12
us know which of the output
00:21:13
probabilities we wanna maximise once we
00:21:17
conclude that last week and go
00:21:18
backwards using the same principles
00:21:19
what a short don't here. So now we can
00:21:21
compute the relatively respect to the
00:21:25
output units and then using this week
00:21:27
in applied recursively to together
00:21:29
derivatives respect the previously as
00:21:31
well as to with respect to the weights
00:21:33
that go into that later and similarly
00:21:36
we can go out again and back problem
00:21:38
one more step and get riveted respected
00:21:40
those weights right so if I go again
00:21:45
it's going oh yeah so we is good that's
00:21:53
just explained it alright oh I'll make
00:22:05
my slides available so you can look at
00:22:07
this more carefully and once you
00:22:09
understand that you can apply this to
00:22:11
any a graph of any structure you can of
00:22:14
course generalises to graphs that are
00:22:17
dynamically a constructed liking the
00:22:19
recurrent network what you have is
00:22:21
instead of having a if excise graph the
00:22:24
graph actually has the form of a chain
00:22:26
like this. Uh and depending on the
00:22:29
number of inputs the X as well that the
00:22:31
graph will be longer to accommodate
00:22:33
Reading all of these inputs and
00:22:34
computing some internal state which
00:22:37
correspond to your own that summarise
00:22:40
everything that's been done before in
00:22:42
in a way that captures what's needed
00:22:43
for whatever computation follows you
00:22:47
you could also generalise to grab that
00:22:48
are trees but again that that which
00:22:50
particular with this we Karen and
00:22:51
recursive architectures is that instead
00:22:55
of having a a fixed us that a graph
00:22:58
depending on the particular data you
00:22:59
have like the length of the sequence or
00:23:01
the the tree that's built on top of
00:23:03
sentence the the graph is is
00:23:06
dynamically constructed for this to
00:23:07
make sense what you need is that the
00:23:09
the same parameters are we use I so we
00:23:13
don't have a separate set of weights
00:23:14
for each time step you have the same
00:23:16
weights use a different times that and
00:23:19
so if I have a longer sequence I can
00:23:20
just extend the graph is gonna be the
00:23:22
same parameters or and and so we can
00:23:25
generalise to you know different links
00:23:27
or different trees in the keys
00:23:28
recursive networks alright so this was
00:23:32
a very very brief intro to back prop
00:23:35
you can read a lot more in in my book
00:23:38
which is available online. And and I'm
00:23:42
gonna move to a little bit of the why
00:23:44
doesn't work and that being is gonna be
00:23:46
very a brief and high level and you
00:23:51
know you probably want to delve deeper
00:23:53
by yourself to make more sense of it
00:23:56
but I'm gonna try to give you the the
00:23:57
basic concepts. So the previous thing
00:24:00
was haul it in a very you know focused
00:24:03
way now let's try to see why wasn't
00:24:05
working. What's what's new with these
00:24:07
deep networks. Um so you know in the
00:24:11
early days of a I where what people are
00:24:13
trying to do is build a system that
00:24:16
goes directly from input to output
00:24:17
through hand design programs all the
00:24:20
knowledge as I said at the beginning
00:24:21
was put in the machine directly from
00:24:24
the brains of the the the experts or
00:24:26
from the brings of the programmer into
00:24:29
a program maybe the program had an
00:24:31
explicit set of facts and rules but
00:24:34
that's how it was done that was no
00:24:35
learning then a lot of the work in
00:24:38
classical machine learning in the
00:24:39
eighties and nineties was based on
00:24:43
introducing some learning in particular
00:24:46
starting from a lot of hand design
00:24:47
features that were crafted based on a
00:24:50
knowledge of with the input is supposed
00:24:52
to be and what kind of in various we're
00:24:54
looking for and then transforming those
00:24:57
features are cruel line mapping often
00:24:59
just a linear mapping or colonel
00:25:01
machine that would go to produce the
00:25:05
the output we want. Um what happens
00:25:08
with neural nets is that ah we look
00:25:11
inside this box and we think of it as a
00:25:13
composition of multiple transformations
00:25:16
and once you start thinking about this
00:25:18
you have something in the middle here
00:25:19
between two sets of transformations you
00:25:22
can call this first transformation
00:25:23
extracting features but now they're
00:25:24
gonna be line. And the second thing
00:25:26
here might be be linear classifier
00:25:28
again. But the thing in the middle of
00:25:29
something new it's a representation
00:25:32
that the computer has a line. So you
00:25:34
know the really important concept in
00:25:36
deep learning is that we're learning
00:25:39
representations. And not any kind of
00:25:41
representations is I'll try to explain
00:25:43
a bit later. But that's really crucial
00:25:45
thing and then you know deepening is
00:25:47
just taking this idea running a
00:25:48
presentation to say okay so we're gonna
00:25:50
learn multiple levels other
00:25:51
presentations who's your I just have to
00:25:53
levels of impatient this guy and this
00:25:55
guy the output is is in in a it's also
00:25:58
a representation but it's it's meaning
00:26:00
is fixed it's depends on that you know
00:26:02
what the semantics of what we're trying
00:26:04
to protect and of course you have a
00:26:05
presentation input but also the
00:26:07
meanings speaks where is the meeting of
00:26:08
these representations or something
00:26:10
discovered by the computer it's
00:26:11
something that the computer makes up
00:26:14
you know to to do the job. So for
00:26:16
example around two thousand nine our
00:26:18
friends at Stanford looked into the the
00:26:21
than you'll and trying to figure out
00:26:22
what kinds of representations that is
00:26:24
the you are that learning. And so this
00:26:26
new one that was trained on a images of
00:26:29
faces and they found that the the
00:26:32
lowest level the first layer units
00:26:35
extracted a detectors because this is
00:26:37
not a new observation this is something
00:26:39
that's been shown with a lot of machine
00:26:41
learning models including you and
00:26:42
that's that an actual set of features
00:26:45
for images like these edge detectors
00:26:47
they like oriented contrast of a
00:26:49
particular size and position and so on.
00:26:52
Um but it it kind of discovered always
00:26:55
by itself then it more interesting is
00:26:57
that if you look at the second level
00:26:58
and you look at what the units like so
00:27:00
this is what these pictures each little
00:27:02
where he represents the kind of input
00:27:04
that a particular unit in that you on
00:27:07
that is is preferring. And you see that
00:27:10
it has things like parts of the of the
00:27:13
face like you know eyes and and noses
00:27:16
or other forty things that you can
00:27:18
think of as the composition of some of
00:27:20
these guys here. So that the units one
00:27:23
level compose together. And not nearly
00:27:27
compute function of the lower level
00:27:31
output so that it takes the
00:27:33
representations at one level in
00:27:35
computing you kind of representation
00:27:37
and here we can see that it seems to be
00:27:39
discovering detectors for parts they
00:27:41
can be combined to form a high level
00:27:43
part or full objects like these these
00:27:45
faces it so why why is this idea
00:27:50
working there's actually no free lunch
00:27:53
anywhere in mission dining and and not
00:27:54
in planning the reason why planning is
00:27:57
working is because somehow this idea of
00:27:59
composing pieces together composing
00:28:04
functions at multiple levels is natural
00:28:07
is something that is out there that
00:28:09
fits well how the world is organised.
00:28:11
So if you're into trying to do a image
00:28:14
recognition well there's a there's a
00:28:15
kind of you know some natural hierarchy
00:28:17
of concepts starting from pixels to
00:28:20
edges to little texture motives. And
00:28:23
and parts and objects if you're doing
00:28:25
you know modelling text then you know
00:28:26
characters combining two words words
00:28:29
combine into weird groups or or freezes
00:28:31
that go into clauses and sentences and
00:28:33
stories and we don't really know what
00:28:35
the rights you know concept should be
00:28:37
higher up but but we can imagine that
00:28:39
there are some high level abstractions
00:28:41
that that makes sense for the
00:28:43
particular domain in speech of course
00:28:45
we go from the acoustics samples to
00:28:46
some spectral features to to sounds and
00:28:50
phones in phonemes and words in in in
00:28:52
more right language models. So
00:28:55
essentially all of these the networks
00:28:57
are obtained by taking the raw data and
00:29:00
transform it it into so the feature
00:29:03
extraction at different levels one more
00:29:05
abstract so the lower levels again the
00:29:07
so now this is similar is why should
00:29:09
you before but we see that this is for
00:29:12
colour images and so you see not just
00:29:15
these edges but also some sort of lower
00:29:19
frequency edges with colours that
00:29:22
become important. And and and higher up
00:29:25
now you see these detectors that
00:29:27
capture these funny shapes that are
00:29:29
made by composing these guys and and
00:29:31
even higher up you see these funny
00:29:33
shapes would actually now start looking
00:29:34
like parts of objects like maybe this
00:29:36
is a the or something. And and of
00:29:38
course we'll and and so on and maybe
00:29:41
this is like a face of the bird or
00:29:44
something now when you really a I think
00:29:50
a lot about machine learning based on
00:29:52
the presentations you can start playing
00:29:54
all kinds of really interesting games
00:29:57
and I only give you glimpse here with
00:29:59
this very old works from my brother
00:30:02
sammy. And his collaborators at the at
00:30:05
Google then Jason west and because I
00:30:08
think it was not a double but the work
00:30:10
together on the idea of learning
00:30:12
representations for both a text more
00:30:16
precisely a short text queries like
00:30:18
Eiffel tower. And for images in making
00:30:22
those are presentations in the same
00:30:23
space and so what's going on is that
00:30:27
there's gonna be a large
00:30:29
transformation. So you can think of a
00:30:31
you on that that takes the image. And
00:30:34
what's a hundred numbers right so
00:30:36
here's for visitors to D but really
00:30:38
initially these were a hundred
00:30:40
dimensional vector spaces today
00:30:41
actually they're like two thousand but
00:30:43
it's the same idea. Um and and so you
00:30:48
have one function that maps images to
00:30:50
that space and you have another
00:30:52
function that you also learn that maps
00:30:54
these queries which you can think of a
00:30:56
symbols. And into the same space
00:30:59
actually those the mapping between
00:31:01
those symbols and and and a vector is
00:31:03
just like table lookup right so you
00:31:04
have a table for each of I don't know a
00:31:06
million queries that are most frequent
00:31:09
the table would tell you what is the
00:31:11
Hunter dimensional vector corresponding
00:31:12
to dolphin all by Eiffel tolerance one
00:31:15
and nowadays they have more complicated
00:31:17
mappings which maybe include recurrent
00:31:18
nets and so on and can deal with a much
00:31:20
larger vocabulary and and things like
00:31:22
that but the idea is we learned these
00:31:25
two mapping so that the representation
00:31:30
for say the word dolphin is gonna be
00:31:32
close to their presentation for an
00:31:34
image of a dolphin. And wise that
00:31:36
useful well you can imagine if your
00:31:38
search engine. And you want to
00:31:40
associate things that people type
00:31:43
queries and images one way or the other
00:31:46
this is very useful right. So you want
00:31:48
images here to map for presentation is
00:31:50
kind of semantic that has to do with
00:31:52
what this means we don't care hear so
00:31:54
much about the details of the you know
00:31:57
the light on the water but we care
00:31:59
about what's in there it's you know
00:32:00
adult and that's in warrants one looks
00:32:03
like a pool. So these kinds of
00:32:05
information that you would like them to
00:32:06
be encoded in this more abstract
00:32:09
representation much more abstract than
00:32:10
the pixels in such a way that you can
00:32:12
also we cover the concepts that go with
00:32:15
them because you end opposing usually
00:32:17
in space right. So let me tell you a
00:32:25
little bit about I'd a big at a high
00:32:27
level what I think are key ingredients
00:32:31
for machine learning to approach a I
00:32:33
and number one is pretty obvious but I
00:32:38
think for many decades we can have
00:32:40
ignored it too much which is we need
00:32:42
lots and lots of data but let's let's
00:32:45
try to step back you why do we need
00:32:47
that much data we need that much
00:32:49
daytime people complain all new on that
00:32:51
anyone of data. Well if we wanna
00:32:55
machine to be intelligent to understand
00:32:57
the world around us. It's gonna need a
00:33:00
lot of the yeah to to get that
00:33:02
knowledge because it's mostly learning
00:33:04
all that information about the world
00:33:05
around this by by learning by by by
00:33:08
observing. And and the world around us
00:33:11
is complicated. So you know to learn
00:33:14
some complex presentation of what's
00:33:16
going on at there we need lots and lots
00:33:18
of data and I think that right now the
00:33:20
amount of data we're using is still way
00:33:22
too small compared to what will be
00:33:24
needed to reach you level yeah so
00:33:26
that's that's in really number one but
00:33:28
of course is not enough to have a light
00:33:29
data if we can't build models that can
00:33:32
capture and so for that we need very
00:33:34
flexible models we need we can to have
00:33:38
a you know models like linear models
00:33:40
that don't have enough capacity we need
00:33:42
the number of parameters to grow a you
00:33:45
know in proportion to the amount of
00:33:46
data so those those models have to be
00:33:48
big you have to be big because they
00:33:50
basically store the data in a different
00:33:53
form which allows the computers to take
00:33:55
good decisions alright. Now if we have
00:33:58
these big models with a lot of
00:33:59
parameters of course we have to train
00:34:02
them and we have to run them and for
00:34:04
that we need enough computing power.
00:34:06
And you know one of the reasons why you
00:34:09
on that's weren't so successful before
00:34:11
is because we didn't train them or not
00:34:12
data we didn't have big enough models
00:34:14
and we didn't have enough computing
00:34:15
power to train and use them now I just
00:34:19
having the first two ingredients is not
00:34:20
enough either because you could have
00:34:24
something like maybe and efficiently
00:34:26
implemented colonel machine and you
00:34:28
would people in principle to deal with
00:34:29
all three of these ingredients. However
00:34:32
there's something else that is
00:34:33
important. Uh I don't know if you heard
00:34:36
about the curse of dimensionality
00:34:37
you're trying to we're trying to learn
00:34:39
from very high dimensional data a and
00:34:42
in principle from you know if we don't
00:34:44
make any assumptions about the data.
00:34:46
It's essentially impossible to line
00:34:48
from that they they're just too many
00:34:49
configurations that are possible. And
00:34:52
the only way around that is to make
00:34:55
sufficiently powerful assumptions about
00:34:57
the data that there's something called
00:34:58
the no free lunch there and it says you
00:35:00
you do need to make these kinds of
00:35:02
assumptions otherwise you won't be able
00:35:04
to to learn something really complex
00:35:08
okay so many tell you about those
00:35:10
assumptions that are being made
00:35:11
specifically in deep learning. And they
00:35:14
have to do with the curse of
00:35:15
dimensionality this exponential number
00:35:17
of configurations that we have to deal
00:35:19
with we can't learn a separate
00:35:21
parameter for each configuration the
00:35:22
input because the number of such
00:35:24
configurations is like is huge is is
00:35:26
much more than the might data we can
00:35:27
ever like you know it's like the more
00:35:29
than the number of atoms in the
00:35:30
universe. So we we can't just learned
00:35:33
by heart everything gives or we need
00:35:34
some form of station. And besides the
00:35:37
smooth this is something which has been
00:35:39
very a successful and powerful machine
00:35:41
running we need other assumptions in in
00:35:44
in these big units we're putting in two
00:35:48
crucial additional assumptions which
00:35:50
have to do with composition T in
00:35:53
different forms of it right so we
00:35:55
already know compositional use very
00:35:57
powerful and humans use it all the time
00:35:59
that's how we we you know drop the
00:36:02
world around us we compose concepts a
00:36:05
human language is basically an exercise
00:36:07
in in composing ideas and meetings
00:36:09
right. So in those neural nets we have
00:36:12
two forms of composition algae one
00:36:15
which happens even with a single they
00:36:18
yearly on that so every time you have a
00:36:20
it what we call it distributed
00:36:21
representations or thing about one
00:36:23
layer venue on that each you're on
00:36:25
their each artificial on is detecting a
00:36:28
a feature a concept and and these these
00:36:31
detectors are not mutually exclusive to
00:36:33
the number of configurations that we
00:36:34
can capture grows exponentially with
00:36:37
the number of units so single layer
00:36:38
neon that is in some sense
00:36:40
exponentially powerful in what can
00:36:42
represent. Um so this idea of learning
00:36:45
features is is our first form of
00:36:47
competition all get an object is
00:36:49
described by the composition of you
00:36:51
know which features are activated yeah
00:36:53
which attributes are rather than to
00:36:55
this particular image. And because we
00:36:57
have many attributes and they can be on
00:36:59
or off or maybe some some grey level we
00:37:02
can have a very rich description that's
00:37:04
exponentially which in some sense and
00:37:06
yeah and there's actually more than the
00:37:10
words that I'm saying that there's a
00:37:11
lot of math behind this showing how are
00:37:14
these the fact that you have these this
00:37:16
representations really buys you
00:37:18
something exponential in statistical
00:37:21
sense. So that's the first form of of
00:37:24
composition holiday. There's a second
00:37:27
form which is the one you get when you
00:37:29
how many layers one on top of the other
00:37:31
where each layer compute something is a
00:37:34
function of the output of the previous
00:37:35
later and and here you also get an
00:37:38
exponential game. So again we have
00:37:40
theory showing that by having more
00:37:43
composition of you know of of the
00:37:46
layers on top of layers and here's you
00:37:48
can represent functions that are
00:37:49
exponentially richer in some sense. Um
00:37:53
so that's that's that's the other key
00:37:55
ingredient. And and of course these are
00:37:58
assumptions about the well there's no
00:37:59
free lunch as I said these things work
00:38:02
because they fit well with the world in
00:38:04
which we live if it wasn't the case
00:38:06
that the world was conveniently
00:38:08
describing the composition way these
00:38:10
neural nets would not be working as
00:38:11
well as they are to give you an example
00:38:14
of this. Um think about a and you on
00:38:19
that we're at some level representation
00:38:22
you have different units that detect
00:38:23
different kinds of attribute so let's
00:38:25
see the the image the input is an image
00:38:26
of a person. So you can imagine that
00:38:29
you could have a unit that recognises
00:38:30
that the person wears glasses you could
00:38:32
have a unit that recognises that the
00:38:34
the the person is a female you can have
00:38:36
another unit that recognises the
00:38:37
person's a child and so on you can
00:38:39
imagine like you know a thousand such
00:38:40
units and and now you know why is that
00:38:44
interesting because imagine you work to
00:38:48
try to learn these detectors right. So
00:38:51
you have these thousand detectors okay
00:38:53
you have and and so if you have these
00:38:59
and thousand factors and thousand
00:39:01
features. And if you want to learn them
00:39:04
separately you would need something
00:39:06
like say K parameters alright "'cause"
00:39:10
you know what what what are the
00:39:11
characteristics of the person that
00:39:12
wears glasses and you need me you can
00:39:14
imagine training a separate neon that
00:39:15
even for each of them. But we will do
00:39:17
better than that by sharing the layers
00:39:19
but in the worst case you could imagine
00:39:20
that if I had and features and each of
00:39:22
them requires sort of order of K
00:39:23
parameters that in total I need or
00:39:25
river and time ski examples to learn
00:39:29
about all these features. Now let's
00:39:32
consider the alternative where we don't
00:39:34
use a this presentation we use with
00:39:36
school and nonparametric approach a
00:39:40
which says well we gonna consider all
00:39:43
of the possible configurations of image
00:39:44
they images of of persons. So so now
00:39:48
what I'm gonna do save from using SBM
00:39:52
or something about or nearest neighbour
00:39:54
approach I'm gonna have to use an
00:39:57
example for each of the configurations
00:39:59
that I want to learn about so how many
00:40:00
configurations of the input like get
00:40:02
well potentially and the D right so do
00:40:06
use the input dimension the number of
00:40:09
ways to to split the data into all the
00:40:12
configurations of you know you has
00:40:13
glasses it doesn't have you know you
00:40:15
she's a female she's not it's a channel
00:40:18
that's not a child. So all of these is
00:40:20
basically an exponentially large set.
00:40:23
And the good news is if those features
00:40:26
if we were present the data with this
00:40:28
this representation we can do it in in
00:40:31
something that grows a nicely with the
00:40:33
amount of the complexity of the task
00:40:36
rather than exponentially. So yeah we
00:40:40
actually we can formal papers about
00:40:44
this characterising the number of of in
00:40:52
your pieces that any on that can
00:40:53
capture that has a single him layer
00:40:56
with and heating units and and and
00:40:58
essentially can represent things that
00:41:00
are that look complicated it may you
00:41:02
know you think you need order prevented
00:41:04
it D parameters in order to learn but
00:41:07
because of the the the the
00:41:10
compositional that we're assuming about
00:41:12
the well that ah you can essentially
00:41:15
the reason this works is that we can
00:41:17
assume that we can learn independently
00:41:19
about wearing glasses or about being
00:41:21
female versus male or being a child or
00:41:24
not right that that's why this this
00:41:26
compilation works. Uh if the detector
00:41:29
for glasses needed to know whether you
00:41:32
were female or child and so on in order
00:41:34
to do the detection of glasses then it
00:41:36
wouldn't work you would need as many
00:41:37
parameters as if you were doing an SVM
00:41:39
or and yours neighbour method the fact
00:41:43
that you can learn about these
00:41:44
attributes kind of separately from each
00:41:46
other without having to know all of the
00:41:47
configurations of the other attributes
00:41:49
is the reason why this is working okay
00:41:54
and yeah and this something similar has
00:41:58
been shown for that where you find that
00:42:01
you you know some functions could be
00:42:04
represented very efficiently with a
00:42:06
deep net but if you wanted to represent
00:42:08
those functions with a shallow network
00:42:10
you might need a huge number of it's
00:42:12
not the words there are functions out
00:42:14
there which really are not fully
00:42:15
expresses a composition of many levels
00:42:19
of nonlinear transformations and if you
00:42:20
try to capture those functions using a
00:42:24
not sufficiently deep network so it's a
00:42:26
shell network with a single layer two
00:42:28
layers. And that's not enough then you
00:42:30
you wouldn't you wouldn't need an
00:42:32
exponential number of units so you
00:42:33
would need an exponential number of
00:42:34
parameters and you need an expression
00:42:37
number examples to learn those
00:42:38
functions properly okay so the this
00:42:41
this is a bit of the theory. Um there's
00:42:44
another vertical thing that happened
00:42:46
recently which is maybe as important
00:42:50
for many years researchers in machine
00:42:54
earning top that neural nets couldn't
00:42:56
be really practical and useful because
00:42:59
training them in for a non convex
00:43:02
optimisation problem which could have
00:43:04
many local minima so what I mean by
00:43:06
this is if you if especially if you are
00:43:08
low dimension you think about functions
00:43:11
you trying to optimise a function like
00:43:13
the the the total error with a spectral
00:43:16
parameters you know it might be very
00:43:17
bad if you just to stochastic doing
00:43:19
descent or any kind of local descent
00:43:21
algorithm global optimisation you might
00:43:22
get stuck in these local minima and
00:43:25
that's not the case for things like
00:43:27
colonel machines. So the question is
00:43:31
you know is that since this a real
00:43:32
problem well it turns out that it's not
00:43:34
a real problem. Uh at least there's a
00:43:36
lot of evidence that this myth that
00:43:38
training neural nets is is riddled by
00:43:42
bad local minima is really a mess in
00:43:45
the and what we found actually is that
00:43:48
it's especially true as we go from tiny
00:43:51
networks to large networks so the
00:43:53
really interesting thing is that the
00:43:54
larger the network that easier it is to
00:43:56
optimise. So you know you can you can
00:44:01
feel really bad cases of optimisation
00:44:03
on very small nets with worse for union
00:44:05
it's when you go to a millions of
00:44:08
parameters or Hunters of millions of
00:44:09
parameters there is as it does with
00:44:11
statistical effect that's happening
00:44:13
that really makes documentation much
00:44:15
easier. And we we started this using
00:44:20
and analysis of critical points which
00:44:24
are the places where the the network
00:44:26
has your derivatives. And it turns out
00:44:29
that for the most part the kind of
00:44:33
critical points that you could
00:44:34
encounter during training and you on
00:44:35
that are subtle points meaning that you
00:44:38
you're not stuck their directions where
00:44:40
it looks like a local minimum but in
00:44:42
other directions actually going down.
00:44:44
And so you know great descent will just
00:44:46
sit and you know go in and not get
00:44:49
stuck in those those sell points. Um
00:44:53
yeah so this this is a lot to say about
00:44:56
this but I see time line let me tell
00:44:59
you a little bit about that of where
00:45:01
we're going now right the the
00:45:05
beginnings of neural nets were really
00:45:07
about having recognition about the
00:45:09
kinds of things have told you when you
00:45:11
recognise objects of images. And and of
00:45:15
course young is is much more than
00:45:17
having recognition. But but what's
00:45:19
interesting is that in the last few
00:45:21
years there's been really a lot of
00:45:23
progress in moving neural that stored
00:45:25
something that's more like a high level
00:45:27
cognition. Um there's been a lot of
00:45:31
work about attention in particular in
00:45:33
my lab and and in many other places
00:45:35
now. I'll tell you about about that so
00:45:38
attention is essentially it's in spite
00:45:41
of course about what we know about
00:45:42
humans a of considering the whole of an
00:45:46
input or a big set of numbers as one
00:45:50
homogeneous block. Um so for example if
00:45:53
you think about a layer that is looking
00:45:56
at the lower layer instead of looking
00:45:57
at everything the network learns to
00:46:00
focus on parts of the import or a layer
00:46:05
learns to focus on part of it's it's
00:46:08
it's important. Um another direction
00:46:14
that's really that's very very
00:46:15
promising is to look at reasoning
00:46:18
problems where instead of going from
00:46:21
input to output in once that you
00:46:23
actually have a sequence of steps and
00:46:25
the number of steps could very we at
00:46:27
each that we combine pieces of evidence
00:46:30
you know to to come up with a
00:46:31
conclusion this is really what
00:46:33
reasoning is about you combine say
00:46:35
different observations with different
00:46:37
things you know about the will and you
00:46:40
you know combine them to find an
00:46:42
answer. So I'll I'll tell you a little
00:46:45
bit about this but about the your and a
00:46:49
half ago this started with the simple
00:46:51
memory networks annual train machines.
00:46:53
And then another direction which is
00:46:55
related to this is everything that has
00:46:57
to do with planning. And reinforcement
00:46:59
learning and this is been exemplified
00:47:01
by the work of deep mind which has been
00:47:04
acquired by Google couple of years ago.
00:47:06
And their work on playing atari games
00:47:09
and more recently on the alpha go the
00:47:11
system that I mentioned at the
00:47:12
beginning that which a beep the will
00:47:14
champion but it's it's much more than
00:47:16
playing games it's about learning to
00:47:19
take decisions. And being able to learn
00:47:22
in a context where you wanna have ms
00:47:23
Elise provide the ability to have a
00:47:26
labels or supervised learning every
00:47:28
step. And and then more recently this
00:47:30
this kind of research we're combining
00:47:32
deep learning with with reinforcement
00:47:35
learning has gone into robotics. So the
00:47:37
whole field of robotics back in
00:47:38
particular by Berkeley is is moving
00:47:40
towards the use of departing and you
00:47:44
say a few words about attention. So
00:47:46
imagine a a a sequence of feature
00:47:52
vectors so you think of each of these
00:47:53
points as a vector we've been using
00:47:56
this for machine translation so each of
00:47:58
those would be a feature vector
00:48:01
extracted corresponding to a particular
00:48:05
place in an input sentence it so it it
00:48:09
may contain a semantic attributes
00:48:11
corresponding to the word at that
00:48:13
position as well as word in the
00:48:14
neighbourhood right so this is a
00:48:17
sequence of feature vectors but you
00:48:19
know it could be any kind of space. And
00:48:22
we're gonna produce another sequence of
00:48:23
feature vectors. But instead of using
00:48:26
this kind of usual fully connected the
00:48:29
approach which is kind of a static grab
00:48:32
we gonna make the the relationship
00:48:33
between the first sequence the sick and
00:48:35
the second sequence as something more
00:48:37
dynamic using and attention they can
00:48:39
and so what's the idea of you can't
00:48:40
make an is the idea is that when we
00:48:43
needed produce this feature vector
00:48:46
instead of looking at all of these guys
00:48:48
we're gonna choose a few of them maybe
00:48:50
missing the one and we're gonna use
00:48:53
that teacher and maybe maybe a few
00:48:55
others to compute the the feature the
00:48:58
next up right so so we're gonna focus
00:49:00
on a few elements in the input sequence
00:49:03
this is the crucial thing and you can
00:49:05
do it using what's called soft
00:49:07
attention or stochastic or attention we
00:49:10
work mostly with soft attention but we
00:49:12
have a paper will also use the plastic
00:49:13
car tension so the idea of soft
00:49:15
attention is that instead of taking a
00:49:19
yes no decision about which features we
00:49:21
gonna be looking at which element in
00:49:22
the set here we're gonna be looking
00:49:23
yeah we can shoot some soft weights
00:49:27
that sum to one over all the elements
00:49:29
here in order to decide you know how
00:49:31
much attention we gonna give to each of
00:49:32
them. And those software it's gonna be
00:49:34
computed by a little attention neural
00:49:37
that a little MLP here that takes the
00:49:41
the contexts at the upper level here
00:49:44
and the features at the lower level and
00:49:46
basically decide it's a good match you
00:49:48
know should we use this guy as input
00:49:50
for the next one here. And I'll put a
00:49:52
score for each of the possible
00:49:54
positions using you know the input
00:49:55
features corresponding input feature it
00:49:57
so because these weights or or or just
00:50:01
a part of a soft complication the
00:50:04
french will computation you can you can
00:50:06
learn to put attention in the right
00:50:08
place and it does learn to do that. And
00:50:10
in fact it's thanks to these attention
00:50:12
mechanism that we reached a state of
00:50:13
the art in machine translation in in in
00:50:16
the last year in two thousand fifteen
00:50:17
so yeah we we basically use the
00:50:23
architecture I should be for to process
00:50:28
input sentences extract those features
00:50:32
from them using a a form of recurrent
00:50:35
net re though so we'll bidirectional
00:50:37
recurrent net. And and then we let me
00:50:40
show you that picture again you can
00:50:42
think of it like we've extracted
00:50:44
semantic features from the whole
00:50:46
sentence or you think about even
00:50:48
Reading a whole book right and each of
00:50:50
these is features extracted at each
00:50:51
position each word in in in in the book
00:50:53
here. And now we can produce a word at
00:50:56
a time in the translated book. And so
00:50:59
each time we produce the next word in
00:51:01
the translated book we decide which
00:51:03
word in which few words in the
00:51:06
sourcebook we should be looking looking
00:51:08
at and and this works quite well my
00:51:11
position to a technique that had been
00:51:12
you have tried before where a along
00:51:15
with our colleagues a Google where you
00:51:17
you you read the whole book you come up
00:51:19
with that kind of semantic
00:51:22
representation of the whole book and
00:51:23
then you feed that into another
00:51:25
recurrent net which produces the the
00:51:27
words in the translated book and that
00:51:29
doesn't work because it's hard to
00:51:30
compress that much information into a
00:51:32
fixed size vector. But by allowing then
00:51:35
the network to to decide at which each
00:51:38
point in producing the output sequence
00:51:40
where to look a it works perfectly well
00:51:42
and so we we won a couple of the WMTI
00:51:48
challenges so this is you only
00:51:49
competition for machine translation
00:51:52
using these neural machine translation
00:51:53
systems. And more recently at our
00:51:56
colleagues at at Stanford I've been
00:51:59
using this and other datasets all the
00:52:01
benchmarks and obtain even stronger
00:52:03
improvements and now there's it's it's
00:52:05
a whole cottage industry to improve
00:52:07
these neural emotion translation
00:52:08
systems ah yeah and they're essentially
00:52:11
leading in in dumb machine translation
00:52:14
will right now one thing you can do
00:52:17
with attention that's quite cool as
00:52:19
well is combining the things we've done
00:52:23
in computer vision with things we've
00:52:26
learned with modelling language. So in
00:52:31
this way what we've done is we we tried
00:52:34
any on that the conclusion on that it
00:52:36
extracts features from the image and
00:52:38
then we use inattention a an of them to
00:52:41
decide to produce one word at a time in
00:52:43
the sentence that's supposed to be a
00:52:44
description of the image so the
00:52:46
computer reads the image and produces a
00:52:47
sentence stochastic Lisa to outputs a
00:52:50
probability for the next word and then
00:52:52
we sample that word and produce
00:52:54
probably for the next word and so on.
00:52:56
And so it sees this image and says a
00:52:57
woman is throwing a frisbee in the park
00:53:00
but it doesn't using attention so each
00:53:01
time it produces a word in the output
00:53:03
sequence it chooses to look in the
00:53:05
input so here when it says frisbee it's
00:53:07
looking in this region where there's a
00:53:09
frisbee. Um so just a few years ago
00:53:11
somebody would have told me well we're
00:53:12
gonna train any on that that you know
00:53:14
looks at it imagine produces a natural
00:53:15
language sentence that describes it I
00:53:18
would have said Nat it's gonna take you
00:53:19
know at least ten years. Uh but it's
00:53:22
there and and you know this is being
00:53:23
this is more than your old result and
00:53:26
and now it's it's you know people are
00:53:30
doing even better than that yeah so let
00:53:33
me show you more of these examples of
00:53:35
the computer looks at this and it says
00:53:37
a dog is standing on the hardwood floor
00:53:39
and when it says dog is looking at the
00:53:41
face of the dog it look at this image
00:53:43
and it says a stop sign is on the road
00:53:45
with the mountain in the background.
00:53:46
And what it sees a sub sub sign it's
00:53:49
it's know what it's a stop sign it's
00:53:52
looking at the stuff now let me show
00:53:55
you something that our colleagues at
00:53:57
face crooked done using something
00:53:59
similar but now instead of producing a
00:54:03
sentence you answer the question is
00:54:07
there a baby yeah what is the man doing
00:54:12
I think is the baby sitting on his lap
00:54:17
yeah are they smiling yeah is there a
00:54:27
baby in the folder yeah where is the
00:54:32
baby standing I am what is the baby
00:54:37
doing actually teeth what game is being
00:54:44
played soccer someone kicking the ball
00:54:49
yeah what colour is the ball yellow
00:54:56
what is the dog playing E what colour
00:55:05
is the dog black is the dog wearing a
00:55:11
colour yeah what is the cat sniffing
00:55:17
yeah where is the cat I'm bad what
00:55:25
colour is the cat black and white what
00:55:30
colour the bananas we okay now you have
00:55:39
to be where this is a demo made by face
00:55:42
book. So oh I mean I mean I think this
00:55:47
is real but they probably selected
00:55:49
cases where it works better
00:55:52
nonetheless. This is really impressive
00:55:55
let me tell you a little bit about
00:55:59
what's behind the scenes in addition to
00:56:01
the make answers I've been telling you
00:56:02
and essentially it's using this
00:56:07
attention make an is an idea not just
00:56:11
to focus on a particular part of the
00:56:14
input but to focus on a particular part
00:56:19
of memory. So the idea here is to
00:56:25
separate the main computation which
00:56:27
would be done by recurrent network
00:56:28
typically from a memory which you can
00:56:33
think of like a computer memory where
00:56:35
you would have a vector at each address
00:56:38
and and these these factors could be
00:56:40
long like think of these as the word
00:56:41
invading so they might be like two
00:56:43
hundred dimensional something like
00:56:44
that. And and now the recurrent net
00:56:49
kind of course read from the external
00:56:51
world and produces output. And servers.
00:56:54
But it can also do internal actions. So
00:56:58
the internal actions here would be
00:57:00
things like Reading at a particular
00:57:02
place or writing at a particular place.
00:57:06
Now instead of taking a hard decision
00:57:08
about where to read and where to write.
00:57:11
And what to write it takes soft
00:57:14
versions of these decisions. So it
00:57:16
computes a a a score for each address
00:57:21
and those scores with the soft next
00:57:23
would sum to one that really you know
00:57:25
where it wants to me and what it's
00:57:28
gonna do is gonna take like we did for
00:57:30
detention a cannon is gonna take those
00:57:32
weights. And make a linear combination
00:57:36
of what's reading. So we take the
00:57:38
contents everywhere weighted by those
00:57:41
those scores of some to one you know to
00:57:44
actually get the information from the
00:57:46
memory into the recurrent net so is
00:57:47
Reading with a focus of attention in a
00:57:50
few places and you can do the same
00:57:52
thing for the writing yes so you can
00:57:58
use these kinds of systems to do things
00:58:00
like read a little story like some
00:58:02
walks into the kitchen sam picks up an
00:58:04
apples and walked into the bedroom and
00:58:06
drop the apple and then question where
00:58:08
is the apple so the computer reads all
00:58:10
of these things including the question.
00:58:12
And knows now this is the question
00:58:14
maybe because there is a special marker
00:58:16
in it's supposed to answer or something
00:58:17
like that or just like we had it in the
00:58:20
demo except that in a demo. Instead of
00:58:23
the text here we had an image but it's
00:58:25
exactly the same account isn't alright.
00:58:31
So I'm gonna close here this is a
00:58:33
picture of my group in Montreal
00:58:37
montreal's representing all rhythms.
00:58:39
And and we always recruiting. Thank
00:58:44
you. So I guess it's time for the break
00:58:59
I'll be here force for the panel later
00:59:02
so if you have questions we can we can
00:59:04
answer the panel and also tomorrow I'll
00:59:07
be giving another lecture and I'll
00:59:11
leave more time for questions during
00:59:13
the lecture so you you know you can
00:59:14
keep your questions a little bit for
00:59:17
later today or tomorrow we can have the
00:59:19
the questions then so we can have time

Share this talk: 


Conference program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
4 July 2016 · 2:01 p.m.
Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
4 July 2016 · 3:20 p.m.
Day 1 - Questions and Answers
Panel
4 July 2016 · 4:16 p.m.
Torch 1
Soumith Chintala, Facebook
5 July 2016 · 10:02 a.m.
Torch 2
Soumith Chintala, Facebook
5 July 2016 · 11:21 a.m.
Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
5 July 2016 · 1:59 p.m.
Torch 3
Soumith Chintala, Facebook
5 July 2016 · 3:28 p.m.
Day 2 - Questions and Answers
Panel
5 July 2016 · 4:21 p.m.
TensorFlow 1
Mihaela Rosca, Google
6 July 2016 · 10 a.m.
TensorFlow 2
Mihaela Rosca, Google
6 July 2016 · 11:19 a.m.
TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
6 July 2016 · 3:21 p.m.

Recommended talks

Pose estimation and gesture recognition using structured deep learning
Christian Wolf, LIRIS team, INSA Lyon, France
17 Oct. 2014 · 11:06 a.m.
The Web: Wisdom of Crowds or Wisdom of a Few?
Ricardo Baeza-Yates, Yahoo! Labs
21 May 2014 · 3:07 p.m.