Embed code
We'll start with a short work by LA
with subject off yeah on then we would
have the first two by you sure so first
okay well you have to be to be very
very short I would be very shall I
promise for ones like you saying that
but well welcome to sun is excellent
and sunny valley in particular the best
place into both less polluted morse and
so I hope that besides this great
workshop we will have science I have
time to enjoy the the region is really
worth is spending times and these
hiking around or whatever you like
anyway I I would like to thank for for
for us the first for organising this
this great workshop and all the people
who kindly accepted to come is that and
these are is is a lecture us for us
this workshop there was two reasons to
have this workshop this year the first
one is that is one of the twenty fifth
anniversary events that lead up is
organising this year one among a few
orders many others that will have
basically every month and the second
one is that basically there is as we
all know and that's why we are all here
there is a big a revival whatever we
want to call it. Well or progress in
the mission on the ink and deep
learning and no network as opposed to
just superficial learning that people
were doing in the past. And obviously
it yeah but is built into a pretty
large institute of hundred twenty
people fifty or in the twenty startups
and many more to come in to the feud to
your work years around the around to
advance signal processing and machine
learning I dress to menu problems is
Dave a the speech processing computer
vision biometrics by you imaging
computer vision human behaviour
understanding and so on. So everything
we are doing so we sounds very fancy
and very complicated but what we like
about what we are doing is that we are
all sharing the same tools which are
basically tools coming from signal
processing and and machine learning. So
any progress in this area regarding
software or regarding hardware that
social something that is unique to this
workshop I believe. I think it's one of
the first time that I know of where we
have people coming from the hardware
side which not always agree with each
other either because we are talking
about CPUGPU in many others a dude he
how about and the softer side and so
this would be a great places so to
exchange ideas about the future and all
we can have the community at large and
and so areas like the one I just
mentioned now. So again thank you a lot
to cross off also is the wall organise
this and you will your boss you will be
possible to next to the treaties yeah
thank you for so okay so as obvious say
hold as you probably know because you
are here there is a strong provide or
when you have an transformation running
on I think we all agree that it's you
to mix of policies balls onto the
juries are side doctor sort of
engineering things going on especially
software frameworks on hardware from
testicle. So I I think it's nice to
have those from plastic speakers
because they they spend ton skate
that's of interest was the for the
running on the old also I think so it's
a nice even to elected yeah because
it's pretty words button for what we we
do here which is to get interface
between mission on either some machine
running on our engineering on concept
to industry. So it's we the Greek
leisure that so for a speaker we be you
shopping you so you sure useful obvious
autonomous yeah the is at of the
machine on a mission running about
Julian could act of the C file hold on
that we you know on the user makes a
puny on keep running as you might have
to you these are the you wrote two
books on the subject on there as being
impressed with my my to press on the
the paper Thanks you guys hear me well
yes good. So today I'll talk more about
supervised learning and I'm just gonna
scratch the surface there's one hours
not enough really to give justice to
this field but and tomorrow I'll talk
more about as provides learning. So as
you know colours or something to drive
themselves and we're starting to talk
to our phones in their starting to say
something back and computers are now
able to beat the world champion in the
game go which is which was not to be
something computers would be able to do
for decades. And all this is
essentially because the progress in
machine learning in an area called deep
learning which is essentially a renewal
of you and nets I think it's a lot more
than these little things I mentioned
it's a whole a new iconic revolution
that is coming with progress and yeah I
currently spearheaded by these
techniques. And that's why so many
companies are jumping into this is only
this is the display like two years ago
and it's it's much bigger and much more
crowded now. So let me tell you about
about deep learning what it is the the
general idea is the we want to learn
from data so he's a machine learning
algorithms. And with particular is that
we're gonna learn of representations of
the data. And we're gonna learn
multiple levels of representation of
data okay that's that's really what's
the planning is about in why would that
be an interesting because these
multiple levels of representation
really hers or a supposed to and and
effectively seem to capture different
degrees extraction so as you go deeper
you tend to be able to capture more
abstract concepts. And and this has
worked out really well it started with
a speech recognition object recognition
hundred detection and more recently
there's been a lot of progress and a
natural language understanding
processing machine translation and
things like so it this this ability of
train you on that's that are deep to
have more then a couple of hidden
layers something that really happened
around two thousand six things to buy a
funding from see far which is a
canadian organisation that makes pretty
long bets on on and shoes research
projects and included different things
that in trying to my live in Montreal
and yellow kinds of in new York so that
was the that the the first breakthrough
essentially allowed us to use
unsupervised learning organs which
existed to bootstrap the training of a
deep supervise you on that then another
I think really important bands that to
people are not aware of a something
that happened in two thousand eleven
when we found that actually we didn't
need this supervise and provide spree
training trick that if we just replace
the nonlinearity that people tended to
use the hyperbolic tangent or the
sigmoid with the rectifier you were
able to train very deep supervise
networks. And just the year after that
our colleagues from Toronto use this
trick along with other you tricks like
drop out. And in order to have a really
big breakthrough in terms of
performance on object recognition I'll
tell you more about that. So first of
all let me step back a little bit about
what a ideas about a I is about
building machines that can take
decisions the decisions. And for a
computer or even a human or animal to
take the decisions that entity needs
knowledge right you get intelligence by
having knowledge however so this is of
course well known and for decades what
what's happened in the I research is
we've tried to give that knowledge to
computers explicitly by you know giving
the computer rules and facts. And the
and P.'s the program but it failed and
it failed because a lot of the very
important knowledge that we happen and
be using you to understand the world
around this isn't something we can
communicate in in the language or in
programs it's things we know but we
can't explain this is essentially
intuition. And that's where machine
running comes because we need to get
that knowledge to computers we can't
tell them exactly what that knowledge
is like how to recognise the face or
chair but we can show examples of
computer and that's how computers have
been able to learn that kind of
intuitive knowledge you know a a good
example of this is the the gonna go I
was telling you before you can ask
expertly why did you make that play and
it will invent a story but really the
story is very incomplete and and
students wouldn't be able to just use
that story in order to plea as well as
the master by far. So the the the the
the expert player has this intuitive
knowledge about with the right thing to
do but we can't really explain it. Um
what are we can take games played by
these high level experts and bootstrap
a you on that that learns to capture
these intuitions implicitly I love this
machine learning power to the data yes
right. So another really important
thing to understand about you know why
is machine learning working in the
first place these days so well is that
it relies on optimisation relies on
defining what it is that you want the
machine to learn is a a function like
an error function that we can just
optimised and and the way we optimise
it is actually incredibly simple
compared to the very sophisticated you
know things have been done in the
position we just do a very small
changes the time we show one example at
a time or a small batch of example
recall many batch but that's technical
detail we should one example at a time
and then we we we look at the error
that the computer is making about
example like you're supposed to save
the car. Um and and we gonna make a
very small change of the parameters
inside the box that define what what is
the mapping from input output. So that
the that nothing produce something
slightly better I'll tell you a lot
more about backdrop later but that's
the idea of a good compute what is the
small change or we can do to the neural
net parameters. So that there will be
slightly smaller next time and we
repeat that hundreds of millions of
times and the thing recognises cards
spaces and desks and and so on. And one
of the first areas where the this
breakthrough of using D that's really
made a difference is in the area of
speech recognition it start around two
dozen ten and we we see in the graph is
what happened with the years on on a
particular benchmark and this is a this
is a cartoon the real picture as you
know lots of ups and downs really and
what we see is that in the nineties
things were progressing quite well
using HM M.'s which with the standard
of the day and it two thousand somehow
even though we had more data and and
yeah faster computers performance in
improved that much until these deep
neural net started being used and that
was a big drop in in error rates and
over matter of a few years the whole
area of speech recognition turn to
using these things. And all the
industrial systems now are based on on
these TV on that's and then you know
lagging by a couple of years something
similar happened in computer vision and
it started with a object recognition
other was given an image you know is
there you know which which object to
present is there a dog is there a chair
is their person. And what the tasks
that that really started this going is
the image neckties where you have a
thousand hundred categories you're
given an image and you're you're
supposed to say which of the categories
is present. Um and and in the last the
few years from two thousand twelve
doesn't fifteen I'm not only the the
the performance you know improve very
fast thanks to these be here
convolutional nets but we essentially
which human level performance on these
tasks there's to the the this is true
of you know sort of nice images and so
humans are still better when when it's
harder to do recognition but the
progress has been really amazing and
essentially it's now more in almost you
know an industrial concern to to get
these into products okay I'm gonna
actually no run you video from my
former colleagues are leading to you of
started the company of years ago and
and being recruited by and media. And
the the use these conclusion that's to
train. And it Uh_huh right I I oh I oh
oh oh oh oh I so you need some more
data on the of of of of of of of of of
right so this and other things
depending is bringing are gonna really
change or world oh okay but now and I
guess for the rest of my presentation
I'm gonna build a bit more into the
technical part of this. And I'm gonna
start by telling you about the the work
force of the progress we've had in the
last years which is just the good go
back from a from the eighties or late
seventies depending on how you wanna
look at it that's based on very very
simple ideas that are really important
to understand in order to even debug
the things that you're playing with
when you when you will in the next few
days. Um so remember I said that we
want to compute the smallest changes
that need to be done to those neural
that parameters so that it performs
slightly better next time this just
happens to be a gradient so it's a
partial derivative of the arrow the
loss function much optimise with
respect to the parameters so how are we
gonna compute these partial derivatives
from this very complicated machine
which is this deep you on that so we're
gonna use a little chain rule which
says that which tells how to compute
derivatives through composition. I so
so if X influences why fruit function G
and why influences that true function F
right so that is F compose with G of X
we can get the derivatives that respect
X to the final answer with respect to
the input by just multiplying these two
partial derivatives along the way and
so in in and and in our case the X that
we care about this is gonna be some
parameter and this is that we care
about is gonna be the error that we're
making on an example the loss that we
wanna minimise. Now to think that is
great about backdrop is that if the
amount of computation you need to
compute the losses function of example
and the parameters is on the order of
and see and depends on the number of
parameters or something about then
computing the gradient so the
derivative the loss of respect or
parameters is also or yeah right so you
don't you can compute something
efficiently you can also computed
greeting efficiently. So we start with
this. But X which is a variable like
the parameter and we do transformation
so this is just a graphical view what I
told you about in we have these partial
derivatives along the way. And the
general tells us that if I make a small
change delta X it's gonna become a
small change don't to live by taking
the build X and multiplying by the
partial driven same thing but delta
otherwise gonna transform into does
that by multiplying delta Y by these
that you want so then if I have a done
the XM plug this into that one I get
that the small changed all the X become
delta that by multiplying by DZEY times
DYDX this is basically what happens
right this is the chain well this is
how it comes up. Um and that's the that
was the simple scenario where X goes
directly to that but maybe they're
different hats right so X influences
one one in influences why to for
example these maybe you know two
neurons in email you know like here and
that is your loss annexes some
parameter. So now it turns out that the
chain will just changes a little bit
and we gonna add the products along the
path for each of the path and so you
know we have these partial derivatives
along the path we do this guy times
this guy was this guy times this guy
and of course we can generalise this
too and pads and we get this equation
which says essentially that for no tax
we're gonna look at the partial
derivatives of the lost respect with
successors here why wonder why and why
high. And multiplied by the partial
derivative along the path DYIDX so for
each these guys okay so that's a very
simple formula and that's what you have
the heart of things like like porch or
or you know or tens of and of course
you can generalise this to an arbitrary
graph of computation so what what these
packages do is they create eight E it
data structure which represents a graph
of computation or flow graph where each
node corresponds to result and usually
those notes wants peace killers
actually will be ten servers like
matrices vectors of higher or a high
order objects a but the principle is
gonna be the same that once we we have
loved that graph we can either computed
forward or we can compute derivatives
in a recursive way by by saying okay so
that the read up the final lost here
with respect to some no tax in the
middle can be obtained recursively by
looking at the already computed partial
derivatives these DYI for each of the
successors of X in the graph times the
partial derivative along the arcs
TYIDXO how this guy influence as the
next guy for each of the next guys and
how each of those guys these influences
this that that we've already computed
that recursively because this is of the
same form is this these that be
something where something is any of the
no of course to make that we're gonna
have to do it do it in the proper order
we first need to compute the wise you
before we compute I mean the derivative
with respect to twice before we
computed derivatives with respect to
the X okay so that's that's essentially
back prop you can apply it to the
multiplier network for example here
would be used simple architecture where
we might up what a a vector abilities
over categories it's typical thing we
do and or lost might be the so called
negative log likelihood which is minus
the log of the probability given to the
crack class so one of these guys the
correct class and we just want this guy
to be as high as possible that we take
minus the log of it and we minimise.
And of course that loss depends both on
the outputs and on the correct answer
why because we use this this correct
answer which is an integer here to let
us know which of the output
probabilities we wanna maximise once we
conclude that last week and go
backwards using the same principles
what a short don't here. So now we can
compute the relatively respect to the
output units and then using this week
in applied recursively to together
derivatives respect the previously as
well as to with respect to the weights
that go into that later and similarly
we can go out again and back problem
one more step and get riveted respected
those weights right so if I go again
it's going oh yeah so we is good that's
just explained it alright oh I'll make
my slides available so you can look at
this more carefully and once you
understand that you can apply this to
any a graph of any structure you can of
course generalises to graphs that are
dynamically a constructed liking the
recurrent network what you have is
instead of having a if excise graph the
graph actually has the form of a chain
like this. Uh and depending on the
number of inputs the X as well that the
graph will be longer to accommodate
Reading all of these inputs and
computing some internal state which
correspond to your own that summarise
everything that's been done before in
in a way that captures what's needed
for whatever computation follows you
you could also generalise to grab that
are trees but again that that which
particular with this we Karen and
recursive architectures is that instead
of having a a fixed us that a graph
depending on the particular data you
have like the length of the sequence or
the the tree that's built on top of
sentence the the graph is is
dynamically constructed for this to
make sense what you need is that the
the same parameters are we use I so we
don't have a separate set of weights
for each time step you have the same
weights use a different times that and
so if I have a longer sequence I can
just extend the graph is gonna be the
same parameters or and and so we can
generalise to you know different links
or different trees in the keys
recursive networks alright so this was
a very very brief intro to back prop
you can read a lot more in in my book
which is available online. And and I'm
gonna move to a little bit of the why
doesn't work and that being is gonna be
very a brief and high level and you
know you probably want to delve deeper
by yourself to make more sense of it
but I'm gonna try to give you the the
basic concepts. So the previous thing
was haul it in a very you know focused
way now let's try to see why wasn't
working. What's what's new with these
deep networks. Um so you know in the
early days of a I where what people are
trying to do is build a system that
goes directly from input to output
through hand design programs all the
knowledge as I said at the beginning
was put in the machine directly from
the brains of the the the experts or
from the brings of the programmer into
a program maybe the program had an
explicit set of facts and rules but
that's how it was done that was no
learning then a lot of the work in
classical machine learning in the
eighties and nineties was based on
introducing some learning in particular
starting from a lot of hand design
features that were crafted based on a
knowledge of with the input is supposed
to be and what kind of in various we're
looking for and then transforming those
features are cruel line mapping often
just a linear mapping or colonel
machine that would go to produce the
the output we want. Um what happens
with neural nets is that ah we look
inside this box and we think of it as a
composition of multiple transformations
and once you start thinking about this
you have something in the middle here
between two sets of transformations you
can call this first transformation
extracting features but now they're
gonna be line. And the second thing
here might be be linear classifier
again. But the thing in the middle of
something new it's a representation
that the computer has a line. So you
know the really important concept in
deep learning is that we're learning
representations. And not any kind of
representations is I'll try to explain
a bit later. But that's really crucial
thing and then you know deepening is
just taking this idea running a
presentation to say okay so we're gonna
learn multiple levels other
presentations who's your I just have to
levels of impatient this guy and this
guy the output is is in in a it's also
a representation but it's it's meaning
is fixed it's depends on that you know
what the semantics of what we're trying
to protect and of course you have a
presentation input but also the
meanings speaks where is the meeting of
these representations or something
discovered by the computer it's
something that the computer makes up
you know to to do the job. So for
example around two thousand nine our
friends at Stanford looked into the the
than you'll and trying to figure out
what kinds of representations that is
the you are that learning. And so this
new one that was trained on a images of
faces and they found that the the
lowest level the first layer units
extracted a detectors because this is
not a new observation this is something
that's been shown with a lot of machine
learning models including you and
that's that an actual set of features
for images like these edge detectors
they like oriented contrast of a
particular size and position and so on.
Um but it it kind of discovered always
by itself then it more interesting is
that if you look at the second level
and you look at what the units like so
this is what these pictures each little
where he represents the kind of input
that a particular unit in that you on
that is is preferring. And you see that
it has things like parts of the of the
face like you know eyes and and noses
or other forty things that you can
think of as the composition of some of
these guys here. So that the units one
level compose together. And not nearly
compute function of the lower level
output so that it takes the
representations at one level in
computing you kind of representation
and here we can see that it seems to be
discovering detectors for parts they
can be combined to form a high level
part or full objects like these these
faces it so why why is this idea
working there's actually no free lunch
anywhere in mission dining and and not
in planning the reason why planning is
working is because somehow this idea of
composing pieces together composing
functions at multiple levels is natural
is something that is out there that
fits well how the world is organised.
So if you're into trying to do a image
recognition well there's a there's a
kind of you know some natural hierarchy
of concepts starting from pixels to
edges to little texture motives. And
and parts and objects if you're doing
you know modelling text then you know
characters combining two words words
combine into weird groups or or freezes
that go into clauses and sentences and
stories and we don't really know what
the rights you know concept should be
higher up but but we can imagine that
there are some high level abstractions
that that makes sense for the
particular domain in speech of course
we go from the acoustics samples to
some spectral features to to sounds and
phones in phonemes and words in in in
more right language models. So
essentially all of these the networks
are obtained by taking the raw data and
transform it it into so the feature
extraction at different levels one more
abstract so the lower levels again the
so now this is similar is why should
you before but we see that this is for
colour images and so you see not just
these edges but also some sort of lower
frequency edges with colours that
become important. And and and higher up
now you see these detectors that
capture these funny shapes that are
made by composing these guys and and
even higher up you see these funny
shapes would actually now start looking
like parts of objects like maybe this
is a the or something. And and of
course we'll and and so on and maybe
this is like a face of the bird or
something now when you really a I think
a lot about machine learning based on
the presentations you can start playing
all kinds of really interesting games
and I only give you glimpse here with
this very old works from my brother
sammy. And his collaborators at the at
Google then Jason west and because I
think it was not a double but the work
together on the idea of learning
representations for both a text more
precisely a short text queries like
Eiffel tower. And for images in making
those are presentations in the same
space and so what's going on is that
there's gonna be a large
transformation. So you can think of a
you on that that takes the image. And
what's a hundred numbers right so
here's for visitors to D but really
initially these were a hundred
dimensional vector spaces today
actually they're like two thousand but
it's the same idea. Um and and so you
have one function that maps images to
that space and you have another
function that you also learn that maps
these queries which you can think of a
symbols. And into the same space
actually those the mapping between
those symbols and and and a vector is
just like table lookup right so you
have a table for each of I don't know a
million queries that are most frequent
the table would tell you what is the
Hunter dimensional vector corresponding
to dolphin all by Eiffel tolerance one
and nowadays they have more complicated
mappings which maybe include recurrent
nets and so on and can deal with a much
larger vocabulary and and things like
that but the idea is we learned these
two mapping so that the representation
for say the word dolphin is gonna be
close to their presentation for an
image of a dolphin. And wise that
useful well you can imagine if your
search engine. And you want to
associate things that people type
queries and images one way or the other
this is very useful right. So you want
images here to map for presentation is
kind of semantic that has to do with
what this means we don't care hear so
much about the details of the you know
the light on the water but we care
about what's in there it's you know
adult and that's in warrants one looks
like a pool. So these kinds of
information that you would like them to
be encoded in this more abstract
representation much more abstract than
the pixels in such a way that you can
also we cover the concepts that go with
them because you end opposing usually
in space right. So let me tell you a
little bit about I'd a big at a high
level what I think are key ingredients
for machine learning to approach a I
and number one is pretty obvious but I
think for many decades we can have
ignored it too much which is we need
lots and lots of data but let's let's
try to step back you why do we need
that much data we need that much
daytime people complain all new on that
anyone of data. Well if we wanna
machine to be intelligent to understand
the world around us. It's gonna need a
lot of the yeah to to get that
knowledge because it's mostly learning
all that information about the world
around this by by learning by by by
observing. And and the world around us
is complicated. So you know to learn
some complex presentation of what's
going on at there we need lots and lots
of data and I think that right now the
amount of data we're using is still way
too small compared to what will be
needed to reach you level yeah so
that's that's in really number one but
of course is not enough to have a light
data if we can't build models that can
capture and so for that we need very
flexible models we need we can to have
a you know models like linear models
that don't have enough capacity we need
the number of parameters to grow a you
know in proportion to the amount of
data so those those models have to be
big you have to be big because they
basically store the data in a different
form which allows the computers to take
good decisions alright. Now if we have
these big models with a lot of
parameters of course we have to train
them and we have to run them and for
that we need enough computing power.
And you know one of the reasons why you
on that's weren't so successful before
is because we didn't train them or not
data we didn't have big enough models
and we didn't have enough computing
power to train and use them now I just
having the first two ingredients is not
enough either because you could have
something like maybe and efficiently
implemented colonel machine and you
would people in principle to deal with
all three of these ingredients. However
there's something else that is
important. Uh I don't know if you heard
about the curse of dimensionality
you're trying to we're trying to learn
from very high dimensional data a and
in principle from you know if we don't
make any assumptions about the data.
It's essentially impossible to line
from that they they're just too many
configurations that are possible. And
the only way around that is to make
sufficiently powerful assumptions about
the data that there's something called
the no free lunch there and it says you
you do need to make these kinds of
assumptions otherwise you won't be able
to to learn something really complex
okay so many tell you about those
assumptions that are being made
specifically in deep learning. And they
have to do with the curse of
dimensionality this exponential number
of configurations that we have to deal
with we can't learn a separate
parameter for each configuration the
input because the number of such
configurations is like is huge is is
much more than the might data we can
ever like you know it's like the more
than the number of atoms in the
universe. So we we can't just learned
by heart everything gives or we need
some form of station. And besides the
smooth this is something which has been
very a successful and powerful machine
running we need other assumptions in in
in these big units we're putting in two
crucial additional assumptions which
have to do with composition T in
different forms of it right so we
already know compositional use very
powerful and humans use it all the time
that's how we we you know drop the
world around us we compose concepts a
human language is basically an exercise
in in composing ideas and meetings
right. So in those neural nets we have
two forms of composition algae one
which happens even with a single they
yearly on that so every time you have a
it what we call it distributed
representations or thing about one
layer venue on that each you're on
their each artificial on is detecting a
a feature a concept and and these these
detectors are not mutually exclusive to
the number of configurations that we
can capture grows exponentially with
the number of units so single layer
neon that is in some sense
exponentially powerful in what can
represent. Um so this idea of learning
features is is our first form of
competition all get an object is
described by the composition of you
know which features are activated yeah
which attributes are rather than to
this particular image. And because we
have many attributes and they can be on
or off or maybe some some grey level we
can have a very rich description that's
exponentially which in some sense and
yeah and there's actually more than the
words that I'm saying that there's a
lot of math behind this showing how are
these the fact that you have these this
representations really buys you
something exponential in statistical
sense. So that's the first form of of
composition holiday. There's a second
form which is the one you get when you
how many layers one on top of the other
where each layer compute something is a
function of the output of the previous
later and and here you also get an
exponential game. So again we have
theory showing that by having more
composition of you know of of the
layers on top of layers and here's you
can represent functions that are
exponentially richer in some sense. Um
so that's that's that's the other key
ingredient. And and of course these are
assumptions about the well there's no
free lunch as I said these things work
because they fit well with the world in
which we live if it wasn't the case
that the world was conveniently
describing the composition way these
neural nets would not be working as
well as they are to give you an example
of this. Um think about a and you on
that we're at some level representation
you have different units that detect
different kinds of attribute so let's
see the the image the input is an image
of a person. So you can imagine that
you could have a unit that recognises
that the person wears glasses you could
have a unit that recognises that the
the the person is a female you can have
another unit that recognises the
person's a child and so on you can
imagine like you know a thousand such
units and and now you know why is that
interesting because imagine you work to
try to learn these detectors right. So
you have these thousand detectors okay
you have and and so if you have these
and thousand factors and thousand
features. And if you want to learn them
separately you would need something
like say K parameters alright "'cause"
you know what what what are the
characteristics of the person that
wears glasses and you need me you can
imagine training a separate neon that
even for each of them. But we will do
better than that by sharing the layers
but in the worst case you could imagine
that if I had and features and each of
them requires sort of order of K
parameters that in total I need or
river and time ski examples to learn
about all these features. Now let's
consider the alternative where we don't
use a this presentation we use with
school and nonparametric approach a
which says well we gonna consider all
of the possible configurations of image
they images of of persons. So so now
what I'm gonna do save from using SBM
or something about or nearest neighbour
approach I'm gonna have to use an
example for each of the configurations
that I want to learn about so how many
configurations of the input like get
well potentially and the D right so do
use the input dimension the number of
ways to to split the data into all the
configurations of you know you has
glasses it doesn't have you know you
she's a female she's not it's a channel
that's not a child. So all of these is
basically an exponentially large set.
And the good news is if those features
if we were present the data with this
this representation we can do it in in
something that grows a nicely with the
amount of the complexity of the task
rather than exponentially. So yeah we
actually we can formal papers about
this characterising the number of of in
your pieces that any on that can
capture that has a single him layer
with and heating units and and and
essentially can represent things that
are that look complicated it may you
know you think you need order prevented
it D parameters in order to learn but
because of the the the the
compositional that we're assuming about
the well that ah you can essentially
the reason this works is that we can
assume that we can learn independently
about wearing glasses or about being
female versus male or being a child or
not right that that's why this this
compilation works. Uh if the detector
for glasses needed to know whether you
were female or child and so on in order
to do the detection of glasses then it
wouldn't work you would need as many
parameters as if you were doing an SVM
or and yours neighbour method the fact
that you can learn about these
attributes kind of separately from each
other without having to know all of the
configurations of the other attributes
is the reason why this is working okay
and yeah and this something similar has
been shown for that where you find that
you you know some functions could be
represented very efficiently with a
deep net but if you wanted to represent
those functions with a shallow network
you might need a huge number of it's
not the words there are functions out
there which really are not fully
expresses a composition of many levels
of nonlinear transformations and if you
try to capture those functions using a
not sufficiently deep network so it's a
shell network with a single layer two
layers. And that's not enough then you
you wouldn't you wouldn't need an
exponential number of units so you
would need an exponential number of
parameters and you need an expression
number examples to learn those
functions properly okay so the this
this is a bit of the theory. Um there's
another vertical thing that happened
recently which is maybe as important
for many years researchers in machine
earning top that neural nets couldn't
be really practical and useful because
training them in for a non convex
optimisation problem which could have
many local minima so what I mean by
this is if you if especially if you are
low dimension you think about functions
you trying to optimise a function like
the the the total error with a spectral
parameters you know it might be very
bad if you just to stochastic doing
descent or any kind of local descent
algorithm global optimisation you might
get stuck in these local minima and
that's not the case for things like
colonel machines. So the question is
you know is that since this a real
problem well it turns out that it's not
a real problem. Uh at least there's a
lot of evidence that this myth that
training neural nets is is riddled by
bad local minima is really a mess in
the and what we found actually is that
it's especially true as we go from tiny
networks to large networks so the
really interesting thing is that the
larger the network that easier it is to
optimise. So you know you can you can
feel really bad cases of optimisation
on very small nets with worse for union
it's when you go to a millions of
parameters or Hunters of millions of
parameters there is as it does with
statistical effect that's happening
that really makes documentation much
easier. And we we started this using
and analysis of critical points which
are the places where the the network
has your derivatives. And it turns out
that for the most part the kind of
critical points that you could
encounter during training and you on
that are subtle points meaning that you
you're not stuck their directions where
it looks like a local minimum but in
other directions actually going down.
And so you know great descent will just
sit and you know go in and not get
stuck in those those sell points. Um
yeah so this this is a lot to say about
this but I see time line let me tell
you a little bit about that of where
we're going now right the the
beginnings of neural nets were really
about having recognition about the
kinds of things have told you when you
recognise objects of images. And and of
course young is is much more than
having recognition. But but what's
interesting is that in the last few
years there's been really a lot of
progress in moving neural that stored
something that's more like a high level
cognition. Um there's been a lot of
work about attention in particular in
my lab and and in many other places
now. I'll tell you about about that so
attention is essentially it's in spite
of course about what we know about
humans a of considering the whole of an
input or a big set of numbers as one
homogeneous block. Um so for example if
you think about a layer that is looking
at the lower layer instead of looking
at everything the network learns to
focus on parts of the import or a layer
learns to focus on part of it's it's
it's important. Um another direction
that's really that's very very
promising is to look at reasoning
problems where instead of going from
input to output in once that you
actually have a sequence of steps and
the number of steps could very we at
each that we combine pieces of evidence
you know to to come up with a
conclusion this is really what
reasoning is about you combine say
different observations with different
things you know about the will and you
you know combine them to find an
answer. So I'll I'll tell you a little
bit about this but about the your and a
half ago this started with the simple
memory networks annual train machines.
And then another direction which is
related to this is everything that has
to do with planning. And reinforcement
learning and this is been exemplified
by the work of deep mind which has been
acquired by Google couple of years ago.
And their work on playing atari games
and more recently on the alpha go the
system that I mentioned at the
beginning that which a beep the will
champion but it's it's much more than
playing games it's about learning to
take decisions. And being able to learn
in a context where you wanna have ms
Elise provide the ability to have a
labels or supervised learning every
step. And and then more recently this
this kind of research we're combining
deep learning with with reinforcement
learning has gone into robotics. So the
whole field of robotics back in
particular by Berkeley is is moving
towards the use of departing and you
say a few words about attention. So
imagine a a a sequence of feature
vectors so you think of each of these
points as a vector we've been using
this for machine translation so each of
those would be a feature vector
extracted corresponding to a particular
place in an input sentence it so it it
may contain a semantic attributes
corresponding to the word at that
position as well as word in the
neighbourhood right so this is a
sequence of feature vectors but you
know it could be any kind of space. And
we're gonna produce another sequence of
feature vectors. But instead of using
this kind of usual fully connected the
approach which is kind of a static grab
we gonna make the the relationship
between the first sequence the sick and
the second sequence as something more
dynamic using and attention they can
and so what's the idea of you can't
make an is the idea is that when we
needed produce this feature vector
instead of looking at all of these guys
we're gonna choose a few of them maybe
missing the one and we're gonna use
that teacher and maybe maybe a few
others to compute the the feature the
next up right so so we're gonna focus
on a few elements in the input sequence
this is the crucial thing and you can
do it using what's called soft
attention or stochastic or attention we
work mostly with soft attention but we
have a paper will also use the plastic
car tension so the idea of soft
attention is that instead of taking a
yes no decision about which features we
gonna be looking at which element in
the set here we're gonna be looking
yeah we can shoot some soft weights
that sum to one over all the elements
here in order to decide you know how
much attention we gonna give to each of
them. And those software it's gonna be
computed by a little attention neural
that a little MLP here that takes the
the contexts at the upper level here
and the features at the lower level and
basically decide it's a good match you
know should we use this guy as input
for the next one here. And I'll put a
score for each of the possible
positions using you know the input
features corresponding input feature it
so because these weights or or or just
a part of a soft complication the
french will computation you can you can
learn to put attention in the right
place and it does learn to do that. And
in fact it's thanks to these attention
mechanism that we reached a state of
the art in machine translation in in in
the last year in two thousand fifteen
so yeah we we basically use the
architecture I should be for to process
input sentences extract those features
from them using a a form of recurrent
net re though so we'll bidirectional
recurrent net. And and then we let me
show you that picture again you can
think of it like we've extracted
semantic features from the whole
sentence or you think about even
Reading a whole book right and each of
these is features extracted at each
position each word in in in in the book
here. And now we can produce a word at
a time in the translated book. And so
each time we produce the next word in
the translated book we decide which
word in which few words in the
sourcebook we should be looking looking
at and and this works quite well my
position to a technique that had been
you have tried before where a along
with our colleagues a Google where you
you you read the whole book you come up
with that kind of semantic
representation of the whole book and
then you feed that into another
recurrent net which produces the the
words in the translated book and that
doesn't work because it's hard to
compress that much information into a
fixed size vector. But by allowing then
the network to to decide at which each
point in producing the output sequence
where to look a it works perfectly well
and so we we won a couple of the WMTI
challenges so this is you only
competition for machine translation
using these neural machine translation
systems. And more recently at our
colleagues at at Stanford I've been
using this and other datasets all the
benchmarks and obtain even stronger
improvements and now there's it's it's
a whole cottage industry to improve
these neural emotion translation
systems ah yeah and they're essentially
leading in in dumb machine translation
will right now one thing you can do
with attention that's quite cool as
well is combining the things we've done
in computer vision with things we've
learned with modelling language. So in
this way what we've done is we we tried
any on that the conclusion on that it
extracts features from the image and
then we use inattention a an of them to
decide to produce one word at a time in
the sentence that's supposed to be a
description of the image so the
computer reads the image and produces a
sentence stochastic Lisa to outputs a
probability for the next word and then
we sample that word and produce
probably for the next word and so on.
And so it sees this image and says a
woman is throwing a frisbee in the park
but it doesn't using attention so each
time it produces a word in the output
sequence it chooses to look in the
input so here when it says frisbee it's
looking in this region where there's a
frisbee. Um so just a few years ago
somebody would have told me well we're
gonna train any on that that you know
looks at it imagine produces a natural
language sentence that describes it I
would have said Nat it's gonna take you
know at least ten years. Uh but it's
there and and you know this is being
this is more than your old result and
and now it's it's you know people are
doing even better than that yeah so let
me show you more of these examples of
the computer looks at this and it says
a dog is standing on the hardwood floor
and when it says dog is looking at the
face of the dog it look at this image
and it says a stop sign is on the road
with the mountain in the background.
And what it sees a sub sub sign it's
it's know what it's a stop sign it's
looking at the stuff now let me show
you something that our colleagues at
face crooked done using something
similar but now instead of producing a
sentence you answer the question is
there a baby yeah what is the man doing
I think is the baby sitting on his lap
yeah are they smiling yeah is there a
baby in the folder yeah where is the
baby standing I am what is the baby
doing actually teeth what game is being
played soccer someone kicking the ball
yeah what colour is the ball yellow
what is the dog playing E what colour
is the dog black is the dog wearing a
colour yeah what is the cat sniffing
yeah where is the cat I'm bad what
colour is the cat black and white what
colour the bananas we okay now you have
to be where this is a demo made by face
book. So oh I mean I mean I think this
is real but they probably selected
cases where it works better
nonetheless. This is really impressive
let me tell you a little bit about
what's behind the scenes in addition to
the make answers I've been telling you
and essentially it's using this
attention make an is an idea not just
to focus on a particular part of the
input but to focus on a particular part
of memory. So the idea here is to
separate the main computation which
would be done by recurrent network
typically from a memory which you can
think of like a computer memory where
you would have a vector at each address
and and these these factors could be
long like think of these as the word
invading so they might be like two
hundred dimensional something like
that. And and now the recurrent net
kind of course read from the external
world and produces output. And servers.
But it can also do internal actions. So
the internal actions here would be
things like Reading at a particular
place or writing at a particular place.
Now instead of taking a hard decision
about where to read and where to write.
And what to write it takes soft
versions of these decisions. So it
computes a a a score for each address
and those scores with the soft next
would sum to one that really you know
where it wants to me and what it's
gonna do is gonna take like we did for
detention a cannon is gonna take those
weights. And make a linear combination
of what's reading. So we take the
contents everywhere weighted by those
those scores of some to one you know to
actually get the information from the
memory into the recurrent net so is
Reading with a focus of attention in a
few places and you can do the same
thing for the writing yes so you can
use these kinds of systems to do things
like read a little story like some
walks into the kitchen sam picks up an
apples and walked into the bedroom and
drop the apple and then question where
is the apple so the computer reads all
of these things including the question.
And knows now this is the question
maybe because there is a special marker
in it's supposed to answer or something
like that or just like we had it in the
demo except that in a demo. Instead of
the text here we had an image but it's
exactly the same account isn't alright.
So I'm gonna close here this is a
picture of my group in Montreal
montreal's representing all rhythms.
And and we always recruiting. Thank
you. So I guess it's time for the break
I'll be here force for the panel later
so if you have questions we can we can
answer the panel and also tomorrow I'll
be giving another lecture and I'll
leave more time for questions during
the lecture so you you know you can
keep your questions a little bit for
later today or tomorrow we can have the
the questions then so we can have time

Share this talk: 


Conference program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
4 Juli 2016 · 2:01 nachm.
Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
4 Juli 2016 · 3:20 nachm.
Day 1 - Questions and Answers
Panel
4 Juli 2016 · 4:16 nachm.
Torch 1
Soumith Chintala, Facebook
5 Juli 2016 · 10:02 vorm.
Torch 2
Soumith Chintala, Facebook
5 Juli 2016 · 11:21 vorm.
Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
5 Juli 2016 · 1:59 nachm.
Torch 3
Soumith Chintala, Facebook
5 Juli 2016 · 3:28 nachm.
Day 2 - Questions and Answers
Panel
5 Juli 2016 · 4:21 nachm.
TensorFlow 1
Mihaela Rosca, Google
6 Juli 2016 · 10 vorm.
TensorFlow 2
Mihaela Rosca, Google
6 Juli 2016 · 11:19 vorm.

Recommended talks

Pose estimation and gesture recognition using structured deep learning
Christian Wolf, LIRIS team, INSA Lyon, France
17 Okt. 2014 · 11:06 vorm.
Décision assistée par ordinateur: de la donnée à l’information
Dr. Dominik Aronsky , Université Vanderbilt (USA)
7 Juni 2013 · 11:58 vorm.