Embed code
It's well welcome back after that
really nice talk by I'm gonna give my
last presentation on torch and this
might be a bit more interesting and
less dense it's going to actually be
related to you are show stark where
we're gonna do complete example of
generative adversarial networks and
towards it's a very short to example in
total about like a hundred fifty lines
are so so the cover that in the next
few slides. And then we'll talk a
little will have some examples about
the article that package the stuff that
it can do that none of the other
packages can and then finally I was
briefly go through the documentation of
fortunate do to basically give you an
overview of what towards net is and
what it brings to the table. And the
last slide it's going to be the next
steps if you guys are interested in
George like how like after this
tutorial where you go from here okay so
with that to start as you're sure
showed address L networks they have a
generator there but discriminator there
is some late in space that's as input
to the generator so some some random
noise for example. And that generate
some sample in most cases and against
an image. And that image along with
real images goes into the
discriminator. And you have some laws
that classifies between real images and
that fake images. Um so far so good
pretty simple you have two distinct
pieces a generator and a discriminator
both of them are neural networks. And
the way the optimisation works as the
generator is optimising self twofold E
so the discriminator if classification
laws should be really high when the
generators optimising itself and the
discriminator is optimising itself to
not get full by the generator. So the
classification laws should be really
low then you compute the gradients with
respect to discriminate or so for
images basically look like some random
nice as you see as input of the
generator that dresses generator
produces some some generation ago sent
discriminator you have some probability
of real or fake and then you will have
some log likelihood cost where ground
truth the Israel so you're maximising
the discriminator for the probability
of real. Um and separately when you're
optimising the discriminator. Um you
have a mini batch of samples of either
real or fake and then you are purely
discriminating there according to
whatever the ground truth so to start
of it in the morning if you recall or
if you remember we discussed but how
you make you know networks in towards
you have the and then package which you
can define and I'm modules. And you
have containers like sequential contact
in parallel which can take in the and
then modules and compose them and
certain ways to construct a neural net
so in this case generator and
discriminate are going to be con that's
of a few layers. And the laws will be a
binary cross entropy. So the start of
some boilerplate you load your packages
you define some options that you can
read from command line this is just an
options parser you create your data
loader similar to the one I showed in
the second lecture. And the dealer has
two functions it the returns the size
of the data set and also a function
call get batch that'll get a mini batch
of samples and the digital or entirely
it has multi threading and it keeps
many batches ready for you as you need
them and the next comes the generator.
So the generator itself is a con that
so input is latent space here. So the
generators a sequential there they
produce latent space so in the first
there you go from latent space less a
and Z is a number of dimensions in in
the input space. Let's say that's about
a hundred you go from a hundred
dimensions to some of variable and GF
which was configurable and and and and
you have an idea for two it was that
are configurable that when you run it
there basically the number of filters.
And the first convolutional there and
the number of filters that that you
learn in the first the first a
coalition there for the discriminator.
So you construct convolutional there's
appropriately and bash normalisation
and value this is how the deep
convolutional against are define the
paper that you actually showed
interpolation is from and you just act
them for a few layers and then the last
layer you make it a ten it so that the
output clears bonded between minus one
and one because the input to the real
images are coming also from a
distribution that is bonded between
minus one and one just to preserve the
property and to make sure that the
discriminator. Um is not duh
discriminating just based on the range
of the values you give nonlinearity
that has similar buttons so that's a
very simple definition within a few
lines or could you define the generator
the discriminator takes as input at the
output of the generator. And it is also
very similarly structured a few
convolutional errors instead of value
you actually use a non in idea for
liquor value. Um and the output layer
of this discriminator is a sigmoid it
goes from zero to one and if it is zero
you expected to the discriminator
there's a things that the sample is
fake. And if it's one the sample things
it's real. And you just have a last
reshaping your because these layers
well convolutional. So they will be of
three dimensional and yeah just to pass
in to the cost function you wanted to
be a single vector. So very simple and
the loss you define in a single line
just the laws is BC criterion which is
if you look at the documentation is the
binary cross entropy criterion and the
optimisation algorithm we use is added
and so we define the learning rate and
the beat up parameter and it's also
very simple there. Um and you create
some buffers to do and at like two and
pushed onto the jeep you sleep
reallocate some G you buffers you do
this because on the GPU doing
allocations and freeze is a fairly
expensive unless you write yourself a
custom locator it's better to just be a
look at the buffers that you need like
the input the input buffer the noise
buffer and the label buffer and then
reuse them again and again just for
performance reasons. And so we're here
you see if the jeep you is used you
some of them all to the GP you using
the cooler call as as I showed in the
the second lecture and finally to use
the Upton package you flatten the
parameters using these calls parameters
D is now vector that contains all the
parameters of your discriminator
network in a single vector and so that
can be passed into any of the optimal
algorithms. And parameters G is also
single vector that contains the
flattened parameters of the generator.
And similarly grad parameters the N
word parameters GR the gradient duh
gradients with respect to each of the
parameters that it's a vector of the
same size as the parameter director and
if you recall there's one more step to
train your no network and that is to
define your your closures for the
optimal package now and the adversarial
network. And sort of had like so in the
previous example I showed and the
supervised a classifier you only have a
single closure that you have to define
because you're just optimising to
minimise the loss and that's about it.
But in the end result network you
actually have to optimisations that are
alternating one after the other the
first one is you optimise a
discriminator. So you create a closure
to evaluate F affects and idea of by DX
of the discriminator so and if you
recall in the discriminator you want to
optimise for the real samples to be to
be sh classified as real and of of the
the generated samples to be classified
as a fake. So but then one slight
subtlety here especially in the DC
again architecture is that you don't
want the bash normalisation self the
real or the fake to interact because
the bat statistics of the real samples
of will initially be different from the
best cystic six store the fake sample.
And that's one simple feature that's
enough for the discriminator to get
really good. And the adversarial game
is never played. So what you do here is
you get a real batch of samples from
your data or you copying them onto the
GPU and you compute de gradients with
respect to this real sample many batch.
Um D label you fill for this is the
real label because your optimism
discriminator everything is exactly as
intended next you want to optimise with
respect to the for example so you
create the uniform noise you send that
to your generator you get the fake
samples you copy those they're already
on the jeep you but you there in a
different buffer so you just copy them
onto this this buffers and then you
optimise with respect to the fake
samples. Um I keep in mind this line
here label the label for the real
samples Israel and the label for the
for example this fate and the second
you'll see for the generator have it
changes. Um and of and you the total
loss of your discriminator is the some
of the losses of the real and the fake
and then your turn D laws and the
gradients with respect to the
discriminator next you define another
close your the colours to evaluate the
generator and the great into just back
to the generator over here you don't
have a real part you have real samples
so what you do is you get some noise
vector you pass it in and through the
generator you get the fake samples and
now because you optimising for the
generator to full the discriminator you
actually fill these labels to be real
labels which means that the
discriminator supposed to think that
it's it's real so that's what you're
optimising for and then you send these
through the discriminator. And then you
compute the gradients with respect to
the generator finally and you just
return de Los and a great introspective
generator. And finally your your
training. Um look here it's very short
for as many parks says you define for
for the number of many batches for you
park you I'll data disk on their
network basically maximise the log
likelihood of the discriminator. Um
with respect a real samples and it's
respective fake samples. And then you
operate a bit generators network where
it's just maximising the discriminator
to do bad. And then you after every
park you save the model. Um two disc it
to reuse them later. That's about it in
about a hundred fifty eight lines you
actually created your adversarial
network. And you will you can train it.
So what does it look like in terms of
training just out of interest. Um so
this is how initially your generation's
look like just random noise. But over
time or time this start improving
rapidly. And they start getting better
and better. And if you see it happen
really quickly going from like random
samples to actual generations. But it's
still improved over a few you box. And
it's like still improving and usually
within any pocky will see whether the
adversarial game plays itself correctly
or not and it's as as you are sure did
mention is lecture these are very
unstable methods that so networks so
there are a few ways in which you can
monitor whether they are working or
not. Um looking at the samples visually
is one of them but it doesn't scale the
all the other methods it or that you
have several generators and
discriminator is training primarily and
then you validate each generator
against the discriminator it's not
raining against and you have a metric
to monitor that so yeah this is these
are all fake images from the same
latent vector overtime these these
samples improve to show again there
initially the samples are pretty before
and then over time they just like they
start getting more more compelling. And
at that at what times the the
discriminator laws is usually hovering
around the classification accuracy of
the discriminator is scoring a point
five that is the discriminator. Um
cannot tell whether these samples are
real or fake but still if you see the
the samples are fairly weird like some
of the best models we got for the paper
be tuned them and the architecture and
the learning rate but like if you just
put together something this against go
pretty far but you do see some
weirdness in the generations especially
like some worldliness and so on. Um so
how do you generate new samples from
this from the checkpoint that you just
saved to disk. It's very simple it's a
single line or could you just create
some new noise factor for that but the
nice you want and then you forward it
to the generator network and you get
images. And some different kinds of
nice you can just fill it with some
uniform nice simple enough. Um but you
can also do slightly slightly fancier
things like you can as and and and your
shows lecture he showed these
interpolate ins and late in space from
one one place to another. So you can do
similar things where you have some
point some point. And another point you
take two points in B in in this case
Eleanor night just meant left and right
and then you just compute linear
interpolation of also points with some
clown that granularity between the left
point on the right point. And you copy
those into DD noise nice vectors that
you created. So basically or were you
could you come you so like each mini
batch of noise and then you find some
intermediate point between left and
right back to put in that the nice mini
batch and then you will give this noise
mini batch to the the network. And that
will generate images where each imagine
this many batch is like basically going
from left to right starting from point
a and point. B and the the if you just
train this network for like one or two
you parks the generations aren't
perfect. But you still see you see
basically I did to just generated them
really quickly but you see that this is
the generation from a point to the left
this a generation from appointed right.
And as you're working as you're working
you the image slowly changes over time
to become from one bedroom to monitor
these are supposed to be bedrooms but
because only train them for money but
they're not very good but the images
that you're sure showed from the paper
or have been trained for like severally
box and that much nicer the last thing
I wanna chill is the arithmetic demo
where you can basically choose certain
images you can do arithmetic related
space we can choose certain images that
certain properties invaded space and do
some kind of a plus B minus see that
kind of thing and then do some re
generations. And for that I will pull
up a terminal there that's a terminal
okay. So I just and the arithmetic
generation script. And it just printed
out the network. And then it says we
will be doing vector may take a minus B
plus C shoes three images for a so the
reason you need to do like this
arithmetic over a few images is because
doing it or single image doesn't exist
I like accurately capture the property
that you're trying to capture. So it
gave me a few give me a few choices
unless you guys have a preference I
will pick someone so a minus B plus C
so I'll try to pick someone. Um the
smiling for a and then we will see what
to do that be and then see so maybe a
woman smiling let's pick for and
another woman smiling forty eight and
the third one let's see okay this three
sort of almost smiling and then now for
B let's try to remove for example the
room and then maybe add man or you can
remove smiling and add a neutral
expression. Um or yeah let's try to to
the smile let's try to remove the
smiling woman and then add a man I
don't know I haven't lost of ideas but
we'll see if you have any or if you
have good ideas let me know okay so
there is woman smiling so we remove the
women. And then it'll be man smiling.
And then we can add sunglasses
moustache okay that's a good idea as
long as we have enough generations of
moustache men. So we need to remove
woman. Um that are neutral because you
wanna still keep the smile okay sixty
three at the bottom I see very neutral
expression anyone spot something faster
let me know last one oh yeah sixty four
and when all of a smiling okay first
one okay okay now the last we have to
to see so now we have to add men with
moustaches says someone requested to
yeah and the that's five qualified the
moustache not twenty four twenty four
has a moustache I can't hear you if you
said that thirty one thirty one doesn't
know moustache thirty four oh sorry
fifty four oh my god at fifty four
let's see what happens so what school
you will see is that it generated them
and at the moustache who is also a
smiling and if you look at the code for
this it's like fairly simple it's like
all you do is literally take the nice
factor and and you let's see can I pull
this actually I think internet is not
working okay I think I we have all that
so if you look at the code for this. Um
it's basically just most of the code is
simply choosing the images. Um and
finally D vector arithmetic itself is
in us here and this line you take the
average you take the B average C
average and then you have the final
nice is average of a minus averages
depots average of C and then you give
that to the the net. And then you that
like the rest a logic as just like
drawing the images and stuff like that.
Um so extremely simple and very natural
to do it as well I need internet though
for my next or I'll figure it out okay
alright. So that's this against fee
basically went through a full four
produce against and we look that hard
to train them how to generate how to do
latent space interpolation is how to do
later in space are arithmetic next we
go to order grad. I wanna just the
briefly talk about this package audible
grad as I mentioned in the morning is a
pack respect router that does that that
computes your gradients with respect to
a function in a very nice way that's a
unique to undergrad at the moment which
is it has a tape recorder it records
whatever all the operations you did in
the for phase I mean by the computer
function. And in the backyard phase it
replace the tape backwards and then for
each of the operations it it it has
greetings defined for every single
operation and then in torch and little
lot. So it works really well so as a
basic example you can have a function
through here of variables AB and C so
ought to grad always computes very the
gradient with respect to the first
available of your function it can
either be a single variable or it can
be a table of variables for example if
you win neural network with matt
multiple weight matrices. Um so you
define the function. And then you call
autocrat on the function and that that
returns another function that is the
backward of de forest function. So it
executes the first function records
what happened. And then it to return a
function there. And then you look at
the reduce with respect to the function
at a particular point you can get the
you can get the the directives and the
value of the function itself. And I
just printed out the value there and
the great in with respect to a here is
just one because is it doesn't have any
germs around it and then just to just
the more interesting examples
conditionals so like a bad in a to
guide you can for example have a
function like this where at come at
runtime you can condition undergrad you
can condition some variables on your
functions and this is something that
you can't do that and then or like
other difference differentiation
packages here I I say if these better
than C then returned this function if
not return this function. And depending
on how I choose my B and C variables
the gradients are appropriately changed
next I want to yeah you probably I
don't know actually we could try I
would expect that it records which
which pat it took in the ford phase and
basically approximate it's it's to that
pat in the backward phase. Um because
from like a could perspective that's
exactly what it does it records that
you went through for example if these
so close to see but these just slightly
greater than C then it goes through
this could that here and it computes
dysfunction. So in the back or phase it
computes with respect to the same
function and I think that's yeah it's a
good question I'd I don't actually know
if it can do anything fancier because
it is basically a they said at tape
based differentiator. And next wanna
show an example of by loops so if you
see here I agree with the function of
there. Now depending on the the the
function has a dependency on in the
while loop on B being graded and see
and if as long as basically and see you
keep we computing this function and
adding it to your function value yeah
back or define. So the the question
that friends had is that if some of
these operators defined in this
function or not and then modules but
your own functions then what happens.
And as it if it is a toward function
for example the T what is defined for
all towards functions and all low at
operators but if it's none of those
then there is a small place in the
order that that packages cell where you
can add the great and it's respect to
your function and that's there's no way
around it essentially but you're
function can be a little function I'm
talking about if you if you for if you
write your own custom function in C for
example and you call that so the
autographed package has no they and
knowing what you're doing there. Um so
yeah in those cases for an opaque
function which it can't inspect
internal so it has to know what the
backwards so yeah this this is an
example I wanted to show where like you
can actually backward through a while
loop as well. So and the last thing a
last feature that talked word has is if
you actually if it it actually has
dependency checks. So let's say you
have a four ford function where no not
dysfunction actually depends on a and
for example in this case of I'd make
might be to be lesser than C then the
function is just the value zero it's
the constant function. And in that case
it it sees that there is nothing that
depends on a so it just generates an
error that the the the great in is like
has no dependency on a so altered what
is not for every kind of research I
think I mean it has it's clear
advantages to do certain kinds of
research if you have dynamic graphs
just as an example if you are if you
want to backdrop through some with them
like with there'd be for example you in
the ford phase you take a certain pat.
And in the backyard phase you want to
basically back prop it to the same that
and just through the maximum pad for
example so in that case it's every for
phase might give you different dynamic
grab one which which directions you
take and and the backward phase you
would one to be able to compute this
easily usually and networks like graph
transformer networks writing this great
and efficiently is fairly complicated.
And if you want to for example do
research and prototype these things out
a that is actually very nice package
and you don't have to use are too bad
independently by itself you can
actually use autocrat in conjunction
with and then so you can make a
autograph module like a not grad
function. And that can be plugged into
and and then a network of it like and
then and then container for example and
played really well with the rest of the
the network that you already have so
yeah to summarise it does really well
and dynamic graphs it has ordered
differentiated and modules it it's the
tape based ultra ultra differentiation.
Um and it is only about thirty percent
slower then doing the regular and then
and this this is the cost that you can
get rid undergrads. But it's just a
small constant factor thirty percent
for the fact that this whole thing is
done immigrants is actually pretty
reasonable so that's not a grad to
conclude that and the last thing I
wanted to show is portion at as soon as
I can figure out okay my internet two
back so going to fortune at a a I did
not have enough time to first slides
records net but documentation itself
want to take you guys it to show what
patterns of computation distortion at
capture the person actually has four
four parts to adjust to recall percent
is a framework release by face but
recently the it makes it much easier
for you to do work in complications
like for example data loading. And
training and testing. And doing some
kind of logging and so on. So I person
that's for modules data data set which
basically all the multi threaded data
loading and unloading from images text
a video all this is abstracted of a
nice data sets for you as long as you
have as your data is in a certain
format it will do the most efficient
data loading for you. And the dataset
package also has it also has a date
augmentation modules that can be
chained together. So for example you
can create a dataset the standard image
data set and then up plug it into a
crop image data set that will randomly
cropped images and then you can chain
it to a batch dataset. Now you have a
batch data set off crop image data set
of an image data set so you can compose
these things. And it like it runs like
a pipeline. And that and then you can
finally put that inside apparel dataset
which makes the whole thing multi
threaded. So you can compose these
things very nicely you can reuse a lot
of this functionality as you and like
across many of your experiments and so
the dataset part of course that is
actually very very powerful. Um and the
engine is basically it's a very too
abstract away the training look the
part that I said we would want to write
as researchers want to write again and
again but there are cases for example
then rebuild production pie plans where
is the same network and the same thing
training daily or weekly on your data
you might wanna abstract that of a
because you no longer doing like
specific research and that part of the
code. So they are engines there like as
GD in gin or ESGD engine which is the
elastic averaging is CD there are
different kinds of engines there that
and that that make it really easy for
you to plug in the network and then
optimiser and it'll take care of the
rest of the details. Um meters and
loggers are basically just useful for
like ease like more structure
checkpoint the you can log to jay's
sound files you can log to like just
the log you cannot lot of the output of
your training scripts like the current
law and so on to Jason files are two
plots or to other things and meters are
basically. There's like accuracy meters
and that measures the raw accuracy in
classification there's like a lost
meter that prints out the raw loss of
your current iteration and so on. So
have a look at a person that if you
think you you already used or and you
think you can you can structure you
code a bit better especially on the
dataset side it's very elegant a very
well written and it's basically been
written and rewritten over several
months face book so that's torsion it.
And that's pretty close to the
conclusion of my talking out take the
rest of time for questions the last
slides I have is what to do next. And
the next steps if you guys are
interested in torch or you can I go to
taurus Darcy hatch there's a getting
started button on how to install
towards and also there's a tutorials
page that that that god gives you a few
pointers on how to like you know just a
crow's learning torch and like it also
plenty of documentation apartment
that's very gay here just for this
specific tutorial wrote three in your
notebooks basically between tutorials
one to take you through the undergrad
package another to showcase them how to
do multi GPU training. And the last one
is to take a pretty retrain residual
network and extract features from that
and do something that that I don't
remember what he did. Um but if you go
to that UR is it public it's okay okay
if you go to that you're you will see
all three notebooks and you can either
execute them on your personal computer
through the I towards notebook
interface or you can you can basically.
Um look like just read through them and
like copy paste code into your own
scripts for example so thanks a lot for
being patient and listening to the
whole they of course I'm actually
surprised that there are so many people
left in the room. And yeah I thank all
of you for coming and if you have
questions filthy fast no no question
specifically oh well maybe we can just
move to the balloon yeah these are the
questions and it really okay except I
see that you are sure has disappeared
thanks again ooh ooh should we should
have a break once was fifteen years
break shouldn't well nobody will give

Share this talk: 

Conference program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
4 July 2016 · 2:01 p.m.
Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
4 July 2016 · 3:20 p.m.
Day 1 - Questions and Answers
4 July 2016 · 4:16 p.m.
Torch 1
Soumith Chintala, Facebook
5 July 2016 · 10:02 a.m.
Torch 2
Soumith Chintala, Facebook
5 July 2016 · 11:21 a.m.
Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
5 July 2016 · 1:59 p.m.
Torch 3
Soumith Chintala, Facebook
5 July 2016 · 3:28 p.m.
Day 2 - Questions and Answers
5 July 2016 · 4:21 p.m.
TensorFlow 1
Mihaela Rosca, Google
6 July 2016 · 10 a.m.
TensorFlow 2
Mihaela Rosca, Google
6 July 2016 · 11:19 a.m.

Recommended talks

Antonio HODGERS, Conseiller d’État chargé du département de l’aménagement, du logement et de l’énergie (DALE), République et canton de Genève
8 Dec. 2016 · 9:11 a.m.