Embed code
Note: this content has been automatically generated.
00:00:00
It's well welcome back after that
00:00:36
really nice talk by I'm gonna give my
00:00:42
last presentation on torch and this
00:00:46
might be a bit more interesting and
00:00:48
less dense it's going to actually be
00:00:53
related to you are show stark where
00:00:56
we're gonna do complete example of
00:01:00
generative adversarial networks and
00:01:03
towards it's a very short to example in
00:01:06
total about like a hundred fifty lines
00:01:08
are so so the cover that in the next
00:01:10
few slides. And then we'll talk a
00:01:13
little will have some examples about
00:01:15
the article that package the stuff that
00:01:18
it can do that none of the other
00:01:21
packages can and then finally I was
00:01:24
briefly go through the documentation of
00:01:27
fortunate do to basically give you an
00:01:31
overview of what towards net is and
00:01:34
what it brings to the table. And the
00:01:37
last slide it's going to be the next
00:01:41
steps if you guys are interested in
00:01:43
George like how like after this
00:01:46
tutorial where you go from here okay so
00:01:53
with that to start as you're sure
00:01:58
showed address L networks they have a
00:02:03
generator there but discriminator there
00:02:06
is some late in space that's as input
00:02:09
to the generator so some some random
00:02:12
noise for example. And that generate
00:02:15
some sample in most cases and against
00:02:19
an image. And that image along with
00:02:24
real images goes into the
00:02:27
discriminator. And you have some laws
00:02:30
that classifies between real images and
00:02:33
that fake images. Um so far so good
00:02:36
pretty simple you have two distinct
00:02:38
pieces a generator and a discriminator
00:02:41
both of them are neural networks. And
00:02:43
the way the optimisation works as the
00:02:47
generator is optimising self twofold E
00:02:50
so the discriminator if classification
00:02:54
laws should be really high when the
00:02:57
generators optimising itself and the
00:03:02
discriminator is optimising itself to
00:03:04
not get full by the generator. So the
00:03:07
classification laws should be really
00:03:08
low then you compute the gradients with
00:03:12
respect to discriminate or so for
00:03:16
images basically look like some random
00:03:19
nice as you see as input of the
00:03:22
generator that dresses generator
00:03:24
produces some some generation ago sent
00:03:28
discriminator you have some probability
00:03:30
of real or fake and then you will have
00:03:33
some log likelihood cost where ground
00:03:37
truth the Israel so you're maximising
00:03:41
the discriminator for the probability
00:03:43
of real. Um and separately when you're
00:03:47
optimising the discriminator. Um you
00:03:50
have a mini batch of samples of either
00:03:53
real or fake and then you are purely
00:03:57
discriminating there according to
00:04:01
whatever the ground truth so to start
00:04:07
of it in the morning if you recall or
00:04:11
if you remember we discussed but how
00:04:16
you make you know networks in towards
00:04:18
you have the and then package which you
00:04:23
can define and I'm modules. And you
00:04:26
have containers like sequential contact
00:04:29
in parallel which can take in the and
00:04:32
then modules and compose them and
00:04:35
certain ways to construct a neural net
00:04:37
so in this case generator and
00:04:42
discriminate are going to be con that's
00:04:44
of a few layers. And the laws will be a
00:04:50
binary cross entropy. So the start of
00:04:55
some boilerplate you load your packages
00:04:57
you define some options that you can
00:04:59
read from command line this is just an
00:05:02
options parser you create your data
00:05:05
loader similar to the one I showed in
00:05:09
the second lecture. And the dealer has
00:05:12
two functions it the returns the size
00:05:15
of the data set and also a function
00:05:17
call get batch that'll get a mini batch
00:05:20
of samples and the digital or entirely
00:05:25
it has multi threading and it keeps
00:05:28
many batches ready for you as you need
00:05:31
them and the next comes the generator.
00:05:36
So the generator itself is a con that
00:05:39
so input is latent space here. So the
00:05:42
generators a sequential there they
00:05:44
produce latent space so in the first
00:05:47
there you go from latent space less a
00:05:50
and Z is a number of dimensions in in
00:05:54
the input space. Let's say that's about
00:05:56
a hundred you go from a hundred
00:05:58
dimensions to some of variable and GF
00:06:03
which was configurable and and and and
00:06:07
you have an idea for two it was that
00:06:10
are configurable that when you run it
00:06:13
there basically the number of filters.
00:06:16
And the first convolutional there and
00:06:19
the number of filters that that you
00:06:21
learn in the first the first a
00:06:25
coalition there for the discriminator.
00:06:27
So you construct convolutional there's
00:06:31
appropriately and bash normalisation
00:06:34
and value this is how the deep
00:06:37
convolutional against are define the
00:06:39
paper that you actually showed
00:06:41
interpolation is from and you just act
00:06:46
them for a few layers and then the last
00:06:49
layer you make it a ten it so that the
00:06:52
output clears bonded between minus one
00:06:54
and one because the input to the real
00:06:57
images are coming also from a
00:06:59
distribution that is bonded between
00:07:01
minus one and one just to preserve the
00:07:03
property and to make sure that the
00:07:05
discriminator. Um is not duh
00:07:10
discriminating just based on the range
00:07:12
of the values you give nonlinearity
00:07:16
that has similar buttons so that's a
00:07:19
very simple definition within a few
00:07:22
lines or could you define the generator
00:07:24
the discriminator takes as input at the
00:07:29
output of the generator. And it is also
00:07:32
very similarly structured a few
00:07:35
convolutional errors instead of value
00:07:37
you actually use a non in idea for
00:07:39
liquor value. Um and the output layer
00:07:43
of this discriminator is a sigmoid it
00:07:47
goes from zero to one and if it is zero
00:07:51
you expected to the discriminator
00:07:55
there's a things that the sample is
00:07:57
fake. And if it's one the sample things
00:08:00
it's real. And you just have a last
00:08:03
reshaping your because these layers
00:08:07
well convolutional. So they will be of
00:08:10
three dimensional and yeah just to pass
00:08:13
in to the cost function you wanted to
00:08:15
be a single vector. So very simple and
00:08:21
the loss you define in a single line
00:08:24
just the laws is BC criterion which is
00:08:28
if you look at the documentation is the
00:08:30
binary cross entropy criterion and the
00:08:34
optimisation algorithm we use is added
00:08:38
and so we define the learning rate and
00:08:41
the beat up parameter and it's also
00:08:47
very simple there. Um and you create
00:08:50
some buffers to do and at like two and
00:08:56
pushed onto the jeep you sleep
00:08:58
reallocate some G you buffers you do
00:09:00
this because on the GPU doing
00:09:03
allocations and freeze is a fairly
00:09:06
expensive unless you write yourself a
00:09:08
custom locator it's better to just be a
00:09:11
look at the buffers that you need like
00:09:13
the input the input buffer the noise
00:09:16
buffer and the label buffer and then
00:09:18
reuse them again and again just for
00:09:21
performance reasons. And so we're here
00:09:24
you see if the jeep you is used you
00:09:27
some of them all to the GP you using
00:09:31
the cooler call as as I showed in the
00:09:34
the second lecture and finally to use
00:09:38
the Upton package you flatten the
00:09:40
parameters using these calls parameters
00:09:44
D is now vector that contains all the
00:09:47
parameters of your discriminator
00:09:50
network in a single vector and so that
00:09:52
can be passed into any of the optimal
00:09:55
algorithms. And parameters G is also
00:09:58
single vector that contains the
00:09:59
flattened parameters of the generator.
00:10:01
And similarly grad parameters the N
00:10:04
word parameters GR the gradient duh
00:10:08
gradients with respect to each of the
00:10:11
parameters that it's a vector of the
00:10:13
same size as the parameter director and
00:10:19
if you recall there's one more step to
00:10:22
train your no network and that is to
00:10:25
define your your closures for the
00:10:30
optimal package now and the adversarial
00:10:33
network. And sort of had like so in the
00:10:36
previous example I showed and the
00:10:38
supervised a classifier you only have a
00:10:41
single closure that you have to define
00:10:44
because you're just optimising to
00:10:46
minimise the loss and that's about it.
00:10:48
But in the end result network you
00:10:49
actually have to optimisations that are
00:10:52
alternating one after the other the
00:10:54
first one is you optimise a
00:10:56
discriminator. So you create a closure
00:10:59
to evaluate F affects and idea of by DX
00:11:03
of the discriminator so and if you
00:11:07
recall in the discriminator you want to
00:11:09
optimise for the real samples to be to
00:11:15
be sh classified as real and of of the
00:11:20
the generated samples to be classified
00:11:22
as a fake. So but then one slight
00:11:27
subtlety here especially in the DC
00:11:29
again architecture is that you don't
00:11:31
want the bash normalisation self the
00:11:34
real or the fake to interact because
00:11:36
the bat statistics of the real samples
00:11:39
of will initially be different from the
00:11:42
best cystic six store the fake sample.
00:11:45
And that's one simple feature that's
00:11:47
enough for the discriminator to get
00:11:49
really good. And the adversarial game
00:11:51
is never played. So what you do here is
00:11:56
you get a real batch of samples from
00:11:58
your data or you copying them onto the
00:12:01
GPU and you compute de gradients with
00:12:05
respect to this real sample many batch.
00:12:09
Um D label you fill for this is the
00:12:12
real label because your optimism
00:12:15
discriminator everything is exactly as
00:12:17
intended next you want to optimise with
00:12:21
respect to the for example so you
00:12:23
create the uniform noise you send that
00:12:26
to your generator you get the fake
00:12:28
samples you copy those they're already
00:12:32
on the jeep you but you there in a
00:12:34
different buffer so you just copy them
00:12:35
onto this this buffers and then you
00:12:40
optimise with respect to the fake
00:12:42
samples. Um I keep in mind this line
00:12:47
here label the label for the real
00:12:49
samples Israel and the label for the
00:12:51
for example this fate and the second
00:12:53
you'll see for the generator have it
00:12:55
changes. Um and of and you the total
00:12:59
loss of your discriminator is the some
00:13:01
of the losses of the real and the fake
00:13:04
and then your turn D laws and the
00:13:06
gradients with respect to the
00:13:08
discriminator next you define another
00:13:12
close your the colours to evaluate the
00:13:14
generator and the great into just back
00:13:17
to the generator over here you don't
00:13:22
have a real part you have real samples
00:13:26
so what you do is you get some noise
00:13:31
vector you pass it in and through the
00:13:36
generator you get the fake samples and
00:13:38
now because you optimising for the
00:13:40
generator to full the discriminator you
00:13:43
actually fill these labels to be real
00:13:47
labels which means that the
00:13:49
discriminator supposed to think that
00:13:51
it's it's real so that's what you're
00:13:53
optimising for and then you send these
00:13:56
through the discriminator. And then you
00:13:58
compute the gradients with respect to
00:14:01
the generator finally and you just
00:14:07
return de Los and a great introspective
00:14:10
generator. And finally your your
00:14:16
training. Um look here it's very short
00:14:21
for as many parks says you define for
00:14:24
for the number of many batches for you
00:14:28
park you I'll data disk on their
00:14:30
network basically maximise the log
00:14:32
likelihood of the discriminator. Um
00:14:37
with respect a real samples and it's
00:14:42
respective fake samples. And then you
00:14:44
operate a bit generators network where
00:14:48
it's just maximising the discriminator
00:14:50
to do bad. And then you after every
00:14:55
park you save the model. Um two disc it
00:14:59
to reuse them later. That's about it in
00:15:03
about a hundred fifty eight lines you
00:15:05
actually created your adversarial
00:15:08
network. And you will you can train it.
00:15:12
So what does it look like in terms of
00:15:15
training just out of interest. Um so
00:15:20
this is how initially your generation's
00:15:22
look like just random noise. But over
00:15:25
time or time this start improving
00:15:29
rapidly. And they start getting better
00:15:34
and better. And if you see it happen
00:15:37
really quickly going from like random
00:15:39
samples to actual generations. But it's
00:15:43
still improved over a few you box. And
00:15:47
it's like still improving and usually
00:15:52
within any pocky will see whether the
00:15:55
adversarial game plays itself correctly
00:15:58
or not and it's as as you are sure did
00:16:01
mention is lecture these are very
00:16:03
unstable methods that so networks so
00:16:07
there are a few ways in which you can
00:16:09
monitor whether they are working or
00:16:13
not. Um looking at the samples visually
00:16:17
is one of them but it doesn't scale the
00:16:21
all the other methods it or that you
00:16:24
have several generators and
00:16:25
discriminator is training primarily and
00:16:28
then you validate each generator
00:16:31
against the discriminator it's not
00:16:32
raining against and you have a metric
00:16:35
to monitor that so yeah this is these
00:16:40
are all fake images from the same
00:16:43
latent vector overtime these these
00:16:46
samples improve to show again there
00:16:50
initially the samples are pretty before
00:16:53
and then over time they just like they
00:16:56
start getting more more compelling. And
00:16:59
at that at what times the the
00:17:04
discriminator laws is usually hovering
00:17:07
around the classification accuracy of
00:17:10
the discriminator is scoring a point
00:17:12
five that is the discriminator. Um
00:17:16
cannot tell whether these samples are
00:17:20
real or fake but still if you see the
00:17:24
the samples are fairly weird like some
00:17:28
of the best models we got for the paper
00:17:31
be tuned them and the architecture and
00:17:34
the learning rate but like if you just
00:17:36
put together something this against go
00:17:39
pretty far but you do see some
00:17:43
weirdness in the generations especially
00:17:45
like some worldliness and so on. Um so
00:17:50
how do you generate new samples from
00:17:52
this from the checkpoint that you just
00:17:55
saved to disk. It's very simple it's a
00:17:58
single line or could you just create
00:18:01
some new noise factor for that but the
00:18:03
nice you want and then you forward it
00:18:06
to the generator network and you get
00:18:09
images. And some different kinds of
00:18:12
nice you can just fill it with some
00:18:15
uniform nice simple enough. Um but you
00:18:19
can also do slightly slightly fancier
00:18:22
things like you can as and and and your
00:18:28
shows lecture he showed these
00:18:30
interpolate ins and late in space from
00:18:33
one one place to another. So you can do
00:18:36
similar things where you have some
00:18:40
point some point. And another point you
00:18:44
take two points in B in in this case
00:18:47
Eleanor night just meant left and right
00:18:50
and then you just compute linear
00:18:54
interpolation of also points with some
00:18:57
clown that granularity between the left
00:19:00
point on the right point. And you copy
00:19:02
those into DD noise nice vectors that
00:19:07
you created. So basically or were you
00:19:11
could you come you so like each mini
00:19:16
batch of noise and then you find some
00:19:20
intermediate point between left and
00:19:22
right back to put in that the nice mini
00:19:25
batch and then you will give this noise
00:19:27
mini batch to the the network. And that
00:19:32
will generate images where each imagine
00:19:34
this many batch is like basically going
00:19:39
from left to right starting from point
00:19:41
a and point. B and the the if you just
00:19:48
train this network for like one or two
00:19:50
you parks the generations aren't
00:19:52
perfect. But you still see you see
00:19:57
basically I did to just generated them
00:20:01
really quickly but you see that this is
00:20:04
the generation from a point to the left
00:20:07
this a generation from appointed right.
00:20:09
And as you're working as you're working
00:20:15
you the image slowly changes over time
00:20:19
to become from one bedroom to monitor
00:20:22
these are supposed to be bedrooms but
00:20:24
because only train them for money but
00:20:26
they're not very good but the images
00:20:30
that you're sure showed from the paper
00:20:31
or have been trained for like severally
00:20:35
box and that much nicer the last thing
00:20:40
I wanna chill is the arithmetic demo
00:20:47
where you can basically choose certain
00:20:51
images you can do arithmetic related
00:20:53
space we can choose certain images that
00:20:56
certain properties invaded space and do
00:20:58
some kind of a plus B minus see that
00:21:00
kind of thing and then do some re
00:21:03
generations. And for that I will pull
00:21:08
up a terminal there that's a terminal
00:21:12
okay. So I just and the arithmetic
00:21:23
generation script. And it just printed
00:21:26
out the network. And then it says we
00:21:30
will be doing vector may take a minus B
00:21:33
plus C shoes three images for a so the
00:21:38
reason you need to do like this
00:21:41
arithmetic over a few images is because
00:21:44
doing it or single image doesn't exist
00:21:47
I like accurately capture the property
00:21:49
that you're trying to capture. So it
00:21:52
gave me a few give me a few choices
00:21:58
unless you guys have a preference I
00:22:02
will pick someone so a minus B plus C
00:22:06
so I'll try to pick someone. Um the
00:22:12
smiling for a and then we will see what
00:22:15
to do that be and then see so maybe a
00:22:21
woman smiling let's pick for and
00:22:29
another woman smiling forty eight and
00:22:37
the third one let's see okay this three
00:22:51
sort of almost smiling and then now for
00:22:55
B let's try to remove for example the
00:23:01
room and then maybe add man or you can
00:23:05
remove smiling and add a neutral
00:23:07
expression. Um or yeah let's try to to
00:23:13
the smile let's try to remove the
00:23:16
smiling woman and then add a man I
00:23:23
don't know I haven't lost of ideas but
00:23:25
we'll see if you have any or if you
00:23:28
have good ideas let me know okay so
00:23:34
there is woman smiling so we remove the
00:23:36
women. And then it'll be man smiling.
00:23:40
And then we can add sunglasses
00:23:42
moustache okay that's a good idea as
00:23:45
long as we have enough generations of
00:23:49
moustache men. So we need to remove
00:23:54
woman. Um that are neutral because you
00:23:58
wanna still keep the smile okay sixty
00:24:04
three at the bottom I see very neutral
00:24:07
expression anyone spot something faster
00:24:14
let me know last one oh yeah sixty four
00:24:24
and when all of a smiling okay first
00:24:35
one okay okay now the last we have to
00:24:43
to see so now we have to add men with
00:24:48
moustaches says someone requested to
00:24:56
yeah and the that's five qualified the
00:25:01
moustache not twenty four twenty four
00:25:03
has a moustache I can't hear you if you
00:25:13
said that thirty one thirty one doesn't
00:25:20
know moustache thirty four oh sorry
00:25:24
fifty four oh my god at fifty four
00:25:32
let's see what happens so what school
00:25:38
you will see is that it generated them
00:25:45
and at the moustache who is also a
00:25:47
smiling and if you look at the code for
00:25:52
this it's like fairly simple it's like
00:25:55
all you do is literally take the nice
00:25:58
factor and and you let's see can I pull
00:26:05
this actually I think internet is not
00:26:08
working okay I think I we have all that
00:26:14
so if you look at the code for this. Um
00:26:19
it's basically just most of the code is
00:26:25
simply choosing the images. Um and
00:26:28
finally D vector arithmetic itself is
00:26:33
in us here and this line you take the
00:26:37
average you take the B average C
00:26:40
average and then you have the final
00:26:42
nice is average of a minus averages
00:26:45
depots average of C and then you give
00:26:47
that to the the net. And then you that
00:26:53
like the rest a logic as just like
00:26:54
drawing the images and stuff like that.
00:26:57
Um so extremely simple and very natural
00:27:02
to do it as well I need internet though
00:27:06
for my next or I'll figure it out okay
00:27:12
alright. So that's this against fee
00:27:17
basically went through a full four
00:27:19
produce against and we look that hard
00:27:23
to train them how to generate how to do
00:27:26
latent space interpolation is how to do
00:27:29
later in space are arithmetic next we
00:27:34
go to order grad. I wanna just the
00:27:37
briefly talk about this package audible
00:27:40
grad as I mentioned in the morning is a
00:27:43
pack respect router that does that that
00:27:48
computes your gradients with respect to
00:27:50
a function in a very nice way that's a
00:27:53
unique to undergrad at the moment which
00:27:56
is it has a tape recorder it records
00:28:00
whatever all the operations you did in
00:28:03
the for phase I mean by the computer
00:28:05
function. And in the backyard phase it
00:28:08
replace the tape backwards and then for
00:28:11
each of the operations it it it has
00:28:15
greetings defined for every single
00:28:17
operation and then in torch and little
00:28:23
lot. So it works really well so as a
00:28:27
basic example you can have a function
00:28:30
through here of variables AB and C so
00:28:34
ought to grad always computes very the
00:28:38
gradient with respect to the first
00:28:39
available of your function it can
00:28:42
either be a single variable or it can
00:28:44
be a table of variables for example if
00:28:47
you win neural network with matt
00:28:48
multiple weight matrices. Um so you
00:28:53
define the function. And then you call
00:28:56
autocrat on the function and that that
00:28:58
returns another function that is the
00:29:00
backward of de forest function. So it
00:29:04
executes the first function records
00:29:07
what happened. And then it to return a
00:29:09
function there. And then you look at
00:29:12
the reduce with respect to the function
00:29:16
at a particular point you can get the
00:29:21
you can get the the directives and the
00:29:26
value of the function itself. And I
00:29:29
just printed out the value there and
00:29:31
the great in with respect to a here is
00:29:34
just one because is it doesn't have any
00:29:38
germs around it and then just to just
00:29:45
the more interesting examples
00:29:47
conditionals so like a bad in a to
00:29:51
guide you can for example have a
00:29:53
function like this where at come at
00:29:58
runtime you can condition undergrad you
00:30:02
can condition some variables on your
00:30:03
functions and this is something that
00:30:05
you can't do that and then or like
00:30:08
other difference differentiation
00:30:10
packages here I I say if these better
00:30:13
than C then returned this function if
00:30:17
not return this function. And depending
00:30:20
on how I choose my B and C variables
00:30:23
the gradients are appropriately changed
00:30:27
next I want to yeah you probably I
00:30:46
don't know actually we could try I
00:30:51
would expect that it records which
00:30:54
which pat it took in the ford phase and
00:30:58
basically approximate it's it's to that
00:31:00
pat in the backward phase. Um because
00:31:05
from like a could perspective that's
00:31:07
exactly what it does it records that
00:31:09
you went through for example if these
00:31:12
so close to see but these just slightly
00:31:15
greater than C then it goes through
00:31:18
this could that here and it computes
00:31:20
dysfunction. So in the back or phase it
00:31:22
computes with respect to the same
00:31:25
function and I think that's yeah it's a
00:31:30
good question I'd I don't actually know
00:31:34
if it can do anything fancier because
00:31:36
it is basically a they said at tape
00:31:38
based differentiator. And next wanna
00:31:42
show an example of by loops so if you
00:31:47
see here I agree with the function of
00:31:49
there. Now depending on the the the
00:31:54
function has a dependency on in the
00:31:58
while loop on B being graded and see
00:32:02
and if as long as basically and see you
00:32:07
keep we computing this function and
00:32:10
adding it to your function value yeah
00:32:21
back or define. So the the question
00:32:35
that friends had is that if some of
00:32:42
these operators defined in this
00:32:43
function or not and then modules but
00:32:48
your own functions then what happens.
00:32:51
And as it if it is a toward function
00:32:55
for example the T what is defined for
00:32:58
all towards functions and all low at
00:33:01
operators but if it's none of those
00:33:04
then there is a small place in the
00:33:06
order that that packages cell where you
00:33:09
can add the great and it's respect to
00:33:12
your function and that's there's no way
00:33:14
around it essentially but you're
00:33:17
function can be a little function I'm
00:33:20
talking about if you if you for if you
00:33:23
write your own custom function in C for
00:33:24
example and you call that so the
00:33:27
autographed package has no they and
00:33:29
knowing what you're doing there. Um so
00:33:33
yeah in those cases for an opaque
00:33:36
function which it can't inspect
00:33:38
internal so it has to know what the
00:33:42
backwards so yeah this this is an
00:33:48
example I wanted to show where like you
00:33:50
can actually backward through a while
00:33:53
loop as well. So and the last thing a
00:33:58
last feature that talked word has is if
00:34:01
you actually if it it actually has
00:34:03
dependency checks. So let's say you
00:34:06
have a four ford function where no not
00:34:09
dysfunction actually depends on a and
00:34:11
for example in this case of I'd make
00:34:15
might be to be lesser than C then the
00:34:18
function is just the value zero it's
00:34:20
the constant function. And in that case
00:34:23
it it sees that there is nothing that
00:34:27
depends on a so it just generates an
00:34:30
error that the the the great in is like
00:34:36
has no dependency on a so altered what
00:34:41
is not for every kind of research I
00:34:43
think I mean it has it's clear
00:34:45
advantages to do certain kinds of
00:34:46
research if you have dynamic graphs
00:34:50
just as an example if you are if you
00:34:54
want to backdrop through some with them
00:34:58
like with there'd be for example you in
00:35:00
the ford phase you take a certain pat.
00:35:02
And in the backyard phase you want to
00:35:04
basically back prop it to the same that
00:35:07
and just through the maximum pad for
00:35:10
example so in that case it's every for
00:35:14
phase might give you different dynamic
00:35:15
grab one which which directions you
00:35:17
take and and the backward phase you
00:35:19
would one to be able to compute this
00:35:22
easily usually and networks like graph
00:35:26
transformer networks writing this great
00:35:30
and efficiently is fairly complicated.
00:35:33
And if you want to for example do
00:35:34
research and prototype these things out
00:35:37
a that is actually very nice package
00:35:39
and you don't have to use are too bad
00:35:41
independently by itself you can
00:35:43
actually use autocrat in conjunction
00:35:45
with and then so you can make a
00:35:47
autograph module like a not grad
00:35:50
function. And that can be plugged into
00:35:55
and and then a network of it like and
00:35:58
then and then container for example and
00:36:00
played really well with the rest of the
00:36:02
the network that you already have so
00:36:07
yeah to summarise it does really well
00:36:09
and dynamic graphs it has ordered
00:36:11
differentiated and modules it it's the
00:36:13
tape based ultra ultra differentiation.
00:36:16
Um and it is only about thirty percent
00:36:20
slower then doing the regular and then
00:36:24
and this this is the cost that you can
00:36:27
get rid undergrads. But it's just a
00:36:30
small constant factor thirty percent
00:36:32
for the fact that this whole thing is
00:36:35
done immigrants is actually pretty
00:36:37
reasonable so that's not a grad to
00:36:42
conclude that and the last thing I
00:36:44
wanted to show is portion at as soon as
00:36:47
I can figure out okay my internet two
00:36:50
back so going to fortune at a a I did
00:36:57
not have enough time to first slides
00:37:00
records net but documentation itself
00:37:04
want to take you guys it to show what
00:37:08
patterns of computation distortion at
00:37:10
capture the person actually has four
00:37:14
four parts to adjust to recall percent
00:37:17
is a framework release by face but
00:37:19
recently the it makes it much easier
00:37:23
for you to do work in complications
00:37:27
like for example data loading. And
00:37:30
training and testing. And doing some
00:37:34
kind of logging and so on. So I person
00:37:38
that's for modules data data set which
00:37:40
basically all the multi threaded data
00:37:44
loading and unloading from images text
00:37:47
a video all this is abstracted of a
00:37:49
nice data sets for you as long as you
00:37:53
have as your data is in a certain
00:37:54
format it will do the most efficient
00:37:57
data loading for you. And the dataset
00:37:59
package also has it also has a date
00:38:04
augmentation modules that can be
00:38:06
chained together. So for example you
00:38:09
can create a dataset the standard image
00:38:13
data set and then up plug it into a
00:38:18
crop image data set that will randomly
00:38:20
cropped images and then you can chain
00:38:23
it to a batch dataset. Now you have a
00:38:26
batch data set off crop image data set
00:38:28
of an image data set so you can compose
00:38:30
these things. And it like it runs like
00:38:32
a pipeline. And that and then you can
00:38:37
finally put that inside apparel dataset
00:38:40
which makes the whole thing multi
00:38:42
threaded. So you can compose these
00:38:44
things very nicely you can reuse a lot
00:38:47
of this functionality as you and like
00:38:49
across many of your experiments and so
00:38:52
the dataset part of course that is
00:38:56
actually very very powerful. Um and the
00:38:59
engine is basically it's a very too
00:39:04
abstract away the training look the
00:39:07
part that I said we would want to write
00:39:11
as researchers want to write again and
00:39:13
again but there are cases for example
00:39:16
then rebuild production pie plans where
00:39:19
is the same network and the same thing
00:39:21
training daily or weekly on your data
00:39:25
you might wanna abstract that of a
00:39:28
because you no longer doing like
00:39:30
specific research and that part of the
00:39:31
code. So they are engines there like as
00:39:36
GD in gin or ESGD engine which is the
00:39:38
elastic averaging is CD there are
00:39:41
different kinds of engines there that
00:39:44
and that that make it really easy for
00:39:48
you to plug in the network and then
00:39:50
optimiser and it'll take care of the
00:39:52
rest of the details. Um meters and
00:39:55
loggers are basically just useful for
00:39:58
like ease like more structure
00:40:00
checkpoint the you can log to jay's
00:40:02
sound files you can log to like just
00:40:04
the log you cannot lot of the output of
00:40:06
your training scripts like the current
00:40:09
law and so on to Jason files are two
00:40:12
plots or to other things and meters are
00:40:16
basically. There's like accuracy meters
00:40:19
and that measures the raw accuracy in
00:40:22
classification there's like a lost
00:40:24
meter that prints out the raw loss of
00:40:27
your current iteration and so on. So
00:40:30
have a look at a person that if you
00:40:33
think you you already used or and you
00:40:37
think you can you can structure you
00:40:40
code a bit better especially on the
00:40:42
dataset side it's very elegant a very
00:40:45
well written and it's basically been
00:40:48
written and rewritten over several
00:40:50
months face book so that's torsion it.
00:40:56
And that's pretty close to the
00:40:58
conclusion of my talking out take the
00:41:02
rest of time for questions the last
00:41:04
slides I have is what to do next. And
00:41:13
the next steps if you guys are
00:41:14
interested in torch or you can I go to
00:41:19
taurus Darcy hatch there's a getting
00:41:21
started button on how to install
00:41:24
towards and also there's a tutorials
00:41:27
page that that that god gives you a few
00:41:31
pointers on how to like you know just a
00:41:35
crow's learning torch and like it also
00:41:38
plenty of documentation apartment
00:41:41
that's very gay here just for this
00:41:43
specific tutorial wrote three in your
00:41:46
notebooks basically between tutorials
00:41:48
one to take you through the undergrad
00:41:52
package another to showcase them how to
00:41:58
do multi GPU training. And the last one
00:42:01
is to take a pretty retrain residual
00:42:04
network and extract features from that
00:42:08
and do something that that I don't
00:42:11
remember what he did. Um but if you go
00:42:14
to that UR is it public it's okay okay
00:42:17
if you go to that you're you will see
00:42:21
all three notebooks and you can either
00:42:25
execute them on your personal computer
00:42:28
through the I towards notebook
00:42:30
interface or you can you can basically.
00:42:35
Um look like just read through them and
00:42:39
like copy paste code into your own
00:42:40
scripts for example so thanks a lot for
00:42:46
being patient and listening to the
00:42:49
whole they of course I'm actually
00:42:51
surprised that there are so many people
00:42:52
left in the room. And yeah I thank all
00:42:56
of you for coming and if you have
00:42:58
questions filthy fast no no question
00:43:17
specifically oh well maybe we can just
00:43:19
move to the balloon yeah these are the
00:43:21
questions and it really okay except I
00:43:23
see that you are sure has disappeared
00:43:26
thanks again ooh ooh should we should
00:47:38
have a break once was fifteen years
00:47:41
break shouldn't well nobody will give

Share this talk: 


Conference program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
4 July 2016 · 2:01 p.m.
Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
4 July 2016 · 3:20 p.m.
Day 1 - Questions and Answers
Panel
4 July 2016 · 4:16 p.m.
Torch 1
Soumith Chintala, Facebook
5 July 2016 · 10:02 a.m.
Torch 2
Soumith Chintala, Facebook
5 July 2016 · 11:21 a.m.
Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
5 July 2016 · 1:59 p.m.
Torch 3
Soumith Chintala, Facebook
5 July 2016 · 3:28 p.m.
Day 2 - Questions and Answers
Panel
5 July 2016 · 4:21 p.m.
TensorFlow 1
Mihaela Rosca, Google
6 July 2016 · 10 a.m.
TensorFlow 2
Mihaela Rosca, Google
6 July 2016 · 11:19 a.m.

Recommended talks

Machine Learning: Alchemy for the Modern Computer Scientist
Erik Meijer, Facebook
7 June 2018 · 2:29 p.m.