Embed code
Note: this content has been automatically generated.
00:00:00
Um small so these have the lunch break
00:00:04
use we just in front of this one also
00:00:08
we will have so the don't stop going on
00:00:11
tables on the should be useful suitable
00:00:13
for the of to people's movement usually
00:00:16
table for the user so just just you're
00:00:19
result was you really be in the cooking
00:00:22
okay okay now comes the more boring
00:00:29
talk. So it's it's a bit more hands on
00:00:34
it's a deeper dive and I don't expect
00:00:37
it to be as interesting for for most if
00:00:40
you so it'd be surprised if I see some
00:00:44
sleeping people alright so in this
00:00:49
topic we're gonna talk about two things
00:00:52
the first is how that answers then
00:00:55
stored this for econ charge. And then
00:00:59
they also talk about how how we use the
00:01:03
neural networks packages this the ad
00:01:08
this this display the stock basically
00:01:10
will get you to a place where you kind
00:01:15
of know now how most people use
00:01:18
storage. And then the next an X like
00:01:21
and the next time. Um I would I would
00:01:25
talk about more experimental stuff
00:01:29
that's a bit more interesting so think
00:01:32
of this as the middle part which gets
00:01:35
really boring us yeah yeah they'll be
00:01:41
available online I'll send you I'll
00:01:45
entrance selling. That's slide okay so
00:01:49
coming to pretenders and storage is and
00:01:53
George ten servers are as I said before
00:01:57
and I wish I race. They are through
00:02:01
major and memory so basically what that
00:02:04
means is let's say there's a two
00:02:06
dimensional tensor that's illustrated
00:02:08
here on in the middle. Um it has rose
00:02:15
of data and if you actually look at in
00:02:20
the system memory how how it's
00:02:23
represented because men like the
00:02:25
memories land here the first rose laid
00:02:28
out as is and then the second rose laid
00:02:31
out and the third and the fourth. Um
00:02:33
this is called a major and if you do it
00:02:36
the column eyes instead it's called
00:02:37
call a major. Um like I think matlab
00:02:43
this column major and number by Israel
00:02:47
major as well which is nice because the
00:02:51
the the layout and to large part the
00:02:55
format that fords and number by follow
00:02:57
is very similar so if you want to share
00:03:01
if you want to copy a utter sensor into
00:03:05
an umpire a or vice versa. And you
00:03:08
don't actually have to do any memory
00:03:09
copy so here in this example the cancer
00:03:19
here has the four of those and six
00:03:23
columns. So the size of the answer is
00:03:25
four by six and there is another
00:03:28
there's another thing that we common
00:03:31
use commonly use when we do any kind of
00:03:35
anti arrays it's called this tried. Um
00:03:38
so what's tried is is in that dimension
00:03:43
to access the next element how many how
00:03:47
many how many the elements and forward
00:03:52
in memory do you have to go. And so as
00:03:55
an example the orange block here the
00:04:00
first remote first column. It's a bad
00:04:04
memory location zero. Now to go to the
00:04:07
first yellow block how many how many
00:04:11
every access is do I have to do make
00:04:14
how about like how much I had I have to
00:04:16
go and that is I have to go six
00:04:18
elements forward and that's basically
00:04:20
that's right. So the stride here is six
00:04:24
in the first dimension and one in the
00:04:26
second dimension. It's one and the
00:04:28
second dimension because if I want to
00:04:31
access the next column for from going
00:04:34
for the first contra second colour for
00:04:36
example I just have to go one one
00:04:40
forward and the actual memory that you
00:04:43
see so why do we have the size in
00:04:49
stride. Um this actually give us a very
00:04:52
powerful way of indexing sub answers
00:04:57
basically choosing parts of cancers and
00:05:00
still operating on then that out doing
00:05:04
expensive operations like memory
00:05:06
copies. Um so as an example here. Let's
00:05:10
say I have a select operation of where
00:05:14
the I'm I'm selecting in the first
00:05:17
dimension the third element what that
00:05:20
means is I would want in the first to
00:05:22
mention that is over the rose I won the
00:05:24
third draw. Um that operation will have
00:05:28
to give me this this third row. And to
00:05:32
do this operation I don't actually have
00:05:34
to create a new memory copy the only
00:05:36
operate operation up to do here is
00:05:39
create just another cancer structure
00:05:42
that changes the size and the stride
00:05:44
and these storage offset that is where
00:05:47
my where might answer starts from and
00:05:49
and memory and it can still map to the
00:05:52
same underlying stories that that at
00:05:54
this tender points to so that also
00:05:58
would mean that if I change anything in
00:06:02
this sub tensor the values will also
00:06:07
change in the original tensor I've
00:06:12
illustrated this here. But these this
00:06:17
this substance Renault very actually
00:06:19
points out and memory and those are the
00:06:22
red red memory locations no if we do it
00:06:26
column wise so also as you see the
00:06:30
offset here now is thirty and which is
00:06:33
starting from this initial story
00:06:34
location how how much for their do I
00:06:37
have to go to start just start my this
00:06:42
up to answer this particular substance
00:06:44
or so west now do this colour wise if
00:06:49
we do this column wise we can still
00:06:51
create a select the sub tensor. So what
00:06:54
you notice here is that the the storage
00:06:58
is still contiguous in memory which
00:06:59
means distance or that answer that you
00:07:02
just created a by doing the select
00:07:04
operation here is every element to the
00:07:08
story just next to each other and this
00:07:10
is called the contiguous tensor. Um and
00:07:14
that doesn't hold if you for example do
00:07:17
column by selection. So if I the call
00:07:20
by selection so all of my calls here
00:07:23
are number and that's a selected in the
00:07:26
second dimension which is over columns
00:07:29
the third colour so I ask for this
00:07:33
particular call them. And that would
00:07:35
give me this particular cancer but if
00:07:38
you see how it's actually mapped into
00:07:41
the raw memory the it's the each of
00:07:46
these are not contiguous in memory but
00:07:49
you can still construct a tensor that
00:07:52
maps to this this particular substance
00:07:54
or by changing the size in this tried.
00:07:57
So the stride here is six which means
00:08:00
that from going from this orange shah
00:08:03
block to the yellow block after takes
00:08:05
six the I have to go six locations
00:08:09
forward and I will get my next element
00:08:11
in that answer and this dimension. And
00:08:14
the offset just points to the fact that
00:08:17
this particular stored stars from the
00:08:19
third element itself and so you start
00:08:22
with a relevant and then for every
00:08:24
element you one next use go six blocks
00:08:27
for and this is actually a very simple
00:08:32
we have mapping things the fact that
00:08:35
you have a tensor and the storage and
00:08:37
that answers map to storage and you can
00:08:39
extract sub answers. Um but it's
00:08:42
extremely powerful you can for example.
00:08:45
Um ask for the first three channels of
00:08:50
your image when they're say a hundred
00:08:52
twenty eight channels and operate on
00:08:54
that separately without having to do a
00:08:56
memory copies and so on. Um so that is
00:09:01
how tender in storage is work at at
00:09:06
like that at at at low level not
00:09:11
looking at some syntax. Um in torture
00:09:15
if you want to read low the package you
00:09:18
call the require sorry you call the
00:09:21
require function. So here we just
00:09:24
requiring towards the semicolon is
00:09:26
optional but if you use I python
00:09:29
notebooks torch actually it has support
00:09:32
for I put on the books where you can
00:09:35
you have a night porch colonel that you
00:09:39
can use which means that you can use I
00:09:43
python if you're familiar with that but
00:09:46
then you can use it as you always use
00:09:48
that has and line hell or to complete
00:09:50
and so on. So you load torture the
00:09:55
semicolon here is indict button
00:09:58
notebooks it wouldn't print out the
00:10:00
result and that's that's all it does so
00:10:05
the created tensor you create you had
00:10:09
the syntax for as part of the party
00:10:11
package you have several types of
00:10:13
dancers double tense reflect answer by
00:10:16
tensor along cancer and so on. Um and
00:10:20
you created tensor of size four times
00:10:24
six it's just the matrix the four by
00:10:26
six matrix. And cancers by default are
00:10:29
not initialised with any default values
00:10:34
so it's it's the standard now has an
00:10:38
initialised memory might cunts contain
00:10:41
all kinds of weird well is so let's
00:10:43
fill it up with a uniform noise you can
00:10:46
do that with the this call this colon
00:10:49
here is just a little a syntax for I
00:10:53
want to operate on this variables the
00:10:57
sequel and to saying a dot uniform of a
00:11:00
so you will keep seeing this call an
00:11:03
operator. It's just could like calling
00:11:06
method of a class. So you call uniform
00:11:11
here that fills the transfer that
00:11:13
uniform nice mean zero standard
00:11:15
deviation one if you actually pass
00:11:17
arguments of then you can actually
00:11:20
change them in a standard deviation you
00:11:22
can print that answer it will print a
00:11:24
screen in a nice format. Um and that's
00:11:29
basically the same tensor that we
00:11:31
wanted to create the ten zero will have
00:11:33
an underlying storage that you can also
00:11:35
access using the colon storage call and
00:11:40
that will actually turn the underlying
00:11:42
stores that you can directly manipulate
00:11:45
that for example and the other
00:11:50
operation we did previously was select
00:11:53
and similarly there's a select call
00:11:55
here the dimension and the element and
00:11:59
as you see you print out the D sept
00:12:03
answer here. It's has the same values
00:12:06
as the third that are doing here. And
00:12:10
to illustrate that the underlying ghost
00:12:12
stories a shared I show that if you
00:12:16
filled be that's some value let's say
00:12:18
with the values three the the the had
00:12:22
original tensor a in rotary also
00:12:26
changes values. And and this is a
00:12:28
pretty important detail to remember
00:12:31
when you're working it answers in sept
00:12:33
answers oh you don't get a call and you
00:12:41
get you get a single a vector when you
00:12:45
select a real it's a one dimensional
00:12:49
that so the print there is just showing
00:12:53
it column wise but that's about it.
00:12:55
Okay those are the basics of tenders I
00:13:04
I obviously wouldn't cover the whole
00:13:07
tensor library because it's has more
00:13:10
than a hundred and fifty tensor
00:13:11
functions of like you have that now do
00:13:16
we all stations compilations a lot of
00:13:18
blast calls a cancer manipulation
00:13:22
operations like narrowing indexing
00:13:25
masks selecting scatter gather logical
00:13:28
operators and so on. And it's fully
00:13:32
documented all the functions that
00:13:33
torture nicely documented with examples
00:13:37
what you expect from a nice library I
00:13:39
guess. And you also have in line help
00:13:41
boat and I torch as well as in the
00:13:44
regular George interpreter you can ask
00:13:48
for the hell by saying question mark
00:13:50
the function you're interested in and
00:13:52
it will in line give you they help you
00:13:55
line up that examples most of the time
00:13:59
coming to the next part I I talked
00:14:06
earlier about the jeep you supporting
00:14:08
words it's extremely seamless it's like
00:14:12
it's exactly like using the C C. P. U.
00:14:14
package except that instead of I
00:14:17
instead of using the torch dot float
00:14:21
answer or torso double tensor user uses
00:14:24
newtons or course taught couldn't
00:14:26
answer. And project couldn't answer is
00:14:29
afloat answer that sits on the jeep you
00:14:32
it also has all the mat operations
00:14:34
defined on it you can use it exactly
00:14:36
how you use the sepia cancer. But for
00:14:40
most of the operations because tense
00:14:42
for cancer operations jeep user usually
00:14:44
faster almost all the operations that
00:14:47
you try to do our faster and you're
00:14:52
only limited by the did the tensor size
00:14:55
you create is your only limited by the
00:14:57
the unwanted cheap P memory you have
00:15:00
which is usually much smaller than the
00:15:04
amount of CPU memory you have on most
00:15:06
systems okay so that's a basic overview
00:15:12
off towards that answers. I didn't go
00:15:16
into a lot of detail because it's
00:15:18
mostly once you get through the basics
00:15:21
it's mostly subtext may freely you look
00:15:23
at how you use non by or matlab
00:15:26
matrices for example you just look at
00:15:28
the functions you're interested and
00:15:29
then you would use that next I wanna
00:15:36
talk about training neural networks. So
00:15:40
neural networks the way you train them
00:15:44
there's a lot of mowing parts well you
00:15:46
can you networks. Um I just feel like I
00:15:50
just created a figure to map some or
00:15:56
most of the use cases can be mapped
00:15:59
into for example such a structure you
00:16:03
have most modern datasets I'm by mortar
00:16:08
I mean large datasets that don't fit.
00:16:11
And memory anymore you have of I'd say
00:16:15
sixteen gigs or sixty four gigs or you
00:16:18
know two fifty six gigs of CPU memory.
00:16:21
And you can no longer load your data
00:16:23
sets into memory like you load and then
00:16:25
Steve on the you know research still
00:16:27
carries on on and this for example
00:16:31
image that is the image that I in
00:16:34
classes dataset is one point two
00:16:35
terabytes and for as like guys as at
00:16:41
face book we consider that a small data
00:16:43
set so usually low these datasets and
00:16:47
some kind of disc either hard drive or
00:16:51
asses these are on some network file
00:16:55
system and then you have a data loader
00:16:58
that loads this data it it basically
00:17:02
you can ask for many batch of samples
00:17:05
attitude on the fly lo these many batch
00:17:08
of samples sent to process them augment
00:17:11
them do all kind of colour jitters and
00:17:14
cropping then all that and then it will
00:17:17
send that into some Q where your neural
00:17:21
network trainer can fall the many
00:17:25
batches off of that Q and the train
00:17:27
your neural network with the cost
00:17:29
function that you specified and the
00:17:31
optimisation algorithm that you specify
00:17:33
like a CD or add it or mass problem for
00:17:36
example and usually. Um these are multi
00:17:41
threaded or multi process. So the data
00:17:44
loader sits in a separate thread or
00:17:46
process and your main thread compute
00:17:52
the computes the neural network
00:17:56
process. And there are other there are
00:18:03
other right ease of this as well for
00:18:06
example if you are doing but serial
00:18:08
learning and bearings you have a much
00:18:13
smaller neural networks and these are
00:18:17
not that much faster than the jeep you
00:18:19
then on the CPU and you so one common
00:18:23
way to train these things is via how
00:18:25
well and by hog well you would have
00:18:29
multiple all these neural networks. Um
00:18:33
replicated sharing the same parameters.
00:18:36
And their train simultaneously in
00:18:38
parallel asynchronously. Um and of it
00:18:43
no no kind of synchronisation and
00:18:45
that's that's hard well and so and this
00:18:47
most common scenario you just have a
00:18:49
single thread for and a single neural
00:18:52
network that your training I'm gonna
00:18:54
cover that first okay so coming to how
00:19:05
these us how how these structures
00:19:08
actually map to a large packages. So
00:19:13
the data loader especially having
00:19:16
multiple data threads that have
00:19:19
callbacks once they're finished in the
00:19:21
main thread and so on. And the are
00:19:23
covered by the threads package be will
00:19:26
go over that briefly and you have the
00:19:31
trainer in fort itself there is no
00:19:36
notion of a train or the the researcher
00:19:39
just right still in training will this
00:19:42
is not common but not uncommon as well
00:19:46
like I mean Indiana for example every
00:19:49
and just racer and training will but in
00:19:51
cafe you would have a trainer that
00:19:55
takes your in fort above of the neural
00:19:57
network and these all were and it would
00:20:00
us all the whole thing so torch at all
00:20:04
also tries to maintain being very raw
00:20:07
the researcher and like fifteen twenty
00:20:09
lines of code writes their own training
00:20:12
will and that gives them the
00:20:13
flexibility to change it in weird ways
00:20:15
it needed. And in the third lecture we
00:20:18
will see how this kind of flexibility
00:20:20
would be very useful for example when
00:20:22
you're training adversarial networks
00:20:24
and the neural network and the cost
00:20:30
function are covered by the and then
00:20:32
package and we will go over it briefly.
00:20:35
And the optimiser is covered by the
00:20:37
optimal package we will also go over
00:20:39
that where we have a platter of we
00:20:44
didn't based optimisation algorithms
00:20:47
it's next starting with the and then
00:20:50
packages started and then package and
00:20:52
then go to adopt them and then lastly
00:20:54
threads because the loading is boring
00:20:57
so the and then packets as they said in
00:21:03
the in the last lecture as it just
00:21:05
briefly touched upon and towards the
00:21:09
neural network packets is has this
00:21:11
notion of building your neural networks
00:21:13
as stacks of Lego blocks and various
00:21:17
structures. So what we have is we call
00:21:25
containers and we have modules. So
00:21:29
containers are these very is structures
00:21:35
that implement. Um that that stack your
00:21:39
modules and and different ways I have a
00:21:41
visualisation coming up for that in a
00:21:42
second. And Montana and modules are
00:21:45
basically the the actual computations
00:21:50
that you want one for example in this
00:21:52
case a convolution over spatial just
00:21:58
images into D with three input feature
00:22:03
map sixteen out for feature maps and a
00:22:05
five but like curl that is added to
00:22:08
this sequential there. So that it's
00:22:10
back and the and the tenets activations
00:22:15
added right after that the now these
00:22:17
two are the input comes through the
00:22:20
convolution and the output of the
00:22:22
convolution goes that the tenets that
00:22:24
pitted and this goes into this max
00:22:26
pulling and so on so it's like the
00:22:28
sequential container is basically just
00:22:31
a linear container that passes the
00:22:33
input through all of these containers
00:22:36
and then give you the output of the
00:22:37
last layer of discontent or in this
00:22:42
example this this short example
00:22:46
implements this particular neural
00:22:48
network using this this seventy lines
00:22:53
of code and one one thing if if your
00:23:01
family with other packages and you come
00:23:03
to torch one thing to keep in mind is
00:23:05
that we implement the self mikes as
00:23:11
lots so max we do have a soft max there
00:23:14
but as most people who would know the
00:23:18
canadian it's computations for soft max
00:23:22
are unstable. So what most packages do
00:23:26
is they call the layer soft max and
00:23:28
they give the output says the soft max
00:23:30
but the compute the gradients of the
00:23:33
log so max we actually wanted to be
00:23:36
more transparent from the beginning so
00:23:38
you actually have a lyrical soft max if
00:23:41
you wanna shoot yourself in the foot go
00:23:43
for it. But you have a lyrical lots of
00:23:46
Mike that does the right thing we all
00:23:49
for example is use lots of mikes so you
00:23:52
would if you looked at the basic
00:23:54
example go to basic controls you would
00:23:57
actually like pretty much just use that
00:24:00
so each of these modules and then and
00:24:05
then package they have a common
00:24:07
interface that they have to define even
00:24:10
though they can like I mean these are
00:24:12
these are functions these three
00:24:15
functions are essential functions that
00:24:17
to define. And they can have custom
00:24:20
functions that they use in other ways.
00:24:24
So the three functions are about this
00:24:28
there's a typo here are update but big
00:24:32
right into it and I grabbed parameters.
00:24:35
Um a bit out but computes the output
00:24:40
given the input so vital to have a fix
00:24:43
a deducted is FFX and it gives out
00:24:48
reply. And a big red input computes. Um
00:24:55
DY by DX basically the gradient with
00:25:02
respect to the input the gradient of
00:25:06
the the input with respect to the
00:25:07
output. Um yeah good and then there is
00:25:14
accurate parameters so the the the the
00:25:17
the module you that you define can be
00:25:19
parametric or nonparametric so max
00:25:22
pulling for example is a nonparametric
00:25:26
module it doesn't have any parameters
00:25:28
that indians. Um but convolution for
00:25:31
example. And convolutional networks.
00:25:35
And it tries to change its filters in a
00:25:38
way that improves define the loss of
00:25:40
your network. And this is a parametric
00:25:43
might do. So it has a set of weights
00:25:45
and biases that are defined inside it.
00:25:47
So for a layer like max pulling you
00:25:50
don't need to define the I good
00:25:52
parameters so you can just leave it as
00:25:54
is but if you do have parameters in
00:25:57
your in your module you would want to
00:26:00
define this function that computes the
00:26:03
gradients with respect to the
00:26:05
parameters that you have in your market
00:26:07
and also part of the and then package
00:26:12
are lost functions like mean square
00:26:15
laws or like negative likelihood loss
00:26:17
or marginal loss and so on. And the
00:26:20
loss functions have a similar
00:26:23
interface. They have and update output
00:26:27
and then update grad input since lost
00:26:29
functions are not parametric you don't
00:26:34
have the egg right parameters there no
00:26:39
coming to the containers. Um the end
00:26:42
package has several several containers
00:26:45
the most common that you saw earlier is
00:26:48
sequential the modules and the white
00:26:51
there the take the input feature input
00:26:54
that is of of four channels and they it
00:26:57
just sends the input through and then
00:27:00
as a spits the output that and then
00:27:04
there's the con cat container which
00:27:08
takes the inputs. And then it sense the
00:27:12
same input to to these two sites that
00:27:19
it has can cat has get basically
00:27:22
creates a pipe for every input so what
00:27:24
this what this structure if you're
00:27:27
actually is is this whole thing is a
00:27:29
con cat which has to sequential as and
00:27:33
that and the sequential stem cell cell
00:27:35
four layers. So the contact a dual here
00:27:39
has two pipes that the input goes
00:27:41
through each of these the same input
00:27:44
goes three to these can get types and
00:27:47
then it has separate outputs that are
00:27:49
then concatenated together to give you
00:27:52
a single out but and then there's also
00:27:56
the the parallel container that let's
00:28:00
see you're given and input of two
00:28:01
channels and you have to and and you
00:28:04
have to to see controls that are added
00:28:09
to it to pipes it gives each of these
00:28:13
channels to each of these separate
00:28:15
pipes and then it gets outputs that it
00:28:17
concatenated together and sends to the
00:28:20
next layer. So as you've seen already
00:28:23
in these cases a container can have
00:28:27
other containers inside it you can
00:28:29
basically compose these things in a
00:28:31
very natural way you can we you can
00:28:36
compose complicated networks like resin
00:28:39
that or or go on that just using these
00:28:43
three can actually just using
00:28:47
sequential yeah just using these three
00:28:49
contenders you can actually create duh
00:28:52
de Vere structure that Google net is
00:28:55
and it's and and very the lines a code.
00:29:00
So getting to the could have back an of
00:29:08
the neural network packets as they
00:29:10
showed earlier using could afford
00:29:13
portrait answers was very very natural
00:29:16
like you had to change one line
00:29:20
similarly the and then packages also
00:29:23
equally natural to use if you have a
00:29:26
model that you define to actually
00:29:29
transfer the model to could all you
00:29:31
have to do is call colon put on the
00:29:33
model and it automatically now sits on
00:29:35
the GPU and it expects inputs to itself
00:29:39
from to be could at answers that's also
00:29:42
sit on the jeep you and now this model
00:29:44
for which computes the update out the
00:29:48
is done on the GPA so very easy to use
00:29:55
a very natural to use you never feel
00:29:58
like you're doing something special for
00:29:59
the GPU next comes the and then graph
00:30:04
package I don't I only have one slide
00:30:06
on this because and then graph and
00:30:08
there's not much to the end where
00:30:11
packets it's very very powerful the
00:30:14
ending graph package introduces
00:30:17
composing neural networks in a
00:30:20
different way instead of composing them
00:30:22
in terms of containers and modules all
00:30:25
you have to do is chained modules one
00:30:27
after the other. So an example is
00:30:31
probably the best way to showcase this.
00:30:33
Um in this example let's say you have
00:30:36
an let's see you have and then graph
00:30:41
where you want to create a a two layer
00:30:44
I'm not be with the tennis nonlinearity
00:30:50
what you do is you create some dummy
00:30:52
input layer is just for a best practise
00:30:55
is this is the actual air and entered
00:30:57
identity open close bracket. And then
00:31:00
graph. Um basically has is overloads
00:31:05
the call operator so that the second
00:31:07
bracket that tells you what it
00:31:09
disconnected do. So the input here is
00:31:12
not connected to anything else it is
00:31:14
the first later in your a neural
00:31:18
network. So it just has an empty
00:31:20
bracket not coming to the next part. Um
00:31:23
you could the first do there where but
00:31:27
you have a an actual air that is
00:31:31
connected to a linear layer that is
00:31:35
connected to input which is this
00:31:36
identity layer. And this whole thing is
00:31:39
not created in one shot and this is the
00:31:44
first it in there. And then you create
00:31:47
the next linear there which connects to
00:31:52
hedge one here which is the first in
00:31:54
there and that gives you the output.
00:31:57
And then when you want to create you
00:31:59
and then got you just define the the
00:32:02
input and the output module there that
00:32:06
that you want your and then grab to map
00:32:08
to so you create what you call in and
00:32:12
entity module. It's a short for a graph
00:32:15
module where in the first set of
00:32:18
parentheses you give all the inputs to
00:32:21
you a neural network. And in the second
00:32:23
set of the prep fancies you give all
00:32:26
the outputs you want from the neural
00:32:28
network and the ML P.s created and you
00:32:32
use it exactly like how you used the
00:32:35
previous and then modules it has the
00:32:38
same interface everything is the same
00:32:40
all it does is it looks at it basically
00:32:46
looks at what's connected to what and
00:32:48
it just creates a competition graph
00:32:51
there and there's not much else to and
00:32:57
then we have to be honest like I mean
00:32:59
it has some useful things like you can
00:33:02
actually but using graph is you can
00:33:06
actually create a visualisation of your
00:33:08
graph where if you have a very complex
00:33:11
crafted we it would be useful to see
00:33:14
what's going on in the graph so you
00:33:17
create. Um an SVG file that shows the
00:33:21
structure of your graphic descriptions
00:33:23
of each layer and your graph and how
00:33:25
they're connected. Um and you also have
00:33:30
a mode where if you haven't ever at
00:33:34
runtime in your neural network the and
00:33:37
then grab can automatically spit out
00:33:41
and that's VG file with the whole grass
00:33:43
structure. And with more that but
00:33:48
different colours for the which no D
00:33:53
ever a card in in case you want to like
00:33:55
visually see which of your neural
00:33:57
network module actually filled and come
00:34:00
like had some runtime error apart from
00:34:03
that and then grabbed is very basic
00:34:04
grey useful to create complicated
00:34:08
things like weird LSTM or other
00:34:12
frequent modules then I come to the
00:34:17
optimal package the option package is
00:34:21
written in a way where it knows nothing
00:34:24
about you know networks optimist
00:34:26
basically just as a bunch of
00:34:28
optimisation algorithms. Um including
00:34:31
and non non graded additional buttons
00:34:35
like a line search algorithms. And it
00:34:38
basically once a function of I go to F
00:34:41
of X a W where W or the parameters of
00:34:47
your system annexes the input actually
00:34:50
does any and care about ex the input it
00:34:53
just once why go to have of doubly. So
00:34:56
in this small example here I'm just
00:34:58
showing at the interface that often
00:35:00
takes you can havoc on fake that
00:35:02
defines all the parameters of your
00:35:04
optimisation. And for for each of your
00:35:08
training sample can create a function
00:35:11
that does that that that that does F
00:35:17
affects. Um and then you can pass that
00:35:21
too often but as you D in this example
00:35:24
very you're doing stochastic gradient
00:35:26
descent you pass the function the
00:35:29
function that computes a affects you
00:35:31
pass an X which is the parameters of
00:35:33
your system that you're trying to
00:35:35
optimise and the configuration and
00:35:38
often in as you D will run on this
00:35:41
function. It's a slightly different
00:35:44
it's it's it's decoupled from neural
00:35:47
networks for a very good reason we want
00:35:49
to do you like right the optima package
00:35:52
to be very generate like a black box
00:35:55
optimiser that you can just plug into
00:35:59
other places well the up in package has
00:36:03
a wide range of algorithms implemented
00:36:06
your standard stochastic gradient
00:36:09
descent averages you DOBFTS conjugate
00:36:12
gradients it out it impacts our as prop
00:36:16
they started line search this is an
00:36:20
interesting one and that's true of SCDR
00:36:24
our prop. And most recently C mas I
00:36:28
haven't really I haven't figured out
00:36:32
what the full form is but it's some guy
00:36:36
contributed this very recently. Um my
00:36:42
favourites here are as GD and adam and
00:36:47
our mess prop the kind of nice
00:36:50
everything else is I only used in
00:36:52
passing. So how does the opt in package
00:36:57
work for neural networks itself. So in
00:37:00
the end then package we have any
00:37:04
powerful a function call get parameters
00:37:08
that what it does is your network has
00:37:12
several modules several modules that
00:37:14
each of them can be parameterised let's
00:37:17
say you had three convolutional errors
00:37:18
each of them has their one parameters
00:37:21
that map to separate memory regions
00:37:24
still they call their on my logs
00:37:26
they're just sitting in different parts
00:37:27
of memory what we what get parameters
00:37:31
does is when you call this it maps all
00:37:35
of the parameters of and all the
00:37:38
parameters of your current neural
00:37:41
network on to a single contiguous
00:37:44
storage. And then re maps that answers
00:37:48
of each of these layers onto that
00:37:51
storage using the offsets and the
00:37:54
strides. And what that would give you
00:37:57
is a single vector that you can pass to
00:38:04
your optimisation package and
00:38:07
optimisation packaged oh oh oh
00:38:10
something have and the optimisation
00:38:16
packages doesn't have to know where
00:38:17
there's a neural net for or anything
00:38:20
else. It just once a vector of
00:38:22
parameters that it once to optimise and
00:38:24
so the and then package has this call
00:38:28
get parameters that will do that for
00:38:29
you it will remap all your parameters
00:38:32
to a single vector that you can that
00:38:34
then pass into the optimal package this
00:38:37
is probably the only harry detail the
00:38:41
hole and then a pin thing but it's a
00:38:44
very important detail and several
00:38:46
people have shot themselves in the foot
00:38:48
in the past using this okay no let's
00:38:56
actually look at how this example the
00:38:59
same example I've given is gonna map to
00:39:03
a neural network. So you want to define
00:39:07
this function F affects that racks are
00:39:10
the input the parameters up your
00:39:13
network. And you want to compute the
00:39:16
the neural network gradients DFTX which
00:39:22
are the gradients with respect to the
00:39:23
weights. And returned them and then the
00:39:26
LCD step is done after that so an
00:39:31
example here let's let me call my
00:39:34
function F well that basically computes
00:39:37
of of of affects scum W it say selects
00:39:43
a training example it loads to training
00:39:46
example let's say select the training
00:39:49
example from random right over here
00:39:52
actually size the training the next
00:39:55
training example in this in this table
00:39:58
called data and the inputs are and they
00:40:08
in the sample of one and targets and
00:40:12
sample of two inputs are the impostor
00:40:14
neural network targets are what you
00:40:15
wanted to be or what you call what your
00:40:18
loss function expects to compute the
00:40:22
laws. So if we first use your the way
00:40:25
with respect your rates because if you
00:40:28
have a previous optimisation instead
00:40:31
the gradients are sitting there
00:40:33
accumulated already. So is just zero
00:40:36
the gradients and the gradients
00:40:40
articulated in and all of a neural
00:40:42
networks to accommodate batch methods
00:40:45
when you're not doing when you context
00:40:48
you compute the batch in one shot so
00:40:50
you just compute a large about sample
00:40:53
by sample and the great escape
00:40:54
accumulated there. And this is very
00:40:57
useful when you're doing memory hungry
00:41:02
methods of optimisation especially so
00:41:08
what you do here have to reset the
00:41:10
gradients as you call criterion colon
00:41:14
for criterion is your loss function
00:41:16
model Colin forward inputs model call
00:41:19
important puts it returns the output.
00:41:22
So and they lost function takes the
00:41:24
output of you know network and the
00:41:25
target that if you wanted to be it
00:41:28
computes a loss. And then you call
00:41:31
model colon backward inputs comma the
00:41:36
gradients. So models backward which
00:41:39
computes but the big red input and
00:41:42
accurate parameters in one shot it
00:41:44
takes the inputs and the gradients with
00:41:47
respect to the output. So the great
00:41:49
interest illegally out that are given
00:41:51
from the backward call off your loss
00:41:53
function. And those are passing as a
00:41:55
second parameter inputs as the first
00:41:57
parameter that computes the that
00:42:02
basically will accumulate into DLDX the
00:42:06
gradients with respect to the weights
00:42:08
and then you sure you return the the
00:42:11
loss that you computed. And the LDX
00:42:15
which is weird to respect to the
00:42:18
weights. And then that closure that you
00:42:22
just define is called FE well right so
00:42:26
you define your LC parameters in this
00:42:28
case because you're doing as you D
00:42:30
learning a southern undertake DK which
00:42:33
is that how much lower it has to drop
00:42:35
off per sample weight decay momentum.
00:42:38
Um and for all he box in your training
00:42:42
little you for all the many batches you
00:42:47
have I guess you just call up in not as
00:42:51
CD of that function. Um and the X which
00:42:56
is the parameters of your network. And
00:43:00
at the SGD configuration parameters
00:43:03
which specify language and so on. And
00:43:07
that's it and the return value here is
00:43:11
one of them is the the loss. And you
00:43:18
just accumulate that and printed out to
00:43:20
make sure that your model is going down
00:43:22
in Los if your model goes up the noise
00:43:24
that's nice. Um that's the often
00:43:29
packets it might be a little dance. But
00:43:32
it's really really powerful and if you
00:43:35
don't understand that at the end of
00:43:38
like two or three we will be pointing
00:43:40
you guys to links to three notebooks
00:43:44
that you can go home and work on in
00:43:47
your own time they will have commons
00:43:49
they will take you to the basic example
00:43:51
of how to do things. Um and lastly the
00:43:58
threads package the threads package. Um
00:44:02
so don't laugh at the next slide
00:44:07
there's a small that there that's
00:44:09
funny. It's mostly an accident so we
00:44:14
created that threads package and at
00:44:15
some point I was writing example code
00:44:18
for myself on how to do data loading
00:44:22
using the threads packets. And they
00:44:25
call those the that frightful donkeys.
00:44:30
And I like I first and open source I
00:44:35
never like actually looked into why I
00:44:37
called it donkeys. But many people in
00:44:41
that arts comedy actually call data
00:44:43
loading threads donkeys. Um so the
00:44:49
examples here are just screen shots
00:44:51
from my my example so they might have a
00:44:55
variable called donkeys and like you
00:44:57
know so basically the way that that
00:45:00
trends package works is it creates
00:45:02
thread tools you can submit arbitrary
00:45:06
functions to distasteful and that
00:45:09
function will get I executed in so one
00:45:11
of the threads in that dreadful and you
00:45:15
can also specify return callback that
00:45:19
executes in the main thread once the
00:45:21
the thread finishes its computation. So
00:45:27
the way you create these threads is
00:45:28
actually very simple you just ask for a
00:45:34
as many times as you want you have some
00:45:37
initialisation functions that have to
00:45:38
be run when the threat is initialised
00:45:41
this can be like loading the functions
00:45:43
that you will call later and so on. And
00:45:48
there is a mode called shared serialise
00:45:50
which is very powerful in the threads
00:45:53
package what this does is it shares all
00:45:59
the ten serious between threads between
00:46:02
the main thread and all the worker
00:46:04
threads. And this this this is really
00:46:08
powerful because one when you're
00:46:10
returning cancers from your thread
00:46:14
spread bore to your main threat you
00:46:15
don't have to amend copy. It's all very
00:46:18
seamless you don't have to this
00:46:19
serialisation or D sterilisation. And
00:46:22
if you want to do hog well training you
00:46:25
can basically just created dreadful
00:46:27
ties in you know network to each of
00:46:28
your threads. And the net the the the
00:46:32
network will automatically be shared
00:46:34
among all threads and you can write
00:46:36
your training inside your thread and
00:46:40
it'll be a synchronous hog well then
00:46:42
it's very fast with like zero overhead
00:46:45
you don't have to collective parameters
00:46:48
to parameters server do the update and
00:46:50
send them back and so on. Um this is
00:46:54
the creating the threads and the slide
00:46:56
here showcases how you use the threads
00:47:00
there's one function that's the most
00:47:02
important it's call ad job the ad job
00:47:05
function takes an arbitrary close your
00:47:09
that you can define. Um you just as you
00:47:13
just defined a complication that you
00:47:15
want to do and that's the first
00:47:19
argument and the second argument is a
00:47:23
callback that is run in the main thread
00:47:25
once you finish doing this computation
00:47:29
and and the thread in in the date it's
00:47:33
right. And in the main thread in this
00:47:36
case for example what I did was and in
00:47:41
the data thread is basically loading a
00:47:43
particular a training sample of bad
00:47:46
size and that sample the that function
00:47:49
here returns inputs and labels and it
00:47:52
returns inputs and labels because the
00:47:54
main thread. And in the main thread the
00:47:59
closure that I defined separately the
00:48:01
function you have the inputs and the
00:48:04
labels that are sitting on the CPU that
00:48:07
come in and these are just some data
00:48:11
logging how much time it's taking to
00:48:14
look low the data and so on. And you
00:48:17
here input CP label sepia sitting on
00:48:20
the CPU there float answers inputs the
00:48:23
labels here or could it answers I copy
00:48:27
over the contents in the float enters
00:48:28
over to the correct answers to transfer
00:48:30
them but you be you I define my have to
00:48:34
well which is zero the great into was
00:48:37
like the parameters forward the art
00:48:39
that's forward the inputs to the model
00:48:42
get the outputs forward the outputs and
00:48:45
the labels to the loss function get the
00:48:47
our and then compute the great great
00:48:50
introspective the outputs of the neural
00:48:52
network past into the neural network
00:48:55
itself. And then returned the the
00:48:58
parameters of the neural network and de
00:49:00
lots and this is defined elsewhere when
00:49:04
you call get parameters on your neural
00:49:07
network the parameters and the grad
00:49:09
parameters the return. And then finally
00:49:12
I call up in the as you D here you can
00:49:14
replace as you deal with your favourite
00:49:16
algorithm for optimisation. So that's
00:49:21
it and this basically you went through
00:49:25
piece by piece a complete training will
00:49:28
for almost all case of neural networks
00:49:33
except like you know if you do weird
00:49:35
stuff like additional training or
00:49:38
something like for almost all
00:49:40
supervised cases at least. Um oh you
00:49:48
went through all the examples and the
00:49:50
last slide there was basically what
00:49:53
we're gonna cover in the next lecture
00:49:54
after you are sure ben just a
00:49:56
congenital models we will do a complete
00:49:59
example. It's only a hundred and sixty
00:50:02
lines ish so don't feel intimidated by
00:50:05
a complete example. We will do a
00:50:07
complete example of using and then up
00:50:09
him and threads for image generation we
00:50:13
will be if you look at the autograph
00:50:15
package and how you how to use it. And
00:50:21
then I will finally talk about sports
00:50:23
net which is a a small helper framework
00:50:28
that is sitting on top of torture and
00:50:30
then and all these things to abstract
00:50:34
of a common patterns that you do like
00:50:37
for example data loading data
00:50:39
augmentation. And all this stuff is
00:50:43
basically code that you copy paste from
00:50:45
one script to another. And towards that
00:50:47
kind of implements these for you in a
00:50:51
nice way and I'll briefly talk about
00:50:54
the that's the end of this session and
00:50:58
if you have questions I'll take them
00:51:00
now I did promise you this lecture was
00:51:15
gonna be boring there's a questionnaire
00:51:20
no okay so I just wanted to know who
00:51:34
easy is to use maybe layers you you
00:51:38
created the through doing then gruff.
00:51:41
And the mean either sequential sore
00:51:43
parallels or content. So it's extremely
00:51:46
natural when you create the G module in
00:51:50
the and then graph package. So once you
00:51:53
create wasted no it's the next okay
00:51:57
once you create the you module here.
00:52:00
This thing now can be added it's a
00:52:02
standard layer it can be added to
00:52:05
containers and so okay when you say
00:52:08
it's the standard lay your you the the
00:52:12
parameters are distinct if you are the
00:52:16
multiple times or the parameter Sir if
00:52:18
you add in mark if you had the same
00:52:21
thing multiple times the parameters are
00:52:23
not distinct okay but also the state or
00:52:27
not distinct and usually you wouldn't
00:52:28
wanna do that. You what you can do use
00:52:30
you can call clone on this thing and
00:52:33
that will create a replica okay thank
00:52:36
you alright and things for the talk on
00:52:42
what you normally do for testing and
00:52:44
debugging so before like I used to just
00:52:51
use this debugger called model debug
00:52:55
it's open source it's installed the
00:52:58
like the package manager. Um these days
00:53:01
I'm using the FB debugger package that
00:53:05
there's also open source. Um and
00:53:09
usually the FB debugger packet has a
00:53:11
mode where if you hit an error it will
00:53:13
automatically going to the debugger.
00:53:16
And you can and then like see what's
00:53:18
wrong for example. And that's very
00:53:21
useful and for testing. I just write
00:53:24
unit test for like all of my players
00:53:27
sense well okay thanks a thanks a lot.
00:53:45
So I mean I'm you know user and what I
00:53:48
find cool is the transparency between
00:53:51
the CPUNGU E. you know exactly run
00:53:56
Michael and see you just check for
00:53:58
dimensions and stuff and then if you
00:54:00
heavy workload ship it to the server is
00:54:02
or any thing like this and it works or
00:54:05
just have a bunch of if statements. Um
00:54:08
you don't have to do an if statement
00:54:10
you can write one single function
00:54:12
called cast that will like cast of all
00:54:17
no network into to keep you or we have
00:54:23
a another function called a course star
00:54:27
set the fall tensor type that it you
00:54:34
can set the default ends are tied that
00:54:35
you do operations and you can first
00:54:39
check it on CPU by setting the default
00:54:41
answer that for example float answer.
00:54:43
And then you can switch it to GPU and
00:54:46
then like all the declarations that you
00:54:48
do but forged a tensor alike by default
00:54:51
in the neural network they get created
00:54:54
with the the jeep you answered thanks
00:54:58
but well them memory monitoring
00:55:07
transfer from the subject you wore I
00:55:09
mean you start responded to the users
00:55:11
yes okay it's as they showed in the
00:55:14
last example it yeah here is it yeah
00:55:21
yeah so as they should hear the CPU
00:55:24
tense yours are not automatic like if
00:55:26
you get the sepia cancer as an input
00:55:28
your GPU module for example it will
00:55:30
just tear you you have to transfer them
00:55:35
yourself to the jeep you there's no
00:55:37
ambiguity there is no like debugging
00:55:39
issues there okay so users no more
00:55:48
question we're already doing so again

Share this talk: 


Conference program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
4 July 2016 · 2:01 p.m.
Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
4 July 2016 · 3:20 p.m.
Day 1 - Questions and Answers
Panel
4 July 2016 · 4:16 p.m.
Torch 1
Soumith Chintala, Facebook
5 July 2016 · 10:02 a.m.
Torch 2
Soumith Chintala, Facebook
5 July 2016 · 11:21 a.m.
Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
5 July 2016 · 1:59 p.m.
Torch 3
Soumith Chintala, Facebook
5 July 2016 · 3:28 p.m.
Day 2 - Questions and Answers
Panel
5 July 2016 · 4:21 p.m.
TensorFlow 1
Mihaela Rosca, Google
6 July 2016 · 10 a.m.
TensorFlow 2
Mihaela Rosca, Google
6 July 2016 · 11:19 a.m.

Recommended talks

Limbic system using Tensorflow
Gema Parreño Piqueras, Tetuan Valley / Madrid, Spain
26 Nov. 2016 · 3:31 p.m.
Q&A - An introduction to TensorFlow
Mihaela Rosca, Google
26 Nov. 2016 · 2:35 p.m.