Embed code
Note: this content has been automatically generated.
00:33:50
parallelism and what what to do here.
00:33:53
And then if you want to but do the
00:33:55
training will you actually do that with
00:33:57
the supervisor manage session so this
00:33:59
is the different part you have some so
00:34:01
but at as expected with any distributed
00:34:03
computation right you need someone to
00:34:05
manage all this and that is the
00:34:06
supervisor. And gives you a I manage
00:34:09
session where you can tell it run this
00:34:12
the training operation and it does
00:34:14
everything for you. So all the in in
00:34:17
this is kind of a more or less complete
00:34:19
if you feel in this this part example
00:34:22
of how to do the data per relevant so
00:34:24
it's not it's not that difficult. And
00:34:28
you can actually do this ball also
00:34:30
force I think one onions they'd up
00:34:31
around this and I think there's a
00:34:33
function I things. If I remember
00:34:35
correctly. It's strange thing I'll do
00:34:38
my are that you can use and that will
00:34:41
will do the the training with
00:34:44
synchronous data present without having
00:34:46
this this potential issues that the
00:34:50
gradients get updated by different
00:34:52
workers and then they are and it's not
00:34:55
the same as of gradient descent that
00:34:57
were used notice into another really
00:35:04
interesting. Um partly that answer for
00:35:08
provided which is ten starboard. So we
00:35:14
are very much aware that for most of
00:35:16
the computation that you have to do
00:35:17
with the stuff well it's complex
00:35:20
specifically Woodbury demure networks
00:35:23
it can even be confusing at times and
00:35:25
you want to be able to plot what how
00:35:28
your accuracy changes on test set for
00:35:31
your accuracy changes on the training
00:35:33
set you want to be able to see how your
00:35:35
weights are changing you want to be
00:35:37
able to see how your biases are
00:35:38
changing or whatever other no hyper
00:35:41
parameters that you might have that
00:35:43
might change during things so for
00:35:44
example you might use making learning
00:35:47
rate you might want to see where is my
00:35:49
learning rate at right now in for to
00:35:52
for debugging for okay my station and
00:35:54
for better understanding we have
00:35:56
included a set of tools that help you
00:35:58
with this and they're all commonly
00:36:00
called Spencer board there just a
00:36:03
variety of twelve here. And setting up
00:36:07
dancer more foreseeing the summaries
00:36:11
that you want to see is actually very
00:36:13
easy. So a lot of the common goal of
00:36:17
this talk is to show you that principle
00:36:19
does a lot of things and it does them
00:36:21
well and it's easy to set them up. So
00:36:23
also here with the summary right there
00:36:25
you in order to one ten support how
00:36:29
your training is to what you have to
00:36:30
create a summary writer which you you
00:36:33
just have to specify where to save the
00:36:36
the locks and then you have to say this
00:36:38
is a major that I care about so for
00:36:41
example. Um I don't know what sorry the
00:36:45
in learning great at this time the
00:36:47
learning rate at this time fit if this
00:36:49
value. And then when you look on on
00:36:52
your tender board instance to actually
00:36:53
be able to see how this changes. So
00:36:56
it's not too difficult in here of
00:36:57
course you can also combine multiple
00:36:59
measures if you care about multiple and
00:37:01
you can define is very flexible so you
00:37:03
can define what you care about and I'll
00:37:08
show you link a demo of how to use this
00:37:13
you internet sorry about that it seems
00:37:30
like no it seems like it's which formed
00:37:34
a network that I want to be into
00:37:36
another network you what happened now
00:37:40
sorry well okay I will have to use this
00:37:52
for for later maybe the other talk
00:37:54
because it seems like the internet is
00:37:55
not working at the moment and also I
00:37:59
actually have light example of sorry
00:38:03
just actually have a life example of
00:38:06
this in the code but yeah it's a pity
00:38:11
that I can't actually just create a
00:38:15
whole support with me please don't
00:38:21
connect no yeah yeah see oh okay now we
00:38:40
also know what kind of format okay
00:38:53
perfect okay and we're back. So this is
00:39:05
how the the tool actually looks so
00:39:07
usually launched is on one of your
00:39:09
machine should be able to look at it
00:39:11
you can see here that they're multiple
00:39:14
tab so you can see your events these
00:39:16
are the things that you told answer for
00:39:17
that you want to look at so for example
00:39:20
this is running them of nist training.
00:39:24
And you can see how the test like
00:39:27
Ulysses going up the training acuity
00:39:30
sparked waiting a little bit and these
00:39:32
are the times that's a time go by you
00:39:34
this is the this is what you want to
00:39:36
see your training this is looking good
00:39:38
for your loss function which in this
00:39:41
case is the cross entropy loss you want
00:39:43
to be able to see how it fluctuates
00:39:44
also doing training this you want to go
00:39:46
down you want to also be able to Prague
00:39:50
applaud other things so I was saying
00:39:52
that you might want to plot the
00:39:54
learning rate you can also look at
00:39:57
dropout so as expected drop out here is
00:39:59
one for in France but you can put it as
00:40:01
much as you want for well you can put
00:40:04
it to be zero point nine four for
00:40:06
training you can also look at how your
00:40:10
biases are changing doing training and
00:40:13
how your weights are changing. So here
00:40:15
for example you can wanted or or are in
00:40:17
and if you have exploding weights so
00:40:20
this can be a a problem entering
00:40:22
recurrent neural network so you can see
00:40:25
how this evolves doing your training
00:40:28
and you can also so here this is very
00:40:31
customisable right so with the cold
00:40:33
that I showed you can add to whatever
00:40:35
summary you want to see but you can
00:40:37
also more you can also see somebody
00:40:41
just that the network is looking out
00:40:43
for this are for test this out for
00:40:45
trainees this are them is the next
00:40:47
digits you can also look at the graph.
00:40:51
So this is very important if you want
00:40:53
to look at your computational graphing
00:40:54
detail. Um it looks also at different
00:40:57
levels so if you extend this graph it's
00:41:00
pretty big. But what you probably care
00:41:02
about initially is that you have an
00:41:03
input you have a layer you have some
00:41:05
dropout allowed another layer than the
00:41:08
cross entropy loss and accuracy. So
00:41:11
here again between late it to what I
00:41:14
was saying before about the fictional
00:41:16
it's here you put you could but if you
00:41:18
want to see two things you want to see
00:41:19
the cross entropy or the cure C and
00:41:22
because that's awful as an extra
00:41:23
feature node when you ask only for the
00:41:26
cross entropy it won't computer QSE
00:41:28
because it looks and sees this is not
00:41:30
on the path that I need to compute to
00:41:31
get the cross entropy. But you might
00:41:34
also want to look at this in so this is
00:41:36
a very nice high level overview if you
00:41:38
have very deep networks you just might
00:41:39
want to look at this. But if you want
00:41:42
to look at each layer in particular to
00:41:43
see how you how it looks like you can
00:41:45
click here. And you see that this
00:41:47
expands and you see the actual layer
00:41:49
what's what's actually made up so these
00:41:52
are the weights you start the bias is
00:41:54
as you expect the usual WX plus B
00:41:58
operation and then you applied
00:42:00
activation function in each which in
00:42:01
this case is the rectifier and this
00:42:04
gets this is at the output of little
00:42:07
and this gets that into the cross
00:42:09
entropy a loss and you can look at each
00:42:12
of these in detail so you can look at
00:42:14
the USC and you can expand absolute
00:42:18
each of them to get an idea of what
00:42:20
actually what I what graph heavy
00:42:23
actually created. And there's more so
00:42:29
you might also want to see the
00:42:31
activations of neurons as training goes
00:42:34
by right you want to see maybe you want
00:42:36
a sparse activation so then you can
00:42:38
monitor that you can more in the
00:42:41
activations you can also monitor the
00:42:42
pretty activation so this is before I
00:42:45
fed they got into sorry. So here this
00:43:01
is always time as time goes by you want
00:43:05
to see how and also in the other cases
00:43:07
you want to see how your summaries
00:43:12
change over time right. So you plot so
00:43:16
in the in the code snippet that I
00:43:17
showed you when you add the summary you
00:43:19
had this is them as your name so you
00:43:21
create a measure that says this is that
00:43:23
the king learning a rate and you also
00:43:25
specify the times tip this is when I
00:43:27
found this major at evaluation step
00:43:30
number eight hundred the learning rate
00:43:32
was zero point three basically okay so
00:43:40
any as you expect so for example the
00:43:42
activation because this is the rare
00:43:43
loan at what they're always about zero
00:43:46
but the P activations don't have to be
00:43:47
able always about see also you can
00:43:50
double check a little bit how do you
00:43:51
are preoccupations look like and how
00:43:53
your weights look like and so on. This
00:43:56
is very very useful when you want to
00:43:59
actually so firstly when you get
00:44:01
started with deploring but not only
00:44:03
right when you want to actually
00:44:04
understand what what's happening with
00:44:07
my training and to make the flexible
00:44:09
black box tried this is it's less of a
00:44:11
black box if you can actually see
00:44:13
what's what's happening here and this
00:44:24
is it for this talk I would take the
00:44:25
questions now let me see Uh_huh eight
00:44:46
so very nice you said you can actually
00:44:52
take no by arrays directly for
00:44:54
inputting data to your to your model to
00:44:57
graph. "'cause" it temple automatically
00:45:00
this transference between the the
00:45:02
computer memory and they GP what we
00:45:04
need to do something special no it does
00:45:06
that for you soon yeah so I I I I said
00:45:09
if you want to run things on the you'd
00:45:10
be you you just have to specify that
00:45:12
run this on the cheap you and then it
00:45:14
would do that for you so you don't have
00:45:16
to try to do this stuff yourself. So
00:45:19
it's really to when I say it's more
00:45:20
flak change from the cube you to this
00:45:22
review I mean it if they actually want
00:45:24
like change hi I was wondering so tends
00:45:31
airports it's great when you want to
00:45:33
see how your implementation does by
00:45:35
what about compilation of the grass so
00:45:38
for example in key and I found the
00:45:40
errors compilation errors to be rather
00:45:42
and informative and I was wondering how
00:45:44
was it what tons the four so that yeah
00:45:47
because you have this compilation step
00:45:52
I agree that sometimes errors can be a
00:45:54
confusing but I think that's a for
00:45:56
that's a relatively good job at trying
00:45:58
to explain where the problem is so I
00:46:00
think yes you do have and it's not
00:46:03
trivial but it's relatively code the
00:46:10
question. Um I was I first things will
00:46:16
talk and I have a question probably
00:46:19
choose lost functions like is there any
00:46:22
package because I have for example
00:46:23
initials logistic laws Spencer possible
00:46:26
supports them you just search it so I
00:46:29
would have been the neural networks
00:46:31
example today are just supported you
00:46:33
just search that's the flow for the
00:46:36
specific roles that you want it's
00:46:37
probably implemented if it's not you
00:46:38
can ask to go for it to be implemented
00:46:41
that these losses already their
00:46:42
contents of oh right so there's no
00:46:44
separate package maybe the I don't know
00:46:49
this hard maybe maybe there's I mean
00:46:50
you just have to change or import for
00:46:51
maybe implored tens of log boat the
00:46:54
cross entropy to importance of rolled
00:46:56
out of lost dog cross entropy but
00:46:58
they're differently there and what
00:47:01
about the realisation of okay it's
00:47:04
regularisation is there L to Norman
00:47:06
also either more yes so the one it's a
00:47:21
you're tense report is it extensible
00:47:23
and what what is I mean if I want to
00:47:26
create my own visualisation can I do
00:47:29
that. What is programmed into it wasn't
00:47:32
clear for me is it by some as well is
00:47:34
so what you so at what level do you
00:47:38
want to extend so if you want to extend
00:47:40
what you see on the board itself is
00:47:41
definitely you can definitely do that
00:47:44
in your own python code if you want to
00:47:46
extend principle you have to probably
00:47:48
look into it and not exactly sure so
00:47:51
you might have to either I I think you
00:47:55
can but if you can you can just ask for
00:47:58
it or send the pool request to to be in
00:47:59
the great so because the court is there
00:48:01
right you the the answer is you
00:48:03
definitely can the the question is how
00:48:06
much effort is to to do it right
00:48:08
because the the call this is there so
00:48:11
that there are no specific support you
00:48:13
mean four and I I am radio stations I
00:48:16
answer is I don't know that this is the
00:48:19
account correct answer but for for me
00:48:22
this from your experience how how how
00:48:33
much it you need to go into C plus plus
00:48:37
N develop a custom functions and custom
00:48:42
code in C plus plus and how much was
00:48:44
that's of flow able to deliver in that
00:48:48
before in my son. I mean and so I will
00:48:50
talk about extending danceable if you
00:48:52
want to add your own operations in the
00:48:53
next talk. But so far I've never needed
00:48:56
it. So from my stance I never needed it
00:48:59
if you want to do something in python
00:49:00
I've done it on in python. So I I've
00:49:02
never had this issue of all I can do
00:49:05
this point in python have to go to
00:49:06
sleep of course so there's no but other
00:49:13
overheads attached to the tense oh
00:49:15
sorry I could not hear that overheads
00:49:17
attached to of that monitoring that
00:49:20
event okay skeletons a computational
00:49:23
overhead well a little bit but it's
00:49:26
just looking I like so forcefully it
00:49:31
depends what you're adding if you're
00:49:33
adding something that's very expensive
00:49:35
to compute right if you want to lock
00:49:37
something that's very expensive to
00:49:38
compute then yes there's overhead
00:49:40
because you have to compute that but if
00:49:41
you want to look at something very
00:49:42
simple such as I want to see how my
00:49:45
learning rating the case or something
00:49:47
like that then no not really in the
00:49:55
distributed answer for setting out
00:49:57
there any multi also features for
00:50:00
Terrence sorry I can't hear you in in
00:50:03
the distributed tensor flow setting up
00:50:05
there in new features for fault
00:50:07
tolerance dealing with the figure of a
00:50:09
worker restarting a word to understand
00:50:12
there is check pointing and we start
00:50:14
from a a particular stylised version of
00:50:16
the model but can we can do some
00:50:20
supervision or a fancy things like
00:50:22
dates a speculative execution when a
00:50:24
particular job in a synchronous Malone
00:50:26
runs very soon one particular work or
00:50:28
something like that so I I think right
00:50:30
now the support this for checkpoint
00:50:31
again restoring fort from that but I'm
00:50:34
not aware of something fancier so
00:50:37
basically if one of my detector values
00:50:39
I have to we it till we however many
00:50:43
executors workers are still left in the
00:50:46
cluster yes I think it will we start
00:50:49
for you if you still have what work are
00:50:51
available. So because that has the
00:50:56
result or I that checkpoint is still
00:51:00
available right it depends if your hard
00:51:01
drive fails or wherever you are saving
00:51:03
the checkpoint admit that probably has
00:51:05
to continue without. But overall I
00:51:08
think that if there's another if you
00:51:10
can't if you can restore the model from
00:51:13
the checkpoints it'll probably spun
00:51:14
another instance in in your available
00:51:17
cluster to deal with that if we can we
00:51:23
use and that's the we're is other
00:51:27
languages like a for example like all
00:51:31
no so right now the support for ten
00:51:33
suppose in python and C plus plus you
00:51:36
can add either front ends it's not too
00:51:38
difficult but there's no either support
00:51:41
at the moment for languages like are
00:51:43
maybe one question related to these on
00:51:48
maybe we talk about this in your next
00:51:49
to but about loading but the spawn was
00:51:52
often frameworks like from to feel from
00:51:54
too much so I don't think there's
00:51:55
anything official that some people have
00:51:57
worked on this so it because again this
00:52:00
is the nice thing about being on open
00:52:02
source community if you have this
00:52:04
questions other people have had that as
00:52:06
well so they're the they're looking
00:52:07
into this and there are some solutions
00:52:09
question could also means is it is it
00:52:11
do it the only because a given that's
00:52:13
why do people have done it people have
00:52:15
done it. in your question no okay so it
00:16:51
twelve DWN of NB so we have a little
00:16:55
bit here that hey you need to compute
00:16:57
the course we we respected WNB know
00:17:00
that because it looks oh sorry wrong
00:17:03
direction it looks as how at how we
00:17:06
define W and it knows WWX plus B and
00:17:11
these are variables and when I want to
00:17:13
minimise the cost that depends only on
00:17:15
a dog W and some labels that will be
00:17:18
given to me I know that what I need to
00:17:21
minimise would respect with WNB so you
00:17:23
don't have to specify in this case the
00:17:25
variables yourself. So here you see
00:17:28
there's no specification of WNB you
00:17:31
just defined the cost as as sure and
00:17:36
then you minimise it with respect to
00:17:39
the variables that need to be optimised
00:17:44
now in order to be able to this
00:17:45
training you need some data. And you
00:17:50
can with can from you can be this data
00:17:52
directly with text slight line read
00:17:54
their or other readers also you can use
00:17:57
anything that supported by nearby
00:17:59
because this fee dictionaries can be
00:18:02
known by race. So if you want to read
00:18:04
your data with your favourite vampire
00:18:06
either that you already using you can
00:18:08
just to that and that's week know
00:18:12
another thing that you can do you might
00:18:13
want to Fitch examples. And feed them
00:18:18
while you feed data to the network at
00:18:20
the same time so you want to do things
00:18:21
that we could have possible you don't
00:18:23
want the training to stop while you
00:18:24
fetch data or if it to dissect
00:18:27
expensive because you fidget from from
00:18:29
from this probably you want to fetch
00:18:32
things at the same time so you can do
00:18:35
this intense appropriate was which I
00:18:37
synchronously allow you to to read them
00:18:41
batches of of data so this is very
00:18:44
important for especially for big
00:18:46
datasets now that you've trained your
00:18:50
model you know that it does well in
00:18:53
training data data but how well does it
00:18:55
do on unseen data so we need to
00:18:57
evaluate in you need to be able to do
00:18:59
inference and this actually very
00:19:01
similar to what we've seen with the
00:19:03
training. So the training is on the
00:19:05
left side of the screen and you see
00:19:07
that it's the same as we had before but
00:19:09
we have to do it a hundred times so
00:19:11
this is something very important dumb
00:19:14
optimiser retained you that it does not
00:19:16
return you something that's already
00:19:18
optimiser from that you have it's
00:19:19
operation that you have to run as many
00:19:21
times as you decide. Um but here when
00:19:25
we do the evaluation on the right side
00:19:28
you actually you want me to run a
00:19:30
little anymore because you have your
00:19:32
data and you just say please keep me
00:19:34
the cost would just make this data. So
00:19:36
the small training the training is done
00:19:37
the parameters WNB have been changed
00:19:41
already doing training. So you just
00:19:43
want to find out the cost using the
00:19:47
land parameters. And now what I would
00:19:50
do actually have a high price and not
00:19:52
we'll go you peter remote that goes to
00:19:54
some of these then we train these
00:19:56
together so that we see firstly how
00:19:59
long it takes in terms of code. And how
00:20:04
how how easy D.'s so forcefully what
00:20:06
that will do I will cleared out of my
00:20:09
kernel so that you know I'm not
00:20:10
cheating and let's see how this works.
00:20:16
Okay so now you see the output is
00:20:19
clear. And I'm gonna start running one
00:20:22
thing at the time and describe what
00:20:25
what the course that's a perfectly
00:20:26
imports don't need to go through that
00:20:28
secondly I'm actually generating the
00:20:30
data because I'm not actually a
00:20:32
professor and I don't have students to
00:20:33
torture about what grade and actually
00:20:35
got and how much they studied for it.
00:20:38
But what I had the simple heuristics if
00:20:41
our professor maybe if you don't study
00:20:43
at all you should get around twenty
00:20:45
percent but yeah you and but afterwards
00:20:50
it should just "'kay" scale five X so
00:20:52
if you study twenty hours you should
00:20:56
give five times twenty plus twenty so
00:20:58
that is my idea that you have a bias of
00:21:01
twenty regardless of how much of study
00:21:03
and then it's ghettos. And four times
00:21:05
the number of of hours that you study
00:21:10
and what is called a shared very really
00:21:13
just generates the data. And then
00:21:15
plotted using not booklet so again hear
00:21:18
nicely I can use my favourite plotting
00:21:20
framework python that I'm already used
00:21:22
to because we are we are in python and
00:21:26
this is not the data looks like so I
00:21:28
had to some normal noise because
00:21:30
otherwise everything would be on the
00:21:31
line. And with this example because we
00:21:35
know the data in we know how I
00:21:37
generated the data we know that in the
00:21:39
end I want to when when we learned WNBI
00:21:42
Want my line to be as close as the blue
00:21:44
line as possible because this is what I
00:21:46
cheated and usually when you when you
00:21:48
do learning you don't know what your
00:21:50
actual solution is but in this case
00:21:52
just for pedagogical example we know
00:21:55
that what we want to get is the blue
00:21:56
line in the end now getting the
00:22:00
interesting part we have the data and
00:22:02
this is the part that defines the
00:22:04
computational perhaps this is exactly
00:22:06
what we we looked at in the slides we
00:22:09
have two placeholder the inputs and the
00:22:11
labels we have the weights that have to
00:22:14
be initialise I initialise them with
00:22:16
the random normal. And the prices are
00:22:19
initialised to zero. And the output for
00:22:22
this is the the why is the matrix
00:22:24
multiplication between the input the
00:22:25
weights and biases. I defined the
00:22:28
course exactly like in the slide so the
00:22:30
distance between what we project and
00:22:32
what we know should be that truth this
00:22:34
is always well what you do and I
00:22:37
supervised learning. I use this
00:22:39
learning rate because it I thought it's
00:22:42
a good day so why not and then I
00:22:45
defined the gradient descent optimiser
00:22:47
to minimise this call so this is the
00:22:49
graph definition part. And I need to
00:22:52
run that and then as we expect that we
00:22:57
have to initialise the station and for
00:23:00
interactive notebooks there is a this
00:23:02
specific kind of session that is very
00:23:05
helpful which is called interactive
00:23:06
session so this you you usually don't
00:23:08
use outside of these I do in the
00:23:11
terminal when you use the interpreter
00:23:13
or in this in this kind of notebooks is
00:23:16
it just a bit more helpful once we have
00:23:18
the session I initialise valuable this
00:23:20
is something or dancer pro programs
00:23:22
have to do. I'm just gonna run down.
00:23:25
And not gonna look at a bit of all the
00:23:29
variables look like without learning so
00:23:31
I'm not doing any learning it right I
00:23:33
just defined the computational grab I
00:23:34
have not try and update operation to
00:23:37
because I'm not around a bit operation
00:23:40
the weights look random the biases zero
00:23:44
as I initialise this and this is what
00:23:46
the predictions look like so the
00:23:48
biggest great is twenty one was not not
00:23:51
great this is not what we want I want
00:23:53
for my students to be able to pass for
00:23:55
example. And it's also actually plot
00:24:00
this to see that it's very far from
00:24:03
from the blue line and of course that
00:24:07
you also without training is huge but
00:24:11
that relates to the training right so
00:24:12
the training is to lines it's
00:24:14
affordable in which you bond update
00:24:16
operation a thousand times with the
00:24:18
data that I generated so this is the
00:24:20
training data that I generated this one
00:24:22
contains the random noise to make the
00:24:23
problem more difficult. And I just
00:24:25
wondered update operation a thousand
00:24:27
times just gonna execute that so this
00:24:30
is the part that the the training right
00:24:32
we want a bit which operation. So now
00:24:34
we expect the cost to be smaller via
00:24:37
ups we expect the weights in the biases
00:24:39
to be updated. And expect to see that
00:24:42
we have predicted the line very close
00:24:44
to the line. So let's look at this the
00:24:50
cost is now we smaller so look this was
00:24:53
the course without training and as
00:24:54
expected of course with training is
00:24:56
much smaller now let's look at the
00:25:00
values of the weights and the biases
00:25:02
because I cheated and I know actually
00:25:03
created the they'd I know that the
00:25:05
ideal value for the way it should be
00:25:07
for and for the backed up bias to be
00:25:10
twenty so we're not quite there we are
00:25:13
four point one and seventeen point nine
00:25:16
but maybe if I use an drop the mines so
00:25:19
if I change my lining wait I will get
00:25:21
to something better that's not really
00:25:22
the point here is just to see how to
00:25:24
use them for for for this and it's also
00:25:27
look at how the lines look like plot it
00:25:31
again using that look like. They're
00:25:34
almost an distinguishable here. So we
00:25:36
can see that we actually learned pretty
00:25:39
well what we what we wanted to learn
00:25:40
and this is the entire program it if I
00:25:45
it's and it actually fits in one
00:25:47
screen. So this doesn't the entire
00:25:50
training from defining the draft
00:25:52
initialising the session. And the
00:25:54
training is just running the update
00:25:55
operation and two lines. So this is
00:25:58
just to say simplify it to just
00:26:00
exemplify a complete answer for a
00:26:02
program that runs or something very
00:26:04
simple in in a bit we will actually
00:26:06
look at how to do this mean neural
00:26:07
networks to and just for completion and
00:26:11
for fun. I wanted to show you the close
00:26:15
solution because you know we correction
00:26:16
you can actually you don't have to do
00:26:18
it it went to be but here I I did like
00:26:22
this to so far conceptual works. And
00:26:25
I'm gonna use type by to see what the
00:26:27
values are if we use the close from
00:26:30
solution in remember what we want to
00:26:32
see is four and twenty so which side by
00:26:36
we do a little bit better than what my
00:26:39
iterative algorithm that we are at the
00:26:41
four and nineteen point four so pretty
00:26:44
close to the fourteen twenty that they
00:26:46
started to if I doesn't mean that we
00:26:48
cannot improve just just to show that
00:26:50
this is also pretty simple and you
00:26:51
shouldn't do linear regression actually
00:26:53
like this let's go back to the slot
00:26:59
yeah and now I want to discuss a little
00:27:17
bit some other conceptual perks are out
00:27:19
there which I think are very important
00:27:22
and all for someone was trying to
00:27:23
figure out that be should used and
00:27:25
therefore not so forcefully you want to
00:27:29
be able to use the GPU and this is one
00:27:32
line change so in order to use the GP
00:27:34
you you have to tell pencil please take
00:27:37
this route for this part of the graph
00:27:39
on the GPU and if you have multiple
00:27:41
cheap used you can say this is GPU two
00:27:44
and obviously you can make this a flag
00:27:46
in your pricing program going yours was
00:27:48
program and you can see me screech from
00:27:52
the CPU to the GPU right so we don't
00:27:54
have to do anything else apart from
00:27:57
changing this this line. And in the
00:28:01
this is exactly the example that we had
00:28:03
before the exact same call that defines
00:28:05
the variables in the placeholder
00:28:07
warning that this time when I do
00:28:08
session that one actually here I forgot
00:28:11
to specify X but it's the same
00:28:12
computation this will actually run on
00:28:14
the on the you'd be you only with this
00:28:16
this extra line. And something worth
00:28:18
mentioning is that you have if you have
00:28:20
set up potential flow with the GPU it
00:28:22
would actually do that by default. So
00:28:24
you say you don't have to worry about
00:28:26
this if you have a GP on your machine
00:28:27
in you have told it to you have set up
00:28:30
with the you'd be you then this will
00:28:31
just work. But how about if you have
00:28:34
multiple attributes is that much more
00:28:36
difficult well not really so that's
00:28:40
that's the Reading I think about
00:28:41
sensible it makes it very easy for you
00:28:43
to run this computation on to GP use.
00:28:46
So if you have multiple GP use you can
00:28:50
say on I'm gonna look to the jeep use
00:28:54
that I have and I'm gonna place
00:28:57
different part of the computational
00:28:59
graph on different devices right so
00:29:02
here let's assume again have the
00:29:03
politically that I want to do eight
00:29:05
times be twice and then add them
00:29:07
together actually say for these devices
00:29:12
compute this part of the graph for us.
00:29:15
And then compare compared amount of
00:29:17
multiplication on the on the GPU on
00:29:20
each of that you'd be use and then I
00:29:22
can pure dietician on on the CPU so
00:29:25
this is also not and not very very
00:29:29
difficult in case you want to to switch
00:29:31
like this between different keep you
00:29:33
cards and the CPU now a bit about
00:29:38
distributed training which you might
00:29:40
also be interested in how to use and so
00:29:44
flow in this I I mentioned before in
00:29:46
the first talk at data for all those
00:29:48
model cars and what what's the actual
00:29:50
support and to look at some concrete
00:29:51
examples to see if it's how how easy
00:29:55
they actually is to to set up so the
00:29:58
more one sending this to beauty
00:30:00
training when you actually look at what
00:30:02
you would have to deal with eyes a
00:30:04
cluster this is as expected. They are
00:30:07
set of tasks that participate in the
00:30:11
distributed execution of the graph
00:30:13
right so this is normal terminology and
00:30:16
the sort of where you have one server
00:30:18
part task and each server has a master
00:30:22
and the work or what the master class
00:30:24
this creates the sessions to be able to
00:30:26
execute stuff on the graph and the
00:30:29
workers and the master just great
00:30:31
decision and the worker execute the
00:30:33
operation in the graph. So this that
00:30:35
this is the terminology that you would
00:30:37
have to use if you want to work with
00:30:39
distributed ring now how do you create
00:30:43
how detailed tens of all this is my
00:30:45
cluster where that looks pretty much
00:30:47
like this so these are the workers that
00:30:49
that want to use and these are the
00:30:50
parameter servers this doesn't look
00:30:52
idea if you have multiple workers so if
00:30:57
you have a cluster one hundred it can
00:30:59
get easy TDS this is in a better way to
00:31:02
do this and not really supported right
00:31:04
now what we are looking into this and
00:31:06
if you have a favourite tool for
00:31:09
cluster management just file and open
00:31:11
source request way what I want to use
00:31:13
this potential when we would definitely
00:31:15
look into it but for now this is how
00:31:18
you do it you specify these are the
00:31:20
workers and this is the these are the
00:31:22
parameter servers. Now if you force one
00:31:26
to just share the rap across devices so
00:31:28
this again going to the model for all
00:31:30
of the sunlight yeah exactly like with
00:31:33
the jeep you card use a this this part
00:31:38
should be here this part should be here
00:31:41
and this part should be here and then I
00:31:43
initialise the session that we won the
00:31:46
train operation. So it's very very easy
00:31:49
but you would have to specify you have
00:31:51
to decide here. What part goes where
00:31:54
and here is where you have to take into
00:31:55
account what I discussed in the forest
00:31:57
talk is it worth to do is what's the
00:32:00
best way to do this for my particular
00:32:02
model and might just wasting so much
00:32:03
bandwidth sending things over the
00:32:05
network that it's not really worth it.
00:32:07
But this is this is the way to do it
00:32:08
definitely take into account the the
00:32:11
trade offs here no if you want to do
00:32:14
they got paralysis no if you want to
00:32:17
have multiple directly because that
00:32:20
that we'd examples in parallel. Um you
00:32:24
can do that and that's us very easily
00:32:26
supported both asynchronous and
00:32:28
synchronous data relevant here I will
00:32:30
have the asynchronous example. And the
00:32:34
slides are a bit more dancing and the
00:32:36
other slides and but please bear with
00:32:38
me it's not that difficult for actually
00:32:40
starting something on multiple
00:32:41
machines. So first you have to create a
00:32:45
cluster as I should before so you have
00:32:47
to give the holes for the parameter
00:32:49
servers and for the workers then you
00:32:52
have to create a server for that
00:32:53
cluster. And a supervisor so supervisor
00:32:57
is utility that really allows you to
00:33:00
work very well with this do this do
00:33:03
this to beauty training so it does a
00:33:04
lot of nice things for you like
00:33:05
checkpoint pointing your model and so
00:33:07
on so it's very recommended that when
00:33:08
you do this to be training you you work
00:33:11
with the supervisor. So now that you
00:33:13
have this set up how do you actually
00:33:16
create the model who do asynchronous
00:33:20
data probably so in your with statement
00:33:24
you have to specify the right click on
00:33:25
device editor. And the cluster and then
00:33:29
you create the model. And and this is
00:33:33
just the creating the computational
00:33:34
graph or two that I discussed the
00:33:36
before so it's whatever model you want
00:33:38
to do with this optimiser in you
00:33:41
minimise this lots and then once you
00:33:45
have created this sense of well know
00:33:47
what to and that you actually want data
00:00:00
So I won't talk by high dollar on then
00:00:03
not quite as this till they on those
00:00:05
this afternoon and beyond again that
00:00:07
was awful Okay let's get going I have
00:00:25
the slide to commemorate the end of
00:00:27
your break not the start of your break
00:00:29
because they're all so excited to hear
00:00:31
more so this second talk is about hands
00:00:35
on introduction mainly about how to use
00:00:37
tens of over something very simple
00:00:39
because initially I just want to get
00:00:42
across the whole how the work flow with
00:00:45
principle is with this basic example
00:00:47
and then we would look at how to use
00:00:50
some of the visualisation tools how to
00:00:52
run your computationally CPU on
00:00:55
multiple GP use and and then we will we
00:01:00
will have to break. So this first
00:01:03
example that go soon you know
00:01:06
regression. So I will use a bit of a
00:01:08
concrete example here so that I make
00:01:10
things they're very easily
00:01:11
understandable so assume I'm a
00:01:13
professor. And I we want to know if
00:01:18
there's a I want to fit the line in
00:01:20
between how much my students tightly
00:01:23
and the great they actually get my exam
00:01:25
because if students don't study at all
00:01:27
and they get a hundred percent well and
00:01:30
that's not the good example but if it
00:01:32
scaled exponentially if students have
00:01:34
to a study a hundred more hours to get
00:01:36
an extra point then that's not that's
00:01:40
not good either so I oh I want to make
00:01:41
sure that this kind of a linear
00:01:43
relationship between the number of
00:01:44
hours day study and and the great that
00:01:48
they get and let's assume I have this
00:01:50
they'd obviously this is a bit you
00:01:52
should take this with a grain of salt
00:01:53
what students answers correctly when
00:01:55
they're professor asked them male how
00:01:57
do you study for my exam. But let's
00:01:59
just assume that we have the state are
00:02:01
and we want to fit the line through it
00:02:04
all we want to learn to slope of the
00:02:05
intercept that fits best the that they
00:02:08
got that we that we see and let's first
00:02:13
look at the principal program that that
00:02:14
something very simple they just compute
00:02:16
this W times X plus B plus one. So this
00:02:21
is kind of how it looks like we will go
00:02:22
into detail of some of the concepts
00:02:24
here. But you already noticed that
00:02:26
looks a bit different then what you
00:02:28
would expect if you are familiar with a
00:02:31
python already so if you use a new by
00:02:35
this is the difference between the new
00:02:37
my problem a program and that answer
00:02:39
for program. So the new type a program
00:02:41
just defines that answer is yeah easily
00:02:44
doesn't have the notion of variable it
00:02:46
doesn't have the notion of constant and
00:02:47
placeholder and then just does the
00:02:50
computation the computation part well
00:02:52
the the definition of the computation
00:02:54
part looks very similar just the matrix
00:02:56
multiplication and the addition. So
00:03:00
this is just to give you an intuition
00:03:02
of how things before at the first
00:03:04
glance. And just just to stress here a
00:03:08
little bit noble mutation monograph is
00:03:11
really crucial when you deal with
00:03:13
several. So whenever you're creating
00:03:15
something you should kind of have in
00:03:16
mind this idea of what's the
00:03:17
computational graph that I'm creating.
00:03:20
And for this very simple expression for
00:03:22
this very simple expression. This is
00:03:25
how the raffle look like. So the reason
00:03:27
this called answer flow is because that
00:03:29
answers afro into the graph. So that
00:03:33
dancers are always the ages that go
00:03:37
into what an operation. And then this
00:03:41
output and cancer again so W times X is
00:03:44
another ten so that it added to be that
00:03:47
it added to one obviously for the
00:03:49
learning example. So for the linear
00:03:51
regression we don't need the one I had
00:03:52
to the ones that we see a constant
00:03:54
example but this will be some some by
00:03:56
the bias in the in the learning. So
00:03:58
always think when you when you deal
00:04:01
with this what what my graph looking
00:04:03
like in Lucy also in this talk that you
00:04:06
can actually visualise your graph was a
00:04:09
very easily with already available
00:04:11
tools. Now. Let's look at each of these
00:04:15
concepts that sensible has one at a
00:04:18
time so the simplest is the courts and
00:04:20
that's something that doing training
00:04:21
you don't want to change. So one is the
00:04:23
constant you can also define things
00:04:25
with this like this. So some very
00:04:29
straightforward as as you would expect
00:04:32
now most and first in principle are
00:04:38
trans yet so you don't actually get the
00:04:40
hand the love them you don't have a
00:04:42
name in you can't really look at them
00:04:44
so for example this one here right
00:04:47
doesn't have a name but with the
00:04:48
variables especially learning they're
00:04:50
very important because we want a big
00:04:52
and that's why we're doing the learning
00:04:54
in the first place we want to change
00:04:55
the WN and be so that defeat our data
00:04:59
and this escape the best simple linear
00:05:01
regression model but maybe we have a
00:05:02
very complicated neural network example
00:05:04
but the idea is the same variables are
00:05:06
the parameters of the model and doubles
00:05:10
are in a special dancers on which
00:05:12
actually have a have a handle. And
00:05:15
always not this that you have to
00:05:17
provide the initial value of the
00:05:18
variable to tensor for also that it
00:05:21
knows how to initialise it. And want to
00:05:23
have a sessions would be would see
00:05:25
decided later about what the session is
00:05:27
you have to initialise all valuable so
00:05:29
this is actually required so now about
00:05:35
another concept placeholders so what
00:05:38
are placeholders placeholders keep
00:05:41
place. And in the computational graph
00:05:44
for data or for information that you
00:05:46
don't have right now. So for example a
00:05:48
user data or did not do the input data
00:05:53
the training examples you don't know
00:05:55
them when you create a graph you would
00:05:56
want to feed for example and one
00:05:59
example at the time or one batch at the
00:06:02
time and so on so this you you don't
00:06:05
know when you start Reading the graph
00:06:07
but you want that dance of low hey I
00:06:09
will I would have this at some point
00:06:11
and I promise I will give it to you
00:06:13
when when I would ask also something of
00:06:15
you because currently we're not asking
00:06:17
anything of tens of we're just building
00:06:18
the graph. So it doesn't need to know
00:06:20
the value for this placeholder. But we
00:06:23
looked tell it later so this doesn't
00:06:24
answer flow this is something I really
00:06:26
care about but I don't have the value
00:06:28
yet and I'll tell it. I would tell it
00:06:30
to you later. So this a part of the
00:06:35
cold front are initial example this
00:06:37
defines the computational graph. So
00:06:39
here nothing actually get executed wise
00:06:42
not computed here it just creates the
00:06:44
graph here. So why doesn't have a value
00:06:48
un it cannot have a value because we
00:06:50
didn't see what X is right so since X
00:06:52
does not have a battery just a
00:06:53
placeholder we actually cannot compute
00:06:56
why this part just builds the graph is
00:06:59
very important most programs intensive
00:07:01
we'll start with this you define your
00:07:03
computational crap without actually
00:07:05
executing. So in order to execute
00:07:09
operations on the graph you have to
00:07:11
initialise a session and with a session
00:07:15
you can do operations on the
00:07:17
computational graph so you can feed
00:07:18
data into the graph this relates the
00:07:20
placeholders that I was talking about
00:07:23
and you can fetch data from the graph
00:07:26
well why would you first data from the
00:07:27
graph that's what we're doing this in
00:07:29
the first place right we want to do
00:07:31
inference we want to be able to ask our
00:07:33
model how likely need what what's the
00:07:37
great that the student will get given
00:07:39
that the study the ten hours and this
00:07:44
is what actually looks in college. So I
00:07:47
colour coded things here I hope you can
00:07:49
also see the colours. So again this is
00:07:52
the structure usually about the answer
00:07:54
for program after you have created the
00:07:56
graph euphoric initialise the station
00:07:58
with this very simple with statement
00:08:01
then remember I said you actually have
00:08:03
to initialise all variables
00:08:04
specifically. And then you use session
00:08:07
don't Ron to feed daytime to the graph
00:08:10
into Fitch data from the graph. So now
00:08:13
what is this what are these arguments
00:08:17
yeah why it tells the station what I
00:08:19
actually want to compute because you
00:08:23
you might have a possible model a
00:08:25
number of things that you want to
00:08:26
compute out of the graph but your task
00:08:28
I want the value of Y and in order to
00:08:31
get the value of why you need to
00:08:32
provide a value for X because we have
00:08:34
not all tense appropriate what what
00:08:37
value for X we should give and and this
00:08:41
is what they think that I think that
00:08:43
that here so this is a dictionary from
00:08:46
placeholder to the value of the
00:08:49
placeholder so in this case it's it's a
00:08:51
simple list and this course once we
00:08:53
have to do this because we defined the
00:08:55
placeholder here. And now that's
00:08:59
computation that that this this court
00:09:02
actually do well it does as we told NW
00:09:05
times X plus B plus one WNBR variables
00:09:10
and we initialise them to one one X is
00:09:13
this is the three four because we
00:09:16
corresponds to this year. And the
00:09:17
output is nine so this is idea of
00:09:21
exactly as very simple and to and
00:09:24
program that does not do any learning
00:09:26
currently but this is very very
00:09:29
important to have in mind when you
00:09:31
think about how to deal with tens of
00:09:32
role you have created the graph then
00:09:34
you want to do something with the graph
00:09:36
and for that you need a station you run
00:09:39
things with the session you tell
00:09:40
decision what you want to get out of it
00:09:43
and all the information you need
00:09:44
because it can't I give you thing is I
00:09:47
mean you do something if he doesn't
00:09:49
have all the information and then you
00:09:51
can get the about put out of it and
00:09:56
what is of flow actually does
00:09:58
contradict what we knew when you do
00:10:00
this isn't doctrine call it replaces
00:10:03
this with the fee node and it adds a
00:10:07
fictional here yeah the fictional this
00:10:10
you if used to figure out what nodes in
00:10:13
the graph I actually need to compute to
00:10:15
get this output because again I might
00:10:18
want need to compute all the crafting
00:10:19
of all the nodes and although operation
00:10:22
so we want to optimise this we want to
00:10:23
be able to run as little as a lead as
00:10:26
little computation as possible. So this
00:10:28
is what actually happens after under
00:10:30
the hood when you when you do this now
00:10:34
again just to recap very important is
00:10:37
example program you always create the
00:10:39
computational graph you initialise the
00:10:41
session in then you feed and Fitch data
00:10:43
from the graph all super complicated
00:10:46
neural network programs this is more or
00:10:48
less what to do and going back to the
00:10:53
entire program and colour coding it. So
00:10:56
you create the computational graph this
00:10:58
is this part does not execute anything
00:11:00
it in this is that doesn't compute why
00:11:02
just creates the graph initialise the
00:11:05
session feed the data into the graph
00:11:09
and fetch the data out of the graph
00:11:10
imprinted or the whatever you want with
00:11:12
it because presumably you wanted since
00:11:14
since you asked for but this was a very
00:11:19
simple example in practise we actually
00:11:22
want to learn WNB right so now we
00:11:24
initialise and we told and support
00:11:26
their valuable so we can learn then but
00:11:30
we have to change them no the first the
00:11:33
simplest way to change things is with a
00:11:36
find operation. So you hear. This
00:11:39
update weights operation it tells
00:11:41
himself flawless assign to W the value
00:11:44
zero. But this does not want update
00:11:47
operations of this we're still in the
00:11:48
building the computational graphic
00:11:50
phase. So if I create a session and I
00:11:52
initially run W dot eval is so this is
00:11:56
the same as the station got run W but
00:12:00
this one does not need any any more
00:12:05
information because W is just a
00:12:07
variable here when print this is gonna
00:12:10
give one one because we have not
00:12:12
actually run the updates operation. So
00:12:15
this part is the defining the part that
00:12:18
does not want this operation but once
00:12:20
we run it then we will see that W is
00:12:24
actually a big to do B zero zero as you
00:12:27
expect this is a bit different than
00:12:29
what you usually use to I mean
00:12:31
programming when you run when you when
00:12:34
you buy rights as some command that
00:12:36
actually gets executed then this
00:12:37
creates the graph for this very
00:12:39
important to to have in mind. But what
00:12:42
we actually want to do is not to change
00:12:44
W from one one zero zero is we actually
00:12:47
want to learn and slope and intercept
00:12:49
of the line in this example. So that we
00:12:51
see how how dumb number of studied
00:12:56
hours corresponds to the number of
00:13:00
great that the student gets in the
00:13:02
exam. And how we do this so we have
00:13:05
some data that we collected before. And
00:13:08
you want to minimise the distance
00:13:10
between what our model predicts and
00:13:13
what we know to be the truth and this
00:13:14
is not only in the case putting a
00:13:16
regression to the case for supervised
00:13:18
learning. So we choose the distance and
00:13:20
we want to minimise the predictive the
00:13:23
distance between what we have predicted
00:13:24
in what we know. So this is how it
00:13:28
looks like beat whatever you want to
00:13:30
minimise this this and this is what we
00:13:32
predict using a parameters so the
00:13:34
prediction always use the parameters
00:13:37
this is what we know to be the truth.
00:13:39
And what we want to minimise is this
00:13:41
this distance and of course we want to
00:13:43
do this all for all our database up
00:13:47
years because for example you don't
00:13:49
want to have the buys well you want to
00:13:52
do this over your entire training set.
00:13:55
So how does this look in intense of
00:13:59
real. So now you need another
00:14:01
placeholder initially we had the
00:14:03
placeholder for the X button supervised
00:14:06
learning on don't only have the
00:14:07
examples you also have the so she could
00:14:09
labels. Um that that are functions that
00:14:14
are even for these these examples and
00:14:16
those are also placeholder because you
00:14:19
don't know then before right so this is
00:14:20
when you train you would give an
00:14:22
example and the label in this would be
00:14:24
the placeholder here why is how exactly
00:14:27
how we computed before WX plus me you
00:14:31
compute the distance as the difference
00:14:33
between the two you do the square
00:14:34
distance because you don't want to the
00:14:36
sign from after and those the in final
00:14:40
cost that we will want to minimise
00:14:43
right so this is a on optimisation
00:14:45
problem. Uh with with sum over all the
00:14:50
training samples now in terms of low
00:14:54
they're allowed their support for a lot
00:14:55
of optimise us and they're very easy to
00:14:58
use so you were probably very familiar
00:15:00
with the great intercept optimiser
00:15:02
simply duh steps in the direction of
00:15:04
the steep and his descent. But you can
00:15:07
also use some of the other optimiser
00:15:09
such just momentum at that they'll come
00:15:11
at the right in a them. And I said it's
00:15:16
one line then it's actually one line so
00:15:19
this is the line here. So the first
00:15:21
lines actually compute the cost so this
00:15:24
is still part of creating the
00:15:25
computational graph for right so we
00:15:27
were not at the part where we are
00:15:29
creating the session now we're creating
00:15:31
the computational graft corresponding
00:15:33
to the cost. And corresponding to the
00:15:35
update operations. And the way update
00:15:38
this computer is you tell them several
00:15:41
with the way the intercept optimise
00:15:42
their with this learning rate which you
00:15:45
choose because maybe you try a couple
00:15:46
of them or you have a hunch that zero
00:15:49
point zero one will work well today and
00:15:52
you minimise the cost that you have
00:15:54
defined right so the cost here is a
00:15:57
part of the computational craft this is
00:15:59
also part of the computational graph
00:16:00
independent defined that this that that
00:16:03
we will actually get to run because
00:16:05
here we're not debating the variables
00:16:07
yet we're just the finding the
00:16:09
operation that will allow us to update
00:16:12
the valuables. Now how do we actually
00:16:17
run this or this is very similar as the
00:16:20
example before when we were not doing
00:16:22
training only now actually won a big
00:16:25
step for this at least that that we
00:16:27
define as part of the computational
00:16:29
graph to be a step in the gradient
00:16:32
descent optimise station I run it a
00:16:34
hundred times with different data so
00:16:36
again as I said here we have two
00:16:38
placeholders we have the placeholders
00:16:41
for the examples and the placeholder
00:16:43
for the labels associated with it and
00:16:45
this actually does the training. So
00:16:47
this actually know that it needs to

Share this talk: 


Conference program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
4 July 2016 · 2:01 p.m.
Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
4 July 2016 · 3:20 p.m.
Day 1 - Questions and Answers
Panel
4 July 2016 · 4:16 p.m.
Torch 1
Soumith Chintala, Facebook
5 July 2016 · 10:02 a.m.
Torch 2
Soumith Chintala, Facebook
5 July 2016 · 11:21 a.m.
Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
5 July 2016 · 1:59 p.m.
Torch 3
Soumith Chintala, Facebook
5 July 2016 · 3:28 p.m.
Day 2 - Questions and Answers
Panel
5 July 2016 · 4:21 p.m.
TensorFlow 1
Mihaela Rosca, Google
6 July 2016 · 10 a.m.
TensorFlow 2
Mihaela Rosca, Google
6 July 2016 · 11:19 a.m.

Recommended talks

Firebase with Angular 2 - the perfect match
Christoffer Noring, OVO Energy / London, England
27 Nov. 2016 · 4:05 p.m.