Embed code
So I won't talk by high dollar on then
not quite as this till they on those
this afternoon and beyond again that
was awful Okay let's get going I have
the slide to commemorate the end of
your break not the start of your break
because they're all so excited to hear
more so this second talk is about hands
on introduction mainly about how to use
tens of over something very simple
because initially I just want to get
across the whole how the work flow with
principle is with this basic example
and then we would look at how to use
some of the visualisation tools how to
run your computationally CPU on
multiple GP use and and then we will we
will have to break. So this first
example that go soon you know
regression. So I will use a bit of a
concrete example here so that I make
things they're very easily
understandable so assume I'm a
professor. And I we want to know if
there's a I want to fit the line in
between how much my students tightly
and the great they actually get my exam
because if students don't study at all
and they get a hundred percent well and
that's not the good example but if it
scaled exponentially if students have
to a study a hundred more hours to get
an extra point then that's not that's
not good either so I oh I want to make
sure that this kind of a linear
relationship between the number of
hours day study and and the great that
they get and let's assume I have this
they'd obviously this is a bit you
should take this with a grain of salt
what students answers correctly when
they're professor asked them male how
do you study for my exam. But let's
just assume that we have the state are
and we want to fit the line through it
all we want to learn to slope of the
intercept that fits best the that they
got that we that we see and let's first
look at the principal program that that
something very simple they just compute
this W times X plus B plus one. So this
is kind of how it looks like we will go
into detail of some of the concepts
here. But you already noticed that
looks a bit different then what you
would expect if you are familiar with a
python already so if you use a new by
this is the difference between the new
my problem a program and that answer
for program. So the new type a program
just defines that answer is yeah easily
doesn't have the notion of variable it
doesn't have the notion of constant and
placeholder and then just does the
computation the computation part well
the the definition of the computation
part looks very similar just the matrix
multiplication and the addition. So
this is just to give you an intuition
of how things before at the first
glance. And just just to stress here a
little bit noble mutation monograph is
really crucial when you deal with
several. So whenever you're creating
something you should kind of have in
mind this idea of what's the
computational graph that I'm creating.
And for this very simple expression for
this very simple expression. This is
how the raffle look like. So the reason
this called answer flow is because that
answers afro into the graph. So that
dancers are always the ages that go
into what an operation. And then this
output and cancer again so W times X is
another ten so that it added to be that
it added to one obviously for the
learning example. So for the linear
regression we don't need the one I had
to the ones that we see a constant
example but this will be some some by
the bias in the in the learning. So
always think when you when you deal
with this what what my graph looking
like in Lucy also in this talk that you
can actually visualise your graph was a
very easily with already available
tools. Now. Let's look at each of these
concepts that sensible has one at a
time so the simplest is the courts and
that's something that doing training
you don't want to change. So one is the
constant you can also define things
with this like this. So some very
straightforward as as you would expect
now most and first in principle are
trans yet so you don't actually get the
hand the love them you don't have a
name in you can't really look at them
so for example this one here right
doesn't have a name but with the
variables especially learning they're
very important because we want a big
and that's why we're doing the learning
in the first place we want to change
the WN and be so that defeat our data
and this escape the best simple linear
regression model but maybe we have a
very complicated neural network example
but the idea is the same variables are
the parameters of the model and doubles
are in a special dancers on which
actually have a have a handle. And
always not this that you have to
provide the initial value of the
variable to tensor for also that it
knows how to initialise it. And want to
have a sessions would be would see
decided later about what the session is
you have to initialise all valuable so
this is actually required so now about
another concept placeholders so what
are placeholders placeholders keep
place. And in the computational graph
for data or for information that you
don't have right now. So for example a
user data or did not do the input data
the training examples you don't know
them when you create a graph you would
want to feed for example and one
example at the time or one batch at the
time and so on so this you you don't
know when you start Reading the graph
but you want that dance of low hey I
will I would have this at some point
and I promise I will give it to you
when when I would ask also something of
you because currently we're not asking
anything of tens of we're just building
the graph. So it doesn't need to know
the value for this placeholder. But we
looked tell it later so this doesn't
answer flow this is something I really
care about but I don't have the value
yet and I'll tell it. I would tell it
to you later. So this a part of the
cold front are initial example this
defines the computational graph. So
here nothing actually get executed wise
not computed here it just creates the
graph here. So why doesn't have a value
un it cannot have a value because we
didn't see what X is right so since X
does not have a battery just a
placeholder we actually cannot compute
why this part just builds the graph is
very important most programs intensive
we'll start with this you define your
computational crap without actually
executing. So in order to execute
operations on the graph you have to
initialise a session and with a session
you can do operations on the
computational graph so you can feed
data into the graph this relates the
placeholders that I was talking about
and you can fetch data from the graph
well why would you first data from the
graph that's what we're doing this in
the first place right we want to do
inference we want to be able to ask our
model how likely need what what's the
great that the student will get given
that the study the ten hours and this
is what actually looks in college. So I
colour coded things here I hope you can
also see the colours. So again this is
the structure usually about the answer
for program after you have created the
graph euphoric initialise the station
with this very simple with statement
then remember I said you actually have
to initialise all variables
specifically. And then you use session
don't Ron to feed daytime to the graph
into Fitch data from the graph. So now
what is this what are these arguments
yeah why it tells the station what I
actually want to compute because you
you might have a possible model a
number of things that you want to
compute out of the graph but your task
I want the value of Y and in order to
get the value of why you need to
provide a value for X because we have
not all tense appropriate what what
value for X we should give and and this
is what they think that I think that
that here so this is a dictionary from
placeholder to the value of the
placeholder so in this case it's it's a
simple list and this course once we
have to do this because we defined the
placeholder here. And now that's
computation that that this this court
actually do well it does as we told NW
times X plus B plus one WNBR variables
and we initialise them to one one X is
this is the three four because we
corresponds to this year. And the
output is nine so this is idea of
exactly as very simple and to and
program that does not do any learning
currently but this is very very
important to have in mind when you
think about how to deal with tens of
role you have created the graph then
you want to do something with the graph
and for that you need a station you run
things with the session you tell
decision what you want to get out of it
and all the information you need
because it can't I give you thing is I
mean you do something if he doesn't
have all the information and then you
can get the about put out of it and
what is of flow actually does
contradict what we knew when you do
this isn't doctrine call it replaces
this with the fee node and it adds a
fictional here yeah the fictional this
you if used to figure out what nodes in
the graph I actually need to compute to
get this output because again I might
want need to compute all the crafting
of all the nodes and although operation
so we want to optimise this we want to
be able to run as little as a lead as
little computation as possible. So this
is what actually happens after under
the hood when you when you do this now
again just to recap very important is
example program you always create the
computational graph you initialise the
session in then you feed and Fitch data
from the graph all super complicated
neural network programs this is more or
less what to do and going back to the
entire program and colour coding it. So
you create the computational graph this
is this part does not execute anything
it in this is that doesn't compute why
just creates the graph initialise the
session feed the data into the graph
and fetch the data out of the graph
imprinted or the whatever you want with
it because presumably you wanted since
since you asked for but this was a very
simple example in practise we actually
want to learn WNB right so now we
initialise and we told and support
their valuable so we can learn then but
we have to change them no the first the
simplest way to change things is with a
find operation. So you hear. This
update weights operation it tells
himself flawless assign to W the value
zero. But this does not want update
operations of this we're still in the
building the computational graphic
phase. So if I create a session and I
initially run W dot eval is so this is
the same as the station got run W but
this one does not need any any more
information because W is just a
variable here when print this is gonna
give one one because we have not
actually run the updates operation. So
this part is the defining the part that
does not want this operation but once
we run it then we will see that W is
actually a big to do B zero zero as you
expect this is a bit different than
what you usually use to I mean
programming when you run when you when
you buy rights as some command that
actually gets executed then this
creates the graph for this very
important to to have in mind. But what
we actually want to do is not to change
W from one one zero zero is we actually
want to learn and slope and intercept
of the line in this example. So that we
see how how dumb number of studied
hours corresponds to the number of
great that the student gets in the
exam. And how we do this so we have
some data that we collected before. And
you want to minimise the distance
between what our model predicts and
what we know to be the truth and this
is not only in the case putting a
regression to the case for supervised
learning. So we choose the distance and
we want to minimise the predictive the
distance between what we have predicted
in what we know. So this is how it
looks like beat whatever you want to
minimise this this and this is what we
predict using a parameters so the
prediction always use the parameters
this is what we know to be the truth.
And what we want to minimise is this
this distance and of course we want to
do this all for all our database up
years because for example you don't
want to have the buys well you want to
do this over your entire training set.
So how does this look in intense of
real. So now you need another
placeholder initially we had the
placeholder for the X button supervised
learning on don't only have the
examples you also have the so she could
labels. Um that that are functions that
are even for these these examples and
those are also placeholder because you
don't know then before right so this is
when you train you would give an
example and the label in this would be
the placeholder here why is how exactly
how we computed before WX plus me you
compute the distance as the difference
between the two you do the square
distance because you don't want to the
sign from after and those the in final
cost that we will want to minimise
right so this is a on optimisation
problem. Uh with with sum over all the
training samples now in terms of low
they're allowed their support for a lot
of optimise us and they're very easy to
use so you were probably very familiar
with the great intercept optimiser
simply duh steps in the direction of
the steep and his descent. But you can
also use some of the other optimiser
such just momentum at that they'll come
at the right in a them. And I said it's
one line then it's actually one line so
this is the line here. So the first
lines actually compute the cost so this
is still part of creating the
computational graph for right so we
were not at the part where we are
creating the session now we're creating
the computational graft corresponding
to the cost. And corresponding to the
update operations. And the way update
this computer is you tell them several
with the way the intercept optimise
their with this learning rate which you
choose because maybe you try a couple
of them or you have a hunch that zero
point zero one will work well today and
you minimise the cost that you have
defined right so the cost here is a
part of the computational craft this is
also part of the computational graph
independent defined that this that that
we will actually get to run because
here we're not debating the variables
yet we're just the finding the
operation that will allow us to update
the valuables. Now how do we actually
run this or this is very similar as the
example before when we were not doing
training only now actually won a big
step for this at least that that we
define as part of the computational
graph to be a step in the gradient
descent optimise station I run it a
hundred times with different data so
again as I said here we have two
placeholders we have the placeholders
for the examples and the placeholder
for the labels associated with it and
this actually does the training. So
this actually know that it needs to
twelve DWN of NB so we have a little
bit here that hey you need to compute
the course we we respected WNB know
that because it looks oh sorry wrong
direction it looks as how at how we
define W and it knows WWX plus B and
these are variables and when I want to
minimise the cost that depends only on
a dog W and some labels that will be
given to me I know that what I need to
minimise would respect with WNB so you
don't have to specify in this case the
variables yourself. So here you see
there's no specification of WNB you
just defined the cost as as sure and
then you minimise it with respect to
the variables that need to be optimised
now in order to be able to this
training you need some data. And you
can with can from you can be this data
directly with text slight line read
their or other readers also you can use
anything that supported by nearby
because this fee dictionaries can be
known by race. So if you want to read
your data with your favourite vampire
either that you already using you can
just to that and that's week know
another thing that you can do you might
want to Fitch examples. And feed them
while you feed data to the network at
the same time so you want to do things
that we could have possible you don't
want the training to stop while you
fetch data or if it to dissect
expensive because you fidget from from
from this probably you want to fetch
things at the same time so you can do
this intense appropriate was which I
synchronously allow you to to read them
batches of of data so this is very
important for especially for big
datasets now that you've trained your
model you know that it does well in
training data data but how well does it
do on unseen data so we need to
evaluate in you need to be able to do
inference and this actually very
similar to what we've seen with the
training. So the training is on the
left side of the screen and you see
that it's the same as we had before but
we have to do it a hundred times so
this is something very important dumb
optimiser retained you that it does not
return you something that's already
optimiser from that you have it's
operation that you have to run as many
times as you decide. Um but here when
we do the evaluation on the right side
you actually you want me to run a
little anymore because you have your
data and you just say please keep me
the cost would just make this data. So
the small training the training is done
the parameters WNB have been changed
already doing training. So you just
want to find out the cost using the
land parameters. And now what I would
do actually have a high price and not
we'll go you peter remote that goes to
some of these then we train these
together so that we see firstly how
long it takes in terms of code. And how
how how easy D.'s so forcefully what
that will do I will cleared out of my
kernel so that you know I'm not
cheating and let's see how this works.
Okay so now you see the output is
clear. And I'm gonna start running one
thing at the time and describe what
what the course that's a perfectly
imports don't need to go through that
secondly I'm actually generating the
data because I'm not actually a
professor and I don't have students to
torture about what grade and actually
got and how much they studied for it.
But what I had the simple heuristics if
our professor maybe if you don't study
at all you should get around twenty
percent but yeah you and but afterwards
it should just "'kay" scale five X so
if you study twenty hours you should
give five times twenty plus twenty so
that is my idea that you have a bias of
twenty regardless of how much of study
and then it's ghettos. And four times
the number of of hours that you study
and what is called a shared very really
just generates the data. And then
plotted using not booklet so again hear
nicely I can use my favourite plotting
framework python that I'm already used
to because we are we are in python and
this is not the data looks like so I
had to some normal noise because
otherwise everything would be on the
line. And with this example because we
know the data in we know how I
generated the data we know that in the
end I want to when when we learned WNBI
Want my line to be as close as the blue
line as possible because this is what I
cheated and usually when you when you
do learning you don't know what your
actual solution is but in this case
just for pedagogical example we know
that what we want to get is the blue
line in the end now getting the
interesting part we have the data and
this is the part that defines the
computational perhaps this is exactly
what we we looked at in the slides we
have two placeholder the inputs and the
labels we have the weights that have to
be initialise I initialise them with
the random normal. And the prices are
initialised to zero. And the output for
this is the the why is the matrix
multiplication between the input the
weights and biases. I defined the
course exactly like in the slide so the
distance between what we project and
what we know should be that truth this
is always well what you do and I
supervised learning. I use this
learning rate because it I thought it's
a good day so why not and then I
defined the gradient descent optimiser
to minimise this call so this is the
graph definition part. And I need to
run that and then as we expect that we
have to initialise the station and for
interactive notebooks there is a this
specific kind of session that is very
helpful which is called interactive
session so this you you usually don't
use outside of these I do in the
terminal when you use the interpreter
or in this in this kind of notebooks is
it just a bit more helpful once we have
the session I initialise valuable this
is something or dancer pro programs
have to do. I'm just gonna run down.
And not gonna look at a bit of all the
variables look like without learning so
I'm not doing any learning it right I
just defined the computational grab I
have not try and update operation to
because I'm not around a bit operation
the weights look random the biases zero
as I initialise this and this is what
the predictions look like so the
biggest great is twenty one was not not
great this is not what we want I want
for my students to be able to pass for
example. And it's also actually plot
this to see that it's very far from
from the blue line and of course that
you also without training is huge but
that relates to the training right so
the training is to lines it's
affordable in which you bond update
operation a thousand times with the
data that I generated so this is the
training data that I generated this one
contains the random noise to make the
problem more difficult. And I just
wondered update operation a thousand
times just gonna execute that so this
is the part that the the training right
we want a bit which operation. So now
we expect the cost to be smaller via
ups we expect the weights in the biases
to be updated. And expect to see that
we have predicted the line very close
to the line. So let's look at this the
cost is now we smaller so look this was
the course without training and as
expected of course with training is
much smaller now let's look at the
values of the weights and the biases
because I cheated and I know actually
created the they'd I know that the
ideal value for the way it should be
for and for the backed up bias to be
twenty so we're not quite there we are
four point one and seventeen point nine
but maybe if I use an drop the mines so
if I change my lining wait I will get
to something better that's not really
the point here is just to see how to
use them for for for this and it's also
look at how the lines look like plot it
again using that look like. They're
almost an distinguishable here. So we
can see that we actually learned pretty
well what we what we wanted to learn
and this is the entire program it if I
it's and it actually fits in one
screen. So this doesn't the entire
training from defining the draft
initialising the session. And the
training is just running the update
operation and two lines. So this is
just to say simplify it to just
exemplify a complete answer for a
program that runs or something very
simple in in a bit we will actually
look at how to do this mean neural
networks to and just for completion and
for fun. I wanted to show you the close
solution because you know we correction
you can actually you don't have to do
it it went to be but here I I did like
this to so far conceptual works. And
I'm gonna use type by to see what the
values are if we use the close from
solution in remember what we want to
see is four and twenty so which side by
we do a little bit better than what my
iterative algorithm that we are at the
four and nineteen point four so pretty
close to the fourteen twenty that they
started to if I doesn't mean that we
cannot improve just just to show that
this is also pretty simple and you
shouldn't do linear regression actually
like this let's go back to the slot
yeah and now I want to discuss a little
bit some other conceptual perks are out
there which I think are very important
and all for someone was trying to
figure out that be should used and
therefore not so forcefully you want to
be able to use the GPU and this is one
line change so in order to use the GP
you you have to tell pencil please take
this route for this part of the graph
on the GPU and if you have multiple
cheap used you can say this is GPU two
and obviously you can make this a flag
in your pricing program going yours was
program and you can see me screech from
the CPU to the GPU right so we don't
have to do anything else apart from
changing this this line. And in the
this is exactly the example that we had
before the exact same call that defines
the variables in the placeholder
warning that this time when I do
session that one actually here I forgot
to specify X but it's the same
computation this will actually run on
the on the you'd be you only with this
this extra line. And something worth
mentioning is that you have if you have
set up potential flow with the GPU it
would actually do that by default. So
you say you don't have to worry about
this if you have a GP on your machine
in you have told it to you have set up
with the you'd be you then this will
just work. But how about if you have
multiple attributes is that much more
difficult well not really so that's
that's the Reading I think about
sensible it makes it very easy for you
to run this computation on to GP use.
So if you have multiple GP use you can
say on I'm gonna look to the jeep use
that I have and I'm gonna place
different part of the computational
graph on different devices right so
here let's assume again have the
politically that I want to do eight
times be twice and then add them
together actually say for these devices
compute this part of the graph for us.
And then compare compared amount of
multiplication on the on the GPU on
each of that you'd be use and then I
can pure dietician on on the CPU so
this is also not and not very very
difficult in case you want to to switch
like this between different keep you
cards and the CPU now a bit about
distributed training which you might
also be interested in how to use and so
flow in this I I mentioned before in
the first talk at data for all those
model cars and what what's the actual
support and to look at some concrete
examples to see if it's how how easy
they actually is to to set up so the
more one sending this to beauty
training when you actually look at what
you would have to deal with eyes a
cluster this is as expected. They are
set of tasks that participate in the
distributed execution of the graph
right so this is normal terminology and
the sort of where you have one server
part task and each server has a master
and the work or what the master class
this creates the sessions to be able to
execute stuff on the graph and the
workers and the master just great
decision and the worker execute the
operation in the graph. So this that
this is the terminology that you would
have to use if you want to work with
distributed ring now how do you create
how detailed tens of all this is my
cluster where that looks pretty much
like this so these are the workers that
that want to use and these are the
parameter servers this doesn't look
idea if you have multiple workers so if
you have a cluster one hundred it can
get easy TDS this is in a better way to
do this and not really supported right
now what we are looking into this and
if you have a favourite tool for
cluster management just file and open
source request way what I want to use
this potential when we would definitely
look into it but for now this is how
you do it you specify these are the
workers and this is the these are the
parameter servers. Now if you force one
to just share the rap across devices so
this again going to the model for all
of the sunlight yeah exactly like with
the jeep you card use a this this part
should be here this part should be here
and this part should be here and then I
initialise the session that we won the
train operation. So it's very very easy
but you would have to specify you have
to decide here. What part goes where
and here is where you have to take into
account what I discussed in the forest
talk is it worth to do is what's the
best way to do this for my particular
model and might just wasting so much
bandwidth sending things over the
network that it's not really worth it.
But this is this is the way to do it
definitely take into account the the
trade offs here no if you want to do
they got paralysis no if you want to
have multiple directly because that
that we'd examples in parallel. Um you
can do that and that's us very easily
supported both asynchronous and
synchronous data relevant here I will
have the asynchronous example. And the
slides are a bit more dancing and the
other slides and but please bear with
me it's not that difficult for actually
starting something on multiple
machines. So first you have to create a
cluster as I should before so you have
to give the holes for the parameter
servers and for the workers then you
have to create a server for that
cluster. And a supervisor so supervisor
is utility that really allows you to
work very well with this do this do
this to beauty training so it does a
lot of nice things for you like
checkpoint pointing your model and so
on so it's very recommended that when
you do this to be training you you work
with the supervisor. So now that you
have this set up how do you actually
create the model who do asynchronous
data probably so in your with statement
you have to specify the right click on
device editor. And the cluster and then
you create the model. And and this is
just the creating the computational
graph or two that I discussed the
before so it's whatever model you want
to do with this optimiser in you
minimise this lots and then once you
have created this sense of well know
what to and that you actually want data
parallelism and what what to do here.
And then if you want to but do the
training will you actually do that with
the supervisor manage session so this
is the different part you have some so
but at as expected with any distributed
computation right you need someone to
manage all this and that is the
supervisor. And gives you a I manage
session where you can tell it run this
the training operation and it does
everything for you. So all the in in
this is kind of a more or less complete
if you feel in this this part example
of how to do the data per relevant so
it's not it's not that difficult. And
you can actually do this ball also
force I think one onions they'd up
around this and I think there's a
function I things. If I remember
correctly. It's strange thing I'll do
my are that you can use and that will
will do the the training with
synchronous data present without having
this this potential issues that the
gradients get updated by different
workers and then they are and it's not
the same as of gradient descent that
were used notice into another really
interesting. Um partly that answer for
provided which is ten starboard. So we
are very much aware that for most of
the computation that you have to do
with the stuff well it's complex
specifically Woodbury demure networks
it can even be confusing at times and
you want to be able to plot what how
your accuracy changes on test set for
your accuracy changes on the training
set you want to be able to see how your
weights are changing you want to be
able to see how your biases are
changing or whatever other no hyper
parameters that you might have that
might change during things so for
example you might use making learning
rate you might want to see where is my
learning rate at right now in for to
for debugging for okay my station and
for better understanding we have
included a set of tools that help you
with this and they're all commonly
called Spencer board there just a
variety of twelve here. And setting up
dancer more foreseeing the summaries
that you want to see is actually very
easy. So a lot of the common goal of
this talk is to show you that principle
does a lot of things and it does them
well and it's easy to set them up. So
also here with the summary right there
you in order to one ten support how
your training is to what you have to
create a summary writer which you you
just have to specify where to save the
the locks and then you have to say this
is a major that I care about so for
example. Um I don't know what sorry the
in learning great at this time the
learning rate at this time fit if this
value. And then when you look on on
your tender board instance to actually
be able to see how this changes. So
it's not too difficult in here of
course you can also combine multiple
measures if you care about multiple and
you can define is very flexible so you
can define what you care about and I'll
show you link a demo of how to use this
you internet sorry about that it seems
like no it seems like it's which formed
a network that I want to be into
another network you what happened now
sorry well okay I will have to use this
for for later maybe the other talk
because it seems like the internet is
not working at the moment and also I
actually have light example of sorry
just actually have a life example of
this in the code but yeah it's a pity
that I can't actually just create a
whole support with me please don't
connect no yeah yeah see oh okay now we
also know what kind of format okay
perfect okay and we're back. So this is
how the the tool actually looks so
usually launched is on one of your
machine should be able to look at it
you can see here that they're multiple
tab so you can see your events these
are the things that you told answer for
that you want to look at so for example
this is running them of nist training.
And you can see how the test like
Ulysses going up the training acuity
sparked waiting a little bit and these
are the times that's a time go by you
this is the this is what you want to
see your training this is looking good
for your loss function which in this
case is the cross entropy loss you want
to be able to see how it fluctuates
also doing training this you want to go
down you want to also be able to Prague
applaud other things so I was saying
that you might want to plot the
learning rate you can also look at
dropout so as expected drop out here is
one for in France but you can put it as
much as you want for well you can put
it to be zero point nine four for
training you can also look at how your
biases are changing doing training and
how your weights are changing. So here
for example you can wanted or or are in
and if you have exploding weights so
this can be a a problem entering
recurrent neural network so you can see
how this evolves doing your training
and you can also so here this is very
customisable right so with the cold
that I showed you can add to whatever
summary you want to see but you can
also more you can also see somebody
just that the network is looking out
for this are for test this out for
trainees this are them is the next
digits you can also look at the graph.
So this is very important if you want
to look at your computational graphing
detail. Um it looks also at different
levels so if you extend this graph it's
pretty big. But what you probably care
about initially is that you have an
input you have a layer you have some
dropout allowed another layer than the
cross entropy loss and accuracy. So
here again between late it to what I
was saying before about the fictional
it's here you put you could but if you
want to see two things you want to see
the cross entropy or the cure C and
because that's awful as an extra
feature node when you ask only for the
cross entropy it won't computer QSE
because it looks and sees this is not
on the path that I need to compute to
get the cross entropy. But you might
also want to look at this in so this is
a very nice high level overview if you
have very deep networks you just might
want to look at this. But if you want
to look at each layer in particular to
see how you how it looks like you can
click here. And you see that this
expands and you see the actual layer
what's what's actually made up so these
are the weights you start the bias is
as you expect the usual WX plus B
operation and then you applied
activation function in each which in
this case is the rectifier and this
gets this is at the output of little
and this gets that into the cross
entropy a loss and you can look at each
of these in detail so you can look at
the USC and you can expand absolute
each of them to get an idea of what
actually what I what graph heavy
actually created. And there's more so
you might also want to see the
activations of neurons as training goes
by right you want to see maybe you want
a sparse activation so then you can
monitor that you can more in the
activations you can also monitor the
pretty activation so this is before I
fed they got into sorry. So here this
is always time as time goes by you want
to see how and also in the other cases
you want to see how your summaries
change over time right. So you plot so
in the in the code snippet that I
showed you when you add the summary you
had this is them as your name so you
create a measure that says this is that
the king learning a rate and you also
specify the times tip this is when I
found this major at evaluation step
number eight hundred the learning rate
was zero point three basically okay so
any as you expect so for example the
activation because this is the rare
loan at what they're always about zero
but the P activations don't have to be
able always about see also you can
double check a little bit how do you
are preoccupations look like and how
your weights look like and so on. This
is very very useful when you want to
actually so firstly when you get
started with deploring but not only
right when you want to actually
understand what what's happening with
my training and to make the flexible
black box tried this is it's less of a
black box if you can actually see
what's what's happening here and this
is it for this talk I would take the
questions now let me see Uh_huh eight
so very nice you said you can actually
take no by arrays directly for
inputting data to your to your model to
graph. "'cause" it temple automatically
this transference between the the
computer memory and they GP what we
need to do something special no it does
that for you soon yeah so I I I I said
if you want to run things on the you'd
be you you just have to specify that
run this on the cheap you and then it
would do that for you so you don't have
to try to do this stuff yourself. So
it's really to when I say it's more
flak change from the cube you to this
review I mean it if they actually want
like change hi I was wondering so tends
airports it's great when you want to
see how your implementation does by
what about compilation of the grass so
for example in key and I found the
errors compilation errors to be rather
and informative and I was wondering how
was it what tons the four so that yeah
because you have this compilation step
I agree that sometimes errors can be a
confusing but I think that's a for
that's a relatively good job at trying
to explain where the problem is so I
think yes you do have and it's not
trivial but it's relatively code the
question. Um I was I first things will
talk and I have a question probably
choose lost functions like is there any
package because I have for example
initials logistic laws Spencer possible
supports them you just search it so I
would have been the neural networks
example today are just supported you
just search that's the flow for the
specific roles that you want it's
probably implemented if it's not you
can ask to go for it to be implemented
that these losses already their
contents of oh right so there's no
separate package maybe the I don't know
this hard maybe maybe there's I mean
you just have to change or import for
maybe implored tens of log boat the
cross entropy to importance of rolled
out of lost dog cross entropy but
they're differently there and what
about the realisation of okay it's
regularisation is there L to Norman
also either more yes so the one it's a
you're tense report is it extensible
and what what is I mean if I want to
create my own visualisation can I do
that. What is programmed into it wasn't
clear for me is it by some as well is
so what you so at what level do you
want to extend so if you want to extend
what you see on the board itself is
definitely you can definitely do that
in your own python code if you want to
extend principle you have to probably
look into it and not exactly sure so
you might have to either I I think you
can but if you can you can just ask for
it or send the pool request to to be in
the great so because the court is there
right you the the answer is you
definitely can the the question is how
much effort is to to do it right
because the the call this is there so
that there are no specific support you
mean four and I I am radio stations I
answer is I don't know that this is the
account correct answer but for for me
this from your experience how how how
much it you need to go into C plus plus
N develop a custom functions and custom
code in C plus plus and how much was
that's of flow able to deliver in that
before in my son. I mean and so I will
talk about extending danceable if you
want to add your own operations in the
next talk. But so far I've never needed
it. So from my stance I never needed it
if you want to do something in python
I've done it on in python. So I I've
never had this issue of all I can do
this point in python have to go to
sleep of course so there's no but other
overheads attached to the tense oh
sorry I could not hear that overheads
attached to of that monitoring that
event okay skeletons a computational
overhead well a little bit but it's
just looking I like so forcefully it
depends what you're adding if you're
adding something that's very expensive
to compute right if you want to lock
something that's very expensive to
compute then yes there's overhead
because you have to compute that but if
you want to look at something very
simple such as I want to see how my
learning rating the case or something
like that then no not really in the
distributed answer for setting out
there any multi also features for
Terrence sorry I can't hear you in in
the distributed tensor flow setting up
there in new features for fault
tolerance dealing with the figure of a
worker restarting a word to understand
there is check pointing and we start
from a a particular stylised version of
the model but can we can do some
supervision or a fancy things like
dates a speculative execution when a
particular job in a synchronous Malone
runs very soon one particular work or
something like that so I I think right
now the support this for checkpoint
again restoring fort from that but I'm
not aware of something fancier so
basically if one of my detector values
I have to we it till we however many
executors workers are still left in the
cluster yes I think it will we start
for you if you still have what work are
available. So because that has the
result or I that checkpoint is still
available right it depends if your hard
drive fails or wherever you are saving
the checkpoint admit that probably has
to continue without. But overall I
think that if there's another if you
can't if you can restore the model from
the checkpoints it'll probably spun
another instance in in your available
cluster to deal with that if we can we
use and that's the we're is other
languages like a for example like all
no so right now the support for ten
suppose in python and C plus plus you
can add either front ends it's not too
difficult but there's no either support
at the moment for languages like are
maybe one question related to these on
maybe we talk about this in your next
to but about loading but the spawn was
often frameworks like from to feel from
too much so I don't think there's
anything official that some people have
worked on this so it because again this
is the nice thing about being on open
source community if you have this
questions other people have had that as
well so they're the they're looking
into this and there are some solutions
question could also means is it is it
do it the only because a given that's
why do people have done it people have
done it. in your question no okay so it

Share this talk: 


Conference program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
4 July 2016 · 2:01 p.m.
Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
4 July 2016 · 3:20 p.m.
Day 1 - Questions and Answers
Panel
4 July 2016 · 4:16 p.m.
Torch 1
Soumith Chintala, Facebook
5 July 2016 · 10:02 a.m.
Torch 2
Soumith Chintala, Facebook
5 July 2016 · 11:21 a.m.
Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
5 July 2016 · 1:59 p.m.
Torch 3
Soumith Chintala, Facebook
5 July 2016 · 3:28 p.m.
Day 2 - Questions and Answers
Panel
5 July 2016 · 4:21 p.m.
TensorFlow 1
Mihaela Rosca, Google
6 July 2016 · 10 a.m.
TensorFlow 2
Mihaela Rosca, Google
6 July 2016 · 11:19 a.m.

Recommended talks