TensorFlow 1

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:00

Uh oh okay so we will start for this

00:00:12

served on the last day. So yesterday

00:00:15

was George onto based on soft also for

00:00:19

for I I don't know if you are if you

00:00:21

are very obvious a situation regarding

00:00:24

framework so it also frees Mcgill I'm

00:00:28

working for depending on on the on the

00:00:32

same as assume you say that it's it's

00:00:35

interesting because it does some

00:00:36

strains yes that does does not have on

00:00:38

remote to some strengths and

00:00:40

weaknesses. So to present to the cease

00:00:43

to do we have a right up on the usually

00:00:46

you know I mean I don't what's gone on

00:00:48

the so she will tell you If you can ask

00:00:50

questions on the oh it's Great so far

00:00:56

as technical tests can everybody hear

00:00:58

me and can everybody hear me well okay

00:01:02

perfect so it's at any point there's a

00:01:04

technical difficulty please let me know

00:01:06

because I I can't hear what you hear

00:01:09

and then let's get started so about

00:01:12

questions feel free to interrupt me at

00:01:14

any time if there's a a question that

00:01:16

you think is very relevant to the

00:01:18

slide. And we will have time for

00:01:21

questions at the end of each talk and

00:01:22

there's also the panel at the end of

00:01:24

the day but it's really important that

00:01:26

you get your questions answered because

00:01:27

that's why we're here. So yeah but part

00:01:36

of my microphone just filled out Phil

00:01:41

works. Okay let's just let's just

00:01:45

continue like this. I a little bit

00:01:47

about me I'm a software engineering in

00:01:49

Google research in this area "'cause"

00:01:51

I've been a total for around two years

00:01:55

now after I graduated from imperial

00:01:56

college London and today I wanna talk

00:02:00

to you a bit about answer flow and

00:02:02

specifically about the trade off yeah

00:02:14

test test test yes okay perfect. So

00:02:34

they're gonna talk specifically about

00:02:37

fanciful which is the deep learning

00:02:39

framework built by Google and the talks

00:02:43

are gonna be structured a little bit

00:02:46

differently so the first talk goes

00:02:48

about the core principles behind

00:02:50

fanciful and specifically what we want

00:02:52

from a deep learning framework and not

00:02:54

answer law actually means a lot of

00:02:55

these these requirements then in that

00:02:58

second poker actually gonna go through

00:03:00

a concrete example of how to use a of

00:03:02

low for something relatively simple

00:03:05

linear regression but we're also gonna

00:03:06

look at some really nice thing that can

00:03:09

civil gives you such as distributed

00:03:12

training how to use the GP you know how

00:03:14

to use some of the knife association

00:03:16

tools that we have and then the third

00:03:18

talk we're gonna focus specifically

00:03:20

form deep learning and your networks

00:03:22

and state of the art models and

00:03:24

community contributions and so on. So

00:03:28

firstly what is that several dancers

00:03:31

always this the standard softer for

00:03:33

general machine learning but it's great

00:03:35

for declining in particular and it was

00:03:38

it's open source or can get hot you can

00:03:40

check it out it was forcibly released

00:03:42

in November two thousand fifteen with a

00:03:44

very flexible license you so you can

00:03:46

use it as you please no I'm gonna show

00:03:51

though fee shall be your first so that

00:03:55

you we get a very high level overview

00:03:57

of what's one drummers on international

00:04:01

remote each For deepening over the last

00:04:07

few years I was initial research

00:04:09

project we've since collaborated with

00:04:11

about fifty different teams a to put

00:04:13

these systems in real products across a

00:04:15

really wide spectrum of work today it's

00:04:18

used heavily in our speech recognition

00:04:20

systems in the new photos product email

00:04:24

or if you need to come that experience

00:04:27

in the that entrance. So tense close

00:04:29

this mission dining library that you

00:04:31

stick a school for the blind implying

00:04:34

to a lot of different is doing both

00:04:36

artificial intelligence research and

00:04:38

deploying production models they're

00:04:40

really powerful at doing various kinds

00:04:43

of perceptual and language

00:04:44

understanding these models are able to

00:04:47

actually make it so computers actually

00:04:49

see actually able to understand what is

00:04:52

in an image when you're looking at what

00:04:53

is in a short video clip in that

00:04:55

enables a kinds of powerful product

00:04:58

machine learning it's a secret sauce

00:04:59

products of tomorrow it no longer makes

00:05:01

sense to separate tools for researchers

00:05:03

machine learning and people who are

00:05:04

developing real products there should

00:05:06

really be one set of tools the

00:05:07

researchers can use to try out the

00:05:09

crazy ideas and if those ideas work

00:05:12

they can move them directly into

00:05:13

products without having to rewrite and

00:05:15

the research side a list you then you

00:05:18

understanding to existing problems

00:05:20

advance acidity are existing problems

00:05:22

understand the problems where it's for

00:05:26

an engineering side because it it is

00:05:28

insights from the research. And use

00:05:30

them to individual products in front of

00:05:33

features they wanted us for quite

00:05:36

offencive flow is to allow

00:05:37

collaboration and communication between

00:05:39

researchers it allows the researcher in

00:05:41

one location to develop an idea next

00:05:43

alright and then just send code that

00:05:45

someone else can use on the other side

00:05:46

of the document a lot easier for yeah

00:05:51

I'm not going to have this as an open

00:05:55

source to really hopes that instruments

00:05:58

that effort up. So they expect

00:06:00

developers to be able to do a lot more

00:06:02

than they can do today we think we have

00:06:04

the bass machine learning of

00:06:05

destruction and and mediocrity share.

00:06:08

And that's we wanted oh oh so I guess

00:06:21

that gives a very high level overview

00:06:23

of what the aim of tens of low is an

00:06:26

authority touches upon a lot of the

00:06:28

discussion yesterday and the panel with

00:06:30

what frameworks are good for research

00:06:31

what radio frameworks are good for

00:06:33

development and so on. And I really

00:06:36

want to stress and we will look at this

00:06:37

in more detail in this talk is that ten

00:06:39

several ins to be a tool for everyone.

00:06:42

So what the aim is to bridge this gap

00:06:44

between researchers and developers and

00:06:47

they are scientists and so on. So this

00:06:50

topic focus on two things first why

00:06:53

does Google care about machine learning

00:06:54

to might wonder then the second thing

00:06:57

is what makes them so for good machine

00:06:58

learning framework and we're gonna go

00:07:00

through through this in to the second

00:07:02

one especially in detail. So firstly

00:07:05

why does Google care about machine

00:07:07

learning and specifically deep

00:07:09

learning. Well declining has this

00:07:11

really nice promise of universal

00:07:14

machine learning right. So the idea is

00:07:16

that you can use a similar set of

00:07:18

algorithms to do speech recognition

00:07:20

query understand the ink text to

00:07:24

speech. And whatever else you might

00:07:26

want to do and you don't have to do

00:07:28

with feature selection yourself so that

00:07:30

is a very nice promise. And the

00:07:33

advantage of deep learning is that

00:07:34

apart from giving this promises

00:07:36

actually works because it gives the

00:07:37

promise that the very nice promise but

00:07:40

if it doesn't work better than the

00:07:41

alternatives then it wouldn't be that

00:07:44

that useful. So then I think about

00:07:46

declining that now it's currently state

00:07:48

of the art in speech recognition image

00:07:50

recognition machine translation and a

00:07:52

lot of other applications not Google

00:07:56

we've seen a very big growth in the use

00:07:59

of deep learning from very little in

00:08:03

the the beginning of two thousand and

00:08:04

twelve to more than two thousand

00:08:08

directories containing a model

00:08:10

description file in the Google source

00:08:14

code repository so that's a lot of code

00:08:17

and a lot of models where are they used

00:08:20

to well and a lot of products turns out

00:08:23

so I hope you can see some of your

00:08:24

favourite products here I can so for

00:08:28

all those global keyboard inbox GE mail

00:08:30

drive you to all these products now are

00:08:34

better because of machine learning and

00:08:39

you also probably know about all my

00:08:41

goal which achieved this the

00:08:43

breakthrough in I that was taught to

00:08:45

not be possible at the moment you just

00:08:47

in a couple of years. And this is

00:08:50

another showcase of how Google is

00:08:53

interested to push the field forward

00:08:56

now going to products some of the

00:09:00

product that I mentioned before in the

00:09:01

product slide I'm gonna go a bit of

00:09:03

into detail into some of them. So you

00:09:06

might not inbox that's an email me now

00:09:08

provided by google. And in November two

00:09:10

thousand fifteen it launched this

00:09:12

feature called smart reply. So the idea

00:09:16

is very simple when you send me an

00:09:18

email me how do you wanna go for dinner

00:09:21

tomorrow. I have a couple of very

00:09:24

likely answers that I might give sure

00:09:27

why why not or how about today or how

00:09:31

about lunches that and you can see how

00:09:34

machine learning is a really good fit

00:09:35

for this task because from a lot of

00:09:37

examples you can kind of learn what the

00:09:39

possible answer is that an incoming

00:09:41

email and in boxes the smart reply

00:09:44

feature but inbox has been very well

00:09:46

received well it was initially launch

00:09:48

of April fools joke. But that also has

00:09:51

a matured a lot because in February two

00:09:54

thousand sixteen and restored up more

00:09:56

than ten percent of mobilising import

00:09:58

import supplies use market like which

00:10:01

makes sense because if I'm the goal

00:10:03

trying to catch the tram I don't want

00:10:05

to start typing I just press a button

00:10:07

and it does that for me. So that's

00:10:09

really great now another product that

00:10:13

makes a lot of sense to use mushy

00:10:16

learning is Google play music. So U

00:10:19

Penn euro listening history and given

00:10:23

certain types of music that be like it

00:10:26

or can recommend you play this it can

00:10:29

recommend you other channels that are

00:10:31

similar for example and so on one of

00:10:35

the new ones additions or relatively

00:10:39

new additions is the ability to

00:10:41

querying for those in a two clearly or

00:10:45

for according to some string in Google

00:10:48

for also if you're like me when it

00:10:50

probably you take literally thousands

00:10:51

of followers when you come home and you

00:10:53

wanna show your parents he actually

00:10:55

this really nice cherry blossoms it

00:10:57

would take you hours to scroll to

00:10:59

scroll that right well with this you

00:11:01

can just say hey these are the cherry

00:11:03

blossom for because I just carry them

00:11:06

in groups for those but that's also

00:11:07

saves you a lot of time and if you're

00:11:11

travelling you might be travelling to a

00:11:13

country where you don't really know the

00:11:15

language more descriptive and of the

00:11:19

inhabitants of the country. So you

00:11:21

might want to use this translate

00:11:24

feature which allows you to take a

00:11:25

picture is about particular sign for

00:11:28

example and translated so this combines

00:11:31

computer vision and translation to to

00:11:34

provide a better user experience now

00:11:38

this is about to why Google cares about

00:11:40

machine learning and how our product to

00:11:42

become better due to it and I let's

00:11:44

talk a bit about ten subplot which is

00:11:45

the engine that Powers a lot of these

00:11:47

features. So forcefully why build

00:11:51

tension flow in the first place vocal

00:11:53

have have not they're deploring system

00:11:55

it was called especially if it was

00:11:57

really great for scalability and

00:12:00

production training but it was not as

00:12:02

flexible as researchers would have

00:12:05

wanted. So was a bit again this this

00:12:08

trade off between research and

00:12:09

production I think things. And having

00:12:12

this understanding of already trying

00:12:14

something the first time really allowed

00:12:16

us to simplify the problem and to learn

00:12:19

from previous mistakes. And in order to

00:12:24

realise what we want from a machine

00:12:26

learning system and from a deep

00:12:27

learning system we have to think about

00:12:29

who uses the machine system right

00:12:31

different use cases have different

00:12:33

requirements. So as you probably can

00:12:36

imagine some researchers developers and

00:12:39

they just Landis want to use research

00:12:41

framework. And they all have different

00:12:45

goals in mind. So researchers want to

00:12:48

quickly the rate they want to

00:12:50

unspecified their new crazy idea and be

00:12:53

able to see if it works or not wanna

00:12:55

let a nist or something much bigger

00:12:57

such as in that developers want to take

00:13:00

these ideas and quickly put them into

00:13:02

products without having to wait for one

00:13:04

year without having to port some code

00:13:06

we'd written in a research system and

00:13:09

they just find this they just want to

00:13:10

tweak these ideas that research have a

00:13:13

researchers have on their own data sets

00:13:16

to get a maximum performance. So with

00:13:20

this in mind. This is what we think

00:13:22

that one would want from the research

00:13:24

system. So forcefully ease of

00:13:26

expression for a lot of crazy ideas

00:13:29

that you might have scalability you

00:13:31

want to be able to run your experiments

00:13:33

pretty quick portability this

00:13:36

especially important for developers you

00:13:37

want to be able to run on a variety of

00:13:39

platforms quite easily reproached

00:13:42

reproducibility so that research chairs

00:13:44

around the world can collaborate they

00:13:46

can sure call they can share models.

00:13:49

And the production readiness so again

00:13:51

the idea of going to really quickly

00:13:53

from research to real products. And I'm

00:13:55

gonna iterate through all of these I'm

00:13:57

gonna take them one at a time. And

00:13:59

actually show you how tense from needs

00:14:02

me meets give each of these so let's

00:14:06

start with ease of expression. So

00:14:08

architecture behind dancer problem. It

00:14:11

is very flexible. And the core idea is

00:14:14

compare the computational perhaps

00:14:16

something that it's similar to other

00:14:19

framework and the idea that you is that

00:14:21

you always specify your computation as

00:14:24

a director sickly graph and you then

00:14:26

add optimisation on top of that and the

00:14:31

general procedure when you work with

00:14:32

sense of flow is to define a graph in

00:14:34

the high level language so you don't

00:14:36

want to deal necessary with memory

00:14:38

management this on when you just want

00:14:39

to be able to say hey this is how many

00:14:42

of them all those should look like the

00:14:44

graph is then compiled and optimise.

00:14:46

And then executed I dream parts you

00:14:49

might not want executing paragraph

00:14:51

fully on the available label devices

00:14:54

which might be CPUGPU or what are the

00:14:57

device you might want to use. So the

00:15:00

core of tens of fuel the core execution

00:15:03

system is in C plus plus and this

00:15:05

allows it to be in a very efficient and

00:15:08

very a good in terms of speed. But

00:15:13

you'll we'll all different front ends

00:15:15

and then and terms of how you want to

00:15:17

specify the computation so you can

00:15:19

specify your computational bracken

00:15:21

python C plus plus today but if you

00:15:23

really like job ah you can actually add

00:15:25

another front and easy another

00:15:32

important point when talking about

00:15:33

things that expression are interface

00:15:35

it's again different people want

00:15:37

different kind of interfaces if I'm a

00:15:39

researcher I want to be able to specify

00:15:42

my models up to the matrix

00:15:43

multiplication level right I want to be

00:15:45

able to say this to answer targets with

00:15:48

a together with the stance Erin I

00:15:49

applied this operation or on them. But

00:15:52

you might also want to be able to use

00:15:54

higher level API is you don't want to

00:15:56

go to the matrix multiplication level

00:15:58

for a CNN or for a deep neural matter

00:16:01

because you might we use the same code

00:16:03

again and again and so on. That's of

00:16:05

little allows you to go both ways

00:16:07

depending on your on your which case

00:16:08

which is very useful now about

00:16:13

scalability. So imagine you you

00:16:17

probably are already aware of this

00:16:19

particular run experiments and they

00:16:20

take a lot of time this can become

00:16:23

easily cumbersome social experiment

00:16:25

takes a couple of minutes or hours this

00:16:26

is pretty great right I can start my

00:16:28

experiment I can easily see again have

00:16:30

this feedback loop about my idea is

00:16:32

good I'm gonna pursue this or am idea

00:16:34

doesn't really work or I have a bargain

00:16:36

I want to to debug it does little bit.

00:16:39

And this is great for research right

00:16:41

you get this good feedback loop if

00:16:44

experiments take a couple of days

00:16:47

that's horrible at this point you

00:16:49

probably already start trying multiple

00:16:51

ideas in parallel and trying to see how

00:16:54

each of them or call because you have

00:16:55

to wait a couple of days. Now if you go

00:16:58

two weeks then you you can see how

00:17:01

appropriate sporting right you can only

00:17:03

try your best ideas out now if things

00:17:08

takes up more than a month probably

00:17:09

it's not really worth trying. So how

00:17:13

does that several allow you to run

00:17:15

experiments quickly well you can use to

00:17:17

be use you can use multiple ports and

00:17:19

multiple cheap you cards and you can

00:17:21

also distributor training on multiple

00:17:23

machines so if you have in your lab the

00:17:26

cluster you can you can use that to

00:17:28

decrease your experiment I'm you know

00:17:32

when you want to distribute computation

00:17:35

you always have to have into account

00:17:36

communication overhead right. So if I'd

00:17:40

dispute computation on two machines but

00:17:42

I only get ten percent speed

00:17:43

improvement that's not really great

00:17:45

because I'm using a lot of

00:17:46

computational power for very little

00:17:49

again. And importance of flow there are

00:17:52

two solutions in particular that I'm

00:17:54

gonna name here that I use to avoid

00:17:57

this communication overhead. And the

00:17:59

first one is to exploit model

00:18:01

parallelism. So especially for a

00:18:04

articulating our couple models that are

00:18:06

pretty good for that. And the second is

00:18:08

to exploit data for Ellison because our

00:18:11

training sets can be you apart can

00:18:14

split into part in is them at the same

00:18:16

time. So let's look at each of these so

00:18:20

for more liberalism how do you do that.

00:18:24

Well you can use instruction

00:18:27

parallelism one single for this is

00:18:29

pretty good it's pretty much free when

00:18:31

you get to do this across course you

00:18:33

have to use straight paralysis them

00:18:35

which is almost free unless you have to

00:18:38

go to war sockets. And across devices

00:18:41

yeah so if you go in between multiple

00:18:44

GP use you are often limited by PCIE

00:18:47

bandwidth and across machines you are

00:18:50

very off of limited by network and they

00:18:52

do it or latency and with this in mind

00:18:58

let's look at how model paralysis

00:19:00

actually works for a network like a

00:19:01

convolutional your network. So the idea

00:19:04

behind convolutional neural networks is

00:19:06

that you have this image that close to

00:19:08

to layers the this is the input layer

00:19:12

layer one layer two and then you get

00:19:13

this fine the representation of the

00:19:15

image at the end. And you have these

00:19:17

kernels also called local receptive

00:19:20

fields that get applied to each part of

00:19:23

the image patch so you can see them

00:19:24

moving around like this in the way you

00:19:29

can split this mortal into multiple

00:19:33

machines are multiple chorus is by

00:19:35

partitioning parts of each layer that

00:19:39

are gonna communicate much together to

00:19:41

be on the same machine. So can you want

00:19:43

will avoid to put parts of the model on

00:19:45

a machine and part of another model

00:19:47

another machine in these two parts have

00:19:48

to communicate all the time because we

00:19:50

end up with this overhead. But in this

00:19:52

case if we do it like this we can we

00:19:55

minimise the network traffic because

00:19:58

when we compute the values of the

00:20:00

neurons at this layer we more or less

00:20:03

always look at the ones in the same

00:20:05

partition on the same machine apart

00:20:07

from these ones that the boundaries. So

00:20:09

you can't really because of having a

00:20:11

compulsion guarantor for you can

00:20:13

completely avoid it but you can

00:20:14

minimise now they paralysis. So the

00:20:22

difference is that we use are also

00:20:24

getting might be and usually we use

00:20:27

batches anyway and the idea behind it

00:20:29

probably them is how about we use we

00:20:31

train multiple batches at the same time

00:20:34

or we see examples by different model

00:20:36

directly costs at the same time so the

00:20:38

idea here is that I don't have one

00:20:40

model. I have multiple model replica as

00:20:44

that copy the pro parameters and they

00:20:46

each to their specific computation and

00:20:49

then they tell a parameter server that

00:20:50

keeps kind of the gold standard for

00:20:54

what the parameter value should be how

00:20:57

to update the parameters. So this is

00:21:00

kind of how it looks like. So as I said

00:21:02

you have multiple a model Ripley

00:21:04

because you have the data that goes

00:21:06

here in parallel each replica sees

00:21:08

different examples and then one the

00:21:11

model directly come as computed

00:21:14

finishing its update so for example in

00:21:16

the case of neural networks gradient

00:21:19

really dysfunctional it sends this

00:21:21

update to the parameter server sense

00:21:23

it's okay please update the parameter

00:21:25

server at the parameters and sold all

00:21:30

the other probably colours now when you

00:21:34

think about this the picture. You have

00:21:37

to understand that there are two ways

00:21:38

to do these updates right this directly

00:21:41

cocktails the parameter server updates

00:21:43

the parameters this Ripley cut does the

00:21:46

parameters or were updated the

00:21:47

parameters and so on. But you can then

00:21:50

combine these updates into one single

00:21:53

update right if we want to be as close

00:21:55

to the original algorithms for example

00:21:58

we understand we want to do this update

00:22:01

synchronously so we wait for all

00:22:03

directly us to finish we combined

00:22:05

obvious together and we apply them only

00:22:07

once to the parameters or four so this

00:22:10

is how it looks like this model has

00:22:13

computer to update this model replica

00:22:16

has completed an they this model

00:22:17

replicas computer didn't know if they

00:22:19

get combined together and then only one

00:22:22

day a one update get sent to the

00:22:24

parameter server. So this is actually

00:22:27

equivalent to having an and times

00:22:29

larger batch size of the computation

00:22:31

that you do the the training that you

00:22:33

do it is exactly as if you have one

00:22:35

model with ten times larger than size

00:22:38

the pro is that you have no gradient

00:22:40

stillness or the model replicas are not

00:22:43

operating a borrowing from still

00:22:45

gradients. But the clone of this

00:22:48

approach is that if one machine fails

00:22:50

then you have to recover and wait the

00:22:52

other machines have to wait for for

00:22:54

this one to to recover you can also do

00:22:57

an synchronous updates. So here the

00:23:00

difference is this part each more

00:23:02

directly got can update the parameter

00:23:05

service. But as you can imagine here

00:23:08

the problem is that one model directly

00:23:11

come I sent and updates with respect to

00:23:13

some parameters that are no longer

00:23:14

there because some of the replica has

00:23:16

modified and so it's not really the

00:23:18

same as what we usually do but the

00:23:21

problem is that it's relatively full

00:23:23

and and in practise it works if you

00:23:26

don't push it would too many rep

00:23:28

because it works. But but both kinds of

00:23:34

updates of both with synchronous and I

00:23:36

think Ron yes you really want model

00:23:38

computation to be large enough. So that

00:23:42

it's worth sending the parameters over

00:23:43

the network. So we saw what sorry we

00:23:47

saw here that each model directly count

00:23:49

has a copy of the parameters of the

00:23:51

parameters server is sending the

00:23:53

parameters over the network to the more

00:23:55

directly by every time this there was a

00:23:57

thing. So if you say these parameters

00:23:59

all the time and if there are a lot of

00:24:01

them use you waste a lot of time by

00:24:04

just sending the parameters of or the

00:24:05

network. So tight gaze to strike this

00:24:07

balance between the computation that

00:24:09

the network bus with one set of

00:24:10

parameters without needing update

00:24:13

versus the number of updates that you

00:24:15

do and given this this P down to depend

00:24:20

on the kind of one also for very dense

00:24:21

models you can get ten to forty speed

00:24:24

up compute directly cars and sparse

00:24:26

models that have less parameter support

00:24:28

many more red because even up to one

00:24:30

thousand and in terms of more doubles

00:24:33

certain models to use each parameter

00:24:36

many times so for example convolutional

00:24:38

networks apply the kernel for the local

00:24:41

receptive field on all possible patches

00:24:44

of the image right to that means that

00:24:46

the same parameters is your arm used a

00:24:48

lot before you need an update. So that

00:24:50

makes them good candidates for data

00:24:52

paralysis same for my current models.

00:24:55

So if we can models are big used for

00:24:58

very much used for sequences that what

00:25:00

they're built for so if I want to do

00:25:01

for example some language modelling and

00:25:03

I feed into the network we have as

00:25:05

giving a talk about the answer for all

00:25:07

either one word at a time so me had a

00:25:10

lot is giving a talk a or one character

00:25:13

at a time the model uses the same

00:25:16

parameters for each input until I'm

00:25:18

done with this sentence. So that makes

00:25:20

them what candidates for this kind of

00:25:22

late upper Allison because they do a

00:25:24

lot of computation before they need to

00:25:25

know so now let's look at some numbers

00:25:30

the part is how this helps. So this

00:25:33

plots the image Annette inception

00:25:36

synchronous training. So you probably

00:25:39

know from yesterday what the inception

00:25:43

model is it's a very big architecture

00:25:46

that this current here trained on the

00:25:48

internet and you can see that time in

00:25:52

hours versus the obtain precision on

00:25:56

one CPUNGP use in fifty two pews. So if

00:26:04

we look at one GPU versus fifty GP was

00:26:07

and we fix our precision so I say that

00:26:10

if my model has zero point five where

00:26:12

other one preparation precision I can

00:26:15

go home for the day I've done my work

00:26:17

and I'm happy to go home well if I use

00:26:19

a one GP you then I have to stand for

00:26:22

for three days if I use. PGPU was in

00:26:26

two point six hours I can already go

00:26:28

home so that's a very big difference at

00:26:30

thirty times difference. So not is that

00:26:33

it's not mean you're right I increase

00:26:35

the number of TP use fifty times. But I

00:26:38

still get the thirty time so they're

00:26:39

still an overhead. But I still get the

00:26:43

massive massive improvement here. Now

00:26:45

if we look at the GP was versus fifty

00:26:47

GP use a different accuracy levels. So

00:26:50

it's zero point six and zero point six

00:26:52

five you will also see around the four

00:26:55

times speed up like going from ten jeep

00:26:58

used to five GPU so this is a pretty

00:27:00

good and this is kind of how the graph

00:27:04

looks like for how we when you increase

00:27:06

the number of workers. Um voices we

00:27:10

increase the number for orders how many

00:27:11

examples per second more the the model

00:27:14

can see so you see that if you use a

00:27:17

hundred workers you get a fifty six

00:27:18

speed up versus if you use one if you

00:27:21

use a sixteen workers to get a fifteen

00:27:24

speed up courses if you one so what's

00:27:26

where again it's nothing actually don't

00:27:28

get them a hundred times speedup if use

00:27:29

one it's clear that there is some

00:27:31

overhead. But but can still speed up

00:27:33

things consider and so they got

00:27:38

relatives and not only great in theory

00:27:41

and and so on it's actually very

00:27:43

important for also this indigenous

00:27:44

inception training actually use this

00:27:46

fifty GP use smart reply I talked about

00:27:50

this the feature of inbox look a bit

00:27:52

earlier it uses the sixteen ripped

00:27:54

across to train the model each with

00:27:56

multiple GP use. And the state of the

00:27:59

art language model one billion moral

00:28:01

benchmark uses both data and model

00:28:04

parallelism one thirty DG you so this

00:28:06

is actually very much used in practise

00:28:08

now this talks a lot about multiple

00:28:14

devices but how about one device ten

00:28:17

several performance I put this here

00:28:19

because it's related to the to the my

00:28:21

previous slides. So abundant supply was

00:28:23

initially we in November two thousand

00:28:25

fifteen it definitely had some speed

00:28:28

issues. But it has improved and it

00:28:32

continues to improve so you can see

00:28:34

here this this number one these and

00:28:38

numbers it's getting quite good but

00:28:40

they're still definitely a lot of work

00:28:42

work to do in this disrespect now about

00:28:47

portability. So as I said before it's

00:28:52

very important that you have a machine

00:28:54

learning framework that runs on a

00:28:56

variety of platforms because that also

00:28:58

decreases the time between researchers

00:29:01

coming up with ideas and the time you

00:29:03

want to you have to production Isa

00:29:04

model and also saves a lot of developer

00:29:07

time because they don't have to port

00:29:08

code from one one architecture to the

00:29:12

other. So that's a flow works on CPUS

00:29:16

keep use one up for phones distributed

00:29:20

systems and even customise your

00:29:22

maligned hardware so it's very very

00:29:23

flexible in that way and if you're

00:29:26

interested in how to do this there is a

00:29:28

a lot of tutorials out there of how to

00:29:31

do how to use to answer for both and

00:29:34

read and I was so here are some screen

00:29:36

shots on how to use them in each net

00:29:39

already trained models you don't have

00:29:40

to train your own model to do image

00:29:43

recognition on a right and this is all

00:29:47

and I was so if you want to see that in

00:29:51

the speech or there's some ice cream

00:29:53

and chocolate sauce then you can build

00:29:55

an up to coming to show you that now

00:29:59

how about reproducibility and so it's a

00:30:05

flaw is open source as I said but

00:30:08

flexible the Apache two point zero

00:30:09

license. And this is very important for

00:30:12

us because we think that this really

00:30:14

helps push English learning research

00:30:16

for word because researchers cannot

00:30:18

publish code for new algorithms tends

00:30:21

the flow they can create repositories

00:30:23

for train models. And that's really

00:30:25

also makes research papers reproducible

00:30:27

how about the external adoption of tens

00:30:33

of so if we look at the cute have most

00:30:35

of the people planning framework that

00:30:37

we we are familiar with on our own

00:30:39

guitar. That's a flaw has twenty seven

00:30:42

thousand stars or did when I created

00:30:44

the slides and ten dollars and forks so

00:30:49

it's so much popular then again the

00:30:52

other frameworks in terms of get a bus

00:30:54

tires and forks. And this is even

00:30:56

though it was lunch and only November

00:30:58

two thousand fifteen also in terms of

00:31:03

external adoption in seventy two hours

00:31:06

after lunch they were more than fifty

00:31:07

thousand installs. And more than five

00:31:10

hundred thousand since November two

00:31:11

thousand fifteen. And despite it being

00:31:14

launched only November two thousand

00:31:17

fifteen it is them most for people in

00:31:20

two thousand fifteen out to pick up. So

00:31:22

we think that's pretty another four

00:31:27

point of tends to flow our tutorials

00:31:30

and documentation. So it's very hard to

00:31:33

start with any framework especially if

00:31:35

you're a beginner with machine learning

00:31:36

to don't know much about machine

00:31:38

learning or planning in particular you

00:31:40

also have to learn to frame or you also

00:31:42

have to learn how to deal with machine

00:31:43

learning and I think that's a flaw has

00:31:46

a really wide variety of tutorials out

00:31:49

there. And it caters to both needs. So

00:31:52

if you already are very much familiar

00:31:55

with deep learning you can provide the

00:31:58

expert in this tutorial which keeps a

00:32:01

lot of the deep learning details and

00:32:03

just goes into hate this is how you

00:32:04

stance of role or you can use the intro

00:32:07

miss tutorial which goes a log into the

00:32:11

details of how the model actually

00:32:16

actually works. And of course if you

00:32:19

want to find out even more about how

00:32:22

the internals of ten for works. There

00:32:24

is a excellent white paper we used in

00:32:27

two thousand fifteen that talks a lot

00:32:29

about the internal computation engine

00:32:31

and even though the optimisations

00:32:34

performed by answer flow and so on. I

00:32:36

definitely recommend it. Now about

00:32:47

production readiness. So it's very

00:32:51

important these days especially with

00:32:54

with declining advancing so fast to be

00:32:56

able to integrate these new models in

00:32:58

this meeting breakthroughs in products

00:33:01

to actually make them available and

00:33:02

useful to people that use their phones

00:33:05

or their laptops everyday. And that's a

00:33:08

it's actually very easy to train models

00:33:10

in python so this is ideals very high

00:33:12

level. And then developers can use this

00:33:15

into C plus plus enough to serve

00:33:17

production cost of of course is very

00:33:19

very efficient in much better for

00:33:21

production code. And them because you

00:33:25

can use the role models that developers

00:33:27

don't have to train models themselves

00:33:30

they can just used the ones that the

00:33:33

researcher strain. It's not an as a

00:33:36

concrete example going back to smart

00:33:39

reply inbox in four months it was

00:33:42

stolen from research in deep learning

00:33:46

product to the project to launch

00:33:50

product that you can you all use on

00:33:52

your phone now. So definitely having

00:33:55

this short iterations cycle and having

00:33:58

the same tool used by everyone helps a

00:34:01

lot with moving moving much faster so

00:34:06

in conclusion for this for sport I

00:34:08

think I was a bit because I was talking

00:34:10

machine learning is definitely changing

00:34:15

the world is changing how we use our

00:34:18

phones how we use our computers how we

00:34:20

think about what problems we can solve

00:34:22

or not a lot of problems that we

00:34:24

thought to not be able to solve right

00:34:28

now are becoming easily easier to be

00:34:32

cracked. And the nice part is that you

00:34:35

can be part of it so when you think

00:34:36

about solving a problem. You you should

00:34:39

actually think should I use machine

00:34:41

learning for this can I use a machine

00:34:43

learning for this and there's a lot of

00:34:45

tools out there including tensor for

00:34:46

all that are free that have a lot of

00:34:48

tutorials and a lot of documentation.

00:34:51

And they can really help you help you

00:34:53

get started with this. And just I think

00:34:56

this is the the incoming message

00:34:57

especially for for those who have not

00:35:01

one already into this mindset it's very

00:35:04

easy to get started in is very easy to

00:35:06

make an impact these days with all

00:35:07

these these available tools. So that

00:35:10

why will take questions if you have

00:35:12

some oh and then we'll we'll continue

00:35:15

with the second talk okay yeah I thank

00:35:39

you for the top. So I wanted to know is

00:35:42

is there any or other or this framework

00:35:48

does not open source that using the

00:35:50

will was not open source here I think

00:35:55

the difficult questions here I rather

00:35:56

not comment I would just say that it's

00:36:00

yeah yes so you see that things might

00:36:06

take more to get out there because

00:36:08

they're very high standards to make

00:36:11

things open source so for example the

00:36:12

distributed training was not in the

00:36:14

first open source please but it got

00:36:15

there now right. So that's that's what

00:36:17

I can say things are are getting up oh

00:36:25

thanks for the talk it's not a question

00:36:39

about the internals of Google are there

00:36:41

any projects where you try and then you

00:36:44

decided not to use Spencer flow again

00:36:49

I'd rather not comment but I don't

00:36:52

think that's awful has any specific

00:36:56

limitations so people are definitely

00:37:00

it's it's not like it's an interesting

00:37:02

thing that there are problems with it

00:37:03

is definitely very much used and it's

00:37:04

made if there are problems I'm sure

00:37:06

that people are gonna fix it. So I'd be

00:37:09

very surprised but again. It's always

00:37:12

asking and asking about open source

00:37:14

file maybe I one question there is a

00:37:24

lot of contributions from external to

00:37:27

Beatles so there are plenty of

00:37:29

contributions for and me Reading cards

00:37:32

and we will go through through this

00:37:33

also later small both to the core

00:37:36

repository there plenty of external

00:37:38

contributions. And also so feature

00:37:41

requests the idea is to if you want

00:37:43

something don't just assume a it's not

00:37:47

there I'm gonna try later just ask for

00:37:49

it and for example for dumb distributed

00:37:52

oh the way to specify your class to for

00:37:55

the the distributed computation and

00:37:57

talk about that in the second talk. Um

00:38:00

it's a bit cumbersome today so people

00:38:03

we are actually asking people what you

00:38:05

want to see right so it's not only that

00:38:08

of course we accept contributions and

00:38:10

if you look on the get up a repository

00:38:13

to actually a lot of very interesting

00:38:14

ones and people are collaborating even

00:38:16

meeting together just we want to do

00:38:19

this not as a four point two pewter but

00:38:20

just we're gonna meet encoders to get

00:38:22

transcend the patch in the patches get

00:38:24

integrated with the repository so

00:38:26

definitely and to you know T algorithms

00:38:34

for quite in this and that are used in

00:38:36

the distributed version of the

00:38:38

synchronous one and you synchronous one

00:38:40

I mean I have ideas about suspects

00:38:42

maybe a more useful the synchronous and

00:38:44

downpour for the asynchronous but so I

00:38:47

mean you can specified optimiser that

00:38:49

you want to use is just that the weight

00:38:52

updates will get applied is different.

00:38:55

So that but our burden it's not that

00:38:58

when you choose to do they got pearls

00:39:02

and you it will fix the algorithm for

00:39:04

you. Because when you build the

00:39:05

computational graphic unspecified

00:39:07

optimiser. And it's just that how the

00:39:09

updates get applied to the parameter

00:39:12

server that changes between the yeah

00:39:17

but there's a there's a constraint on

00:39:18

that depending on on on whether a

00:39:20

department or server was executive

00:39:22

director executive communication

00:39:24

there's a limitation of which kind of

00:39:26

distributed algorithm you can actually

00:39:28

apply to get the stochastic gradient

00:39:31

right. So downpour for example is

00:39:34

famous for the fact that not only a

00:39:35

synchronous but that executors can

00:39:37

communicate between themselves which

00:39:39

brings a downpour to some kind of to

00:39:42

some H is in in in some cases where

00:39:46

you're great search get to so you have

00:39:51

a centralised parameter server

00:39:53

executors talking to it without talking

00:39:56

to each other and that's you have to

00:39:58

specify get to visit mice yeah so

00:40:03

actually go back or yeah it's cool. So

00:40:10

I think it's less about what were yeah

00:40:13

so it's less about the optimiser

00:40:14

because the optimiser will just it's

00:40:16

Conference Program

59:34

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

2369 views

55:38

Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.

427 views

01:01:02

Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.

331 views

55:14

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

815 views

55:57

Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.

342 views

01:08:04

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.

2156 views

49:29

Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.

275 views

52:43

Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.

151 views

45:40

TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.

2659 views

52:33

TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.

1705 views

01:05:51

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD
July 6, 2016 · 1:59 p.m.

1406 views

01:04:41

TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.

2251 views

TensorFlow 1
Mihaela Rosca, Google

Embed

Transcriptions

Conference Program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.

Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.

Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.

Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.

TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.

TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD
July 6, 2016 · 1:59 p.m.

TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.

Recommended talks

Klewel SA

What is Klewel?

Follow Us

Contact Us

TensorFlow 1 Mihaela Rosca, Google

Embed

Transcriptions

Conference Program

Deep Supervised Learning of Representations Yoshua Bengio, University of Montreal, Canada July 4, 2016 · 2:01 p.m.

Hardware & software update from NVIDIA, Enabling Deep Learning Alison B Lowndes, NVIDIA July 4, 2016 · 3:20 p.m.

Day 1 - Questions and Answers Panel July 4, 2016 · 4:16 p.m.

Torch 1 Soumith Chintala, Facebook July 5, 2016 · 10:02 a.m.

Torch 2 Soumith Chintala, Facebook July 5, 2016 · 11:21 a.m.

Deep Generative Models Yoshua Bengio, University of Montreal, Canada July 5, 2016 · 1:59 p.m.

Torch 3 Soumith Chintala, Facebook July 5, 2016 · 3:28 p.m.

Day 2 - Questions and Answers Panel July 5, 2016 · 4:21 p.m.

TensorFlow 1 Mihaela Rosca, Google July 6, 2016 · 10 a.m.

TensorFlow 2 Mihaela Rosca, Google July 6, 2016 · 11:19 a.m.

AMD's Open Compute and Open Source cross platform solutions for Machine Learning Mauricio Breternitz, AMD July 6, 2016 · 1:59 p.m.

TensorFlow 3 and Day 3 Questions and Answers session Mihaela Rosca, Google July 6, 2016 · 3:21 p.m.

Recommended talks

Klewel SA

What is Klewel?

Follow Us

Contact Us

TensorFlow 1
Mihaela Rosca, Google

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.

Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.

Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.

Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.

TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.

TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD
July 6, 2016 · 1:59 p.m.

TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.