AMD's Open Compute and Open Source cross platform solutions for Machine Learning

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:00

Just just so but yeah Okay so you start

00:00:49

again for the first two "'cause" you

00:00:51

have to know. So this talk would be

00:00:55

given by voice you but done it's one

00:00:57

the AMD so moist use engineering fiddle

00:01:00

what injuries are on this currently

00:01:02

working on the exist get computing

00:01:04

systems or the acceleration of crude

00:01:07

cloud workload on that that it seeks

00:01:09

the child's Muppet use some not travel

00:01:11

on the on the what she's twice a like

00:01:14

machine running with CPUNGPU so we are

00:01:17

button all that is to that would be

00:01:20

again the talk about console for one

00:01:22

that the product thank you. Thank you

00:01:24

can everybody hear me okay alright so

00:01:28

it's my pleasure to be here thank you

00:01:30

for the invitation. And I said you know

00:01:32

I want to discuss which you we know

00:01:34

what I it has happened this space with

00:01:36

a DL try to give you some background.

00:01:39

And that to be honest some of the

00:01:40

support that we have a going on right

00:01:42

now no I am part of a team TV sets so

00:01:46

wanted to give you a feel for this. And

00:01:48

these are some of the locations we have

00:01:50

a throughout the US a lot of actual

00:01:53

research work is also well funded by

00:01:56

the US government of a department of

00:01:58

energy. But we actually have a crime

00:02:01

project about a fifty million dollars

00:02:03

look it actually scale computer system.

00:02:05

And that means writing tend to be king

00:02:09

floating point operations per second.

00:02:12

And they're our budget or twenty

00:02:14

megawatts we're looking at Arco

00:02:16

supercomputing C so they will be on the

00:02:18

order of about the sixty two hundred

00:02:20

thousand notes again using a new

00:02:23

generation CPUZUP use from the year

00:02:26

twenty twenty one and twenty two and

00:02:28

this requires both a Hanson is on the

00:02:30

architecture with the programming

00:02:32

models on the power efficiency and so

00:02:35

on and actually in addition to that

00:02:36

research you know this I have further

00:02:39

verifications that applies you a next

00:02:42

generation lady product it as I said is

00:02:45

affecting all of our you know software

00:02:47

stacks it also huntress X I want to

00:02:49

make that connection for you. And yeah

00:02:53

I tend to take questions that yeah but

00:02:55

if there's a pressing question please

00:02:56

don't hesitate to entrap media we that

00:02:59

you to respond. And actually this is oh

00:03:03

the outline for my talk. They so of try

00:03:06

to populate no specifically focus on

00:03:09

what we have been going in support of

00:03:11

machine learning. And actually no we

00:03:14

have an overall vision latinos that we

00:03:16

would doing nor softer not necessarily

00:03:19

have four mushy line you but also for

00:03:21

other markets is necessarily no solid

00:03:24

foundations out to describe to you the

00:03:26

kinds of harder features that we see

00:03:28

right now X I am at this conference all

00:03:30

sweet and I know I towards identifying

00:03:34

the lottery do bring features to having

00:03:35

a harder into Reno to keep enhancing it

00:03:38

specifically because you know this is

00:03:40

the feel that is a rapidly moving. And

00:03:43

and you know there's a lot of

00:03:45

innovations happen so we need to play

00:03:47

that passed between old comedy things

00:03:49

too hard really identify the key

00:03:52

building blocks such that we could

00:03:55

create efficient hard if you should

00:03:57

softer support. And as I said the

00:03:59

somewhat the foundations that we have

00:04:02

also some of their approach and all

00:04:04

that open source libraries I have a

00:04:06

little bit of common so what we have

00:04:08

been doing in open source libraries.

00:04:10

And I get a specific you want to have a

00:04:12

two day in the machine learning and of

00:04:15

course. This is a rapidly moving

00:04:18

futility name gee this is also moving

00:04:20

we're really play catch up in a certain

00:04:22

areas that we hopefully it'll will

00:04:24

continue doing this and specifically

00:04:26

also we do in alignment with our vision

00:04:29

which is related also to you know being

00:04:31

open and we know very collaborative

00:04:34

buddy on overall usual would be all you

00:04:36

saw the diagram I had in my previous

00:04:39

slide about you know moving from you

00:04:42

know a small system the left aspects

00:04:44

desktops are going to even so

00:04:46

provocative says admission to in the

00:04:48

past. And the idea is actually you know

00:04:50

to having computers a supercomputers

00:04:53

that you know have high performance in

00:04:55

L so they're heterogeneous you know

00:04:57

there are certain classes of cold. They

00:04:59

do map well to did you feel you know in

00:05:01

the future you for this use for compute

00:05:04

can be quite efficient to certain

00:05:06

classes of computations one minute

00:05:08

computation by versus when you have a

00:05:10

lot of great you sometimes see pews

00:05:12

better connection automation this a

00:05:15

little bit later. But are now we do

00:05:17

have to see over having a local hand

00:05:19

heterogeneous of processing knows that

00:05:21

would be able to execute the CPU colds

00:05:23

when one makes sense to do so and I

00:05:26

said well this is an all in a sense to

00:05:28

one of the features that we see that be

00:05:30

able to apply the bass hard

00:05:32

organisation for the best problem that

00:05:34

we have and actually my my second

00:05:37

bullet here mentions the HSA which is

00:05:40

programming model which is the standard

00:05:43

is actually initiated by G but is not

00:05:45

call on this part of the arms part of

00:05:47

it it is a foundation that's proposing

00:05:50

programming model to make an

00:05:52

approximation did GUE easier. I I hope

00:05:56

to be able to explain that a little bit

00:05:57

more it's not the key talk of my talk

00:06:00

here. But I can easy support the

00:06:02

decision veggies that you would be able

00:06:04

to you know right. C plus plus programs

00:06:07

that are you know directly extra

00:06:09

readable onto again using the past

00:06:11

platform for the execution in that

00:06:14

could be as I said earlier either CPU

00:06:16

energy you embroidery heterogeneous

00:06:18

systems are architecture also doesn't

00:06:21

really presuppose only those X or it

00:06:23

could be an FPG or something of the

00:06:25

some of those in the future. And I just

00:06:27

cetera don't one approach and all being

00:06:29

openly would be a perfectly feasible

00:06:31

for someone and especially Seminole

00:06:33

researchers to also continue to that

00:06:37

that we have one item that I know

00:06:39

neglected to mention about energy

00:06:42

research and I want to really make this

00:06:44

point here as a set of your spread out

00:06:46

throughout the USATV do work you know

00:06:48

in collaboration with many divorces

00:06:50

action a lot of our work is done in all

00:06:53

via also hosting a postdoc in that PH

00:06:57

the occurrence. So this talk is also

00:06:59

peach another hopefully you know we

00:07:02

will find interesting postdoc cylinders

00:07:04

that ignore willing to work with us you

00:07:06

know extend the research you know in

00:07:08

collaboration and as I said the the

00:07:11

third bullet that I want to described

00:07:13

you guys here would be the approaching

00:07:15

the wiser at all I should beat on this

00:07:17

point a little bit the idea is to be

00:07:20

able to provide open source software in

00:07:22

house or hundreds open standard you

00:07:25

know in a sense that he supports many

00:07:27

of the key programming interfaces the

00:07:30

either through like a day I say like

00:07:32

thirty six you may like it or not but

00:07:35

he's a standard you know instruction

00:07:38

set architecture and also even the

00:07:40

harder interfaces such a PCI a new

00:07:42

standard recording see I actually to

00:07:44

six is also coming along nicely set it

00:07:47

hopefully this is also supported by

00:07:49

open source the compilers it also

00:07:51

enable meant even for you know domain

00:07:54

specific ranges of even alderman types

00:07:56

for example panicky or sauce and as I

00:08:00

said earlier no philosophy that is so

00:08:02

pretty meteor products is the idea to

00:08:05

adhere to open standards and hopefully

00:08:07

again this would be stimulate the both

00:08:09

the collaboration innovation and as I

00:08:13

said you know we have done a lot of our

00:08:16

hardware software based on this in a

00:08:18

letters unisys system architectures

00:08:20

which again benefits for you know a a

00:08:22

number of a daily TI industrial you

00:08:25

know members. And again even the

00:08:27

intellectual property loser are such

00:08:29

that enable collaboration. And battery

00:08:32

no they just say foundations actually

00:08:34

independent of energies actually it's

00:08:36

or foundation you can find the

00:08:39

specification available on this

00:08:41

website. And out of the you know I have

00:08:43

listed a number of a links here now

00:08:46

hopefully overthrew the someone

00:08:48

interested can fall some of those links

00:08:50

and and get access to to the extra

00:08:52

information and then they said the man

00:08:55

we more specifically for agencies

00:08:56

supports not what open standards like

00:08:58

the coming up C plus plus fourteen. And

00:09:02

actually they said some of our work is

00:09:04

also done in support of X escape

00:09:05

computing. And as I said is a huge use

00:09:08

a supercomputing system so in LP gas

00:09:12

traditional global address space

00:09:13

languages are and then version standard

00:09:16

that you know of course are able to

00:09:18

support the memory model so such as dot

00:09:20

net and Java actually turns out that

00:09:22

the memory semantics of agency are

00:09:27

formally described. So we know someone

00:09:29

with interest and that can you know

00:09:31

study that's in defined a new ways of

00:09:34

connecting to the architecture because

00:09:36

I said one of the key ideas behind each

00:09:38

essay is this well with the CPU in did

00:09:42

you feel you know they both have equal

00:09:45

access to memory so we actually could

00:09:47

even think about passing a point there

00:09:49

from the sea view to G you and did G

00:09:51

use able to do reference that point and

00:09:54

get to the data and without and all the

00:09:56

previous need for doing that we copies

00:09:58

and so on. And again in all the

00:10:01

coherence is taken care of by the

00:10:02

harder so so hopefully this would

00:10:05

enable both the programmability in some

00:10:07

situations you can actually I get

00:10:09

better performance. And I said one of

00:10:11

the ways of the performance it would be

00:10:13

to do some dynamic scheduling in

00:10:15

interplay either choosing execute

00:10:17

pieces of code online to deceive your

00:10:19

the DGPU and again I guess someone to

00:10:22

programming models here. And well

00:10:25

actually diving to somewhat those a

00:10:28

little bit later but one of them is are

00:10:30

the no C plus plus compiler. Um and

00:10:34

also as I said earlier no opening P is

00:10:36

important in certain spaces

00:10:38

indefinitely these important into the

00:10:41

excess clear high performance computing

00:10:43

again that we do have a version of an

00:10:45

open source compiled it supports this

00:10:47

all of that is is supported by not

00:10:51

allowing him but back and then

00:10:53

optimising compiler. And specifically

00:10:58

I've omission is that again later but

00:11:00

I'll repeat a little bit to try to make

00:11:02

the point you know in a sense you know

00:11:04

if you use the LPN back and we actually

00:11:07

benefit from much of the optimisation

00:11:09

much ado researchers weak line on on

00:11:11

the open source compiler arena. But

00:11:15

also you get to to access them to the

00:11:17

back and which is the part of the

00:11:19

compiler there's actually doing the

00:11:20

code generation. So you can actually

00:11:22

look and see what kind of coding

00:11:23

generated. And again if you're like it

00:11:26

you actually give you ways to make you

00:11:27

know you're wrong modifications but

00:11:29

even that was about it to make it

00:11:31

better. And as I said you know not

00:11:33

clank is actually the you know the fact

00:11:35

a standard for you no different into

00:11:37

the compilers. That's what I yeah yeah

00:11:40

the the signal or software is built on

00:11:42

topics source Linux it actually all of

00:11:44

our patches in or strain. And actually

00:11:47

they are you Emmanuel information here

00:11:49

is the no the device that allows you to

00:11:52

have you know a coherent coherent view

00:11:54

of a virtual memory yeah what I

00:11:56

mentioned earlier when you press a

00:11:58

pointer between the CPU and the you you

00:12:00

this is already two virtual memory

00:12:02

purchase space you don't have to or

00:12:05

yourself about and dealing with the

00:12:06

physical maybe I saw. And actually know

00:12:09

most and all leaders a Linux

00:12:11

distribution would be you know enabled

00:12:13

we try to say as I said earlier

00:12:16

actually I I be repeating this a lot

00:12:19

because I want to try the home

00:12:21

disappointed at all of our stack is

00:12:23

open source including what you know

00:12:25

traditionally used to be very secret

00:12:27

which would be the kernel mode drivers.

00:12:30

I know traditionally especially knowing

00:12:32

GP use the I it's a distraction

00:12:35

instruction set or did you you use not

00:12:38

public and and there were good reasons

00:12:40

for that they're good business reasons

00:12:42

for that one would be the compatibility

00:12:46

a hardware design that this design the

00:12:48

next generation Intel X eighty six CPU

00:12:52

cannot invent a new instruction. We

00:12:54

actually tried to the data. But is very

00:12:57

complicated because there's means of PC

00:12:59

so they decide I don't want to use or

00:13:02

to implement in all I give any

00:13:04

structural I've given add instruction

00:13:06

lexically six disc brakes a lot of code

00:13:08

that's out there so you have a huge

00:13:10

burden of backwards compatibility

00:13:12

whereas the knowledge if you denied

00:13:14

designers had that this degree of

00:13:16

freedom because traditionally the ISA

00:13:19

was hidden behind the driver so you can

00:13:23

make a or whatever like you never games

00:13:25

you know traditionally the the use by

00:13:27

games you communicate that to the

00:13:29

driver which is that as you loaded

00:13:31

again in many times sometimes I

00:13:33

dispatch you would compile the code to

00:13:35

that native I say so hundred is I just

00:13:38

had a lot of feed only Vanden

00:13:39

instructions all the time and so on.

00:13:42

And again in this case an LAPS has

00:13:44

published or instruction set for GPU

00:13:47

called you see and and again we do have

00:13:50

a burden of of backwards compatibility

00:13:52

what do you hope to know navigate this

00:13:54

well in that when you change things

00:13:56

will let you know find to bind you you

00:13:59

which is not so bad after anyway. And

00:14:03

as they said concealing normal open CL

00:14:06

compiler sells you plus plus and also

00:14:08

we do have a support for python are

00:14:10

also open source. And the the back and

00:14:13

the or contributions so the back and

00:14:15

they're also strained and now in in a

00:14:18

certain situations where should you

00:14:20

support have a few compilers which are

00:14:23

not you know do not belong to us in

00:14:26

those fort lee at the moment they're

00:14:27

closed source and some we have some

00:14:29

opening keep compilers that you know

00:14:32

are available to our chips however

00:14:34

they're not open source as everything

00:14:36

else. So as I said well you know the

00:14:40

the key visualiser said would be these

00:14:42

you know having open source software in

00:14:43

X twelve keep mentioned is website here

00:14:47

specifically because not traditionally

00:14:49

does you user use you know for gaming

00:14:51

and there's a lot of hard reinvested

00:14:53

into you know creating high quality

00:14:56

pixels that are creating texture and so

00:14:58

on. But also for the compute side no

00:15:03

there's no no emerging applications of

00:15:05

the GPO itself in actually we do have a

00:15:08

ops this was a website called GP opened

00:15:10

dot com in most on sales oh oh I'm

00:15:14

tempted to say almost a hundred percent

00:15:16

of our software is available you know

00:15:19

you open source N through there. And as

00:15:21

I had mentioned there you know well

00:15:23

hopefully want to couple this would

00:15:24

open standard hundred this will

00:15:26

actually enable us signal to basically

00:15:29

increase the collaboration increase you

00:15:31

know the creativity. And to be honest

00:15:33

in all all of us working together is is

00:15:36

will be better than no the limited

00:15:38

resources that we have in try to

00:15:40

provide not everything for everybody.

00:15:42

So that is one of the the engines than

00:15:44

actually this is actually not one of my

00:15:47

files this is from you know or cheating

00:15:49

that is you know describing what we

00:15:51

call radio hoping compute in our

00:15:54

marketing guy said finally these and

00:15:55

they use the the idea of the rock. They

00:15:57

quoted rock I'm a radio is the brand

00:16:00

for OGP use it as I said the the ideas

00:16:03

to having this open computer platform

00:16:05

and then the name provocative you see a

00:16:07

few for is it all really too rocking

00:16:10

what on this presentation here in the

00:16:13

idea is to create you know what passed

00:16:14

for different performance computing as

00:16:16

I said you know what is funny about or

00:16:18

research is that at risk in computing

00:16:20

market. However we actually want to be

00:16:22

able to you do you I value to our

00:16:25

customers you know to other classes of

00:16:28

customers you know from the system as

00:16:30

we have been doing this a space. And as

00:16:32

I said earlier this there should be

00:16:34

enough foundation for you know for the

00:16:36

development for that experimentation

00:16:38

and for the discovery. So there is you

00:16:40

know the philosophy behind this far

00:16:43

come which is don't group computer for

00:16:46

platform minute we have the the open

00:16:48

source runtime was called rocker and

00:16:51

you see that coming up in the middle

00:16:52

hope to be honest with me right now. So

00:16:57

in a sense I described to you and one

00:16:59

overall vision about how we're

00:17:01

navigating this the space. And

00:17:03

decisions dictated no vice several

00:17:05

pragmatic use also dictated by the need

00:17:09

you know to collaborate to because as I

00:17:10

said well you know not the largest

00:17:12

company in the world. It is actually

00:17:14

beneficial to you know to benefit from

00:17:17

the energy of many of the research is

00:17:19

such as you that would be you know

00:17:21

interested into collaborating creating

00:17:23

new ideas. But that part of the support

00:17:26

for this would be you know what kind of

00:17:28

hard to that we have an LH mission like

00:17:31

this is an interesting my development.

00:17:34

And the two or understand you know the

00:17:37

the notion of the high bandwidth memory

00:17:39

is key and as you know in the B you use

00:17:42

really hungry for not a lot of the axis

00:17:45

it's so aimed it was the first company

00:17:48

to act commercialise not developing

00:17:50

between commercialise I banished memory

00:17:53

describing this simply the idea is

00:17:56

actually play C you know one a second I

00:17:59

you would place the memory layers on

00:18:03

top of the GPU to be honest. It's not

00:18:07

always on top sometimes they're next to

00:18:09

each other. But the idea would be to by

00:18:12

doing using silicone instead of having

00:18:14

a one package device or another package

00:18:17

and then you have to have wires going

00:18:18

from one to the other throw some

00:18:20

standard that memory interface like DDR

00:18:23

and so on. You know you see daddy are

00:18:25

restricted when you are doing this in

00:18:27

cell a colour you are the able to have

00:18:31

you know thousands and thousands of

00:18:33

wires. So you would be able like for

00:18:35

example you have a a number of memory

00:18:38

memory dies or memory leaders here. And

00:18:42

then you would be too complicated that

00:18:44

data chi you to the CPU by using say

00:18:48

you know four thousand wires then you

00:18:50

can get a lot more data going on very

00:18:52

quickly yeah this is menorah product

00:18:55

from from angel already for a while in

00:18:58

that I just said provides signal much

00:19:01

higher memory bandwidth connection has

00:19:02

been adopted at throughout the industry

00:19:05

even so that now and there's a new

00:19:07

upcoming or second generation high then

00:19:09

this memory standard that again I and

00:19:12

he's participating on and the key idea

00:19:16

behind this was to provide both in an

00:19:18

increase bandwidth. But also bandwidth

00:19:21

per why again if you have a to traverse

00:19:25

long distances you know with your

00:19:28

signal yeah actually there's an all up

00:19:30

energy in terms of the power that you

00:19:31

consume. And the one animation the

00:19:33

beginning of our of our talk to achieve

00:19:36

one X a flock of execution undertaking

00:19:39

megawatts if you do the math me neither

00:19:41

between that and accent hundred X

00:19:43

improvement in your you know energy

00:19:45

efficiency to be able to these under so

00:19:48

it's such are constraint. And the idea

00:19:51

here is that you know the electrons

00:19:53

will not have to travel very far and

00:19:56

actually will be build a frankly that

00:19:58

keeps making this point it only cost

00:20:00

you more sometimes to transcribe it

00:20:02

from here to here then to actually do

00:20:04

the floating point operations be again

00:20:06

in all energy efficiency that Vicki a

00:20:09

consideration for this. So you know at

00:20:11

this point of my are actually give you

00:20:13

the idea of the foundation where we see

00:20:16

that you know hard restructure that are

00:20:18

beneficial for you know computation.

00:20:21

And as I said I'm also keeping a

00:20:23

particular I out towards you know deep

00:20:27

learning which also we would benefit

00:20:29

from the get increase memory Bentley is

00:20:31

the beach to be able to address in a

00:20:33

larger memory spaces and so on. And

00:20:36

therefore completely these are products

00:20:38

that are available today this was the

00:20:41

first one you know the initial no no

00:20:46

prototype the the the issue prodded it

00:20:48

came out in all four gigabytes of high

00:20:50

bandwidth memory. And stuff five

00:20:52

hundred kilobytes per second memory

00:20:54

bandwidth in a basically a tariff lot

00:20:56

single procedure in all in all we have

00:20:59

the fire approach to which in a sense

00:21:02

actually couples each of these together

00:21:04

it's able to have a eight you get buys

00:21:06

I'd get of member that can be accessed

00:21:08

at high bandwidth you know limit that

00:21:11

thirteen their faults. And the third

00:21:13

one here this is so used to be called

00:21:15

Polaris was announced a I think a week

00:21:17

or so ago Kennedy gigabyte set you on a

00:21:21

fifty six a gigabytes per second of

00:21:23

memory bandwidth hundred Watts and

00:21:26

actually this is I think the first one

00:21:27

to break down to two hundred dollar

00:21:29

price cost you know we had a discussion

00:21:32

wide here about you know how many get a

00:21:35

false are going to be using to computer

00:21:37

neural networks and so on should you

00:21:39

discuss that in your papers hopefully

00:21:41

made by making diffuse cheaper you can

00:21:43

make that available to everybody. So

00:21:46

again this there some the here is the

00:21:48

cells that will be able to enable you

00:21:50

know for their experimentation on this.

00:21:53

And I just wanted to highlight this to

00:21:55

know what we mentioned earlier this no

00:21:56

showed excellent processing unit in as

00:21:59

I said what you have these on the same

00:22:01

a second die you have a the in this

00:22:05

case the the particular product

00:22:07

actually this laptop that I'm writing

00:22:09

on is one of those actually you mikey's

00:22:11

I have a four CD six if you course in L

00:22:15

that you feel that are sharing memory

00:22:17

and again in their equal citizens in

00:22:20

terms of accessing the memory of course

00:22:22

there you can see this is the sense

00:22:23

that they're able to access the same

00:22:25

memory and of course you know the jeep

00:22:27

you is usually much more bandit and

00:22:29

within the CPU again we you have a

00:22:31

mechanism to make sure you know

00:22:33

execution proceeds that property. So

00:22:37

the idea is et cetera idea would be you

00:22:39

be able to both accommodate this year

00:22:42

computationally the data parallel

00:22:44

computation hopefully to be able to

00:22:46

seamlessly switch between each you with

00:22:49

the no the lowest cost overhead like

00:22:51

being able to access memory from the

00:22:55

CPU two GPU doing everything in user

00:22:58

mode like you can just do a little as I

00:23:00

said earlier you know increase

00:23:02

implementations for example in the

00:23:03

discrete GPU need to invoke a device

00:23:06

drivers you know if you saw for guys

00:23:08

you realise if you do that you're

00:23:10

making a kernel transition to

00:23:11

privileged mode. So in a sense you

00:23:14

distracting execution plan to be it

00:23:16

until you get your data there so you

00:23:18

try to minimise those yeah actually

00:23:22

this is just to give you some notion of

00:23:23

what is going on again this is just

00:23:26

showing again they have many these that

00:23:28

you get from using high bandwidth

00:23:31

memory and easy to see here this is in

00:23:35

the sense it'll looking in terms of the

00:23:37

increase the block size tresses and is

00:23:40

on the left you have a transferring

00:23:42

memory from the host to the GP you

00:23:44

would on the right you have a going one

00:23:46

backboard and as as you can see or

00:23:49

previous generation drivers you know

00:23:51

had to my slower memory bandwidth. And

00:23:54

again a direct attach the GPU was able

00:23:57

to achieve in a higher memory and it's

00:24:00

because I again I'm trying to make the

00:24:02

point that you know when you need to be

00:24:03

able to access you know much more data

00:24:06

this this is enhanced memory and is

00:24:09

going to be no crucial for achieving an

00:24:12

efficient and and and the higher

00:24:14

performance in the you go in similarly

00:24:19

to that this is like a roof line the

00:24:23

makes mention Mart showing you know how

00:24:26

much you know single precision digger

00:24:28

frocks you're able to achieve in

00:24:30

battery disliking products that are

00:24:31

available to day in hopefully you know

00:24:34

in the future will be able to achieve

00:24:36

in the even higher even higher

00:24:38

performance but again I just gives you

00:24:40

an ideal work where you are today. And

00:24:43

that no this is overflowing curve as

00:24:45

you see for work covered a better some

00:24:47

Berkeley in a sense you you change

00:24:49

their a should be T know maybe floating

00:24:51

operations per byte you're doing. And

00:24:53

then you keep going you know in the

00:24:55

left side your your excuse me

00:24:59

bottleneck by a compute capacity you

00:25:02

you hit you know your your saturation

00:25:05

point and this is where you know you

00:25:06

see the peak of the roof line here. So

00:25:11

anyway I hope that so far I was able to

00:25:13

give you one idea of what we see is

00:25:16

being the foundation is you know the in

00:25:19

in the hardware support that you do.

00:25:22

And then again that point you know in

00:25:24

in the samples foundations but I'll try

00:25:25

to describe the know the the software

00:25:28

stack that we have because you know I

00:25:31

know we're considering Nolan dip your

00:25:33

network here in actually construct from

00:25:36

the mathematical aspects. But is very

00:25:38

important especially to get performs

00:25:39

that you understand both the

00:25:41

mathematical aspects how that is

00:25:43

expressed in softer. And then how that

00:25:46

software is actually mapped into

00:25:47

harder. And actually in this in the in

00:25:49

this case that we want to show you what

00:25:52

we have that would be you know helping

00:25:54

or something you to you achieve that

00:25:57

task as I said earlier no one animation

00:26:00

about agency foundation. Um no initial

00:26:03

focus was actually want is eighty you

00:26:06

know this is the name for having the

00:26:08

superuser G pieced together in our

00:26:10

recently you know the same technology

00:26:12

has been extended to also support the

00:26:15

discrete GP use individual have a lot

00:26:18

of ongoing effort there in that case we

00:26:20

do have a a separate in there and

00:26:22

separate dies. And we have a high

00:26:24

performance interconnects and again

00:26:27

high performance harder support to

00:26:29

enable you to approaching transferring

00:26:32

data between the CPGPU you know at the

00:26:35

user level with the hopefully did me

00:26:36

more soft overhead mean this is that

00:26:40

the second portion of this components

00:26:45

here would be really to the you know

00:26:47

the open computer architecture which is

00:26:49

et cetera right now there's no one to

00:26:51

keep things are that each a save

00:26:53

runtime is was originally specified

00:26:56

which means the device driver did you

00:26:59

have a two today is fully agency

00:27:01

compliance it is said agencies a

00:27:03

separate from day should has specific

00:27:06

specification about what is a compliant

00:27:08

device which means supporting the

00:27:10

memory model supporting in a number of

00:27:12

instructions and so on. And that is

00:27:15

fully complying. However has a few

00:27:16

other extensions that actual

00:27:19

information this one especially peer to

00:27:21

peer in in this case you are able to

00:27:23

transfer memory no between GPOGPO

00:27:27

directly yeah actually one interesting

00:27:29

point like to meet about agency is they

00:27:31

said they are you know we can citizens

00:27:34

in the sense that traditionally we see

00:27:36

the CP was being the master in the you

00:27:38

you'll be in this way so you write code

00:27:41

in C and then you despite some code for

00:27:43

execution the GPU but no in wheat ages

00:27:47

say you are actually able to have the

00:27:50

GP will dispatch code to run on the CPU

00:27:52

and vice versa of course you know we

00:27:54

need to be careful about what you're

00:27:56

doing because if you're running sixty

00:27:57

four threads around you you you you can

00:28:00

quickly over on TCPU and so on in

00:28:03

another you know key exchange for peer

00:28:04

to peer is is in the case in which you

00:28:07

have a so you have a huge new

00:28:09

networking we have a you know a system

00:28:11

would say hundreds of nodes and so on.

00:28:14

So you'll be able to transfer memory

00:28:16

again hopefully at user level between

00:28:18

those is he S separation. I

00:28:21

specifically when you doing this to

00:28:23

this GD no when you get to the outer

00:28:26

all communication phase where you

00:28:27

passing out parameters this is a very

00:28:29

key feature to achieving a lot of

00:28:32

performance right there you know this

00:28:34

could be one of your bottlenecks

00:28:35

especially as arabic other described

00:28:37

earlier when you're doing synchronise

00:28:39

SGG yeah at a point in each everybody

00:28:41

has to communicate their their

00:28:43

parameters to everybody I'm a already

00:28:47

you know some other compilers no bite

00:28:49

on an LL me I'll mission and we have a

00:28:53

little bit later about hip this epistle

00:28:55

idea for portability in that it's a

00:28:58

it's an approach to enable cool the

00:29:00

developers to be able to take advantage

00:29:02

of a million DCP use as I said earlier

00:29:06

they memory the discrete memoranda

00:29:08

cheap use also exposed to the program

00:29:10

or get to the normal programming

00:29:12

language constructs in actual at the

00:29:16

bottom here just to show you how this

00:29:18

is developing rapidly actually as well

00:29:21

a now's the time November at the

00:29:23

supercomputing conference that was in

00:29:25

austin especially someone was there but

00:29:28

you know the preview releases in yeah

00:29:30

second releasing a prude you it did

00:29:32

point is that you know this is actually

00:29:34

a very active ongoing development. So

00:29:38

don't be afraid to download the latest

00:29:40

drivers and actually afraid to

00:29:42

contribute to Reno enhancing those

00:29:44

drivers and so on so this is a very

00:29:46

active filter development right now and

00:29:50

as I said oh the GPU up in website you

00:29:54

know most of our code source code is

00:29:56

available you their own on get have or

00:29:59

in a bit bucket as I said in addition

00:30:04

to you know supporting compute we also

00:30:06

have a dog in your libraries that open

00:30:07

source. But you know for these audience

00:30:10

also mentioned I know some of our

00:30:12

compute libraries which are no the

00:30:14

source for less libraries that fifty

00:30:17

random number generators inexorable

00:30:19

touch a little bit about are spies

00:30:21

Matthews operations yeah but again I

00:30:25

want to make sure that they know you

00:30:26

get the notion that you know most of

00:30:29

our salt is actually not found in one

00:30:30

place they know from did you you opened

00:30:33

a common can go and get that and no

00:30:35

access to the most softer right there

00:30:38

in the one of the key up pieces of that

00:30:40

softer is weird we're cloudy the CC

00:30:43

heterogeneous C compiler is we call it

00:30:47

single sources C plus plus for the GPU

00:30:51

and actually do have a support for a

00:30:53

pity P extensions if you're not sure

00:30:57

many of you are open EP that

00:30:58

programmers but in opening P right your

00:31:00

C or fortran or so on like cold. But

00:31:03

you are not eight specific looks with

00:31:05

progress. So you can tell the compiler

00:31:08

this is a terrible. And then the

00:31:10

compiler generates peril code for

00:31:12

example for money and the multi core

00:31:13

GPU in scenery the latest opening P

00:31:18

standard actually for that five right

00:31:19

now has some accelerator absolute

00:31:23

workstations in at least you can

00:31:24

actually specify I want this look for

00:31:27

this particular variables to be located

00:31:29

on the TV you're on the GPO and then

00:31:31

the compiler is free to do whatever you

00:31:34

know it finds the best you can even

00:31:36

ignore all those for arguments in which

00:31:39

case you get a perfectly fine

00:31:41

sequential program though the notion

00:31:44

for us here you know specifically you

00:31:46

know P in with the description by

00:31:49

merrier raise this tells the compiler

00:31:52

for example this is available in this

00:31:54

is the data access to that we are that

00:31:58

we are performing within the look and

00:32:00

then the compiler you for example to

00:32:02

understand this if your compiler person

00:32:04

can computer enough find expression

00:32:06

money missus. And then you can decide

00:32:08

what is it best dated layout. And then

00:32:10

you get you know like someone mission

00:32:12

only know talk assume emission is talk

00:32:15

about all these try of access and so

00:32:17

on. It because again you know remember

00:32:20

one did you you we have a multiple

00:32:21

threads running simultaneously the on

00:32:23

execute the same instruction. So when

00:32:25

you get one memory load is highly

00:32:28

beneficial to the GPU that every one of

00:32:31

those threads access for example

00:32:34

consecutive memory access. We call

00:32:36

those that can acquire less memory

00:32:38

access to everything did you you and

00:32:40

you have the highest that bandwidth in

00:32:42

that case because you eat one axis you

00:32:43

gets two hundred fifty six bytes it is

00:32:46

also possible to have each one of those

00:32:48

could do the different memory

00:32:50

locations. And your code works fine.

00:32:53

And what the jeep you does in that

00:32:55

situation is this in the sense it

00:32:57

suppose you get sixty four different

00:33:01

addresses the memory controller takes

00:33:04

over eighteen to sixty four accesses

00:33:06

meanwhile the the GPU scared range in

00:33:09

is very quick at switching context

00:33:13

typically on a jeep you you have

00:33:14

thousands of dollars of ready execution

00:33:17

thread. And then switches over to

00:33:19

everything. However the situation the

00:33:21

animation early is the preferred one we

00:33:23

want either are compilers or a program

00:33:27

is to create calls that you axes the

00:33:30

memory in a preferred way. And you can

00:33:33

I already received a little bit of a

00:33:36

tension here. Because ten you you have

00:33:39

to understand the math behind your

00:33:41

networks you need to understand cancer

00:33:44

for tax. And then you also need to

00:33:47

think about how your data's being laid

00:33:48

out in memory so that the you you

00:33:50

access it they know consecutively

00:33:52

hopefully as a set of are making it

00:33:55

such that we want our compiler to be to

00:33:58

assist you in those tasks I described

00:34:02

so far opening P C. plus plus in C plus

00:34:05

plus at seventeen have the notion of

00:34:07

the palestinian template you can and a

00:34:11

cantaloupe and say this is a parallel

00:34:13

for each I have an example coming out

00:34:15

in actually we we say named you were

00:34:18

working with C plus plus language

00:34:20

eastenders to propose this as

00:34:22

extensions again innovation program

00:34:24

will be able to tell the compiler okay

00:34:27

this is a parable deserted data that

00:34:29

I'm touching and and then the compiler

00:34:32

would be able to do this option my

00:34:34

stations that are described for you.

00:34:37

And by the way we're building on

00:34:39

existing infrastructure right now and

00:34:40

all crying in a lot of yeah already

00:34:43

have some features to this

00:34:44

optimisations and again this is a

00:34:46

continuing developing within the

00:34:48

community again it but if you are

00:34:55

interested you can also an updater code

00:34:58

for hands for example you can place

00:35:01

some data into different member

00:35:04

categories are Jenny for for me to the

00:35:06

GP organisation you know in addition to

00:35:09

this high bandwidth memory data

00:35:11

described a we have what we call local

00:35:15

minimum or share memory you know the

00:35:16

not the naming between dingy dingy this

00:35:19

this thing is a little bit. But is a

00:35:21

very fast high bandwidth memory that is

00:35:24

available to one execution offender

00:35:26

execution more using the code the

00:35:28

nomenclature. And that members super

00:35:31

high ported memory actually turns out

00:35:33

to be one of the hard rock city leaders

00:35:35

that enables us do the sparse meetings

00:35:38

operations really fast in actually this

00:35:40

is the kind of things that I'm actually

00:35:41

not actively thinking about to try to

00:35:43

see what other things can you do that

00:35:45

helps as well back additionally

00:35:47

actually if you're a programmer that

00:35:49

I'm interested in getting the last bit

00:35:51

of performance you this possible to

00:35:53

annotate you know in the source code

00:35:56

for you know helping people combine a

00:35:58

locating this in battery original

00:36:00

active discussion right now we also

00:36:02

have an uncle meat even have a number

00:36:04

of blogs in which people tell us we're

00:36:07

going and doing a good job or not. You

00:36:09

know progressing with this is I said

00:36:12

the right now the compilers available

00:36:14

plus or support signal ISOC plus plus

00:36:18

is it but it said is already open

00:36:20

sourced in as a said supports the panel

00:36:23

centre and language and also supporting

00:36:25

open AP and as I said in this

00:36:28

particular case the computationally is

00:36:31

a single source signal you generate in

00:36:34

a single file we can be Riding your CPU

00:36:37

cold in any annotator Payroll for each

00:36:39

have a jeep you code in you know don't

00:36:41

really concern yourself with

00:36:43

transferring data because you can refer

00:36:46

to global variables within your cold

00:36:48

without you know having all this

00:36:50

usually to make sure you according to

00:36:52

the the the copies finished by the time

00:36:54

you need to access the data and so on.

00:36:56

So someone ask earlier basically a

00:37:00

compiler generated a whole school for

00:37:01

the CPU injuries you know the coda goes

00:37:05

we want to did you you know in the the

00:37:07

elbow onto the same L file. And again

00:37:09

teacher say foundational we say and you

00:37:11

have proposing a laugh binary formats

00:37:14

says that you there you can use your

00:37:16

standard tools to analyse elf if you

00:37:18

care so but it is at least to allow

00:37:21

motivated programs to drill down

00:37:23

optimise. So think you can even now

00:37:26

control data profanity nola talking

00:37:29

someone really that mission like

00:37:30

sometimes when you are able to make

00:37:32

your code of fit into the GPO cash you

00:37:35

really get the very good performance so

00:37:37

if you can you can do that. And as the

00:37:39

said or did you view parents can be

00:37:40

lost sight synchronously oh the mean

00:37:44

age essay basically we architect that

00:37:47

it's just see structure in the sense in

00:37:50

a C structure describe what worked way

00:37:52

want to keep you to do in there you

00:37:55

describe what is the address of a

00:37:56

method which are functioned executed

00:37:58

what are the arguments. And you feel

00:38:01

that function in you place it in a

00:38:03

queue in memory of the harbour is

00:38:06

looking at the Q and then he sees hey

00:38:09

there's work to do. And then factors

00:38:11

work to do and battery different the

00:38:13

models a jeep you have a have a

00:38:15

different capabilities some have eight

00:38:18

such you we need you looking for work

00:38:20

to do. So you know from the users

00:38:22

perspective all you do is you create

00:38:24

this packets you dispatch awarded you

00:38:27

you wouldn't and everything else that

00:38:28

happens you know automatically. So in a

00:38:31

sense you can you know start work on

00:38:33

the GP see mostly in again the agency

00:38:36

sitting which itself provides a

00:38:38

signalling mechanism for no the CP you

00:38:40

for example to be informed if you care

00:38:43

about you know the execution of a given

00:38:45

colonel. And actually remove even

00:38:48

create you know dependency chase which

00:38:51

would be a good thing for that so for

00:38:53

for example note ringer over those dogs

00:38:55

that you did you almost can't translate

00:38:57

those two hundred should create an oil

00:38:59

dependency change you need to say P

00:39:01

find this guy executes I want the

00:39:03

second one to be able to dispatch. And

00:39:05

then on the second terminates the third

00:39:07

one was on it is possible for you to

00:39:09

create such a structure and have the

00:39:11

harder. I get takeover in only let you

00:39:13

know one everything is executed again

00:39:15

in some environments this can be a

00:39:17

alive specific and has meant. And this

00:39:21

is actually just a very quick example

00:39:23

here to show in actually any see how

00:39:27

I'm doing and time. Um yes we need to

00:39:32

speed up basically the the key point we

00:39:36

will be this as I said we we we writing

00:39:38

a parallel for each construct. And as I

00:39:41

said a little bit earlier so I declared

00:39:43

the bounds of my array. So I able to

00:39:46

specify million perfectly standard C

00:39:49

plus plus code you know go do that

00:39:51

would be equal compiled and run on the

00:39:53

G you win as I said I mostly would

00:39:56

validate this code for grief I wanted

00:39:58

to get more performance out of it is

00:40:02

just submission all here is called

00:40:05

header jeans computer interface for

00:40:07

portability the idea is to enable who

00:40:10

the programmers to move to a common C

00:40:14

plus plus programming environment. And

00:40:17

the idea is that we can create I just

00:40:20

said to you convert your could echo to

00:40:22

standard notation. And then that

00:40:25

notation can be processed either by you

00:40:28

know in you CC which is if you give a

00:40:30

compiler or I standard C compiler late

00:40:33

each CC so in a sense just you

00:40:36

simplified forty this too is evolving

00:40:40

and actually I just said is very easy

00:40:41

to see like I I is on the left to have

00:40:43

operational could a code here. We have

00:40:46

a clue them in copy and so on any if

00:40:49

you look at the right side you see

00:40:52

again I hit man copy. And then that you

00:40:55

have a you know a block encoded that in

00:40:58

you had the second on the declared on

00:41:00

the right. So as you can see the goal

00:41:02

here was to simplify no porting of

00:41:05

colds. And this is to is not perfect

00:41:08

the no uses DGC compiler is I said I

00:41:10

don't want to convert to hippie right

00:41:13

to compile for example in HCC and then

00:41:15

run the same code on on the actually I

00:41:19

do inferior or in the GPS the to

00:41:25

converse I would say about eighty

00:41:26

ninety percent of your code in that are

00:41:28

served most trucks that would require

00:41:30

many derivations so this is in the

00:41:32

sense and that's just shoe portability.

00:41:35

And actually this point I put also

00:41:37

mentioned all did. GPUDPCC compiler

00:41:42

from who that is also capable compiling

00:41:44

occluded directly in generating a LP

00:41:47

yeah meeting with the language which

00:41:48

you could also be we target towards you

00:41:51

know both classes of GPS for completely

00:41:56

as we also be you know collaborative

00:41:58

coaching you analytic see did you

00:42:00

provide a high performance python. So

00:42:02

the I D.s actually to be able to

00:42:04

provide the know accelerated classes in

00:42:07

it provide a number of features to

00:42:09

support efficient execution and also

00:42:12

such as I said earlier this seekers

00:42:14

colonel execution using our share

00:42:16

memory again the shared memories is

00:42:18

fast memory on did you use in again

00:42:21

making even implicit that the research

00:42:23

based on this. So the idea is to

00:42:25

provide no python programmers which are

00:42:27

way of target EGP use a very

00:42:29

efficiently and then finally I get to

00:42:33

additional accordion destructing we

00:42:36

have a and a number of solutions that's

00:42:40

what I described so far were more of

00:42:42

the softer foundations I talked about

00:42:45

the about some of the libraries and

00:42:46

also some of the compiler and runtime.

00:42:49

And then oh you I just said a number of

00:42:51

a machine learning to your network

00:42:54

framework supports in addition to order

00:42:59

here is is the setting a lot atomic

00:43:01

simulation especially access Q

00:43:02

computing is also very important to us

00:43:05

as well as we're looking at the energy

00:43:07

and the signal processing. And also

00:43:09

dating graph and the really can

00:43:11

actually want to be able to comment on

00:43:13

that a little bit coming up later in

00:43:16

just to say here again all those

00:43:18

libraries are eight open source is well

00:43:20

we have a blast libraries the fish T we

00:43:23

have a sparse matrix library a rental

00:43:26

number generates first and make sure

00:43:29

this is familiar here but the US

00:43:31

department and if you have the new that

00:43:34

framework or the core calls which

00:43:36

enables you to do high performance

00:43:38

computing again expressing those

00:43:39

computations that high level is is I

00:43:42

have really template at library which

00:43:45

you know you write record an evil can

00:43:46

those templates no using matter

00:43:49

compilation you do actually get the

00:43:51

generated code to be quite efficient

00:43:53

across a number platforms because one

00:43:55

of the goals are screen able high

00:43:57

performance while preserving

00:44:00

portability and sometimes these two

00:44:02

goals are to each other every lot with

00:44:04

each other to mention a charm plus plus

00:44:07

is one of the the emerging P guess

00:44:10

language that I mentioned in this in in

00:44:12

the beginning of CHPX more "'cause"

00:44:14

outside in one of those is sequence

00:44:17

execution frameworks. And this is

00:44:19

actually this especially though there's

00:44:21

less three are important then again in

00:44:24

the highly distributed environment a

00:44:28

lotta right here I know you son of a

00:44:30

bitch Mike then we have a menu right

00:44:31

now and some might be funny to in some

00:44:34

are familiar to win the lottery areas

00:44:36

of work. And then they said well

00:44:38

specifically related to your network so

00:44:40

we do have support for tort seventeen

00:44:43

comfy in here omission on the passing

00:44:46

and we do have machine learning

00:44:49

library. And I and I confess you know

00:44:51

this is in a normal any development

00:44:53

right now in the goes to have you know

00:44:55

what am Isaac a convolution libraries

00:44:58

for neural networks. And again I

00:45:01

believe but we have so far is already

00:45:02

open source is is to not at the peak.

00:45:05

And then there's actually go work going

00:45:07

on to provide well better in intensive

00:45:10

versions of this it initially specific

00:45:12

food to produce audience it would

00:45:14

definitely work no for their

00:45:16

contributions. Um the to to to make

00:45:19

those libraries better and more

00:45:20

interesting in the related to these ago

00:45:23

so do you have a working in the open if

00:45:26

yeah it's a community against an open

00:45:28

source framework is specifically big

00:45:30

graph optimiser features in open GXLR

00:45:34

also no deviation to be a foundation

00:45:36

towards a supporting good machinery and

00:45:42

again actual booklet this maybe quickly

00:45:45

and unless this let's see this ad in

00:45:47

the number of open source library thing

00:45:49

about it is all of this information is

00:45:51

a publicly available also a name to the

00:45:53

developer website. So basically in a

00:45:56

number of a computer I business anymore

00:45:58

so for compute libraries are available

00:46:01

in open source. I'm tempted to say all

00:46:03

of them but you know oh I would say you

00:46:06

know the vast majority have terrible as

00:46:09

I said well we're starting with the das

00:46:11

libraries and again the the the sparse

00:46:12

matrix operations. And I mentioned

00:46:15

earlier already the heat the computer

00:46:18

interface Proctor tabulation of the

00:46:20

current version that we have is already

00:46:23

available out there there's is said

00:46:24

this is rapidly evolving that you saw

00:46:27

you know before they so do this will

00:46:29

you have been releasing code quite

00:46:30

rapidly so as as this evolves you can

00:46:33

always get the latest one you know

00:46:35

previous work project was the boat C

00:46:37

plus plus which is a a template that

00:46:39

live Reno compatible DSTL that would

00:46:42

enable you to basically male target or

00:46:44

go to the GPS no the next one is my

00:46:48

favourite intellectual mission this

00:46:50

email also in in the future we've been

00:46:54

talking about CC plus plus insulin and

00:46:57

there's also not a huge amount of work

00:46:59

in Java including you know you're an

00:47:00

outward to work in a crappy a friend of

00:47:05

mine developed it in the cadres is

00:47:07

horrible name is Cody paralegal PI it's

00:47:10

away in use you can express your data

00:47:12

panel or two in java. And then at one

00:47:16

time we take the Java byte codes in

00:47:18

over those inch you called for did GPO

00:47:22

you know you're from you know the way

00:47:23

Java works. You know he said dynamic a

00:47:26

compiled language so you probably

00:47:28

translate Java code you generate those

00:47:29

byte codes. And in upper upper you

00:47:32

write your Java code normally but from

00:47:34

here from one special class had runtime

00:47:38

you know or runtime would take over in

00:47:40

inspect the byte code. And in case you

00:47:43

know everything's okay we'll try to

00:47:44

generate code that runs on the GPU and

00:47:48

actually tradition regenerating under

00:47:50

the covers would be generating open CL

00:47:53

and if it fails you would have to

00:47:55

revert back to either move like or

00:47:57

execution or even to sequester

00:47:59

execution in the show mission a little

00:48:02

bit later we have we have used this

00:48:05

particular feature to accelerate order

00:48:07

frameworks is pretty particular how do

00:48:10

and not biting spark which are you know

00:48:13

very very important big data a frame or

00:48:17

sin most of them are abused on top of

00:48:20

the JVM the Java virtual machine and

00:48:24

actually in again that supercomputing

00:48:26

we actually to does this is the the the

00:48:29

the time that we should we introduced

00:48:31

the notion of of making or open source

00:48:35

drivers they making everything not

00:48:36

fully transparent. So he was called the

00:48:39

boatman initiative in all drivers for

00:48:41

the video cards. And again is some only

00:48:44

so not to you know I spend a lot of

00:48:46

times the details here but it is said

00:48:48

do we have an awfully fully computer no

00:48:53

bless routines a random number

00:48:55

generators. I just want to highlight a

00:48:58

intermediate or sparse matrix and I

00:49:01

have a couple for ozone it coming up

00:49:03

later because as I said I'm also not

00:49:07

thinking about to know what kind of

00:49:08

things make you know called run faster

00:49:11

GPU and turns out you know the shared

00:49:13

memory no the fact that they share

00:49:15

memories either to support in our Casey

00:49:17

she for memory accesses simultaneously

00:49:19

was very very key to enable high

00:49:22

performance on one sparse matrix and

00:49:24

actually the secrets is naughty what

00:49:26

what what was done in this library.

00:49:29

It's conceptually simple we wonder

00:49:31

processing is sparse matrix we actually

00:49:33

do prefect it which you you know like

00:49:35

affecting the whole description that we

00:49:38

can do this you know using a high

00:49:40

memory bandwidth. And then repression

00:49:42

prefect into this highly ported memory

00:49:44

and then from there we to dispatch

00:49:46

multiple threads. And because the mammy

00:49:49

supports multiple simultaneous access

00:49:51

all of them can proceed in parallel in

00:49:53

this is basically the key for the

00:49:55

performance of this library which works

00:49:57

well on both indian and yeah we I just

00:50:01

for computers omission or fifty

00:50:02

library. And we have a know and a few

00:50:05

other frameworks here that are also

00:50:07

supported again some of the people here

00:50:09

might be using those libraries no I

00:50:13

want to talk a little bit specific

00:50:15

about what we have a long enough for

00:50:16

machine learning in actually that

00:50:18

enough sense the two threads of this

00:50:20

work as we do contribute something to

00:50:23

the to the some of the leading a

00:50:25

framework to knock off a torch that had

00:50:27

been to discuss right here in

00:50:29

somebody's work is it that don adding G

00:50:32

cell is also the two in our club racial

00:50:34

the multi core where as I already

00:50:37

mentioned on all the and some of the

00:50:40

platforms also if you are into no image

00:50:42

processing also do have some of labor's

00:50:44

talks very that and actually so I'll

00:50:47

describe no specific some of the

00:50:49

libraries that are related to machine

00:50:51

learning that we have an an HCC cafe

00:50:53

version which is in a lot of original

00:50:55

coffee utilising or C plus plus

00:50:57

compiler ports succumb to the number of

00:50:59

models to test. And again this is the

00:51:02

place where you could download

00:51:05

experiment with this and again you

00:51:06

could use it to target or GP use

00:51:09

seemingly with stores you know similar

00:51:11

concept again using no the few plus

00:51:13

plus dependent portions of those are

00:51:15

you know enabled using in our C

00:51:17

compilers again you're able to download

00:51:19

it in experiment it's it's as much as

00:51:21

you can okay yeah so much to bring need

00:51:27

to run fast but I wanted to do this

00:51:29

because I'll take a make the change

00:51:31

here. And actually appreciated Nicholas

00:51:33

said this morning to go so far we have

00:51:35

been thinking about what happens within

00:51:37

an older within a computer. However

00:51:39

there's also you know we have this use

00:51:42

data centres from go face broken and I

00:51:44

said was the pronoun there is you have

00:51:46

a you know cluster with six thousand

00:51:49

nodes in advance. So we'll highly

00:51:52

distributed computing is you don't use

00:51:54

a funky area for us to be thinking

00:51:56

about but before that are just

00:51:58

introduce you to the notional map redo

00:52:00

signal which is you know popularised by

00:52:02

go go in agony is highly interesting

00:52:06

way of doing parallel computation the

00:52:08

notional map produce here for those

00:52:10

that are not for me to action describe

00:52:12

a simple problem here like suppose you

00:52:13

want estimate divided by you know quick

00:52:16

ways as you pick a a number Renault

00:52:18

points inside the unit rectangle any

00:52:20

count how many of those twelve falling

00:52:23

side. And the the idea behind that

00:52:26

precludes is is that the system the

00:52:28

framework does everything for you is a

00:52:30

very easy way of doing at parallel

00:52:32

programming you write the map method

00:52:35

you like them at the produce method.

00:52:37

And then the system decide to scale you

00:52:39

run those things in parallel and

00:52:40

basically you run you know your map

00:52:43

functions you which basically generate

00:52:46

like in this case in this example here.

00:52:48

I generate a random point in to make

00:52:50

additions you whether or not it is

00:52:52

insights right I'll put or one has here

00:52:55

in this case my those function is

00:52:57

basically just signal counting the

00:52:58

number one the system does the rest for

00:53:00

you. And actually use this to motivate

00:53:03

for example one of very popular

00:53:05

framework for again begin early this is

00:53:07

how do which is an open source

00:53:10

implementation of them out produced

00:53:11

framework develop real. And you resume

00:53:15

this you to several Java virtual

00:53:17

machine you have to know create a file

00:53:20

system we have relied be done cheese

00:53:22

different note fails it does it all for

00:53:24

you. You want to program you write them

00:53:25

at at the offended with this method

00:53:27

know how do you want to accelerate that

00:53:29

on the GPU in actual quickly show you

00:53:31

how we did it get our target was a

00:53:34

cluster of a use which is you know a

00:53:37

number of those signal abuse Z use

00:53:39

together connected via some kind of

00:53:41

network in we are using what he call to

00:53:43

level prejudice next I I used to call

00:53:46

this a map reduce reduce in a sense in

00:53:50

the idea would be additional think

00:53:51

about it. No one of our nose has seen

00:53:54

one number of course in the number GP

00:53:56

you course. So I have I not apparel

00:53:59

system internals and then I have a

00:54:00

number of those nodes. So what we did

00:54:03

that we actually break the problem we

00:54:05

do map produce. We slice the problem

00:54:08

across the note. And then we in the

00:54:11

node I have really listen so I can

00:54:13

actually run mac produce in a note. And

00:54:16

then I do a further reduce using the

00:54:18

network. And that is the the example

00:54:20

that that we have right now and

00:54:22

actually publish is work in

00:54:24

collaboration with the rice university

00:54:27

again the code is open source. And he

00:54:30

did have a sermon nice speed at a

00:54:33

national prior to members of the and

00:54:35

then another framework and the big deal

00:54:37

let's call model and we were able to

00:54:39

show some ice speed ups again. This is

00:54:42

going towards and it'll be on the

00:54:44

single note we want to be able to run

00:54:46

you know this programs using in all

00:54:49

highly this we did a framework working

00:54:52

on using hopefully hundreds of

00:54:53

thousands of nodes though how do itself

00:54:57

is becoming in a sense already did the

00:55:00

older guy. I knew I developed is to

00:55:03

spark and again recording collaboration

00:55:06

with the rice university. And using the

00:55:10

similar to actually both how do and the

00:55:14

passion spark bill run on top of the

00:55:17

Java virtual machine how do this

00:55:19

program in Java and practise partitions

00:55:21

calibre units in E and you produce byte

00:55:24

code. So we were able to use that

00:55:26

probably framework I showed you a

00:55:27

little bit earlier in in this is what

00:55:30

we have been doing right now we

00:55:32

basically no Harriet the byte code in

00:55:34

this case we recognise the mac and if

00:55:38

read this construct in the in this part

00:55:41

framework convert those you in in this

00:55:44

case to open CL and then on the GPU and

00:55:48

this is also the other in open source

00:55:49

than actually we believe this is also

00:55:51

be quite interesting later on there's

00:55:53

been a quite a number of development so

00:55:55

people know feeding a number eighty in

00:55:58

in that effort out of a single

00:56:00

executions of a party by my mission you

00:56:06

know the yellow sparse in actual like

00:56:08

just a signal actually we were able I I

00:56:10

believe that this is probably one of

00:56:13

the fastest implementation of sparse

00:56:15

matrix foot multiply right now actually

00:56:18

do have a paper that was present I walk

00:56:20

or last month you know we show that I

00:56:23

don't have to present all vendor

00:56:24

optimise library in which we're

00:56:26

teaching as actual tricks that work

00:56:28

done optimise is typing. And actually I

00:56:30

just said the the key to cause

00:56:33

basically making good use of the shared

00:56:34

member the LDS what you call in this

00:56:37

particular libraries able to you you

00:56:41

know be it you know one aboriginal for

00:56:44

the matrices we tried about two two two

00:56:47

and a half at times faster again

00:56:49

depends on your sparse India and so on.

00:56:51

And actually this led to an interesting

00:56:53

observation this is work they was

00:56:56

present the less month skin yeah he's

00:56:59

got it's a work out up you dollars

00:57:01

group that stand for you know we still

00:57:03

defined efficient inference engine in

00:57:06

for this committee to one of the key

00:57:07

factor was up by you know a lot of

00:57:09

compression to the yeah your network

00:57:12

itself like you have a many many which

00:57:15

company into a note in one of the

00:57:17

waiters and also searchable find it

00:57:19

doesn't really affect the output to do

00:57:20

drop it in the background this in the

00:57:23

sense to cover the problem to a sparse

00:57:25

matrix problem. Well then I said that's

00:57:28

up by this that lets you to live you

00:57:31

know I sparse matrix multiply in

00:57:33

actually we're in the process of

00:57:34

developing right now but I want to

00:57:35

share the ideas we which to die so far

00:57:38

the idea is that then as I as I may

00:57:41

sure here. Um or machinery libraries

00:57:45

actually still evolving in for example

00:57:49

you know in contrast to sure libraries

00:57:52

such as few DNN and actually see an

00:57:55

opportunity right here the just by

00:57:57

applying or pick a gaze of ice by

00:58:00

spaces to buy. We are able to catch up

00:58:03

you know the the to display single.

00:58:07

This paper they present some numbers

00:58:09

here noise is is too many members for

00:58:11

you to to read but basically we felt

00:58:15

like you know they described one the

00:58:16

where using dance matrix are libraries

00:58:19

or sparse matrix libraries in when you

00:58:22

switch to the numbers that you get for

00:58:23

with our own libraries on these you

00:58:25

know we we achieve a party with the for

00:58:28

example CUDN and this is somewhat the

00:58:30

ideas that are going on in one

00:58:32

summation to you guys just to see like

00:58:34

the direction that things are going and

00:58:37

as is set up when she was eighteen

00:58:39

about you know the radio no open

00:58:42

computing this is a no what we end up

00:58:44

with an overall hopefully you know will

00:58:45

be rocking. And producing you know if

00:58:47

you should code in this is that the end

00:58:49

of my talk obviate glad to take any

00:58:51

questions in the time that we have

00:58:53

thank you very much. Hmmm Oh yeah okay

00:59:09

questions I have one question that is

00:59:24

that and DCP is set her only come and

00:59:32

have or G be and it to be very events

00:59:35

like four gigabytes in thinking about

00:59:37

is area send you do have well takes but

00:59:42

that's like a double check or why

00:59:45

haven't you guys really is larger

00:59:50

memory to use because it seems like

00:59:53

twelve DB seems to be essential for

00:59:56

designing these days at least and it's

00:59:58

only going for the yeah actually great

01:00:00

question I have to think about I have

01:00:03

the same problem you had before how

01:00:05

much can I say actually well there were

01:00:07

some hardware limitations of the first

01:00:09

remember this was the first of a force

01:00:11

HBM product don't expect the number to

01:00:14

stay the same I I can see this much

01:00:17

though one of the device that I show

01:00:19

you here we go to a gigabytes but in a

01:00:21

sense we're having now the two jeep

01:00:23

used together because of the no

01:00:26

crocheting that we have you're able to

01:00:27

access till eight gigabytes. And though

01:00:30

the the older generation to how why I

01:00:33

forgot what the code name is this that

01:00:34

one is able to go to thirty two

01:00:36

gigabytes but this is DDR five member

01:00:39

yeah also times the for the types of

01:00:47

which faced a couple of questions. And

01:00:50

hear me yes oh you short that you in a

01:00:55

a and you have produced high quality oh

01:01:02

jeep use well for example the one oh we

01:01:06

though about fourteen. I gig got lost

01:01:10

which is pretty good compared to six

01:01:13

point one of electronics. So the

01:01:17

question is that why the market is

01:01:19

captured by indeed yeah then I mean it

01:01:24

is that it is because of price or oh

01:01:30

software framework or there is any

01:01:33

other advantage of yeah which I'll keep

01:01:37

the most use that and I most acute

01:01:41

you're used is here down so the lady

01:01:43

from indeed yeah oh I believe she left

01:01:49

any make him bodily harm the

01:01:51

researchers so my this is all my guess

01:01:53

it on our really speaking frame D

01:01:56

that's a good question I would say

01:02:02

there's probably probably really have

01:02:06

to also know how much is support you

01:02:08

have that there could be many many

01:02:11

reasons that I specifically also

01:02:13

because you know the tradition and

01:02:14

there's been an emphasis also on the

01:02:17

you know the graphics I don't GPL so

01:02:19

that that could have been a factor well

01:02:23

I mean I don't know why someone would

01:02:25

buy one versus the other again mad at a

01:02:28

given point in time is definitely not

01:02:30

tries not necessarily help applies that

01:02:31

was announced last week as the first

01:02:34

the GPU to break the two hundred dollar

01:02:37

yeah yeah there there you know you have

01:02:39

no gaming performance that you know

01:02:42

that's stock prices yeah sorry I know

01:02:46

this is not a sex I don't is about the

01:02:47

only to ask the marketing guys about

01:02:49

companies to actually get that answer

01:02:51

yeah so I've been using NTTP use for

01:03:05

magically was processing and they

01:03:07

perform accident but every models I

01:03:11

kind of look around try to see if you

01:03:15

know I can run any of the the learning

01:03:17

networks like tensor flow torch on an

01:03:20

ATTPU and it turns out that we still

01:03:24

cannot do that. So are there any nouns

01:03:27

to you know fix. This issue yes

01:03:32

actually bought this one described we

01:03:34

both have a versions of some of the

01:03:36

frameworks like fortune cafe does that

01:03:40

are some sports I know about cafe there

01:03:43

are some open CL ports that are

01:03:45

available out there in in a yes that

01:03:47

our plans we're definitely working on

01:03:49

that both within named D and hopefully

01:03:52

at all simply to Reno inspires some you

01:03:54

to also contribute to that effort but

01:03:56

yes this is definitely a on our radar

01:03:59

so actually we just so that there are

01:04:13

quite a few libraries that are getting

01:04:16

open source for a specific early in the

01:04:19

hardware that is compatible us weapons

01:04:23

you have to to to in the video hardware

01:04:26

but the the real question there is how

01:04:30

much do we have to wait to have to be

01:04:32

fully featured drivers that are open

01:04:35

source for energy or and it yeah

01:04:38

actually no the DMD hardware open

01:04:40

source today you can download them okay

01:04:43

that drives are open source yeah thank

01:04:46

you sure yeah I said rocking with the

01:04:51

radio hoping compute and rocker is

01:04:53

actually open compute the driver yeah

01:04:56

and actually just refer completeness in

01:05:01

or we also enabling HCC as I said we

01:05:05

polished instruction set the DGPU so if

01:05:07

you really really care about

01:05:09

performance and have lots and lots of

01:05:10

time you cannot write coding assembly

01:05:13

and actually hope someone will actually

01:05:14

you know take the time to do that

01:05:16

because for example in few again and

01:05:19

and this is how they achieved it all

01:05:20

the ninety percent efficiency no I

01:05:22

guess is said our version of the bible

01:05:24

are growing in in efficiency and we

01:05:28

measured that buddy but what fraction

01:05:29

to peak flocks that we get again we

01:05:32

made very good progress so far but we

01:05:34

we have more to go okay so if there is

01:05:40

no more question maybe we can things a

01:05:42

speaker again on the go for the quick

Share this talk:

Conference Program

59:34

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

2373 views

55:38

Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.

427 views

01:01:02

Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.

331 views

55:14

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

815 views

55:57

Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.

344 views

01:08:04

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.

2157 views

49:29

Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.

275 views

52:43

Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.

151 views

45:40

TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.

2660 views

52:33

TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.

1705 views

01:05:51

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD
July 6, 2016 · 1:59 p.m.

1406 views

01:04:41

TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.

2252 views

Recommended talks

01:19:41

The Web: Wisdom of Crowds or Wisdom of a Few?
Ricardo Baeza-Yates, Yahoo! Labs
May 21, 2014 · 3:07 p.m.

124 views

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD

Embed

Transcriptions

Conference Program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.

Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.

Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.

Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.

TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.

TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD
July 6, 2016 · 1:59 p.m.

TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.

Recommended talks

The Web: Wisdom of Crowds or Wisdom of a Few?
Ricardo Baeza-Yates, Yahoo! Labs
May 21, 2014 · 3:07 p.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

AMD's Open Compute and Open Source cross platform solutions for Machine Learning Mauricio Breternitz, AMD

Embed

Transcriptions

Conference Program

Deep Supervised Learning of Representations Yoshua Bengio, University of Montreal, Canada July 4, 2016 · 2:01 p.m.

Hardware & software update from NVIDIA, Enabling Deep Learning Alison B Lowndes, NVIDIA July 4, 2016 · 3:20 p.m.

Day 1 - Questions and Answers Panel July 4, 2016 · 4:16 p.m.

Torch 1 Soumith Chintala, Facebook July 5, 2016 · 10:02 a.m.

Torch 2 Soumith Chintala, Facebook July 5, 2016 · 11:21 a.m.

Deep Generative Models Yoshua Bengio, University of Montreal, Canada July 5, 2016 · 1:59 p.m.

Torch 3 Soumith Chintala, Facebook July 5, 2016 · 3:28 p.m.

Day 2 - Questions and Answers Panel July 5, 2016 · 4:21 p.m.

TensorFlow 1 Mihaela Rosca, Google July 6, 2016 · 10 a.m.

TensorFlow 2 Mihaela Rosca, Google July 6, 2016 · 11:19 a.m.

AMD's Open Compute and Open Source cross platform solutions for Machine Learning Mauricio Breternitz, AMD July 6, 2016 · 1:59 p.m.

TensorFlow 3 and Day 3 Questions and Answers session Mihaela Rosca, Google July 6, 2016 · 3:21 p.m.

Recommended talks

The Web: Wisdom of Crowds or Wisdom of a Few? Ricardo Baeza-Yates, Yahoo! Labs May 21, 2014 · 3:07 p.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
July 4, 2016 · 2:01 p.m.

Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
July 4, 2016 · 3:20 p.m.

Day 1 - Questions and Answers
Panel
July 4, 2016 · 4:16 p.m.

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

Torch 2
Soumith Chintala, Facebook
July 5, 2016 · 11:21 a.m.

Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
July 5, 2016 · 1:59 p.m.

Torch 3
Soumith Chintala, Facebook
July 5, 2016 · 3:28 p.m.

Day 2 - Questions and Answers
Panel
July 5, 2016 · 4:21 p.m.

TensorFlow 1
Mihaela Rosca, Google
July 6, 2016 · 10 a.m.

TensorFlow 2
Mihaela Rosca, Google
July 6, 2016 · 11:19 a.m.

AMD's Open Compute and Open Source cross platform solutions for Machine Learning
Mauricio Breternitz, AMD
July 6, 2016 · 1:59 p.m.

TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
July 6, 2016 · 3:21 p.m.

The Web: Wisdom of Crowds or Wisdom of a Few?
Ricardo Baeza-Yates, Yahoo! Labs
May 21, 2014 · 3:07 p.m.