Embed code
Note: this content has been automatically generated.
00:00:00
Just just so but yeah Okay so you start
00:00:49
again for the first two "'cause" you
00:00:51
have to know. So this talk would be
00:00:55
given by voice you but done it's one
00:00:57
the AMD so moist use engineering fiddle
00:01:00
what injuries are on this currently
00:01:02
working on the exist get computing
00:01:04
systems or the acceleration of crude
00:01:07
cloud workload on that that it seeks
00:01:09
the child's Muppet use some not travel
00:01:11
on the on the what she's twice a like
00:01:14
machine running with CPUNGPU so we are
00:01:17
button all that is to that would be
00:01:20
again the talk about console for one
00:01:22
that the product thank you. Thank you
00:01:24
can everybody hear me okay alright so
00:01:28
it's my pleasure to be here thank you
00:01:30
for the invitation. And I said you know
00:01:32
I want to discuss which you we know
00:01:34
what I it has happened this space with
00:01:36
a DL try to give you some background.
00:01:39
And that to be honest some of the
00:01:40
support that we have a going on right
00:01:42
now no I am part of a team TV sets so
00:01:46
wanted to give you a feel for this. And
00:01:48
these are some of the locations we have
00:01:50
a throughout the US a lot of actual
00:01:53
research work is also well funded by
00:01:56
the US government of a department of
00:01:58
energy. But we actually have a crime
00:02:01
project about a fifty million dollars
00:02:03
look it actually scale computer system.
00:02:05
And that means writing tend to be king
00:02:09
floating point operations per second.
00:02:12
And they're our budget or twenty
00:02:14
megawatts we're looking at Arco
00:02:16
supercomputing C so they will be on the
00:02:18
order of about the sixty two hundred
00:02:20
thousand notes again using a new
00:02:23
generation CPUZUP use from the year
00:02:26
twenty twenty one and twenty two and
00:02:28
this requires both a Hanson is on the
00:02:30
architecture with the programming
00:02:32
models on the power efficiency and so
00:02:35
on and actually in addition to that
00:02:36
research you know this I have further
00:02:39
verifications that applies you a next
00:02:42
generation lady product it as I said is
00:02:45
affecting all of our you know software
00:02:47
stacks it also huntress X I want to
00:02:49
make that connection for you. And yeah
00:02:53
I tend to take questions that yeah but
00:02:55
if there's a pressing question please
00:02:56
don't hesitate to entrap media we that
00:02:59
you to respond. And actually this is oh
00:03:03
the outline for my talk. They so of try
00:03:06
to populate no specifically focus on
00:03:09
what we have been going in support of
00:03:11
machine learning. And actually no we
00:03:14
have an overall vision latinos that we
00:03:16
would doing nor softer not necessarily
00:03:19
have four mushy line you but also for
00:03:21
other markets is necessarily no solid
00:03:24
foundations out to describe to you the
00:03:26
kinds of harder features that we see
00:03:28
right now X I am at this conference all
00:03:30
sweet and I know I towards identifying
00:03:34
the lottery do bring features to having
00:03:35
a harder into Reno to keep enhancing it
00:03:38
specifically because you know this is
00:03:40
the feel that is a rapidly moving. And
00:03:43
and you know there's a lot of
00:03:45
innovations happen so we need to play
00:03:47
that passed between old comedy things
00:03:49
too hard really identify the key
00:03:52
building blocks such that we could
00:03:55
create efficient hard if you should
00:03:57
softer support. And as I said the
00:03:59
somewhat the foundations that we have
00:04:02
also some of their approach and all
00:04:04
that open source libraries I have a
00:04:06
little bit of common so what we have
00:04:08
been doing in open source libraries.
00:04:10
And I get a specific you want to have a
00:04:12
two day in the machine learning and of
00:04:15
course. This is a rapidly moving
00:04:18
futility name gee this is also moving
00:04:20
we're really play catch up in a certain
00:04:22
areas that we hopefully it'll will
00:04:24
continue doing this and specifically
00:04:26
also we do in alignment with our vision
00:04:29
which is related also to you know being
00:04:31
open and we know very collaborative
00:04:34
buddy on overall usual would be all you
00:04:36
saw the diagram I had in my previous
00:04:39
slide about you know moving from you
00:04:42
know a small system the left aspects
00:04:44
desktops are going to even so
00:04:46
provocative says admission to in the
00:04:48
past. And the idea is actually you know
00:04:50
to having computers a supercomputers
00:04:53
that you know have high performance in
00:04:55
L so they're heterogeneous you know
00:04:57
there are certain classes of cold. They
00:04:59
do map well to did you feel you know in
00:05:01
the future you for this use for compute
00:05:04
can be quite efficient to certain
00:05:06
classes of computations one minute
00:05:08
computation by versus when you have a
00:05:10
lot of great you sometimes see pews
00:05:12
better connection automation this a
00:05:15
little bit later. But are now we do
00:05:17
have to see over having a local hand
00:05:19
heterogeneous of processing knows that
00:05:21
would be able to execute the CPU colds
00:05:23
when one makes sense to do so and I
00:05:26
said well this is an all in a sense to
00:05:28
one of the features that we see that be
00:05:30
able to apply the bass hard
00:05:32
organisation for the best problem that
00:05:34
we have and actually my my second
00:05:37
bullet here mentions the HSA which is
00:05:40
programming model which is the standard
00:05:43
is actually initiated by G but is not
00:05:45
call on this part of the arms part of
00:05:47
it it is a foundation that's proposing
00:05:50
programming model to make an
00:05:52
approximation did GUE easier. I I hope
00:05:56
to be able to explain that a little bit
00:05:57
more it's not the key talk of my talk
00:06:00
here. But I can easy support the
00:06:02
decision veggies that you would be able
00:06:04
to you know right. C plus plus programs
00:06:07
that are you know directly extra
00:06:09
readable onto again using the past
00:06:11
platform for the execution in that
00:06:14
could be as I said earlier either CPU
00:06:16
energy you embroidery heterogeneous
00:06:18
systems are architecture also doesn't
00:06:21
really presuppose only those X or it
00:06:23
could be an FPG or something of the
00:06:25
some of those in the future. And I just
00:06:27
cetera don't one approach and all being
00:06:29
openly would be a perfectly feasible
00:06:31
for someone and especially Seminole
00:06:33
researchers to also continue to that
00:06:37
that we have one item that I know
00:06:39
neglected to mention about energy
00:06:42
research and I want to really make this
00:06:44
point here as a set of your spread out
00:06:46
throughout the USATV do work you know
00:06:48
in collaboration with many divorces
00:06:50
action a lot of our work is done in all
00:06:53
via also hosting a postdoc in that PH
00:06:57
the occurrence. So this talk is also
00:06:59
peach another hopefully you know we
00:07:02
will find interesting postdoc cylinders
00:07:04
that ignore willing to work with us you
00:07:06
know extend the research you know in
00:07:08
collaboration and as I said the the
00:07:11
third bullet that I want to described
00:07:13
you guys here would be the approaching
00:07:15
the wiser at all I should beat on this
00:07:17
point a little bit the idea is to be
00:07:20
able to provide open source software in
00:07:22
house or hundreds open standard you
00:07:25
know in a sense that he supports many
00:07:27
of the key programming interfaces the
00:07:30
either through like a day I say like
00:07:32
thirty six you may like it or not but
00:07:35
he's a standard you know instruction
00:07:38
set architecture and also even the
00:07:40
harder interfaces such a PCI a new
00:07:42
standard recording see I actually to
00:07:44
six is also coming along nicely set it
00:07:47
hopefully this is also supported by
00:07:49
open source the compilers it also
00:07:51
enable meant even for you know domain
00:07:54
specific ranges of even alderman types
00:07:56
for example panicky or sauce and as I
00:08:00
said earlier no philosophy that is so
00:08:02
pretty meteor products is the idea to
00:08:05
adhere to open standards and hopefully
00:08:07
again this would be stimulate the both
00:08:09
the collaboration innovation and as I
00:08:13
said you know we have done a lot of our
00:08:16
hardware software based on this in a
00:08:18
letters unisys system architectures
00:08:20
which again benefits for you know a a
00:08:22
number of a daily TI industrial you
00:08:25
know members. And again even the
00:08:27
intellectual property loser are such
00:08:29
that enable collaboration. And battery
00:08:32
no they just say foundations actually
00:08:34
independent of energies actually it's
00:08:36
or foundation you can find the
00:08:39
specification available on this
00:08:41
website. And out of the you know I have
00:08:43
listed a number of a links here now
00:08:46
hopefully overthrew the someone
00:08:48
interested can fall some of those links
00:08:50
and and get access to to the extra
00:08:52
information and then they said the man
00:08:55
we more specifically for agencies
00:08:56
supports not what open standards like
00:08:58
the coming up C plus plus fourteen. And
00:09:02
actually they said some of our work is
00:09:04
also done in support of X escape
00:09:05
computing. And as I said is a huge use
00:09:08
a supercomputing system so in LP gas
00:09:12
traditional global address space
00:09:13
languages are and then version standard
00:09:16
that you know of course are able to
00:09:18
support the memory model so such as dot
00:09:20
net and Java actually turns out that
00:09:22
the memory semantics of agency are
00:09:27
formally described. So we know someone
00:09:29
with interest and that can you know
00:09:31
study that's in defined a new ways of
00:09:34
connecting to the architecture because
00:09:36
I said one of the key ideas behind each
00:09:38
essay is this well with the CPU in did
00:09:42
you feel you know they both have equal
00:09:45
access to memory so we actually could
00:09:47
even think about passing a point there
00:09:49
from the sea view to G you and did G
00:09:51
use able to do reference that point and
00:09:54
get to the data and without and all the
00:09:56
previous need for doing that we copies
00:09:58
and so on. And again in all the
00:10:01
coherence is taken care of by the
00:10:02
harder so so hopefully this would
00:10:05
enable both the programmability in some
00:10:07
situations you can actually I get
00:10:09
better performance. And I said one of
00:10:11
the ways of the performance it would be
00:10:13
to do some dynamic scheduling in
00:10:15
interplay either choosing execute
00:10:17
pieces of code online to deceive your
00:10:19
the DGPU and again I guess someone to
00:10:22
programming models here. And well
00:10:25
actually diving to somewhat those a
00:10:28
little bit later but one of them is are
00:10:30
the no C plus plus compiler. Um and
00:10:34
also as I said earlier no opening P is
00:10:36
important in certain spaces
00:10:38
indefinitely these important into the
00:10:41
excess clear high performance computing
00:10:43
again that we do have a version of an
00:10:45
open source compiled it supports this
00:10:47
all of that is is supported by not
00:10:51
allowing him but back and then
00:10:53
optimising compiler. And specifically
00:10:58
I've omission is that again later but
00:11:00
I'll repeat a little bit to try to make
00:11:02
the point you know in a sense you know
00:11:04
if you use the LPN back and we actually
00:11:07
benefit from much of the optimisation
00:11:09
much ado researchers weak line on on
00:11:11
the open source compiler arena. But
00:11:15
also you get to to access them to the
00:11:17
back and which is the part of the
00:11:19
compiler there's actually doing the
00:11:20
code generation. So you can actually
00:11:22
look and see what kind of coding
00:11:23
generated. And again if you're like it
00:11:26
you actually give you ways to make you
00:11:27
know you're wrong modifications but
00:11:29
even that was about it to make it
00:11:31
better. And as I said you know not
00:11:33
clank is actually the you know the fact
00:11:35
a standard for you no different into
00:11:37
the compilers. That's what I yeah yeah
00:11:40
the the signal or software is built on
00:11:42
topics source Linux it actually all of
00:11:44
our patches in or strain. And actually
00:11:47
they are you Emmanuel information here
00:11:49
is the no the device that allows you to
00:11:52
have you know a coherent coherent view
00:11:54
of a virtual memory yeah what I
00:11:56
mentioned earlier when you press a
00:11:58
pointer between the CPU and the you you
00:12:00
this is already two virtual memory
00:12:02
purchase space you don't have to or
00:12:05
yourself about and dealing with the
00:12:06
physical maybe I saw. And actually know
00:12:09
most and all leaders a Linux
00:12:11
distribution would be you know enabled
00:12:13
we try to say as I said earlier
00:12:16
actually I I be repeating this a lot
00:12:19
because I want to try the home
00:12:21
disappointed at all of our stack is
00:12:23
open source including what you know
00:12:25
traditionally used to be very secret
00:12:27
which would be the kernel mode drivers.
00:12:30
I know traditionally especially knowing
00:12:32
GP use the I it's a distraction
00:12:35
instruction set or did you you use not
00:12:38
public and and there were good reasons
00:12:40
for that they're good business reasons
00:12:42
for that one would be the compatibility
00:12:46
a hardware design that this design the
00:12:48
next generation Intel X eighty six CPU
00:12:52
cannot invent a new instruction. We
00:12:54
actually tried to the data. But is very
00:12:57
complicated because there's means of PC
00:12:59
so they decide I don't want to use or
00:13:02
to implement in all I give any
00:13:04
structural I've given add instruction
00:13:06
lexically six disc brakes a lot of code
00:13:08
that's out there so you have a huge
00:13:10
burden of backwards compatibility
00:13:12
whereas the knowledge if you denied
00:13:14
designers had that this degree of
00:13:16
freedom because traditionally the ISA
00:13:19
was hidden behind the driver so you can
00:13:23
make a or whatever like you never games
00:13:25
you know traditionally the the use by
00:13:27
games you communicate that to the
00:13:29
driver which is that as you loaded
00:13:31
again in many times sometimes I
00:13:33
dispatch you would compile the code to
00:13:35
that native I say so hundred is I just
00:13:38
had a lot of feed only Vanden
00:13:39
instructions all the time and so on.
00:13:42
And again in this case an LAPS has
00:13:44
published or instruction set for GPU
00:13:47
called you see and and again we do have
00:13:50
a burden of of backwards compatibility
00:13:52
what do you hope to know navigate this
00:13:54
well in that when you change things
00:13:56
will let you know find to bind you you
00:13:59
which is not so bad after anyway. And
00:14:03
as they said concealing normal open CL
00:14:06
compiler sells you plus plus and also
00:14:08
we do have a support for python are
00:14:10
also open source. And the the back and
00:14:13
the or contributions so the back and
00:14:15
they're also strained and now in in a
00:14:18
certain situations where should you
00:14:20
support have a few compilers which are
00:14:23
not you know do not belong to us in
00:14:26
those fort lee at the moment they're
00:14:27
closed source and some we have some
00:14:29
opening keep compilers that you know
00:14:32
are available to our chips however
00:14:34
they're not open source as everything
00:14:36
else. So as I said well you know the
00:14:40
the key visualiser said would be these
00:14:42
you know having open source software in
00:14:43
X twelve keep mentioned is website here
00:14:47
specifically because not traditionally
00:14:49
does you user use you know for gaming
00:14:51
and there's a lot of hard reinvested
00:14:53
into you know creating high quality
00:14:56
pixels that are creating texture and so
00:14:58
on. But also for the compute side no
00:15:03
there's no no emerging applications of
00:15:05
the GPO itself in actually we do have a
00:15:08
ops this was a website called GP opened
00:15:10
dot com in most on sales oh oh I'm
00:15:14
tempted to say almost a hundred percent
00:15:16
of our software is available you know
00:15:19
you open source N through there. And as
00:15:21
I had mentioned there you know well
00:15:23
hopefully want to couple this would
00:15:24
open standard hundred this will
00:15:26
actually enable us signal to basically
00:15:29
increase the collaboration increase you
00:15:31
know the creativity. And to be honest
00:15:33
in all all of us working together is is
00:15:36
will be better than no the limited
00:15:38
resources that we have in try to
00:15:40
provide not everything for everybody.
00:15:42
So that is one of the the engines than
00:15:44
actually this is actually not one of my
00:15:47
files this is from you know or cheating
00:15:49
that is you know describing what we
00:15:51
call radio hoping compute in our
00:15:54
marketing guy said finally these and
00:15:55
they use the the idea of the rock. They
00:15:57
quoted rock I'm a radio is the brand
00:16:00
for OGP use it as I said the the ideas
00:16:03
to having this open computer platform
00:16:05
and then the name provocative you see a
00:16:07
few for is it all really too rocking
00:16:10
what on this presentation here in the
00:16:13
idea is to create you know what passed
00:16:14
for different performance computing as
00:16:16
I said you know what is funny about or
00:16:18
research is that at risk in computing
00:16:20
market. However we actually want to be
00:16:22
able to you do you I value to our
00:16:25
customers you know to other classes of
00:16:28
customers you know from the system as
00:16:30
we have been doing this a space. And as
00:16:32
I said earlier this there should be
00:16:34
enough foundation for you know for the
00:16:36
development for that experimentation
00:16:38
and for the discovery. So there is you
00:16:40
know the philosophy behind this far
00:16:43
come which is don't group computer for
00:16:46
platform minute we have the the open
00:16:48
source runtime was called rocker and
00:16:51
you see that coming up in the middle
00:16:52
hope to be honest with me right now. So
00:16:57
in a sense I described to you and one
00:16:59
overall vision about how we're
00:17:01
navigating this the space. And
00:17:03
decisions dictated no vice several
00:17:05
pragmatic use also dictated by the need
00:17:09
you know to collaborate to because as I
00:17:10
said well you know not the largest
00:17:12
company in the world. It is actually
00:17:14
beneficial to you know to benefit from
00:17:17
the energy of many of the research is
00:17:19
such as you that would be you know
00:17:21
interested into collaborating creating
00:17:23
new ideas. But that part of the support
00:17:26
for this would be you know what kind of
00:17:28
hard to that we have an LH mission like
00:17:31
this is an interesting my development.
00:17:34
And the two or understand you know the
00:17:37
the notion of the high bandwidth memory
00:17:39
is key and as you know in the B you use
00:17:42
really hungry for not a lot of the axis
00:17:45
it's so aimed it was the first company
00:17:48
to act commercialise not developing
00:17:50
between commercialise I banished memory
00:17:53
describing this simply the idea is
00:17:56
actually play C you know one a second I
00:17:59
you would place the memory layers on
00:18:03
top of the GPU to be honest. It's not
00:18:07
always on top sometimes they're next to
00:18:09
each other. But the idea would be to by
00:18:12
doing using silicone instead of having
00:18:14
a one package device or another package
00:18:17
and then you have to have wires going
00:18:18
from one to the other throw some
00:18:20
standard that memory interface like DDR
00:18:23
and so on. You know you see daddy are
00:18:25
restricted when you are doing this in
00:18:27
cell a colour you are the able to have
00:18:31
you know thousands and thousands of
00:18:33
wires. So you would be able like for
00:18:35
example you have a a number of memory
00:18:38
memory dies or memory leaders here. And
00:18:42
then you would be too complicated that
00:18:44
data chi you to the CPU by using say
00:18:48
you know four thousand wires then you
00:18:50
can get a lot more data going on very
00:18:52
quickly yeah this is menorah product
00:18:55
from from angel already for a while in
00:18:58
that I just said provides signal much
00:19:01
higher memory bandwidth connection has
00:19:02
been adopted at throughout the industry
00:19:05
even so that now and there's a new
00:19:07
upcoming or second generation high then
00:19:09
this memory standard that again I and
00:19:12
he's participating on and the key idea
00:19:16
behind this was to provide both in an
00:19:18
increase bandwidth. But also bandwidth
00:19:21
per why again if you have a to traverse
00:19:25
long distances you know with your
00:19:28
signal yeah actually there's an all up
00:19:30
energy in terms of the power that you
00:19:31
consume. And the one animation the
00:19:33
beginning of our of our talk to achieve
00:19:36
one X a flock of execution undertaking
00:19:39
megawatts if you do the math me neither
00:19:41
between that and accent hundred X
00:19:43
improvement in your you know energy
00:19:45
efficiency to be able to these under so
00:19:48
it's such are constraint. And the idea
00:19:51
here is that you know the electrons
00:19:53
will not have to travel very far and
00:19:56
actually will be build a frankly that
00:19:58
keeps making this point it only cost
00:20:00
you more sometimes to transcribe it
00:20:02
from here to here then to actually do
00:20:04
the floating point operations be again
00:20:06
in all energy efficiency that Vicki a
00:20:09
consideration for this. So you know at
00:20:11
this point of my are actually give you
00:20:13
the idea of the foundation where we see
00:20:16
that you know hard restructure that are
00:20:18
beneficial for you know computation.
00:20:21
And as I said I'm also keeping a
00:20:23
particular I out towards you know deep
00:20:27
learning which also we would benefit
00:20:29
from the get increase memory Bentley is
00:20:31
the beach to be able to address in a
00:20:33
larger memory spaces and so on. And
00:20:36
therefore completely these are products
00:20:38
that are available today this was the
00:20:41
first one you know the initial no no
00:20:46
prototype the the the issue prodded it
00:20:48
came out in all four gigabytes of high
00:20:50
bandwidth memory. And stuff five
00:20:52
hundred kilobytes per second memory
00:20:54
bandwidth in a basically a tariff lot
00:20:56
single procedure in all in all we have
00:20:59
the fire approach to which in a sense
00:21:02
actually couples each of these together
00:21:04
it's able to have a eight you get buys
00:21:06
I'd get of member that can be accessed
00:21:08
at high bandwidth you know limit that
00:21:11
thirteen their faults. And the third
00:21:13
one here this is so used to be called
00:21:15
Polaris was announced a I think a week
00:21:17
or so ago Kennedy gigabyte set you on a
00:21:21
fifty six a gigabytes per second of
00:21:23
memory bandwidth hundred Watts and
00:21:26
actually this is I think the first one
00:21:27
to break down to two hundred dollar
00:21:29
price cost you know we had a discussion
00:21:32
wide here about you know how many get a
00:21:35
false are going to be using to computer
00:21:37
neural networks and so on should you
00:21:39
discuss that in your papers hopefully
00:21:41
made by making diffuse cheaper you can
00:21:43
make that available to everybody. So
00:21:46
again this there some the here is the
00:21:48
cells that will be able to enable you
00:21:50
know for their experimentation on this.
00:21:53
And I just wanted to highlight this to
00:21:55
know what we mentioned earlier this no
00:21:56
showed excellent processing unit in as
00:21:59
I said what you have these on the same
00:22:01
a second die you have a the in this
00:22:05
case the the particular product
00:22:07
actually this laptop that I'm writing
00:22:09
on is one of those actually you mikey's
00:22:11
I have a four CD six if you course in L
00:22:15
that you feel that are sharing memory
00:22:17
and again in their equal citizens in
00:22:20
terms of accessing the memory of course
00:22:22
there you can see this is the sense
00:22:23
that they're able to access the same
00:22:25
memory and of course you know the jeep
00:22:27
you is usually much more bandit and
00:22:29
within the CPU again we you have a
00:22:31
mechanism to make sure you know
00:22:33
execution proceeds that property. So
00:22:37
the idea is et cetera idea would be you
00:22:39
be able to both accommodate this year
00:22:42
computationally the data parallel
00:22:44
computation hopefully to be able to
00:22:46
seamlessly switch between each you with
00:22:49
the no the lowest cost overhead like
00:22:51
being able to access memory from the
00:22:55
CPU two GPU doing everything in user
00:22:58
mode like you can just do a little as I
00:23:00
said earlier you know increase
00:23:02
implementations for example in the
00:23:03
discrete GPU need to invoke a device
00:23:06
drivers you know if you saw for guys
00:23:08
you realise if you do that you're
00:23:10
making a kernel transition to
00:23:11
privileged mode. So in a sense you
00:23:14
distracting execution plan to be it
00:23:16
until you get your data there so you
00:23:18
try to minimise those yeah actually
00:23:22
this is just to give you some notion of
00:23:23
what is going on again this is just
00:23:26
showing again they have many these that
00:23:28
you get from using high bandwidth
00:23:31
memory and easy to see here this is in
00:23:35
the sense it'll looking in terms of the
00:23:37
increase the block size tresses and is
00:23:40
on the left you have a transferring
00:23:42
memory from the host to the GP you
00:23:44
would on the right you have a going one
00:23:46
backboard and as as you can see or
00:23:49
previous generation drivers you know
00:23:51
had to my slower memory bandwidth. And
00:23:54
again a direct attach the GPU was able
00:23:57
to achieve in a higher memory and it's
00:24:00
because I again I'm trying to make the
00:24:02
point that you know when you need to be
00:24:03
able to access you know much more data
00:24:06
this this is enhanced memory and is
00:24:09
going to be no crucial for achieving an
00:24:12
efficient and and and the higher
00:24:14
performance in the you go in similarly
00:24:19
to that this is like a roof line the
00:24:23
makes mention Mart showing you know how
00:24:26
much you know single precision digger
00:24:28
frocks you're able to achieve in
00:24:30
battery disliking products that are
00:24:31
available to day in hopefully you know
00:24:34
in the future will be able to achieve
00:24:36
in the even higher even higher
00:24:38
performance but again I just gives you
00:24:40
an ideal work where you are today. And
00:24:43
that no this is overflowing curve as
00:24:45
you see for work covered a better some
00:24:47
Berkeley in a sense you you change
00:24:49
their a should be T know maybe floating
00:24:51
operations per byte you're doing. And
00:24:53
then you keep going you know in the
00:24:55
left side your your excuse me
00:24:59
bottleneck by a compute capacity you
00:25:02
you hit you know your your saturation
00:25:05
point and this is where you know you
00:25:06
see the peak of the roof line here. So
00:25:11
anyway I hope that so far I was able to
00:25:13
give you one idea of what we see is
00:25:16
being the foundation is you know the in
00:25:19
in the hardware support that you do.
00:25:22
And then again that point you know in
00:25:24
in the samples foundations but I'll try
00:25:25
to describe the know the the software
00:25:28
stack that we have because you know I
00:25:31
know we're considering Nolan dip your
00:25:33
network here in actually construct from
00:25:36
the mathematical aspects. But is very
00:25:38
important especially to get performs
00:25:39
that you understand both the
00:25:41
mathematical aspects how that is
00:25:43
expressed in softer. And then how that
00:25:46
software is actually mapped into
00:25:47
harder. And actually in this in the in
00:25:49
this case that we want to show you what
00:25:52
we have that would be you know helping
00:25:54
or something you to you achieve that
00:25:57
task as I said earlier no one animation
00:26:00
about agency foundation. Um no initial
00:26:03
focus was actually want is eighty you
00:26:06
know this is the name for having the
00:26:08
superuser G pieced together in our
00:26:10
recently you know the same technology
00:26:12
has been extended to also support the
00:26:15
discrete GP use individual have a lot
00:26:18
of ongoing effort there in that case we
00:26:20
do have a a separate in there and
00:26:22
separate dies. And we have a high
00:26:24
performance interconnects and again
00:26:27
high performance harder support to
00:26:29
enable you to approaching transferring
00:26:32
data between the CPGPU you know at the
00:26:35
user level with the hopefully did me
00:26:36
more soft overhead mean this is that
00:26:40
the second portion of this components
00:26:45
here would be really to the you know
00:26:47
the open computer architecture which is
00:26:49
et cetera right now there's no one to
00:26:51
keep things are that each a save
00:26:53
runtime is was originally specified
00:26:56
which means the device driver did you
00:26:59
have a two today is fully agency
00:27:01
compliance it is said agencies a
00:27:03
separate from day should has specific
00:27:06
specification about what is a compliant
00:27:08
device which means supporting the
00:27:10
memory model supporting in a number of
00:27:12
instructions and so on. And that is
00:27:15
fully complying. However has a few
00:27:16
other extensions that actual
00:27:19
information this one especially peer to
00:27:21
peer in in this case you are able to
00:27:23
transfer memory no between GPOGPO
00:27:27
directly yeah actually one interesting
00:27:29
point like to meet about agency is they
00:27:31
said they are you know we can citizens
00:27:34
in the sense that traditionally we see
00:27:36
the CP was being the master in the you
00:27:38
you'll be in this way so you write code
00:27:41
in C and then you despite some code for
00:27:43
execution the GPU but no in wheat ages
00:27:47
say you are actually able to have the
00:27:50
GP will dispatch code to run on the CPU
00:27:52
and vice versa of course you know we
00:27:54
need to be careful about what you're
00:27:56
doing because if you're running sixty
00:27:57
four threads around you you you you can
00:28:00
quickly over on TCPU and so on in
00:28:03
another you know key exchange for peer
00:28:04
to peer is is in the case in which you
00:28:07
have a so you have a huge new
00:28:09
networking we have a you know a system
00:28:11
would say hundreds of nodes and so on.
00:28:14
So you'll be able to transfer memory
00:28:16
again hopefully at user level between
00:28:18
those is he S separation. I
00:28:21
specifically when you doing this to
00:28:23
this GD no when you get to the outer
00:28:26
all communication phase where you
00:28:27
passing out parameters this is a very
00:28:29
key feature to achieving a lot of
00:28:32
performance right there you know this
00:28:34
could be one of your bottlenecks
00:28:35
especially as arabic other described
00:28:37
earlier when you're doing synchronise
00:28:39
SGG yeah at a point in each everybody
00:28:41
has to communicate their their
00:28:43
parameters to everybody I'm a already
00:28:47
you know some other compilers no bite
00:28:49
on an LL me I'll mission and we have a
00:28:53
little bit later about hip this epistle
00:28:55
idea for portability in that it's a
00:28:58
it's an approach to enable cool the
00:29:00
developers to be able to take advantage
00:29:02
of a million DCP use as I said earlier
00:29:06
they memory the discrete memoranda
00:29:08
cheap use also exposed to the program
00:29:10
or get to the normal programming
00:29:12
language constructs in actual at the
00:29:16
bottom here just to show you how this
00:29:18
is developing rapidly actually as well
00:29:21
a now's the time November at the
00:29:23
supercomputing conference that was in
00:29:25
austin especially someone was there but
00:29:28
you know the preview releases in yeah
00:29:30
second releasing a prude you it did
00:29:32
point is that you know this is actually
00:29:34
a very active ongoing development. So
00:29:38
don't be afraid to download the latest
00:29:40
drivers and actually afraid to
00:29:42
contribute to Reno enhancing those
00:29:44
drivers and so on so this is a very
00:29:46
active filter development right now and
00:29:50
as I said oh the GPU up in website you
00:29:54
know most of our code source code is
00:29:56
available you their own on get have or
00:29:59
in a bit bucket as I said in addition
00:30:04
to you know supporting compute we also
00:30:06
have a dog in your libraries that open
00:30:07
source. But you know for these audience
00:30:10
also mentioned I know some of our
00:30:12
compute libraries which are no the
00:30:14
source for less libraries that fifty
00:30:17
random number generators inexorable
00:30:19
touch a little bit about are spies
00:30:21
Matthews operations yeah but again I
00:30:25
want to make sure that they know you
00:30:26
get the notion that you know most of
00:30:29
our salt is actually not found in one
00:30:30
place they know from did you you opened
00:30:33
a common can go and get that and no
00:30:35
access to the most softer right there
00:30:38
in the one of the key up pieces of that
00:30:40
softer is weird we're cloudy the CC
00:30:43
heterogeneous C compiler is we call it
00:30:47
single sources C plus plus for the GPU
00:30:51
and actually do have a support for a
00:30:53
pity P extensions if you're not sure
00:30:57
many of you are open EP that
00:30:58
programmers but in opening P right your
00:31:00
C or fortran or so on like cold. But
00:31:03
you are not eight specific looks with
00:31:05
progress. So you can tell the compiler
00:31:08
this is a terrible. And then the
00:31:10
compiler generates peril code for
00:31:12
example for money and the multi core
00:31:13
GPU in scenery the latest opening P
00:31:18
standard actually for that five right
00:31:19
now has some accelerator absolute
00:31:23
workstations in at least you can
00:31:24
actually specify I want this look for
00:31:27
this particular variables to be located
00:31:29
on the TV you're on the GPO and then
00:31:31
the compiler is free to do whatever you
00:31:34
know it finds the best you can even
00:31:36
ignore all those for arguments in which
00:31:39
case you get a perfectly fine
00:31:41
sequential program though the notion
00:31:44
for us here you know specifically you
00:31:46
know P in with the description by
00:31:49
merrier raise this tells the compiler
00:31:52
for example this is available in this
00:31:54
is the data access to that we are that
00:31:58
we are performing within the look and
00:32:00
then the compiler you for example to
00:32:02
understand this if your compiler person
00:32:04
can computer enough find expression
00:32:06
money missus. And then you can decide
00:32:08
what is it best dated layout. And then
00:32:10
you get you know like someone mission
00:32:12
only know talk assume emission is talk
00:32:15
about all these try of access and so
00:32:17
on. It because again you know remember
00:32:20
one did you you we have a multiple
00:32:21
threads running simultaneously the on
00:32:23
execute the same instruction. So when
00:32:25
you get one memory load is highly
00:32:28
beneficial to the GPU that every one of
00:32:31
those threads access for example
00:32:34
consecutive memory access. We call
00:32:36
those that can acquire less memory
00:32:38
access to everything did you you and
00:32:40
you have the highest that bandwidth in
00:32:42
that case because you eat one axis you
00:32:43
gets two hundred fifty six bytes it is
00:32:46
also possible to have each one of those
00:32:48
could do the different memory
00:32:50
locations. And your code works fine.
00:32:53
And what the jeep you does in that
00:32:55
situation is this in the sense it
00:32:57
suppose you get sixty four different
00:33:01
addresses the memory controller takes
00:33:04
over eighteen to sixty four accesses
00:33:06
meanwhile the the GPU scared range in
00:33:09
is very quick at switching context
00:33:13
typically on a jeep you you have
00:33:14
thousands of dollars of ready execution
00:33:17
thread. And then switches over to
00:33:19
everything. However the situation the
00:33:21
animation early is the preferred one we
00:33:23
want either are compilers or a program
00:33:27
is to create calls that you axes the
00:33:30
memory in a preferred way. And you can
00:33:33
I already received a little bit of a
00:33:36
tension here. Because ten you you have
00:33:39
to understand the math behind your
00:33:41
networks you need to understand cancer
00:33:44
for tax. And then you also need to
00:33:47
think about how your data's being laid
00:33:48
out in memory so that the you you
00:33:50
access it they know consecutively
00:33:52
hopefully as a set of are making it
00:33:55
such that we want our compiler to be to
00:33:58
assist you in those tasks I described
00:34:02
so far opening P C. plus plus in C plus
00:34:05
plus at seventeen have the notion of
00:34:07
the palestinian template you can and a
00:34:11
cantaloupe and say this is a parallel
00:34:13
for each I have an example coming out
00:34:15
in actually we we say named you were
00:34:18
working with C plus plus language
00:34:20
eastenders to propose this as
00:34:22
extensions again innovation program
00:34:24
will be able to tell the compiler okay
00:34:27
this is a parable deserted data that
00:34:29
I'm touching and and then the compiler
00:34:32
would be able to do this option my
00:34:34
stations that are described for you.
00:34:37
And by the way we're building on
00:34:39
existing infrastructure right now and
00:34:40
all crying in a lot of yeah already
00:34:43
have some features to this
00:34:44
optimisations and again this is a
00:34:46
continuing developing within the
00:34:48
community again it but if you are
00:34:55
interested you can also an updater code
00:34:58
for hands for example you can place
00:35:01
some data into different member
00:35:04
categories are Jenny for for me to the
00:35:06
GP organisation you know in addition to
00:35:09
this high bandwidth memory data
00:35:11
described a we have what we call local
00:35:15
minimum or share memory you know the
00:35:16
not the naming between dingy dingy this
00:35:19
this thing is a little bit. But is a
00:35:21
very fast high bandwidth memory that is
00:35:24
available to one execution offender
00:35:26
execution more using the code the
00:35:28
nomenclature. And that members super
00:35:31
high ported memory actually turns out
00:35:33
to be one of the hard rock city leaders
00:35:35
that enables us do the sparse meetings
00:35:38
operations really fast in actually this
00:35:40
is the kind of things that I'm actually
00:35:41
not actively thinking about to try to
00:35:43
see what other things can you do that
00:35:45
helps as well back additionally
00:35:47
actually if you're a programmer that
00:35:49
I'm interested in getting the last bit
00:35:51
of performance you this possible to
00:35:53
annotate you know in the source code
00:35:56
for you know helping people combine a
00:35:58
locating this in battery original
00:36:00
active discussion right now we also
00:36:02
have an uncle meat even have a number
00:36:04
of blogs in which people tell us we're
00:36:07
going and doing a good job or not. You
00:36:09
know progressing with this is I said
00:36:12
the right now the compilers available
00:36:14
plus or support signal ISOC plus plus
00:36:18
is it but it said is already open
00:36:20
sourced in as a said supports the panel
00:36:23
centre and language and also supporting
00:36:25
open AP and as I said in this
00:36:28
particular case the computationally is
00:36:31
a single source signal you generate in
00:36:34
a single file we can be Riding your CPU
00:36:37
cold in any annotator Payroll for each
00:36:39
have a jeep you code in you know don't
00:36:41
really concern yourself with
00:36:43
transferring data because you can refer
00:36:46
to global variables within your cold
00:36:48
without you know having all this
00:36:50
usually to make sure you according to
00:36:52
the the the copies finished by the time
00:36:54
you need to access the data and so on.
00:36:56
So someone ask earlier basically a
00:37:00
compiler generated a whole school for
00:37:01
the CPU injuries you know the coda goes
00:37:05
we want to did you you know in the the
00:37:07
elbow onto the same L file. And again
00:37:09
teacher say foundational we say and you
00:37:11
have proposing a laugh binary formats
00:37:14
says that you there you can use your
00:37:16
standard tools to analyse elf if you
00:37:18
care so but it is at least to allow
00:37:21
motivated programs to drill down
00:37:23
optimise. So think you can even now
00:37:26
control data profanity nola talking
00:37:29
someone really that mission like
00:37:30
sometimes when you are able to make
00:37:32
your code of fit into the GPO cash you
00:37:35
really get the very good performance so
00:37:37
if you can you can do that. And as the
00:37:39
said or did you view parents can be
00:37:40
lost sight synchronously oh the mean
00:37:44
age essay basically we architect that
00:37:47
it's just see structure in the sense in
00:37:50
a C structure describe what worked way
00:37:52
want to keep you to do in there you
00:37:55
describe what is the address of a
00:37:56
method which are functioned executed
00:37:58
what are the arguments. And you feel
00:38:01
that function in you place it in a
00:38:03
queue in memory of the harbour is
00:38:06
looking at the Q and then he sees hey
00:38:09
there's work to do. And then factors
00:38:11
work to do and battery different the
00:38:13
models a jeep you have a have a
00:38:15
different capabilities some have eight
00:38:18
such you we need you looking for work
00:38:20
to do. So you know from the users
00:38:22
perspective all you do is you create
00:38:24
this packets you dispatch awarded you
00:38:27
you wouldn't and everything else that
00:38:28
happens you know automatically. So in a
00:38:31
sense you can you know start work on
00:38:33
the GP see mostly in again the agency
00:38:36
sitting which itself provides a
00:38:38
signalling mechanism for no the CP you
00:38:40
for example to be informed if you care
00:38:43
about you know the execution of a given
00:38:45
colonel. And actually remove even
00:38:48
create you know dependency chase which
00:38:51
would be a good thing for that so for
00:38:53
for example note ringer over those dogs
00:38:55
that you did you almost can't translate
00:38:57
those two hundred should create an oil
00:38:59
dependency change you need to say P
00:39:01
find this guy executes I want the
00:39:03
second one to be able to dispatch. And
00:39:05
then on the second terminates the third
00:39:07
one was on it is possible for you to
00:39:09
create such a structure and have the
00:39:11
harder. I get takeover in only let you
00:39:13
know one everything is executed again
00:39:15
in some environments this can be a
00:39:17
alive specific and has meant. And this
00:39:21
is actually just a very quick example
00:39:23
here to show in actually any see how
00:39:27
I'm doing and time. Um yes we need to
00:39:32
speed up basically the the key point we
00:39:36
will be this as I said we we we writing
00:39:38
a parallel for each construct. And as I
00:39:41
said a little bit earlier so I declared
00:39:43
the bounds of my array. So I able to
00:39:46
specify million perfectly standard C
00:39:49
plus plus code you know go do that
00:39:51
would be equal compiled and run on the
00:39:53
G you win as I said I mostly would
00:39:56
validate this code for grief I wanted
00:39:58
to get more performance out of it is
00:40:02
just submission all here is called
00:40:05
header jeans computer interface for
00:40:07
portability the idea is to enable who
00:40:10
the programmers to move to a common C
00:40:14
plus plus programming environment. And
00:40:17
the idea is that we can create I just
00:40:20
said to you convert your could echo to
00:40:22
standard notation. And then that
00:40:25
notation can be processed either by you
00:40:28
know in you CC which is if you give a
00:40:30
compiler or I standard C compiler late
00:40:33
each CC so in a sense just you
00:40:36
simplified forty this too is evolving
00:40:40
and actually I just said is very easy
00:40:41
to see like I I is on the left to have
00:40:43
operational could a code here. We have
00:40:46
a clue them in copy and so on any if
00:40:49
you look at the right side you see
00:40:52
again I hit man copy. And then that you
00:40:55
have a you know a block encoded that in
00:40:58
you had the second on the declared on
00:41:00
the right. So as you can see the goal
00:41:02
here was to simplify no porting of
00:41:05
colds. And this is to is not perfect
00:41:08
the no uses DGC compiler is I said I
00:41:10
don't want to convert to hippie right
00:41:13
to compile for example in HCC and then
00:41:15
run the same code on on the actually I
00:41:19
do inferior or in the GPS the to
00:41:25
converse I would say about eighty
00:41:26
ninety percent of your code in that are
00:41:28
served most trucks that would require
00:41:30
many derivations so this is in the
00:41:32
sense and that's just shoe portability.
00:41:35
And actually this point I put also
00:41:37
mentioned all did. GPUDPCC compiler
00:41:42
from who that is also capable compiling
00:41:44
occluded directly in generating a LP
00:41:47
yeah meeting with the language which
00:41:48
you could also be we target towards you
00:41:51
know both classes of GPS for completely
00:41:56
as we also be you know collaborative
00:41:58
coaching you analytic see did you
00:42:00
provide a high performance python. So
00:42:02
the I D.s actually to be able to
00:42:04
provide the know accelerated classes in
00:42:07
it provide a number of features to
00:42:09
support efficient execution and also
00:42:12
such as I said earlier this seekers
00:42:14
colonel execution using our share
00:42:16
memory again the shared memories is
00:42:18
fast memory on did you use in again
00:42:21
making even implicit that the research
00:42:23
based on this. So the idea is to
00:42:25
provide no python programmers which are
00:42:27
way of target EGP use a very
00:42:29
efficiently and then finally I get to
00:42:33
additional accordion destructing we
00:42:36
have a and a number of solutions that's
00:42:40
what I described so far were more of
00:42:42
the softer foundations I talked about
00:42:45
the about some of the libraries and
00:42:46
also some of the compiler and runtime.
00:42:49
And then oh you I just said a number of
00:42:51
a machine learning to your network
00:42:54
framework supports in addition to order
00:42:59
here is is the setting a lot atomic
00:43:01
simulation especially access Q
00:43:02
computing is also very important to us
00:43:05
as well as we're looking at the energy
00:43:07
and the signal processing. And also
00:43:09
dating graph and the really can
00:43:11
actually want to be able to comment on
00:43:13
that a little bit coming up later in
00:43:16
just to say here again all those
00:43:18
libraries are eight open source is well
00:43:20
we have a blast libraries the fish T we
00:43:23
have a sparse matrix library a rental
00:43:26
number generates first and make sure
00:43:29
this is familiar here but the US
00:43:31
department and if you have the new that
00:43:34
framework or the core calls which
00:43:36
enables you to do high performance
00:43:38
computing again expressing those
00:43:39
computations that high level is is I
00:43:42
have really template at library which
00:43:45
you know you write record an evil can
00:43:46
those templates no using matter
00:43:49
compilation you do actually get the
00:43:51
generated code to be quite efficient
00:43:53
across a number platforms because one
00:43:55
of the goals are screen able high
00:43:57
performance while preserving
00:44:00
portability and sometimes these two
00:44:02
goals are to each other every lot with
00:44:04
each other to mention a charm plus plus
00:44:07
is one of the the emerging P guess
00:44:10
language that I mentioned in this in in
00:44:12
the beginning of CHPX more "'cause"
00:44:14
outside in one of those is sequence
00:44:17
execution frameworks. And this is
00:44:19
actually this especially though there's
00:44:21
less three are important then again in
00:44:24
the highly distributed environment a
00:44:28
lotta right here I know you son of a
00:44:30
bitch Mike then we have a menu right
00:44:31
now and some might be funny to in some
00:44:34
are familiar to win the lottery areas
00:44:36
of work. And then they said well
00:44:38
specifically related to your network so
00:44:40
we do have support for tort seventeen
00:44:43
comfy in here omission on the passing
00:44:46
and we do have machine learning
00:44:49
library. And I and I confess you know
00:44:51
this is in a normal any development
00:44:53
right now in the goes to have you know
00:44:55
what am Isaac a convolution libraries
00:44:58
for neural networks. And again I
00:45:01
believe but we have so far is already
00:45:02
open source is is to not at the peak.
00:45:05
And then there's actually go work going
00:45:07
on to provide well better in intensive
00:45:10
versions of this it initially specific
00:45:12
food to produce audience it would
00:45:14
definitely work no for their
00:45:16
contributions. Um the to to to make
00:45:19
those libraries better and more
00:45:20
interesting in the related to these ago
00:45:23
so do you have a working in the open if
00:45:26
yeah it's a community against an open
00:45:28
source framework is specifically big
00:45:30
graph optimiser features in open GXLR
00:45:34
also no deviation to be a foundation
00:45:36
towards a supporting good machinery and
00:45:42
again actual booklet this maybe quickly
00:45:45
and unless this let's see this ad in
00:45:47
the number of open source library thing
00:45:49
about it is all of this information is
00:45:51
a publicly available also a name to the
00:45:53
developer website. So basically in a
00:45:56
number of a computer I business anymore
00:45:58
so for compute libraries are available
00:46:01
in open source. I'm tempted to say all
00:46:03
of them but you know oh I would say you
00:46:06
know the vast majority have terrible as
00:46:09
I said well we're starting with the das
00:46:11
libraries and again the the the sparse
00:46:12
matrix operations. And I mentioned
00:46:15
earlier already the heat the computer
00:46:18
interface Proctor tabulation of the
00:46:20
current version that we have is already
00:46:23
available out there there's is said
00:46:24
this is rapidly evolving that you saw
00:46:27
you know before they so do this will
00:46:29
you have been releasing code quite
00:46:30
rapidly so as as this evolves you can
00:46:33
always get the latest one you know
00:46:35
previous work project was the boat C
00:46:37
plus plus which is a a template that
00:46:39
live Reno compatible DSTL that would
00:46:42
enable you to basically male target or
00:46:44
go to the GPS no the next one is my
00:46:48
favourite intellectual mission this
00:46:50
email also in in the future we've been
00:46:54
talking about CC plus plus insulin and
00:46:57
there's also not a huge amount of work
00:46:59
in Java including you know you're an
00:47:00
outward to work in a crappy a friend of
00:47:05
mine developed it in the cadres is
00:47:07
horrible name is Cody paralegal PI it's
00:47:10
away in use you can express your data
00:47:12
panel or two in java. And then at one
00:47:16
time we take the Java byte codes in
00:47:18
over those inch you called for did GPO
00:47:22
you know you're from you know the way
00:47:23
Java works. You know he said dynamic a
00:47:26
compiled language so you probably
00:47:28
translate Java code you generate those
00:47:29
byte codes. And in upper upper you
00:47:32
write your Java code normally but from
00:47:34
here from one special class had runtime
00:47:38
you know or runtime would take over in
00:47:40
inspect the byte code. And in case you
00:47:43
know everything's okay we'll try to
00:47:44
generate code that runs on the GPU and
00:47:48
actually tradition regenerating under
00:47:50
the covers would be generating open CL
00:47:53
and if it fails you would have to
00:47:55
revert back to either move like or
00:47:57
execution or even to sequester
00:47:59
execution in the show mission a little
00:48:02
bit later we have we have used this
00:48:05
particular feature to accelerate order
00:48:07
frameworks is pretty particular how do
00:48:10
and not biting spark which are you know
00:48:13
very very important big data a frame or
00:48:17
sin most of them are abused on top of
00:48:20
the JVM the Java virtual machine and
00:48:24
actually in again that supercomputing
00:48:26
we actually to does this is the the the
00:48:29
the time that we should we introduced
00:48:31
the notion of of making or open source
00:48:35
drivers they making everything not
00:48:36
fully transparent. So he was called the
00:48:39
boatman initiative in all drivers for
00:48:41
the video cards. And again is some only
00:48:44
so not to you know I spend a lot of
00:48:46
times the details here but it is said
00:48:48
do we have an awfully fully computer no
00:48:53
bless routines a random number
00:48:55
generators. I just want to highlight a
00:48:58
intermediate or sparse matrix and I
00:49:01
have a couple for ozone it coming up
00:49:03
later because as I said I'm also not
00:49:07
thinking about to know what kind of
00:49:08
things make you know called run faster
00:49:11
GPU and turns out you know the shared
00:49:13
memory no the fact that they share
00:49:15
memories either to support in our Casey
00:49:17
she for memory accesses simultaneously
00:49:19
was very very key to enable high
00:49:22
performance on one sparse matrix and
00:49:24
actually the secrets is naughty what
00:49:26
what what was done in this library.
00:49:29
It's conceptually simple we wonder
00:49:31
processing is sparse matrix we actually
00:49:33
do prefect it which you you know like
00:49:35
affecting the whole description that we
00:49:38
can do this you know using a high
00:49:40
memory bandwidth. And then repression
00:49:42
prefect into this highly ported memory
00:49:44
and then from there we to dispatch
00:49:46
multiple threads. And because the mammy
00:49:49
supports multiple simultaneous access
00:49:51
all of them can proceed in parallel in
00:49:53
this is basically the key for the
00:49:55
performance of this library which works
00:49:57
well on both indian and yeah we I just
00:50:01
for computers omission or fifty
00:50:02
library. And we have a know and a few
00:50:05
other frameworks here that are also
00:50:07
supported again some of the people here
00:50:09
might be using those libraries no I
00:50:13
want to talk a little bit specific
00:50:15
about what we have a long enough for
00:50:16
machine learning in actually that
00:50:18
enough sense the two threads of this
00:50:20
work as we do contribute something to
00:50:23
the to the some of the leading a
00:50:25
framework to knock off a torch that had
00:50:27
been to discuss right here in
00:50:29
somebody's work is it that don adding G
00:50:32
cell is also the two in our club racial
00:50:34
the multi core where as I already
00:50:37
mentioned on all the and some of the
00:50:40
platforms also if you are into no image
00:50:42
processing also do have some of labor's
00:50:44
talks very that and actually so I'll
00:50:47
describe no specific some of the
00:50:49
libraries that are related to machine
00:50:51
learning that we have an an HCC cafe
00:50:53
version which is in a lot of original
00:50:55
coffee utilising or C plus plus
00:50:57
compiler ports succumb to the number of
00:50:59
models to test. And again this is the
00:51:02
place where you could download
00:51:05
experiment with this and again you
00:51:06
could use it to target or GP use
00:51:09
seemingly with stores you know similar
00:51:11
concept again using no the few plus
00:51:13
plus dependent portions of those are
00:51:15
you know enabled using in our C
00:51:17
compilers again you're able to download
00:51:19
it in experiment it's it's as much as
00:51:21
you can okay yeah so much to bring need
00:51:27
to run fast but I wanted to do this
00:51:29
because I'll take a make the change
00:51:31
here. And actually appreciated Nicholas
00:51:33
said this morning to go so far we have
00:51:35
been thinking about what happens within
00:51:37
an older within a computer. However
00:51:39
there's also you know we have this use
00:51:42
data centres from go face broken and I
00:51:44
said was the pronoun there is you have
00:51:46
a you know cluster with six thousand
00:51:49
nodes in advance. So we'll highly
00:51:52
distributed computing is you don't use
00:51:54
a funky area for us to be thinking
00:51:56
about but before that are just
00:51:58
introduce you to the notional map redo
00:52:00
signal which is you know popularised by
00:52:02
go go in agony is highly interesting
00:52:06
way of doing parallel computation the
00:52:08
notional map produce here for those
00:52:10
that are not for me to action describe
00:52:12
a simple problem here like suppose you
00:52:13
want estimate divided by you know quick
00:52:16
ways as you pick a a number Renault
00:52:18
points inside the unit rectangle any
00:52:20
count how many of those twelve falling
00:52:23
side. And the the idea behind that
00:52:26
precludes is is that the system the
00:52:28
framework does everything for you is a
00:52:30
very easy way of doing at parallel
00:52:32
programming you write the map method
00:52:35
you like them at the produce method.
00:52:37
And then the system decide to scale you
00:52:39
run those things in parallel and
00:52:40
basically you run you know your map
00:52:43
functions you which basically generate
00:52:46
like in this case in this example here.
00:52:48
I generate a random point in to make
00:52:50
additions you whether or not it is
00:52:52
insights right I'll put or one has here
00:52:55
in this case my those function is
00:52:57
basically just signal counting the
00:52:58
number one the system does the rest for
00:53:00
you. And actually use this to motivate
00:53:03
for example one of very popular
00:53:05
framework for again begin early this is
00:53:07
how do which is an open source
00:53:10
implementation of them out produced
00:53:11
framework develop real. And you resume
00:53:15
this you to several Java virtual
00:53:17
machine you have to know create a file
00:53:20
system we have relied be done cheese
00:53:22
different note fails it does it all for
00:53:24
you. You want to program you write them
00:53:25
at at the offended with this method
00:53:27
know how do you want to accelerate that
00:53:29
on the GPU in actual quickly show you
00:53:31
how we did it get our target was a
00:53:34
cluster of a use which is you know a
00:53:37
number of those signal abuse Z use
00:53:39
together connected via some kind of
00:53:41
network in we are using what he call to
00:53:43
level prejudice next I I used to call
00:53:46
this a map reduce reduce in a sense in
00:53:50
the idea would be additional think
00:53:51
about it. No one of our nose has seen
00:53:54
one number of course in the number GP
00:53:56
you course. So I have I not apparel
00:53:59
system internals and then I have a
00:54:00
number of those nodes. So what we did
00:54:03
that we actually break the problem we
00:54:05
do map produce. We slice the problem
00:54:08
across the note. And then we in the
00:54:11
node I have really listen so I can
00:54:13
actually run mac produce in a note. And
00:54:16
then I do a further reduce using the
00:54:18
network. And that is the the example
00:54:20
that that we have right now and
00:54:22
actually publish is work in
00:54:24
collaboration with the rice university
00:54:27
again the code is open source. And he
00:54:30
did have a sermon nice speed at a
00:54:33
national prior to members of the and
00:54:35
then another framework and the big deal
00:54:37
let's call model and we were able to
00:54:39
show some ice speed ups again. This is
00:54:42
going towards and it'll be on the
00:54:44
single note we want to be able to run
00:54:46
you know this programs using in all
00:54:49
highly this we did a framework working
00:54:52
on using hopefully hundreds of
00:54:53
thousands of nodes though how do itself
00:54:57
is becoming in a sense already did the
00:55:00
older guy. I knew I developed is to
00:55:03
spark and again recording collaboration
00:55:06
with the rice university. And using the
00:55:10
similar to actually both how do and the
00:55:14
passion spark bill run on top of the
00:55:17
Java virtual machine how do this
00:55:19
program in Java and practise partitions
00:55:21
calibre units in E and you produce byte
00:55:24
code. So we were able to use that
00:55:26
probably framework I showed you a
00:55:27
little bit earlier in in this is what
00:55:30
we have been doing right now we
00:55:32
basically no Harriet the byte code in
00:55:34
this case we recognise the mac and if
00:55:38
read this construct in the in this part
00:55:41
framework convert those you in in this
00:55:44
case to open CL and then on the GPU and
00:55:48
this is also the other in open source
00:55:49
than actually we believe this is also
00:55:51
be quite interesting later on there's
00:55:53
been a quite a number of development so
00:55:55
people know feeding a number eighty in
00:55:58
in that effort out of a single
00:56:00
executions of a party by my mission you
00:56:06
know the yellow sparse in actual like
00:56:08
just a signal actually we were able I I
00:56:10
believe that this is probably one of
00:56:13
the fastest implementation of sparse
00:56:15
matrix foot multiply right now actually
00:56:18
do have a paper that was present I walk
00:56:20
or last month you know we show that I
00:56:23
don't have to present all vendor
00:56:24
optimise library in which we're
00:56:26
teaching as actual tricks that work
00:56:28
done optimise is typing. And actually I
00:56:30
just said the the key to cause
00:56:33
basically making good use of the shared
00:56:34
member the LDS what you call in this
00:56:37
particular libraries able to you you
00:56:41
know be it you know one aboriginal for
00:56:44
the matrices we tried about two two two
00:56:47
and a half at times faster again
00:56:49
depends on your sparse India and so on.
00:56:51
And actually this led to an interesting
00:56:53
observation this is work they was
00:56:56
present the less month skin yeah he's
00:56:59
got it's a work out up you dollars
00:57:01
group that stand for you know we still
00:57:03
defined efficient inference engine in
00:57:06
for this committee to one of the key
00:57:07
factor was up by you know a lot of
00:57:09
compression to the yeah your network
00:57:12
itself like you have a many many which
00:57:15
company into a note in one of the
00:57:17
waiters and also searchable find it
00:57:19
doesn't really affect the output to do
00:57:20
drop it in the background this in the
00:57:23
sense to cover the problem to a sparse
00:57:25
matrix problem. Well then I said that's
00:57:28
up by this that lets you to live you
00:57:31
know I sparse matrix multiply in
00:57:33
actually we're in the process of
00:57:34
developing right now but I want to
00:57:35
share the ideas we which to die so far
00:57:38
the idea is that then as I as I may
00:57:41
sure here. Um or machinery libraries
00:57:45
actually still evolving in for example
00:57:49
you know in contrast to sure libraries
00:57:52
such as few DNN and actually see an
00:57:55
opportunity right here the just by
00:57:57
applying or pick a gaze of ice by
00:58:00
spaces to buy. We are able to catch up
00:58:03
you know the the to display single.
00:58:07
This paper they present some numbers
00:58:09
here noise is is too many members for
00:58:11
you to to read but basically we felt
00:58:15
like you know they described one the
00:58:16
where using dance matrix are libraries
00:58:19
or sparse matrix libraries in when you
00:58:22
switch to the numbers that you get for
00:58:23
with our own libraries on these you
00:58:25
know we we achieve a party with the for
00:58:28
example CUDN and this is somewhat the
00:58:30
ideas that are going on in one
00:58:32
summation to you guys just to see like
00:58:34
the direction that things are going and
00:58:37
as is set up when she was eighteen
00:58:39
about you know the radio no open
00:58:42
computing this is a no what we end up
00:58:44
with an overall hopefully you know will
00:58:45
be rocking. And producing you know if
00:58:47
you should code in this is that the end
00:58:49
of my talk obviate glad to take any
00:58:51
questions in the time that we have
00:58:53
thank you very much. Hmmm Oh yeah okay
00:59:09
questions I have one question that is
00:59:24
that and DCP is set her only come and
00:59:32
have or G be and it to be very events
00:59:35
like four gigabytes in thinking about
00:59:37
is area send you do have well takes but
00:59:42
that's like a double check or why
00:59:45
haven't you guys really is larger
00:59:50
memory to use because it seems like
00:59:53
twelve DB seems to be essential for
00:59:56
designing these days at least and it's
00:59:58
only going for the yeah actually great
01:00:00
question I have to think about I have
01:00:03
the same problem you had before how
01:00:05
much can I say actually well there were
01:00:07
some hardware limitations of the first
01:00:09
remember this was the first of a force
01:00:11
HBM product don't expect the number to
01:00:14
stay the same I I can see this much
01:00:17
though one of the device that I show
01:00:19
you here we go to a gigabytes but in a
01:00:21
sense we're having now the two jeep
01:00:23
used together because of the no
01:00:26
crocheting that we have you're able to
01:00:27
access till eight gigabytes. And though
01:00:30
the the older generation to how why I
01:00:33
forgot what the code name is this that
01:00:34
one is able to go to thirty two
01:00:36
gigabytes but this is DDR five member
01:00:39
yeah also times the for the types of
01:00:47
which faced a couple of questions. And
01:00:50
hear me yes oh you short that you in a
01:00:55
a and you have produced high quality oh
01:01:02
jeep use well for example the one oh we
01:01:06
though about fourteen. I gig got lost
01:01:10
which is pretty good compared to six
01:01:13
point one of electronics. So the
01:01:17
question is that why the market is
01:01:19
captured by indeed yeah then I mean it
01:01:24
is that it is because of price or oh
01:01:30
software framework or there is any
01:01:33
other advantage of yeah which I'll keep
01:01:37
the most use that and I most acute
01:01:41
you're used is here down so the lady
01:01:43
from indeed yeah oh I believe she left
01:01:49
any make him bodily harm the
01:01:51
researchers so my this is all my guess
01:01:53
it on our really speaking frame D
01:01:56
that's a good question I would say
01:02:02
there's probably probably really have
01:02:06
to also know how much is support you
01:02:08
have that there could be many many
01:02:11
reasons that I specifically also
01:02:13
because you know the tradition and
01:02:14
there's been an emphasis also on the
01:02:17
you know the graphics I don't GPL so
01:02:19
that that could have been a factor well
01:02:23
I mean I don't know why someone would
01:02:25
buy one versus the other again mad at a
01:02:28
given point in time is definitely not
01:02:30
tries not necessarily help applies that
01:02:31
was announced last week as the first
01:02:34
the GPU to break the two hundred dollar
01:02:37
yeah yeah there there you know you have
01:02:39
no gaming performance that you know
01:02:42
that's stock prices yeah sorry I know
01:02:46
this is not a sex I don't is about the
01:02:47
only to ask the marketing guys about
01:02:49
companies to actually get that answer
01:02:51
yeah so I've been using NTTP use for
01:03:05
magically was processing and they
01:03:07
perform accident but every models I
01:03:11
kind of look around try to see if you
01:03:15
know I can run any of the the learning
01:03:17
networks like tensor flow torch on an
01:03:20
ATTPU and it turns out that we still
01:03:24
cannot do that. So are there any nouns
01:03:27
to you know fix. This issue yes
01:03:32
actually bought this one described we
01:03:34
both have a versions of some of the
01:03:36
frameworks like fortune cafe does that
01:03:40
are some sports I know about cafe there
01:03:43
are some open CL ports that are
01:03:45
available out there in in a yes that
01:03:47
our plans we're definitely working on
01:03:49
that both within named D and hopefully
01:03:52
at all simply to Reno inspires some you
01:03:54
to also contribute to that effort but
01:03:56
yes this is definitely a on our radar
01:03:59
so actually we just so that there are
01:04:13
quite a few libraries that are getting
01:04:16
open source for a specific early in the
01:04:19
hardware that is compatible us weapons
01:04:23
you have to to to in the video hardware
01:04:26
but the the real question there is how
01:04:30
much do we have to wait to have to be
01:04:32
fully featured drivers that are open
01:04:35
source for energy or and it yeah
01:04:38
actually no the DMD hardware open
01:04:40
source today you can download them okay
01:04:43
that drives are open source yeah thank
01:04:46
you sure yeah I said rocking with the
01:04:51
radio hoping compute and rocker is
01:04:53
actually open compute the driver yeah
01:04:56
and actually just refer completeness in
01:05:01
or we also enabling HCC as I said we
01:05:05
polished instruction set the DGPU so if
01:05:07
you really really care about
01:05:09
performance and have lots and lots of
01:05:10
time you cannot write coding assembly
01:05:13
and actually hope someone will actually
01:05:14
you know take the time to do that
01:05:16
because for example in few again and
01:05:19
and this is how they achieved it all
01:05:20
the ninety percent efficiency no I
01:05:22
guess is said our version of the bible
01:05:24
are growing in in efficiency and we
01:05:28
measured that buddy but what fraction
01:05:29
to peak flocks that we get again we
01:05:32
made very good progress so far but we
01:05:34
we have more to go okay so if there is
01:05:40
no more question maybe we can things a
01:05:42
speaker again on the go for the quick

Share this talk: 


Conference program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
4 July 2016 · 2:01 p.m.
Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
4 July 2016 · 3:20 p.m.
Day 1 - Questions and Answers
Panel
4 July 2016 · 4:16 p.m.
Torch 1
Soumith Chintala, Facebook
5 July 2016 · 10:02 a.m.
Torch 2
Soumith Chintala, Facebook
5 July 2016 · 11:21 a.m.
Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
5 July 2016 · 1:59 p.m.
Torch 3
Soumith Chintala, Facebook
5 July 2016 · 3:28 p.m.
Day 2 - Questions and Answers
Panel
5 July 2016 · 4:21 p.m.
TensorFlow 1
Mihaela Rosca, Google
6 July 2016 · 10 a.m.
TensorFlow 2
Mihaela Rosca, Google
6 July 2016 · 11:19 a.m.

Recommended talks

The Firebase tier for your app
Matteo Bonifazi, Technogym / Cesena, Italy
27 Nov. 2016 · 10:49 a.m.