Embed code
Note: this content has been automatically generated.
Just just so but yeah Okay so you start
again for the first two "'cause" you
have to know. So this talk would be
given by voice you but done it's one
the AMD so moist use engineering fiddle
what injuries are on this currently
working on the exist get computing
systems or the acceleration of crude
cloud workload on that that it seeks
the child's Muppet use some not travel
on the on the what she's twice a like
machine running with CPUNGPU so we are
button all that is to that would be
again the talk about console for one
that the product thank you. Thank you
can everybody hear me okay alright so
it's my pleasure to be here thank you
for the invitation. And I said you know
I want to discuss which you we know
what I it has happened this space with
a DL try to give you some background.
And that to be honest some of the
support that we have a going on right
now no I am part of a team TV sets so
wanted to give you a feel for this. And
these are some of the locations we have
a throughout the US a lot of actual
research work is also well funded by
the US government of a department of
energy. But we actually have a crime
project about a fifty million dollars
look it actually scale computer system.
And that means writing tend to be king
floating point operations per second.
And they're our budget or twenty
megawatts we're looking at Arco
supercomputing C so they will be on the
order of about the sixty two hundred
thousand notes again using a new
generation CPUZUP use from the year
twenty twenty one and twenty two and
this requires both a Hanson is on the
architecture with the programming
models on the power efficiency and so
on and actually in addition to that
research you know this I have further
verifications that applies you a next
generation lady product it as I said is
affecting all of our you know software
stacks it also huntress X I want to
make that connection for you. And yeah
I tend to take questions that yeah but
if there's a pressing question please
don't hesitate to entrap media we that
you to respond. And actually this is oh
the outline for my talk. They so of try
to populate no specifically focus on
what we have been going in support of
machine learning. And actually no we
have an overall vision latinos that we
would doing nor softer not necessarily
have four mushy line you but also for
other markets is necessarily no solid
foundations out to describe to you the
kinds of harder features that we see
right now X I am at this conference all
sweet and I know I towards identifying
the lottery do bring features to having
a harder into Reno to keep enhancing it
specifically because you know this is
the feel that is a rapidly moving. And
and you know there's a lot of
innovations happen so we need to play
that passed between old comedy things
too hard really identify the key
building blocks such that we could
create efficient hard if you should
softer support. And as I said the
somewhat the foundations that we have
also some of their approach and all
that open source libraries I have a
little bit of common so what we have
been doing in open source libraries.
And I get a specific you want to have a
two day in the machine learning and of
course. This is a rapidly moving
futility name gee this is also moving
we're really play catch up in a certain
areas that we hopefully it'll will
continue doing this and specifically
also we do in alignment with our vision
which is related also to you know being
open and we know very collaborative
buddy on overall usual would be all you
saw the diagram I had in my previous
slide about you know moving from you
know a small system the left aspects
desktops are going to even so
provocative says admission to in the
past. And the idea is actually you know
to having computers a supercomputers
that you know have high performance in
L so they're heterogeneous you know
there are certain classes of cold. They
do map well to did you feel you know in
the future you for this use for compute
can be quite efficient to certain
classes of computations one minute
computation by versus when you have a
lot of great you sometimes see pews
better connection automation this a
little bit later. But are now we do
have to see over having a local hand
heterogeneous of processing knows that
would be able to execute the CPU colds
when one makes sense to do so and I
said well this is an all in a sense to
one of the features that we see that be
able to apply the bass hard
organisation for the best problem that
we have and actually my my second
bullet here mentions the HSA which is
programming model which is the standard
is actually initiated by G but is not
call on this part of the arms part of
it it is a foundation that's proposing
programming model to make an
approximation did GUE easier. I I hope
to be able to explain that a little bit
more it's not the key talk of my talk
here. But I can easy support the
decision veggies that you would be able
to you know right. C plus plus programs
that are you know directly extra
readable onto again using the past
platform for the execution in that
could be as I said earlier either CPU
energy you embroidery heterogeneous
systems are architecture also doesn't
really presuppose only those X or it
could be an FPG or something of the
some of those in the future. And I just
cetera don't one approach and all being
openly would be a perfectly feasible
for someone and especially Seminole
researchers to also continue to that
that we have one item that I know
neglected to mention about energy
research and I want to really make this
point here as a set of your spread out
throughout the USATV do work you know
in collaboration with many divorces
action a lot of our work is done in all
via also hosting a postdoc in that PH
the occurrence. So this talk is also
peach another hopefully you know we
will find interesting postdoc cylinders
that ignore willing to work with us you
know extend the research you know in
collaboration and as I said the the
third bullet that I want to described
you guys here would be the approaching
the wiser at all I should beat on this
point a little bit the idea is to be
able to provide open source software in
house or hundreds open standard you
know in a sense that he supports many
of the key programming interfaces the
either through like a day I say like
thirty six you may like it or not but
he's a standard you know instruction
set architecture and also even the
harder interfaces such a PCI a new
standard recording see I actually to
six is also coming along nicely set it
hopefully this is also supported by
open source the compilers it also
enable meant even for you know domain
specific ranges of even alderman types
for example panicky or sauce and as I
said earlier no philosophy that is so
pretty meteor products is the idea to
adhere to open standards and hopefully
again this would be stimulate the both
the collaboration innovation and as I
said you know we have done a lot of our
hardware software based on this in a
letters unisys system architectures
which again benefits for you know a a
number of a daily TI industrial you
know members. And again even the
intellectual property loser are such
that enable collaboration. And battery
no they just say foundations actually
independent of energies actually it's
or foundation you can find the
specification available on this
website. And out of the you know I have
listed a number of a links here now
hopefully overthrew the someone
interested can fall some of those links
and and get access to to the extra
information and then they said the man
we more specifically for agencies
supports not what open standards like
the coming up C plus plus fourteen. And
actually they said some of our work is
also done in support of X escape
computing. And as I said is a huge use
a supercomputing system so in LP gas
traditional global address space
languages are and then version standard
that you know of course are able to
support the memory model so such as dot
net and Java actually turns out that
the memory semantics of agency are
formally described. So we know someone
with interest and that can you know
study that's in defined a new ways of
connecting to the architecture because
I said one of the key ideas behind each
essay is this well with the CPU in did
you feel you know they both have equal
access to memory so we actually could
even think about passing a point there
from the sea view to G you and did G
use able to do reference that point and
get to the data and without and all the
previous need for doing that we copies
and so on. And again in all the
coherence is taken care of by the
harder so so hopefully this would
enable both the programmability in some
situations you can actually I get
better performance. And I said one of
the ways of the performance it would be
to do some dynamic scheduling in
interplay either choosing execute
pieces of code online to deceive your
the DGPU and again I guess someone to
programming models here. And well
actually diving to somewhat those a
little bit later but one of them is are
the no C plus plus compiler. Um and
also as I said earlier no opening P is
important in certain spaces
indefinitely these important into the
excess clear high performance computing
again that we do have a version of an
open source compiled it supports this
all of that is is supported by not
allowing him but back and then
optimising compiler. And specifically
I've omission is that again later but
I'll repeat a little bit to try to make
the point you know in a sense you know
if you use the LPN back and we actually
benefit from much of the optimisation
much ado researchers weak line on on
the open source compiler arena. But
also you get to to access them to the
back and which is the part of the
compiler there's actually doing the
code generation. So you can actually
look and see what kind of coding
generated. And again if you're like it
you actually give you ways to make you
know you're wrong modifications but
even that was about it to make it
better. And as I said you know not
clank is actually the you know the fact
a standard for you no different into
the compilers. That's what I yeah yeah
the the signal or software is built on
topics source Linux it actually all of
our patches in or strain. And actually
they are you Emmanuel information here
is the no the device that allows you to
have you know a coherent coherent view
of a virtual memory yeah what I
mentioned earlier when you press a
pointer between the CPU and the you you
this is already two virtual memory
purchase space you don't have to or
yourself about and dealing with the
physical maybe I saw. And actually know
most and all leaders a Linux
distribution would be you know enabled
we try to say as I said earlier
actually I I be repeating this a lot
because I want to try the home
disappointed at all of our stack is
open source including what you know
traditionally used to be very secret
which would be the kernel mode drivers.
I know traditionally especially knowing
GP use the I it's a distraction
instruction set or did you you use not
public and and there were good reasons
for that they're good business reasons
for that one would be the compatibility
a hardware design that this design the
next generation Intel X eighty six CPU
cannot invent a new instruction. We
actually tried to the data. But is very
complicated because there's means of PC
so they decide I don't want to use or
to implement in all I give any
structural I've given add instruction
lexically six disc brakes a lot of code
that's out there so you have a huge
burden of backwards compatibility
whereas the knowledge if you denied
designers had that this degree of
freedom because traditionally the ISA
was hidden behind the driver so you can
make a or whatever like you never games
you know traditionally the the use by
games you communicate that to the
driver which is that as you loaded
again in many times sometimes I
dispatch you would compile the code to
that native I say so hundred is I just
had a lot of feed only Vanden
instructions all the time and so on.
And again in this case an LAPS has
published or instruction set for GPU
called you see and and again we do have
a burden of of backwards compatibility
what do you hope to know navigate this
well in that when you change things
will let you know find to bind you you
which is not so bad after anyway. And
as they said concealing normal open CL
compiler sells you plus plus and also
we do have a support for python are
also open source. And the the back and
the or contributions so the back and
they're also strained and now in in a
certain situations where should you
support have a few compilers which are
not you know do not belong to us in
those fort lee at the moment they're
closed source and some we have some
opening keep compilers that you know
are available to our chips however
they're not open source as everything
else. So as I said well you know the
the key visualiser said would be these
you know having open source software in
X twelve keep mentioned is website here
specifically because not traditionally
does you user use you know for gaming
and there's a lot of hard reinvested
into you know creating high quality
pixels that are creating texture and so
on. But also for the compute side no
there's no no emerging applications of
the GPO itself in actually we do have a
ops this was a website called GP opened
dot com in most on sales oh oh I'm
tempted to say almost a hundred percent
of our software is available you know
you open source N through there. And as
I had mentioned there you know well
hopefully want to couple this would
open standard hundred this will
actually enable us signal to basically
increase the collaboration increase you
know the creativity. And to be honest
in all all of us working together is is
will be better than no the limited
resources that we have in try to
provide not everything for everybody.
So that is one of the the engines than
actually this is actually not one of my
files this is from you know or cheating
that is you know describing what we
call radio hoping compute in our
marketing guy said finally these and
they use the the idea of the rock. They
quoted rock I'm a radio is the brand
for OGP use it as I said the the ideas
to having this open computer platform
and then the name provocative you see a
few for is it all really too rocking
what on this presentation here in the
idea is to create you know what passed
for different performance computing as
I said you know what is funny about or
research is that at risk in computing
market. However we actually want to be
able to you do you I value to our
customers you know to other classes of
customers you know from the system as
we have been doing this a space. And as
I said earlier this there should be
enough foundation for you know for the
development for that experimentation
and for the discovery. So there is you
know the philosophy behind this far
come which is don't group computer for
platform minute we have the the open
source runtime was called rocker and
you see that coming up in the middle
hope to be honest with me right now. So
in a sense I described to you and one
overall vision about how we're
navigating this the space. And
decisions dictated no vice several
pragmatic use also dictated by the need
you know to collaborate to because as I
said well you know not the largest
company in the world. It is actually
beneficial to you know to benefit from
the energy of many of the research is
such as you that would be you know
interested into collaborating creating
new ideas. But that part of the support
for this would be you know what kind of
hard to that we have an LH mission like
this is an interesting my development.
And the two or understand you know the
the notion of the high bandwidth memory
is key and as you know in the B you use
really hungry for not a lot of the axis
it's so aimed it was the first company
to act commercialise not developing
between commercialise I banished memory
describing this simply the idea is
actually play C you know one a second I
you would place the memory layers on
top of the GPU to be honest. It's not
always on top sometimes they're next to
each other. But the idea would be to by
doing using silicone instead of having
a one package device or another package
and then you have to have wires going
from one to the other throw some
standard that memory interface like DDR
and so on. You know you see daddy are
restricted when you are doing this in
cell a colour you are the able to have
you know thousands and thousands of
wires. So you would be able like for
example you have a a number of memory
memory dies or memory leaders here. And
then you would be too complicated that
data chi you to the CPU by using say
you know four thousand wires then you
can get a lot more data going on very
quickly yeah this is menorah product
from from angel already for a while in
that I just said provides signal much
higher memory bandwidth connection has
been adopted at throughout the industry
even so that now and there's a new
upcoming or second generation high then
this memory standard that again I and
he's participating on and the key idea
behind this was to provide both in an
increase bandwidth. But also bandwidth
per why again if you have a to traverse
long distances you know with your
signal yeah actually there's an all up
energy in terms of the power that you
consume. And the one animation the
beginning of our of our talk to achieve
one X a flock of execution undertaking
megawatts if you do the math me neither
between that and accent hundred X
improvement in your you know energy
efficiency to be able to these under so
it's such are constraint. And the idea
here is that you know the electrons
will not have to travel very far and
actually will be build a frankly that
keeps making this point it only cost
you more sometimes to transcribe it
from here to here then to actually do
the floating point operations be again
in all energy efficiency that Vicki a
consideration for this. So you know at
this point of my are actually give you
the idea of the foundation where we see
that you know hard restructure that are
beneficial for you know computation.
And as I said I'm also keeping a
particular I out towards you know deep
learning which also we would benefit
from the get increase memory Bentley is
the beach to be able to address in a
larger memory spaces and so on. And
therefore completely these are products
that are available today this was the
first one you know the initial no no
prototype the the the issue prodded it
came out in all four gigabytes of high
bandwidth memory. And stuff five
hundred kilobytes per second memory
bandwidth in a basically a tariff lot
single procedure in all in all we have
the fire approach to which in a sense
actually couples each of these together
it's able to have a eight you get buys
I'd get of member that can be accessed
at high bandwidth you know limit that
thirteen their faults. And the third
one here this is so used to be called
Polaris was announced a I think a week
or so ago Kennedy gigabyte set you on a
fifty six a gigabytes per second of
memory bandwidth hundred Watts and
actually this is I think the first one
to break down to two hundred dollar
price cost you know we had a discussion
wide here about you know how many get a
false are going to be using to computer
neural networks and so on should you
discuss that in your papers hopefully
made by making diffuse cheaper you can
make that available to everybody. So
again this there some the here is the
cells that will be able to enable you
know for their experimentation on this.
And I just wanted to highlight this to
know what we mentioned earlier this no
showed excellent processing unit in as
I said what you have these on the same
a second die you have a the in this
case the the particular product
actually this laptop that I'm writing
on is one of those actually you mikey's
I have a four CD six if you course in L
that you feel that are sharing memory
and again in their equal citizens in
terms of accessing the memory of course
there you can see this is the sense
that they're able to access the same
memory and of course you know the jeep
you is usually much more bandit and
within the CPU again we you have a
mechanism to make sure you know
execution proceeds that property. So
the idea is et cetera idea would be you
be able to both accommodate this year
computationally the data parallel
computation hopefully to be able to
seamlessly switch between each you with
the no the lowest cost overhead like
being able to access memory from the
CPU two GPU doing everything in user
mode like you can just do a little as I
said earlier you know increase
implementations for example in the
discrete GPU need to invoke a device
drivers you know if you saw for guys
you realise if you do that you're
making a kernel transition to
privileged mode. So in a sense you
distracting execution plan to be it
until you get your data there so you
try to minimise those yeah actually
this is just to give you some notion of
what is going on again this is just
showing again they have many these that
you get from using high bandwidth
memory and easy to see here this is in
the sense it'll looking in terms of the
increase the block size tresses and is
on the left you have a transferring
memory from the host to the GP you
would on the right you have a going one
backboard and as as you can see or
previous generation drivers you know
had to my slower memory bandwidth. And
again a direct attach the GPU was able
to achieve in a higher memory and it's
because I again I'm trying to make the
point that you know when you need to be
able to access you know much more data
this this is enhanced memory and is
going to be no crucial for achieving an
efficient and and and the higher
performance in the you go in similarly
to that this is like a roof line the
makes mention Mart showing you know how
much you know single precision digger
frocks you're able to achieve in
battery disliking products that are
available to day in hopefully you know
in the future will be able to achieve
in the even higher even higher
performance but again I just gives you
an ideal work where you are today. And
that no this is overflowing curve as
you see for work covered a better some
Berkeley in a sense you you change
their a should be T know maybe floating
operations per byte you're doing. And
then you keep going you know in the
left side your your excuse me
bottleneck by a compute capacity you
you hit you know your your saturation
point and this is where you know you
see the peak of the roof line here. So
anyway I hope that so far I was able to
give you one idea of what we see is
being the foundation is you know the in
in the hardware support that you do.
And then again that point you know in
in the samples foundations but I'll try
to describe the know the the software
stack that we have because you know I
know we're considering Nolan dip your
network here in actually construct from
the mathematical aspects. But is very
important especially to get performs
that you understand both the
mathematical aspects how that is
expressed in softer. And then how that
software is actually mapped into
harder. And actually in this in the in
this case that we want to show you what
we have that would be you know helping
or something you to you achieve that
task as I said earlier no one animation
about agency foundation. Um no initial
focus was actually want is eighty you
know this is the name for having the
superuser G pieced together in our
recently you know the same technology
has been extended to also support the
discrete GP use individual have a lot
of ongoing effort there in that case we
do have a a separate in there and
separate dies. And we have a high
performance interconnects and again
high performance harder support to
enable you to approaching transferring
data between the CPGPU you know at the
user level with the hopefully did me
more soft overhead mean this is that
the second portion of this components
here would be really to the you know
the open computer architecture which is
et cetera right now there's no one to
keep things are that each a save
runtime is was originally specified
which means the device driver did you
have a two today is fully agency
compliance it is said agencies a
separate from day should has specific
specification about what is a compliant
device which means supporting the
memory model supporting in a number of
instructions and so on. And that is
fully complying. However has a few
other extensions that actual
information this one especially peer to
peer in in this case you are able to
transfer memory no between GPOGPO
directly yeah actually one interesting
point like to meet about agency is they
said they are you know we can citizens
in the sense that traditionally we see
the CP was being the master in the you
you'll be in this way so you write code
in C and then you despite some code for
execution the GPU but no in wheat ages
say you are actually able to have the
GP will dispatch code to run on the CPU
and vice versa of course you know we
need to be careful about what you're
doing because if you're running sixty
four threads around you you you you can
quickly over on TCPU and so on in
another you know key exchange for peer
to peer is is in the case in which you
have a so you have a huge new
networking we have a you know a system
would say hundreds of nodes and so on.
So you'll be able to transfer memory
again hopefully at user level between
those is he S separation. I
specifically when you doing this to
this GD no when you get to the outer
all communication phase where you
passing out parameters this is a very
key feature to achieving a lot of
performance right there you know this
could be one of your bottlenecks
especially as arabic other described
earlier when you're doing synchronise
SGG yeah at a point in each everybody
has to communicate their their
parameters to everybody I'm a already
you know some other compilers no bite
on an LL me I'll mission and we have a
little bit later about hip this epistle
idea for portability in that it's a
it's an approach to enable cool the
developers to be able to take advantage
of a million DCP use as I said earlier
they memory the discrete memoranda
cheap use also exposed to the program
or get to the normal programming
language constructs in actual at the
bottom here just to show you how this
is developing rapidly actually as well
a now's the time November at the
supercomputing conference that was in
austin especially someone was there but
you know the preview releases in yeah
second releasing a prude you it did
point is that you know this is actually
a very active ongoing development. So
don't be afraid to download the latest
drivers and actually afraid to
contribute to Reno enhancing those
drivers and so on so this is a very
active filter development right now and
as I said oh the GPU up in website you
know most of our code source code is
available you their own on get have or
in a bit bucket as I said in addition
to you know supporting compute we also
have a dog in your libraries that open
source. But you know for these audience
also mentioned I know some of our
compute libraries which are no the
source for less libraries that fifty
random number generators inexorable
touch a little bit about are spies
Matthews operations yeah but again I
want to make sure that they know you
get the notion that you know most of
our salt is actually not found in one
place they know from did you you opened
a common can go and get that and no
access to the most softer right there
in the one of the key up pieces of that
softer is weird we're cloudy the CC
heterogeneous C compiler is we call it
single sources C plus plus for the GPU
and actually do have a support for a
pity P extensions if you're not sure
many of you are open EP that
programmers but in opening P right your
C or fortran or so on like cold. But
you are not eight specific looks with
progress. So you can tell the compiler
this is a terrible. And then the
compiler generates peril code for
example for money and the multi core
GPU in scenery the latest opening P
standard actually for that five right
now has some accelerator absolute
workstations in at least you can
actually specify I want this look for
this particular variables to be located
on the TV you're on the GPO and then
the compiler is free to do whatever you
know it finds the best you can even
ignore all those for arguments in which
case you get a perfectly fine
sequential program though the notion
for us here you know specifically you
know P in with the description by
merrier raise this tells the compiler
for example this is available in this
is the data access to that we are that
we are performing within the look and
then the compiler you for example to
understand this if your compiler person
can computer enough find expression
money missus. And then you can decide
what is it best dated layout. And then
you get you know like someone mission
only know talk assume emission is talk
about all these try of access and so
on. It because again you know remember
one did you you we have a multiple
threads running simultaneously the on
execute the same instruction. So when
you get one memory load is highly
beneficial to the GPU that every one of
those threads access for example
consecutive memory access. We call
those that can acquire less memory
access to everything did you you and
you have the highest that bandwidth in
that case because you eat one axis you
gets two hundred fifty six bytes it is
also possible to have each one of those
could do the different memory
locations. And your code works fine.
And what the jeep you does in that
situation is this in the sense it
suppose you get sixty four different
addresses the memory controller takes
over eighteen to sixty four accesses
meanwhile the the GPU scared range in
is very quick at switching context
typically on a jeep you you have
thousands of dollars of ready execution
thread. And then switches over to
everything. However the situation the
animation early is the preferred one we
want either are compilers or a program
is to create calls that you axes the
memory in a preferred way. And you can
I already received a little bit of a
tension here. Because ten you you have
to understand the math behind your
networks you need to understand cancer
for tax. And then you also need to
think about how your data's being laid
out in memory so that the you you
access it they know consecutively
hopefully as a set of are making it
such that we want our compiler to be to
assist you in those tasks I described
so far opening P C. plus plus in C plus
plus at seventeen have the notion of
the palestinian template you can and a
cantaloupe and say this is a parallel
for each I have an example coming out
in actually we we say named you were
working with C plus plus language
eastenders to propose this as
extensions again innovation program
will be able to tell the compiler okay
this is a parable deserted data that
I'm touching and and then the compiler
would be able to do this option my
stations that are described for you.
And by the way we're building on
existing infrastructure right now and
all crying in a lot of yeah already
have some features to this
optimisations and again this is a
continuing developing within the
community again it but if you are
interested you can also an updater code
for hands for example you can place
some data into different member
categories are Jenny for for me to the
GP organisation you know in addition to
this high bandwidth memory data
described a we have what we call local
minimum or share memory you know the
not the naming between dingy dingy this
this thing is a little bit. But is a
very fast high bandwidth memory that is
available to one execution offender
execution more using the code the
nomenclature. And that members super
high ported memory actually turns out
to be one of the hard rock city leaders
that enables us do the sparse meetings
operations really fast in actually this
is the kind of things that I'm actually
not actively thinking about to try to
see what other things can you do that
helps as well back additionally
actually if you're a programmer that
I'm interested in getting the last bit
of performance you this possible to
annotate you know in the source code
for you know helping people combine a
locating this in battery original
active discussion right now we also
have an uncle meat even have a number
of blogs in which people tell us we're
going and doing a good job or not. You
know progressing with this is I said
the right now the compilers available
plus or support signal ISOC plus plus
is it but it said is already open
sourced in as a said supports the panel
centre and language and also supporting
open AP and as I said in this
particular case the computationally is
a single source signal you generate in
a single file we can be Riding your CPU
cold in any annotator Payroll for each
have a jeep you code in you know don't
really concern yourself with
transferring data because you can refer
to global variables within your cold
without you know having all this
usually to make sure you according to
the the the copies finished by the time
you need to access the data and so on.
So someone ask earlier basically a
compiler generated a whole school for
the CPU injuries you know the coda goes
we want to did you you know in the the
elbow onto the same L file. And again
teacher say foundational we say and you
have proposing a laugh binary formats
says that you there you can use your
standard tools to analyse elf if you
care so but it is at least to allow
motivated programs to drill down
optimise. So think you can even now
control data profanity nola talking
someone really that mission like
sometimes when you are able to make
your code of fit into the GPO cash you
really get the very good performance so
if you can you can do that. And as the
said or did you view parents can be
lost sight synchronously oh the mean
age essay basically we architect that
it's just see structure in the sense in
a C structure describe what worked way
want to keep you to do in there you
describe what is the address of a
method which are functioned executed
what are the arguments. And you feel
that function in you place it in a
queue in memory of the harbour is
looking at the Q and then he sees hey
there's work to do. And then factors
work to do and battery different the
models a jeep you have a have a
different capabilities some have eight
such you we need you looking for work
to do. So you know from the users
perspective all you do is you create
this packets you dispatch awarded you
you wouldn't and everything else that
happens you know automatically. So in a
sense you can you know start work on
the GP see mostly in again the agency
sitting which itself provides a
signalling mechanism for no the CP you
for example to be informed if you care
about you know the execution of a given
colonel. And actually remove even
create you know dependency chase which
would be a good thing for that so for
for example note ringer over those dogs
that you did you almost can't translate
those two hundred should create an oil
dependency change you need to say P
find this guy executes I want the
second one to be able to dispatch. And
then on the second terminates the third
one was on it is possible for you to
create such a structure and have the
harder. I get takeover in only let you
know one everything is executed again
in some environments this can be a
alive specific and has meant. And this
is actually just a very quick example
here to show in actually any see how
I'm doing and time. Um yes we need to
speed up basically the the key point we
will be this as I said we we we writing
a parallel for each construct. And as I
said a little bit earlier so I declared
the bounds of my array. So I able to
specify million perfectly standard C
plus plus code you know go do that
would be equal compiled and run on the
G you win as I said I mostly would
validate this code for grief I wanted
to get more performance out of it is
just submission all here is called
header jeans computer interface for
portability the idea is to enable who
the programmers to move to a common C
plus plus programming environment. And
the idea is that we can create I just
said to you convert your could echo to
standard notation. And then that
notation can be processed either by you
know in you CC which is if you give a
compiler or I standard C compiler late
each CC so in a sense just you
simplified forty this too is evolving
and actually I just said is very easy
to see like I I is on the left to have
operational could a code here. We have
a clue them in copy and so on any if
you look at the right side you see
again I hit man copy. And then that you
have a you know a block encoded that in
you had the second on the declared on
the right. So as you can see the goal
here was to simplify no porting of
colds. And this is to is not perfect
the no uses DGC compiler is I said I
don't want to convert to hippie right
to compile for example in HCC and then
run the same code on on the actually I
do inferior or in the GPS the to
converse I would say about eighty
ninety percent of your code in that are
served most trucks that would require
many derivations so this is in the
sense and that's just shoe portability.
And actually this point I put also
mentioned all did. GPUDPCC compiler
from who that is also capable compiling
occluded directly in generating a LP
yeah meeting with the language which
you could also be we target towards you
know both classes of GPS for completely
as we also be you know collaborative
coaching you analytic see did you
provide a high performance python. So
the I D.s actually to be able to
provide the know accelerated classes in
it provide a number of features to
support efficient execution and also
such as I said earlier this seekers
colonel execution using our share
memory again the shared memories is
fast memory on did you use in again
making even implicit that the research
based on this. So the idea is to
provide no python programmers which are
way of target EGP use a very
efficiently and then finally I get to
additional accordion destructing we
have a and a number of solutions that's
what I described so far were more of
the softer foundations I talked about
the about some of the libraries and
also some of the compiler and runtime.
And then oh you I just said a number of
a machine learning to your network
framework supports in addition to order
here is is the setting a lot atomic
simulation especially access Q
computing is also very important to us
as well as we're looking at the energy
and the signal processing. And also
dating graph and the really can
actually want to be able to comment on
that a little bit coming up later in
just to say here again all those
libraries are eight open source is well
we have a blast libraries the fish T we
have a sparse matrix library a rental
number generates first and make sure
this is familiar here but the US
department and if you have the new that
framework or the core calls which
enables you to do high performance
computing again expressing those
computations that high level is is I
have really template at library which
you know you write record an evil can
those templates no using matter
compilation you do actually get the
generated code to be quite efficient
across a number platforms because one
of the goals are screen able high
performance while preserving
portability and sometimes these two
goals are to each other every lot with
each other to mention a charm plus plus
is one of the the emerging P guess
language that I mentioned in this in in
the beginning of CHPX more "'cause"
outside in one of those is sequence
execution frameworks. And this is
actually this especially though there's
less three are important then again in
the highly distributed environment a
lotta right here I know you son of a
bitch Mike then we have a menu right
now and some might be funny to in some
are familiar to win the lottery areas
of work. And then they said well
specifically related to your network so
we do have support for tort seventeen
comfy in here omission on the passing
and we do have machine learning
library. And I and I confess you know
this is in a normal any development
right now in the goes to have you know
what am Isaac a convolution libraries
for neural networks. And again I
believe but we have so far is already
open source is is to not at the peak.
And then there's actually go work going
on to provide well better in intensive
versions of this it initially specific
food to produce audience it would
definitely work no for their
contributions. Um the to to to make
those libraries better and more
interesting in the related to these ago
so do you have a working in the open if
yeah it's a community against an open
source framework is specifically big
graph optimiser features in open GXLR
also no deviation to be a foundation
towards a supporting good machinery and
again actual booklet this maybe quickly
and unless this let's see this ad in
the number of open source library thing
about it is all of this information is
a publicly available also a name to the
developer website. So basically in a
number of a computer I business anymore
so for compute libraries are available
in open source. I'm tempted to say all
of them but you know oh I would say you
know the vast majority have terrible as
I said well we're starting with the das
libraries and again the the the sparse
matrix operations. And I mentioned
earlier already the heat the computer
interface Proctor tabulation of the
current version that we have is already
available out there there's is said
this is rapidly evolving that you saw
you know before they so do this will
you have been releasing code quite
rapidly so as as this evolves you can
always get the latest one you know
previous work project was the boat C
plus plus which is a a template that
live Reno compatible DSTL that would
enable you to basically male target or
go to the GPS no the next one is my
favourite intellectual mission this
email also in in the future we've been
talking about CC plus plus insulin and
there's also not a huge amount of work
in Java including you know you're an
outward to work in a crappy a friend of
mine developed it in the cadres is
horrible name is Cody paralegal PI it's
away in use you can express your data
panel or two in java. And then at one
time we take the Java byte codes in
over those inch you called for did GPO
you know you're from you know the way
Java works. You know he said dynamic a
compiled language so you probably
translate Java code you generate those
byte codes. And in upper upper you
write your Java code normally but from
here from one special class had runtime
you know or runtime would take over in
inspect the byte code. And in case you
know everything's okay we'll try to
generate code that runs on the GPU and
actually tradition regenerating under
the covers would be generating open CL
and if it fails you would have to
revert back to either move like or
execution or even to sequester
execution in the show mission a little
bit later we have we have used this
particular feature to accelerate order
frameworks is pretty particular how do
and not biting spark which are you know
very very important big data a frame or
sin most of them are abused on top of
the JVM the Java virtual machine and
actually in again that supercomputing
we actually to does this is the the the
the time that we should we introduced
the notion of of making or open source
drivers they making everything not
fully transparent. So he was called the
boatman initiative in all drivers for
the video cards. And again is some only
so not to you know I spend a lot of
times the details here but it is said
do we have an awfully fully computer no
bless routines a random number
generators. I just want to highlight a
intermediate or sparse matrix and I
have a couple for ozone it coming up
later because as I said I'm also not
thinking about to know what kind of
things make you know called run faster
GPU and turns out you know the shared
memory no the fact that they share
memories either to support in our Casey
she for memory accesses simultaneously
was very very key to enable high
performance on one sparse matrix and
actually the secrets is naughty what
what what was done in this library.
It's conceptually simple we wonder
processing is sparse matrix we actually
do prefect it which you you know like
affecting the whole description that we
can do this you know using a high
memory bandwidth. And then repression
prefect into this highly ported memory
and then from there we to dispatch
multiple threads. And because the mammy
supports multiple simultaneous access
all of them can proceed in parallel in
this is basically the key for the
performance of this library which works
well on both indian and yeah we I just
for computers omission or fifty
library. And we have a know and a few
other frameworks here that are also
supported again some of the people here
might be using those libraries no I
want to talk a little bit specific
about what we have a long enough for
machine learning in actually that
enough sense the two threads of this
work as we do contribute something to
the to the some of the leading a
framework to knock off a torch that had
been to discuss right here in
somebody's work is it that don adding G
cell is also the two in our club racial
the multi core where as I already
mentioned on all the and some of the
platforms also if you are into no image
processing also do have some of labor's
talks very that and actually so I'll
describe no specific some of the
libraries that are related to machine
learning that we have an an HCC cafe
version which is in a lot of original
coffee utilising or C plus plus
compiler ports succumb to the number of
models to test. And again this is the
place where you could download
experiment with this and again you
could use it to target or GP use
seemingly with stores you know similar
concept again using no the few plus
plus dependent portions of those are
you know enabled using in our C
compilers again you're able to download
it in experiment it's it's as much as
you can okay yeah so much to bring need
to run fast but I wanted to do this
because I'll take a make the change
here. And actually appreciated Nicholas
said this morning to go so far we have
been thinking about what happens within
an older within a computer. However
there's also you know we have this use
data centres from go face broken and I
said was the pronoun there is you have
a you know cluster with six thousand
nodes in advance. So we'll highly
distributed computing is you don't use
a funky area for us to be thinking
about but before that are just
introduce you to the notional map redo
signal which is you know popularised by
go go in agony is highly interesting
way of doing parallel computation the
notional map produce here for those
that are not for me to action describe
a simple problem here like suppose you
want estimate divided by you know quick
ways as you pick a a number Renault
points inside the unit rectangle any
count how many of those twelve falling
side. And the the idea behind that
precludes is is that the system the
framework does everything for you is a
very easy way of doing at parallel
programming you write the map method
you like them at the produce method.
And then the system decide to scale you
run those things in parallel and
basically you run you know your map
functions you which basically generate
like in this case in this example here.
I generate a random point in to make
additions you whether or not it is
insights right I'll put or one has here
in this case my those function is
basically just signal counting the
number one the system does the rest for
you. And actually use this to motivate
for example one of very popular
framework for again begin early this is
how do which is an open source
implementation of them out produced
framework develop real. And you resume
this you to several Java virtual
machine you have to know create a file
system we have relied be done cheese
different note fails it does it all for
you. You want to program you write them
at at the offended with this method
know how do you want to accelerate that
on the GPU in actual quickly show you
how we did it get our target was a
cluster of a use which is you know a
number of those signal abuse Z use
together connected via some kind of
network in we are using what he call to
level prejudice next I I used to call
this a map reduce reduce in a sense in
the idea would be additional think
about it. No one of our nose has seen
one number of course in the number GP
you course. So I have I not apparel
system internals and then I have a
number of those nodes. So what we did
that we actually break the problem we
do map produce. We slice the problem
across the note. And then we in the
node I have really listen so I can
actually run mac produce in a note. And
then I do a further reduce using the
network. And that is the the example
that that we have right now and
actually publish is work in
collaboration with the rice university
again the code is open source. And he
did have a sermon nice speed at a
national prior to members of the and
then another framework and the big deal
let's call model and we were able to
show some ice speed ups again. This is
going towards and it'll be on the
single note we want to be able to run
you know this programs using in all
highly this we did a framework working
on using hopefully hundreds of
thousands of nodes though how do itself
is becoming in a sense already did the
older guy. I knew I developed is to
spark and again recording collaboration
with the rice university. And using the
similar to actually both how do and the
passion spark bill run on top of the
Java virtual machine how do this
program in Java and practise partitions
calibre units in E and you produce byte
code. So we were able to use that
probably framework I showed you a
little bit earlier in in this is what
we have been doing right now we
basically no Harriet the byte code in
this case we recognise the mac and if
read this construct in the in this part
framework convert those you in in this
case to open CL and then on the GPU and
this is also the other in open source
than actually we believe this is also
be quite interesting later on there's
been a quite a number of development so
people know feeding a number eighty in
in that effort out of a single
executions of a party by my mission you
know the yellow sparse in actual like
just a signal actually we were able I I
believe that this is probably one of
the fastest implementation of sparse
matrix foot multiply right now actually
do have a paper that was present I walk
or last month you know we show that I
don't have to present all vendor
optimise library in which we're
teaching as actual tricks that work
done optimise is typing. And actually I
just said the the key to cause
basically making good use of the shared
member the LDS what you call in this
particular libraries able to you you
know be it you know one aboriginal for
the matrices we tried about two two two
and a half at times faster again
depends on your sparse India and so on.
And actually this led to an interesting
observation this is work they was
present the less month skin yeah he's
got it's a work out up you dollars
group that stand for you know we still
defined efficient inference engine in
for this committee to one of the key
factor was up by you know a lot of
compression to the yeah your network
itself like you have a many many which
company into a note in one of the
waiters and also searchable find it
doesn't really affect the output to do
drop it in the background this in the
sense to cover the problem to a sparse
matrix problem. Well then I said that's
up by this that lets you to live you
know I sparse matrix multiply in
actually we're in the process of
developing right now but I want to
share the ideas we which to die so far
the idea is that then as I as I may
sure here. Um or machinery libraries
actually still evolving in for example
you know in contrast to sure libraries
such as few DNN and actually see an
opportunity right here the just by
applying or pick a gaze of ice by
spaces to buy. We are able to catch up
you know the the to display single.
This paper they present some numbers
here noise is is too many members for
you to to read but basically we felt
like you know they described one the
where using dance matrix are libraries
or sparse matrix libraries in when you
switch to the numbers that you get for
with our own libraries on these you
know we we achieve a party with the for
example CUDN and this is somewhat the
ideas that are going on in one
summation to you guys just to see like
the direction that things are going and
as is set up when she was eighteen
about you know the radio no open
computing this is a no what we end up
with an overall hopefully you know will
be rocking. And producing you know if
you should code in this is that the end
of my talk obviate glad to take any
questions in the time that we have
thank you very much. Hmmm Oh yeah okay
questions I have one question that is
that and DCP is set her only come and
have or G be and it to be very events
like four gigabytes in thinking about
is area send you do have well takes but
that's like a double check or why
haven't you guys really is larger
memory to use because it seems like
twelve DB seems to be essential for
designing these days at least and it's
only going for the yeah actually great
question I have to think about I have
the same problem you had before how
much can I say actually well there were
some hardware limitations of the first
remember this was the first of a force
HBM product don't expect the number to
stay the same I I can see this much
though one of the device that I show
you here we go to a gigabytes but in a
sense we're having now the two jeep
used together because of the no
crocheting that we have you're able to
access till eight gigabytes. And though
the the older generation to how why I
forgot what the code name is this that
one is able to go to thirty two
gigabytes but this is DDR five member
yeah also times the for the types of
which faced a couple of questions. And
hear me yes oh you short that you in a
a and you have produced high quality oh
jeep use well for example the one oh we
though about fourteen. I gig got lost
which is pretty good compared to six
point one of electronics. So the
question is that why the market is
captured by indeed yeah then I mean it
is that it is because of price or oh
software framework or there is any
other advantage of yeah which I'll keep
the most use that and I most acute
you're used is here down so the lady
from indeed yeah oh I believe she left
any make him bodily harm the
researchers so my this is all my guess
it on our really speaking frame D
that's a good question I would say
there's probably probably really have
to also know how much is support you
have that there could be many many
reasons that I specifically also
because you know the tradition and
there's been an emphasis also on the
you know the graphics I don't GPL so
that that could have been a factor well
I mean I don't know why someone would
buy one versus the other again mad at a
given point in time is definitely not
tries not necessarily help applies that
was announced last week as the first
the GPU to break the two hundred dollar
yeah yeah there there you know you have
no gaming performance that you know
that's stock prices yeah sorry I know
this is not a sex I don't is about the
only to ask the marketing guys about
companies to actually get that answer
yeah so I've been using NTTP use for
magically was processing and they
perform accident but every models I
kind of look around try to see if you
know I can run any of the the learning
networks like tensor flow torch on an
ATTPU and it turns out that we still
cannot do that. So are there any nouns
to you know fix. This issue yes
actually bought this one described we
both have a versions of some of the
frameworks like fortune cafe does that
are some sports I know about cafe there
are some open CL ports that are
available out there in in a yes that
our plans we're definitely working on
that both within named D and hopefully
at all simply to Reno inspires some you
to also contribute to that effort but
yes this is definitely a on our radar
so actually we just so that there are
quite a few libraries that are getting
open source for a specific early in the
hardware that is compatible us weapons
you have to to to in the video hardware
but the the real question there is how
much do we have to wait to have to be
fully featured drivers that are open
source for energy or and it yeah
actually no the DMD hardware open
source today you can download them okay
that drives are open source yeah thank
you sure yeah I said rocking with the
radio hoping compute and rocker is
actually open compute the driver yeah
and actually just refer completeness in
or we also enabling HCC as I said we
polished instruction set the DGPU so if
you really really care about
performance and have lots and lots of
time you cannot write coding assembly
and actually hope someone will actually
you know take the time to do that
because for example in few again and
and this is how they achieved it all
the ninety percent efficiency no I
guess is said our version of the bible
are growing in in efficiency and we
measured that buddy but what fraction
to peak flocks that we get again we
made very good progress so far but we
we have more to go okay so if there is
no more question maybe we can things a
speaker again on the go for the quick

Share this talk: 


Conference program

Deep Supervised Learning of Representations
Yoshua Bengio, University of Montreal, Canada
4 juil. 2016 · 2:01 après-midi
Hardware & software update from NVIDIA, Enabling Deep Learning
Alison B Lowndes, NVIDIA
4 juil. 2016 · 3:20 après-midi
Day 1 - Questions and Answers
Panel
4 juil. 2016 · 4:16 après-midi
Torch 1
Soumith Chintala, Facebook
5 juil. 2016 · 10:02 matin
Torch 2
Soumith Chintala, Facebook
5 juil. 2016 · 11:21 matin
Deep Generative Models
Yoshua Bengio, University of Montreal, Canada
5 juil. 2016 · 1:59 après-midi
Torch 3
Soumith Chintala, Facebook
5 juil. 2016 · 3:28 après-midi
Day 2 - Questions and Answers
Panel
5 juil. 2016 · 4:21 après-midi
TensorFlow 1
Mihaela Rosca, Google
6 juil. 2016 · 10 matin
TensorFlow 2
Mihaela Rosca, Google
6 juil. 2016 · 11:19 matin
TensorFlow 3 and Day 3 Questions and Answers session
Mihaela Rosca, Google
6 juil. 2016 · 3:21 après-midi

Recommended talks

EM Microelectronic
Vincent Peiris, Wireless & Sensing BU
18 mai 2016 · 4:23 après-midi