Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:01
h. ah like e. s. e. in high heels and uh so
00:00:08
it's just use really about uh just softer side right implications like abstract
00:00:15
ah there is a very complex hard to restructure people's all right i'll get harder
00:00:23
is it was a so called it a night respective u. i. e.
00:00:34
so yeah it's a white uh c. e. was like
00:00:41
i was watching each or recently you actually you know
00:00:46
so uh from our discussions so we're like knowledgeable one on the practicalities me just like this
00:00:55
yeah so actually perspective i am very grateful to have you here thanks again for shoes
00:01:03
judges cash huh thank you under wraps for stuff like that
00:01:10
okay okay yeah first off i'd like to thank you andrea and the
00:01:13
rest of the yep been for organising this extremely relevant and timely event
00:01:18
uh i think we've all learned quite a lot today and uh
00:01:21
things is very exciting technology to to be a understanding more these days
00:01:27
so um as as i mention that i i'm i'm here to talk about the
00:01:31
infrastructure that's necessary to build these large language models that everyone has now hard about
00:01:38
so um as we've heard from well is because these sort
00:01:42
of technologies have incredible potential for revolutionising the way we do
00:01:46
a i and they already power applications like translation x. generation
00:01:52
search and even um any other sequence morning tasks such as
00:01:57
no predicting time c. d.'s and it takes you know makes
00:02:00
don't design as well as other modalities that that we did
00:02:05
not discuss today but like you can represent vision in the form of um tokens that you can boss into these models
00:02:12
so the these find these architectures of fundamentally not restricted to just
00:02:16
natural language so there's there's a lot of massive potential
00:02:19
that the computer requirements to train these models presently already large
00:02:25
and constantly going so i'm able to get on the
00:02:28
top that mortals restores the dirty easy you to do
00:02:33
zoom league
00:02:41
and it it um i am not yeah you only online yes thank you have a sorry for interrupting right
00:02:51
so um yeah says as i was mentioning the computer requirements
00:02:55
necessary to train these models presently are large and constantly going so
00:03:00
ah you might have already read about the the the mass amounts of compute necessary to train
00:03:05
more intellectual g. p. d. you might have seen numbers in the order
00:03:09
of millions of dollars in and months of training on very large clusters
00:03:14
but what you might have not hard about so much is that
00:03:18
often times the amount of compute and operational preparation necessary for the salons
00:03:24
is equivalent to the amount of time that you actually spend on the on the run itself so
00:03:29
it's a it's it's double in terms of what what you actually might see as as numbers
00:03:35
reported in the media um and setting up these
00:03:39
arms is also or operationally quite complicated so uh
00:03:44
what what we're thinking about is that there has to be a better way to be able to train these models without so much
00:03:50
uh overhead in terms of operational capabilities and software engineering excellence so um
00:03:56
that's kind of what what that city buses about so too i just think that so i'm not
00:04:10
if you look at this chart what you see is the size of the mortal um that's
00:04:15
that's popular in in two thousand eighteen that was considered state of the art so that sport base
00:04:21
it was a hundred and ten million part of me doesn't consider to be a fairly large more too
00:04:25
at that at that time and what you see on the shot
00:04:28
it's also due to get three hundred and seventy five billion parameter more
00:04:32
and the time span between these two models is is awfully to yes so
00:04:38
it's a thousand times more compute in just two yes so that's three orders of magnitude
00:04:44
in just two yes that's that's the scaling that's happened in in such a short span of time
00:04:51
and the the the amount of effort that you need in order to be able to take advantage
00:04:56
of these uh morgan's is is quite significant so the typical way that you train these models is on
00:05:02
large distributive clusters with hundreds of thousands of cheap use these clusters that difficult to set up
00:05:09
and orchestrating training on these clusters was large teams of engineers
00:05:14
envelopes and we'll expertise and just plain babysitting so if anyone hasn't had a chance to read
00:05:20
some of the loans of training these large models that were open so's by like matter you can read that
00:05:26
uh these clusters failed in the middle of your training
00:05:29
they sometimes run into a numerical issues and they change
00:05:33
op amazes they change type of adam does on the flight is a lot of buddha goes on uh in
00:05:39
the process of training these and in addition setting up this is is is a task in and of itself
00:05:47
now um what she on this particular chart is that ah as you draw additional
00:05:53
amount of computed the problem so all we we see on the start on that
00:05:58
um on the exact is the number of jeep use that you're throwing it the problem and here is the speed up that you would expect to see
00:06:04
so the orange line indicates linear speedup so that is if you know
00:06:08
one additional cheap you for double the amount of jeep you that you
00:06:11
tore the problem you would expect to see twice as much performance but
00:06:14
that's not what you see in reality the more computer where the problem
00:06:18
it produces diminishing returns on on your uh on the other additional
00:06:23
experience so i'm actually because we looked at this and we talked about
00:06:28
what what what could we do to can fundamentally redesign
00:06:31
the way that we deal with this so sadly because um
00:06:38
was founded in town sixteen with the goal of rethinking deep learning hardware from first principles
00:06:43
so the idea here was that the uh the
00:06:48
don't so did you view is what i was most keep learning architectures now but the g. p. was not created
00:06:55
gone up for deep learning it was a happy coincidence that the the
00:06:59
the architect of the jeep you prove to be beneficial for people in it
00:07:04
that's what it was we talk about okay what can we do if we
00:07:06
have to build architecture from the ground up that works with learning salsa repress is
00:07:11
a three hundred and fifty plus engineers distributed across hardware software
00:07:15
machine learning uh in in three continents on automatic uh asia europe
00:07:21
uh and i and what we've actually been able to produce is
00:07:26
the city because with the skin engine so it's the the largest chip
00:07:30
that was uh that ever made eight hundred and fifty thousand course
00:07:34
on a single chip um wait two point six trillion plans this those
00:07:39
and forty two bytes of memory that is accessible with one clock cycle from the computer so
00:07:45
we'll call contrary to jeep use where the the memory so it's a bit
00:07:49
further away from the compute that way you have to spend some communication overhead
00:07:54
in on on the with the skill engine it's accessible with
00:07:57
one uh clock cycle so with all of the scores integrated on
00:08:02
single piece of silicon we will get some mind boggling performance if you see of memory in fact that band widths these are
00:08:08
really incredibly high uh in terms of what what can actually generate and
00:08:16
you wouldn't put a race car engine in an economic uh shots
00:08:19
so the cease to is what powers the entire with the scale engine
00:08:24
so it's good design gone up the power couldn't extract the maximal performance form of a four
00:08:29
and it contains hardware that allows for one point two data bits of communication to and for all the system
00:08:36
uh it consumes twenty kilowatts of power and so it's it's really a a monster for machine
00:08:42
and but it's all in a package that fits in the standard data centre back
00:08:48
and uh so it gets good so
00:08:52
uh this already offers quite a bit of compute
00:08:55
bart models that we trained these days don't necessarily uh
00:09:00
don't we don't necessarily have enough compute even on the cease to to to
00:09:04
train the large models that we want so what we have um were brief
00:09:11
that lot is this to the wafer scale gloucester which essentially is a cluster of
00:09:16
cease to start work together to provide massive amounts of compute um
00:09:22
in a way that is transparent to the end user so if your yet still at the goal of the city because we've
00:09:29
escaped last artist essentially all for all of this massive compute on
00:09:33
the record in a way that's a transparent to the end user
00:09:37
and to take advantage of all of this hardware all the user needs to do is essentially log in
00:09:43
use a simple tighten based interface that specifies what you want to run and how much how
00:09:48
many says do you want to take advantage of all of the difficult work of distributing your lord
00:09:54
uh handling overhead in terms of communication coordination between the different
00:09:59
wafers all of this is is completely hidden from you
00:10:06
and um in contrast to what we see what we've seen earlier with the just retraining
00:10:12
across clusters that you use what we've seen with the city districts get engine is that
00:10:18
uh putting together multiple seized who's in the way for scale
00:10:21
plus the v. c. kind of linear performance so this means that
00:10:24
if you have a job that that takes a hundred hours with the single cease to uh putting pins his toes
00:10:30
on the job will reduce your and what to reduce the time necessary to complete the job in a linear fashion so
00:10:38
you you don't end up in the situation where did you get diminishing
00:10:42
returns as you skin so these are actual compute numbers that that we
00:10:46
tested for models that range from two hundred fifty million parameters all the
00:10:50
way to twenty five billion but i mean doesn't we see yet linear scaling
00:10:54
and we have a bit based in the process of conducting tests with much larger clusters
00:10:59
as well but initial results are quite promising in terms of what we're able to achieve
00:11:08
something else that's that's right baby you need to work out
00:11:12
a hardware office is uh the ability to work with a long sequence like it's so i'll
00:11:20
if the the the charging p. d. interface that everyone
00:11:24
works with hasn't memory window of probably four thousand dockets
00:11:27
that means that if you generate the first token um if you generate
00:11:32
if you put some information in your first token and then you generate
00:11:36
additional three thousand nine hundred ninety nine tokens and then you ask your model
00:11:40
at the next position to remember the for many first dog and it will not be within the context window
00:11:46
so this means that your model is limited in terms of how much memory didn't work with and uh
00:11:52
with this used to we actually can scale all the way up to fifty thousand
00:11:56
tokens so in in all of the only a box with clark today about how
00:12:01
the um large language models have the problems of being able to
00:12:05
produce an eight um they have challenges you know quick leaving factual information
00:12:11
so in other pattern in which you can take you use these
00:12:15
large language models if you have expanded context is to be able to
00:12:19
actually feeding factual information and use the underlying language model is billiard
00:12:23
reasoning sort of engine reason will fax in this is currently not
00:12:27
to be any possible because of the context window limitations you really want to use all of that
00:12:32
window for working with um with the data that you want to generate but if you have expanded context
00:12:38
you can feed in additional information and then perform reasoning with the remaining amount of space that you have
00:12:46
so i want to switch gears a little bit and talk a little bit about um training from scratch so
00:12:53
report about um the different bottom names in in what it talks about
00:12:58
training using creating models fine tuning using zero sharp you sharp and so on
00:13:04
and um the current explosion of large language models as open up a lot
00:13:09
lot of models that already pretty trained does using these models is an excellent fit
00:13:14
provided those models fit the use cases that you want to work with so this means that if
00:13:20
you want to work with a language that is very common you represented in these model so english
00:13:26
there's there's lots of open models that let you do that but if you want to work with
00:13:31
language that's not as commonly represented so for example my uh my mother
00:13:35
don't dismal alum there's a there's a few minutes because of this but
00:13:40
it's not a language that is commonly represented in this in these data
00:13:43
sets but then you might be a little bit less out the flock
00:13:48
and the shirley if you have models that are capable of the target task still be part about
00:13:53
working with you know makes report about working with a dark discovery part about working with like biologically
00:13:59
inspired data all of these things some of the models are more capable than others of these tasks
00:14:05
and if you have a data that that fits within those button and you can actually take advantage of them
00:14:11
but if not you have a plane from scratch and this can be as a this can be
00:14:18
achieved in as little time as half a day for a one point three billion parameter more group
00:14:22
going up to offer fifty days for twenty billion partly
00:14:26
to more no the other paradigms fine tuning so uh if
00:14:35
you have a model that works for your task out of the box you can still actually get expect to achieve
00:14:41
a significant performance improvements if you fine tune the model for your particular use cases
00:14:47
so this improves performance or you can use a much more
00:14:51
them ordered which is the exact same performance so that means
00:14:55
that you can that that that has very strong implications if you want to use the model for inference
00:15:00
much larger model is is dramatically more challenging to use
00:15:05
in terms of inference as well as it's more expensive
00:15:08
and if you can use of much for the more you can get away
00:15:11
with using um a lot less compute in order to be able to own inference
00:15:16
so with some test it seemed that you can easily scale down
00:15:20
um uh more due to afford off its size we just using fine tuning and outperform model
00:15:27
that four times as large so this this is an exclusion to using other methods like distilling
00:15:33
quantisation all of these things this is really the only with just training so um and yeah so
00:15:41
here we see that's you know depending on the amount of tokens that you have for your particular task
00:15:46
you can train six billion bottom into more roof on ten billion tokens that's that's usually far beyond
00:15:52
most fine tuning costs requirements ins as little seventeen us so um
00:16:02
i'd like to to finish up with a couple of use cases
00:16:05
that that we've seen a at city bus that have been quite successful
00:16:09
so um this is and this is a particular
00:16:14
example of a collaboration that we've had with researchers at argonne national labs
00:16:19
who wanted to study the cool regional so they wanted to create genome scale large language models
00:16:25
which use all of the available um genome context in
00:16:29
order to understand um the nature of covertly aliens and how
00:16:35
but i think they might be and what sort of meetings could potentially applies
00:16:39
so they around some experimentation zeus comparing this used to a cost of seized who's
00:16:44
against the cost of a a one hundreds so we can see have for
00:16:49
the small the model so that's two hundred and fifty million part of it does
00:16:52
the cost of sixteen seems to the still faster that last of five hundred and a one hundred
00:16:58
and what's even more interesting is that they wanted to expand the context window they wanted to use
00:17:03
a a window of ten thousand tokens to be able to better more to the domain specific task that they wanted to do
00:17:10
and this was not something that they could actually handle even though they have quite a lot of in house expertise in a massive
00:17:16
you puke law stuff so in this case we were able to scale all the way up
00:17:19
to twenty five billion part of it that's um and the last discuss that i would like to
00:17:28
a show is another collaboration that we had with classes with plain
00:17:33
where they wanted to use um the cease to to do one uh to create additional that language models and hear
00:17:40
what they observed is that they have a panic speedup fusing
00:17:44
single cease to compare to this sixteen or g. p. cluster
00:17:48
so for them what was really important was the ability to i to read quickly
00:17:52
and explore the problem space in a way that's not typically possible when you train this
00:17:56
model so if it takes you a week ago over your data set a single time
00:18:00
there's a limit to the number of experiments that you can drop but if you can do it in half or
00:18:04
date that that completely changes the equation how you interact

Share this talk: 


Conference Program

The Evolution of Large Language Models that led to ChatGPT (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 8:34 a.m.
664 views
Understanding Transformers
James Henderson, Idiap Research Institute
March 10, 2023 · 8:46 a.m.
369 views
Inference using Large Language Models (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:19 a.m.
Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:45 a.m.
ChatGPT for Digital Marketing
Floris Keijser, N98 Digital Marketing
March 10, 2023 · 9:58 a.m.
Biomedical Inference & Large Language Models
Oskar Wysocki, University of Manchester
March 10, 2023 · 10:19 a.m.
Abstract Reasoning
Marco Valentino, Idiap Research Institute
March 10, 2023 · 10:38 a.m.
120 views
Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 10:58 a.m.
Round Table: Risks & Broader Societal Impact (Legal, Educational and Labor)
Lonneke van der Plas, Idiap Research Institute
March 10, 2023 · 2:07 p.m.

Recommended talks

Learning to Segment 3D Linear Structures Using Only 2D Annotations
Dr. Mateusz Kozinski, EPFL
April 19, 2018 · 11:33 a.m.
348 views