The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:01

h. ah like e. s. e. in high heels and uh so

00:00:08

it's just use really about uh just softer side right implications like abstract

00:00:15

ah there is a very complex hard to restructure people's all right i'll get harder

00:00:23

is it was a so called it a night respective u. i. e.

00:00:34

so yeah it's a white uh c. e. was like

00:00:41

i was watching each or recently you actually you know

00:00:46

so uh from our discussions so we're like knowledgeable one on the practicalities me just like this

00:00:55

yeah so actually perspective i am very grateful to have you here thanks again for shoes

00:01:03

judges cash huh thank you under wraps for stuff like that

00:01:10

okay okay yeah first off i'd like to thank you andrea and the

00:01:13

rest of the yep been for organising this extremely relevant and timely event

00:01:18

uh i think we've all learned quite a lot today and uh

00:01:21

things is very exciting technology to to be a understanding more these days

00:01:27

so um as as i mention that i i'm i'm here to talk about the

00:01:31

infrastructure that's necessary to build these large language models that everyone has now hard about

00:01:38

so um as we've heard from well is because these sort

00:01:42

of technologies have incredible potential for revolutionising the way we do

00:01:46

a i and they already power applications like translation x. generation

00:01:52

search and even um any other sequence morning tasks such as

00:01:57

no predicting time c. d.'s and it takes you know makes

00:02:00

don't design as well as other modalities that that we did

00:02:05

not discuss today but like you can represent vision in the form of um tokens that you can boss into these models

00:02:12

so the these find these architectures of fundamentally not restricted to just

00:02:16

natural language so there's there's a lot of massive potential

00:02:19

that the computer requirements to train these models presently already large

00:02:25

and constantly going so i'm able to get on the

00:02:28

top that mortals restores the dirty easy you to do

00:02:33

zoom league

00:02:41

and it it um i am not yeah you only online yes thank you have a sorry for interrupting right

00:02:51

so um yeah says as i was mentioning the computer requirements

00:02:55

necessary to train these models presently are large and constantly going so

00:03:00

ah you might have already read about the the the mass amounts of compute necessary to train

00:03:05

more intellectual g. p. d. you might have seen numbers in the order

00:03:09

of millions of dollars in and months of training on very large clusters

00:03:14

but what you might have not hard about so much is that

00:03:18

often times the amount of compute and operational preparation necessary for the salons

00:03:24

is equivalent to the amount of time that you actually spend on the on the run itself so

00:03:29

it's a it's it's double in terms of what what you actually might see as as numbers

00:03:35

reported in the media um and setting up these

00:03:39

arms is also or operationally quite complicated so uh

00:03:44

what what we're thinking about is that there has to be a better way to be able to train these models without so much

00:03:50

uh overhead in terms of operational capabilities and software engineering excellence so um

00:03:56

that's kind of what what that city buses about so too i just think that so i'm not

00:04:10

if you look at this chart what you see is the size of the mortal um that's

00:04:15

that's popular in in two thousand eighteen that was considered state of the art so that sport base

00:04:21

it was a hundred and ten million part of me doesn't consider to be a fairly large more too

00:04:25

at that at that time and what you see on the shot

00:04:28

it's also due to get three hundred and seventy five billion parameter more

00:04:32

and the time span between these two models is is awfully to yes so

00:04:38

it's a thousand times more compute in just two yes so that's three orders of magnitude

00:04:44

in just two yes that's that's the scaling that's happened in in such a short span of time

00:04:51

and the the the amount of effort that you need in order to be able to take advantage

00:04:56

of these uh morgan's is is quite significant so the typical way that you train these models is on

00:05:02

large distributive clusters with hundreds of thousands of cheap use these clusters that difficult to set up

00:05:09

and orchestrating training on these clusters was large teams of engineers

00:05:14

envelopes and we'll expertise and just plain babysitting so if anyone hasn't had a chance to read

00:05:20

some of the loans of training these large models that were open so's by like matter you can read that

00:05:26

uh these clusters failed in the middle of your training

00:05:29

they sometimes run into a numerical issues and they change

00:05:33

op amazes they change type of adam does on the flight is a lot of buddha goes on uh in

00:05:39

the process of training these and in addition setting up this is is is a task in and of itself

00:05:47

now um what she on this particular chart is that ah as you draw additional

00:05:53

amount of computed the problem so all we we see on the start on that

00:05:58

um on the exact is the number of jeep use that you're throwing it the problem and here is the speed up that you would expect to see

00:06:04

so the orange line indicates linear speedup so that is if you know

00:06:08

one additional cheap you for double the amount of jeep you that you

00:06:11

tore the problem you would expect to see twice as much performance but

00:06:14

that's not what you see in reality the more computer where the problem

00:06:18

it produces diminishing returns on on your uh on the other additional

00:06:23

experience so i'm actually because we looked at this and we talked about

00:06:28

what what what could we do to can fundamentally redesign

00:06:31

the way that we deal with this so sadly because um

00:06:38

was founded in town sixteen with the goal of rethinking deep learning hardware from first principles

00:06:43

so the idea here was that the uh the

00:06:48

don't so did you view is what i was most keep learning architectures now but the g. p. was not created

00:06:55

gone up for deep learning it was a happy coincidence that the the

00:06:59

the architect of the jeep you prove to be beneficial for people in it

00:07:04

that's what it was we talk about okay what can we do if we

00:07:06

have to build architecture from the ground up that works with learning salsa repress is

00:07:11

a three hundred and fifty plus engineers distributed across hardware software

00:07:15

machine learning uh in in three continents on automatic uh asia europe

00:07:21

uh and i and what we've actually been able to produce is

00:07:26

the city because with the skin engine so it's the the largest chip

00:07:30

that was uh that ever made eight hundred and fifty thousand course

00:07:34

on a single chip um wait two point six trillion plans this those

00:07:39

and forty two bytes of memory that is accessible with one clock cycle from the computer so

00:07:45

we'll call contrary to jeep use where the the memory so it's a bit

00:07:49

further away from the compute that way you have to spend some communication overhead

00:07:54

in on on the with the skill engine it's accessible with

00:07:57

one uh clock cycle so with all of the scores integrated on

00:08:02

single piece of silicon we will get some mind boggling performance if you see of memory in fact that band widths these are

00:08:08

really incredibly high uh in terms of what what can actually generate and

00:08:16

you wouldn't put a race car engine in an economic uh shots

00:08:19

so the cease to is what powers the entire with the scale engine

00:08:24

so it's good design gone up the power couldn't extract the maximal performance form of a four

00:08:29

and it contains hardware that allows for one point two data bits of communication to and for all the system

00:08:36

uh it consumes twenty kilowatts of power and so it's it's really a a monster for machine

00:08:42

and but it's all in a package that fits in the standard data centre back

00:08:48

and uh so it gets good so

00:08:52

uh this already offers quite a bit of compute

00:08:55

bart models that we trained these days don't necessarily uh

00:09:00

don't we don't necessarily have enough compute even on the cease to to to

00:09:04

train the large models that we want so what we have um were brief

00:09:11

that lot is this to the wafer scale gloucester which essentially is a cluster of

00:09:16

cease to start work together to provide massive amounts of compute um

00:09:22

in a way that is transparent to the end user so if your yet still at the goal of the city because we've

00:09:29

escaped last artist essentially all for all of this massive compute on

00:09:33

the record in a way that's a transparent to the end user

00:09:37

and to take advantage of all of this hardware all the user needs to do is essentially log in

00:09:43

use a simple tighten based interface that specifies what you want to run and how much how

00:09:48

many says do you want to take advantage of all of the difficult work of distributing your lord

00:09:54

uh handling overhead in terms of communication coordination between the different

00:09:59

wafers all of this is is completely hidden from you

00:10:06

and um in contrast to what we see what we've seen earlier with the just retraining

00:10:12

across clusters that you use what we've seen with the city districts get engine is that

00:10:18

uh putting together multiple seized who's in the way for scale

00:10:21

plus the v. c. kind of linear performance so this means that

00:10:24

if you have a job that that takes a hundred hours with the single cease to uh putting pins his toes

00:10:30

on the job will reduce your and what to reduce the time necessary to complete the job in a linear fashion so

00:10:38

you you don't end up in the situation where did you get diminishing

00:10:42

returns as you skin so these are actual compute numbers that that we

00:10:46

tested for models that range from two hundred fifty million parameters all the

00:10:50

way to twenty five billion but i mean doesn't we see yet linear scaling

00:10:54

and we have a bit based in the process of conducting tests with much larger clusters

00:10:59

as well but initial results are quite promising in terms of what we're able to achieve

00:11:08

something else that's that's right baby you need to work out

00:11:12

a hardware office is uh the ability to work with a long sequence like it's so i'll

00:11:20

if the the the charging p. d. interface that everyone

00:11:24

works with hasn't memory window of probably four thousand dockets

00:11:27

that means that if you generate the first token um if you generate

00:11:32

if you put some information in your first token and then you generate

00:11:36

additional three thousand nine hundred ninety nine tokens and then you ask your model

00:11:40

at the next position to remember the for many first dog and it will not be within the context window

00:11:46

so this means that your model is limited in terms of how much memory didn't work with and uh

00:11:52

with this used to we actually can scale all the way up to fifty thousand

00:11:56

tokens so in in all of the only a box with clark today about how

00:12:01

the um large language models have the problems of being able to

00:12:05

produce an eight um they have challenges you know quick leaving factual information

00:12:11

so in other pattern in which you can take you use these

00:12:15

large language models if you have expanded context is to be able to

00:12:19

actually feeding factual information and use the underlying language model is billiard

00:12:23

reasoning sort of engine reason will fax in this is currently not

00:12:27

to be any possible because of the context window limitations you really want to use all of that

00:12:32

window for working with um with the data that you want to generate but if you have expanded context

00:12:38

you can feed in additional information and then perform reasoning with the remaining amount of space that you have

00:12:46

so i want to switch gears a little bit and talk a little bit about um training from scratch so

00:12:53

report about um the different bottom names in in what it talks about

00:12:58

training using creating models fine tuning using zero sharp you sharp and so on

00:13:04

and um the current explosion of large language models as open up a lot

00:13:09

lot of models that already pretty trained does using these models is an excellent fit

00:13:14

provided those models fit the use cases that you want to work with so this means that if

00:13:20

you want to work with a language that is very common you represented in these model so english

00:13:26

there's there's lots of open models that let you do that but if you want to work with

00:13:31

language that's not as commonly represented so for example my uh my mother

00:13:35

don't dismal alum there's a there's a few minutes because of this but

00:13:40

it's not a language that is commonly represented in this in these data

00:13:43

sets but then you might be a little bit less out the flock

00:13:48

and the shirley if you have models that are capable of the target task still be part about

00:13:53

working with you know makes report about working with a dark discovery part about working with like biologically

00:13:59

inspired data all of these things some of the models are more capable than others of these tasks

00:14:05

and if you have a data that that fits within those button and you can actually take advantage of them

00:14:11

but if not you have a plane from scratch and this can be as a this can be

00:14:18

achieved in as little time as half a day for a one point three billion parameter more group

00:14:22

going up to offer fifty days for twenty billion partly

00:14:26

to more no the other paradigms fine tuning so uh if

00:14:35

you have a model that works for your task out of the box you can still actually get expect to achieve

00:14:41

a significant performance improvements if you fine tune the model for your particular use cases

00:14:47

so this improves performance or you can use a much more

00:14:51

them ordered which is the exact same performance so that means

00:14:55

that you can that that that has very strong implications if you want to use the model for inference

00:15:00

much larger model is is dramatically more challenging to use

00:15:05

in terms of inference as well as it's more expensive

00:15:08

and if you can use of much for the more you can get away

00:15:11

with using um a lot less compute in order to be able to own inference

00:15:16

so with some test it seemed that you can easily scale down

00:15:20

um uh more due to afford off its size we just using fine tuning and outperform model

00:15:27

that four times as large so this this is an exclusion to using other methods like distilling

00:15:33

quantisation all of these things this is really the only with just training so um and yeah so

00:15:41

here we see that's you know depending on the amount of tokens that you have for your particular task

00:15:46

you can train six billion bottom into more roof on ten billion tokens that's that's usually far beyond

00:15:52

most fine tuning costs requirements ins as little seventeen us so um

00:16:02

i'd like to to finish up with a couple of use cases

00:16:05

that that we've seen a at city bus that have been quite successful

00:16:09

so um this is and this is a particular

00:16:14

example of a collaboration that we've had with researchers at argonne national labs

00:16:19

who wanted to study the cool regional so they wanted to create genome scale large language models

00:16:25

which use all of the available um genome context in

00:16:29

order to understand um the nature of covertly aliens and how

00:16:35

but i think they might be and what sort of meetings could potentially applies

00:16:39

so they around some experimentation zeus comparing this used to a cost of seized who's

00:16:44

against the cost of a a one hundreds so we can see have for

00:16:49

the small the model so that's two hundred and fifty million part of it does

00:16:52

the cost of sixteen seems to the still faster that last of five hundred and a one hundred

00:16:58

and what's even more interesting is that they wanted to expand the context window they wanted to use

00:17:03

a a window of ten thousand tokens to be able to better more to the domain specific task that they wanted to do

00:17:10

and this was not something that they could actually handle even though they have quite a lot of in house expertise in a massive

00:17:16

you puke law stuff so in this case we were able to scale all the way up

00:17:19

to twenty five billion part of it that's um and the last discuss that i would like to

00:17:28

a show is another collaboration that we had with classes with plain

00:17:33

where they wanted to use um the cease to to do one uh to create additional that language models and hear

00:17:40

what they observed is that they have a panic speedup fusing

00:17:44

single cease to compare to this sixteen or g. p. cluster

00:17:48

so for them what was really important was the ability to i to read quickly

00:17:52

and explore the problem space in a way that's not typically possible when you train this

00:17:56

model so if it takes you a week ago over your data set a single time

00:18:00

there's a limit to the number of experiments that you can drop but if you can do it in half or

00:18:04

date that that completely changes the equation how you interact

Share this talk:

Conference Program

09:52

The Evolution of Large Language Models that led to ChatGPT (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 8:34 a.m.

664 views

30:48

Understanding Transformers
James Henderson, Idiap Research Institute
March 10, 2023 · 8:46 a.m.

369 views

25:22

Inference using Large Language Models (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:19 a.m.

12:41

Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:45 a.m.

19:12

ChatGPT for Digital Marketing
Floris Keijser, N98 Digital Marketing
March 10, 2023 · 9:58 a.m.

18:16

Biomedical Inference & Large Language Models
Oskar Wysocki, University of Manchester
March 10, 2023 · 10:19 a.m.

20:13

Abstract Reasoning
Marco Valentino, Idiap Research Institute
March 10, 2023 · 10:38 a.m.

120 views

15:42

Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 10:58 a.m.

18:35

The Risks Behind Large Language Models (Al Brown, Fujitsu)
Al Brown, Fujitsu
March 10, 2023 · 1:42 p.m.

05:07

Q&A: The Risks Behind Large Language Models (Al Brown, Fujitsu)
Al Brown, Fujitsu
March 10, 2023 · 2:01 p.m.

57:08

Round Table: Risks & Broader Societal Impact (Legal, Educational and Labor)
Lonneke van der Plas, Idiap Research Institute
March 10, 2023 · 2:07 p.m.

18:17

The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems
March 10, 2023 · 3:12 p.m.

06:37

Q&A: The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems
March 10, 2023 · 3:30 p.m.

Recommended talks

20:42

Learning to Segment 3D Linear Structures Using Only 2D Annotations
Dr. Mateusz Kozinski, EPFL
April 19, 2018 · 11:33 a.m.

348 views

The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems

Embed

Transcriptions

Conference Program

The Evolution of Large Language Models that led to ChatGPT (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 8:34 a.m.

Understanding Transformers
James Henderson, Idiap Research Institute
March 10, 2023 · 8:46 a.m.

Inference using Large Language Models (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:19 a.m.

Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:45 a.m.

ChatGPT for Digital Marketing
Floris Keijser, N98 Digital Marketing
March 10, 2023 · 9:58 a.m.

Biomedical Inference & Large Language Models
Oskar Wysocki, University of Manchester
March 10, 2023 · 10:19 a.m.

Abstract Reasoning
Marco Valentino, Idiap Research Institute
March 10, 2023 · 10:38 a.m.

Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 10:58 a.m.

The Risks Behind Large Language Models (Al Brown, Fujitsu)
Al Brown, Fujitsu
March 10, 2023 · 1:42 p.m.

Q&A: The Risks Behind Large Language Models (Al Brown, Fujitsu)
Al Brown, Fujitsu
March 10, 2023 · 2:01 p.m.

Round Table: Risks & Broader Societal Impact (Legal, Educational and Labor)
Lonneke van der Plas, Idiap Research Institute
March 10, 2023 · 2:07 p.m.

The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems
March 10, 2023 · 3:12 p.m.

Q&A: The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems
March 10, 2023 · 3:30 p.m.

Recommended talks

Learning to Segment 3D Linear Structures Using Only 2D Annotations
Dr. Mateusz Kozinski, EPFL
April 19, 2018 · 11:33 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems) Vinay Pondenkandath, Cerebras Systems

Embed

Transcriptions

Conference Program

The Evolution of Large Language Models that led to ChatGPT (Andre Freitas, Idiap) Andre Freitas, Idiap Research Institute March 10, 2023 · 8:34 a.m.

Understanding Transformers James Henderson, Idiap Research Institute March 10, 2023 · 8:46 a.m.

Inference using Large Language Models (Andre Freitas, Idiap) Andre Freitas, Idiap Research Institute March 10, 2023 · 9:19 a.m.

Q&A Andre Freitas, Idiap Research Institute March 10, 2023 · 9:45 a.m.

ChatGPT for Digital Marketing Floris Keijser, N98 Digital Marketing March 10, 2023 · 9:58 a.m.

Biomedical Inference & Large Language Models Oskar Wysocki, University of Manchester March 10, 2023 · 10:19 a.m.

Abstract Reasoning Marco Valentino, Idiap Research Institute March 10, 2023 · 10:38 a.m.

Q&A Andre Freitas, Idiap Research Institute March 10, 2023 · 10:58 a.m.

The Risks Behind Large Language Models (Al Brown, Fujitsu) Al Brown, Fujitsu March 10, 2023 · 1:42 p.m.

Q&A: The Risks Behind Large Language Models (Al Brown, Fujitsu) Al Brown, Fujitsu March 10, 2023 · 2:01 p.m.

Round Table: Risks & Broader Societal Impact (Legal, Educational and Labor) Lonneke van der Plas, Idiap Research Institute March 10, 2023 · 2:07 p.m.

The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems) Vinay Pondenkandath, Cerebras Systems March 10, 2023 · 3:12 p.m.

Q&A: The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems) Vinay Pondenkandath, Cerebras Systems March 10, 2023 · 3:30 p.m.

Recommended talks

Learning to Segment 3D Linear Structures Using Only 2D Annotations Dr. Mateusz Kozinski, EPFL April 19, 2018 · 11:33 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems

The Evolution of Large Language Models that led to ChatGPT (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 8:34 a.m.

Understanding Transformers
James Henderson, Idiap Research Institute
March 10, 2023 · 8:46 a.m.

Inference using Large Language Models (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:19 a.m.

Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:45 a.m.

ChatGPT for Digital Marketing
Floris Keijser, N98 Digital Marketing
March 10, 2023 · 9:58 a.m.

Biomedical Inference & Large Language Models
Oskar Wysocki, University of Manchester
March 10, 2023 · 10:19 a.m.

Abstract Reasoning
Marco Valentino, Idiap Research Institute
March 10, 2023 · 10:38 a.m.

Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 10:58 a.m.

The Risks Behind Large Language Models (Al Brown, Fujitsu)
Al Brown, Fujitsu
March 10, 2023 · 1:42 p.m.

Q&A: The Risks Behind Large Language Models (Al Brown, Fujitsu)
Al Brown, Fujitsu
March 10, 2023 · 2:01 p.m.

Round Table: Risks & Broader Societal Impact (Legal, Educational and Labor)
Lonneke van der Plas, Idiap Research Institute
March 10, 2023 · 2:07 p.m.

The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems
March 10, 2023 · 3:12 p.m.

Q&A: The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems
March 10, 2023 · 3:30 p.m.

Learning to Segment 3D Linear Structures Using Only 2D Annotations
Dr. Mateusz Kozinski, EPFL
April 19, 2018 · 11:33 a.m.