Understanding Transformers

Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.

00:00:01

okay um um thanks andrei so i'm a head

00:00:06

of the natural language understanding group and it's true

00:00:09

i spend all my time thinking about transformers are so uh this is kind

00:00:16

of the technical details start with attention um as an attention is all you need

00:00:22

then going to transformers and a bit about retraining transformers um some

00:00:28

of the slides are taken from the stanford n. l. p. course

00:00:32

which if you're interested in a technical details is a very good course

00:00:38

okay attention um the basic problem with text

00:00:43

is that a tax can be very long

00:00:47

and you need to somehow access be able to access all the information that's in

00:00:52

in a text um we can just compress a whole text into a single

00:00:58

vector and then condition on that factor because it's just the too much information

00:01:05

um but luckily we don't normally want to look at the whole text

00:01:10

we want to look at some little bit for one decision that some other part for another decision um

00:01:18

and so if we can do a solved this problem by having

00:01:25

every part of the text get a different factor so the number of

00:01:29

vectors grows with the size of the text and then for given

00:01:33

question you look at one vector to answer one question then another vector

00:01:38

to answer another question so you need to do but i have this alignment between what you want to know what the

00:01:44

moment and with that part of the text you want to look at and not a line is what attention gets you

00:01:52

says uh uh like a learned so blatant soft alignment um between what you

00:01:58

want to know and what part of the text you want to look at

00:02:03

so it looks basically like this um the different

00:02:08

uh uh parts of your tests of the words and up as being vectors

00:02:13

and then you have some vector representing your current state what you need to look at

00:02:18

and you're doing a similarity uh just checking is this the one i want to look at you get

00:02:24

a school or maybe it's just when we get to school or make it says when you get a score

00:02:29

you take all those scores and you put it in a

00:02:32

a normal us exponential so soft max to give you a distribution

00:02:37

over these these factors now these uh this distribution tells you what

00:02:44

you want to look at what you want to pay attention to

00:02:47

so now you do a weighted average you take

00:02:50

these factors you multiply them by this weights some together

00:02:55

and you get a factor which is in this case basically a

00:02:58

copy of this that or if it was a different state i

00:03:02

would get a copy of the different factor but it can be

00:03:05

kind of smooth version of multiple vectors that are all some together

00:03:10

and you get your resulting vector and then you could just condition on

00:03:15

that so so you can your conditioning on an individual factor but it's

00:03:19

it's a vector that's specific to the question you wanna ask at that moment

00:03:26

um and so just to uh to give you a bit of that details the

00:03:32

basic idea is uh you have these vectors you want to look at so it's

00:03:38

a set of actors and you have some state and menu new um computer score

00:03:46

which is the dot product between uh you take

00:03:49

this state uh_huh multiplied by matrix get a query vector

00:03:54

you take the the individual vector multiplied by

00:03:58

another meg matrix to get the chain vector

00:04:02

the dot product gives you a score soft max gives you a

00:04:06

normalised wait you then take a weighted average of one of the

00:04:10

vectors a map into a value vector so it's query key value

00:04:17

weighted average and that gives you the result of the attention function

00:04:22

so attention you take the sat and you end up with the

00:04:26

function from factors to vectors so that that's the attention function and everything

00:04:33

is based on that okay self attention um a crucial extension of

00:04:40

this idea is um if you have a text and you want to

00:04:46

uh encode the whole tax then every

00:04:49

position needs to compute its individual vector

00:04:53

but each one of those vectors need to look at all the other vectors so that's

00:04:58

self attention every word every part of the taxes looking at every other part of the

00:05:03

text i'm right and and you do this and multiple layers so that information can propagate

00:05:15

around the text you everything is context roll

00:05:20

eyes with respect its neighbouring words and then

00:05:23

contextual eyes with the with respect to those contractual those representations et cetera

00:05:29

so i'm off and many lights are okay so

00:05:36

attention is just learning a soft alignment um hum of for a a to

00:05:42

a sequence of tokens um this let you deal with a variable size inputs

00:05:50

but uh those inputs are represented as

00:05:53

a set now 'cause attention doesn't really

00:05:56

no do what position those factors are decides what to look at

00:06:01

just by looking at the content of the individual vector nothing

00:06:05

else um it's extremely affected that's why we're all here um

00:06:11

and uh uh self attention is is also extremely effective

00:06:16

a very general way to encode a text okay so

00:06:21

that uh the last point here um the uh it's

00:06:25

use itself attention as a general method for encoding text

00:06:29

is exactly the idea behind transforms so let's talk about transformers um

00:06:37

transformers or multiple levels of self attention um

00:06:42

every token here we have a word so

00:06:46

will just refer to these as tokens there's a vector associated with every one of these

00:06:52

and there's a different factor at everyone at every level but for multiple levels

00:06:58

self attention is used to propagate information service

00:07:03

level uses self attention to look at this level

00:07:07

that's just detected by these lines the different

00:07:11

colours or different kinds of self attention um

00:07:15

but that that gives you a a kind of learn structure

00:07:20

over this that um and then all other then uh attention

00:07:27

everything else is just a computation that's being done independently at every

00:07:32

position and that makes it extremely efficient to run under g. p. u.

00:07:37

'cause you're just doing the same computation lots and lots of times in parallel

00:07:45

so that was a a crucial of factor for the success of transformers

00:07:50

so um there are uh some problems with just using self attention

00:07:57

the first is that it's it's representing these tokens as

00:08:01

a set when in fact we it's a sequence somehow we

00:08:05

need to tell the model what the sequences and the answer

00:08:09

is that we just have a position bedding set we add

00:08:13

to the information about the word we add in the fact that this is this is a word

00:08:19

this particular word and it's a position five this

00:08:23

ones that position six those are typically learned um but

00:08:27

uh uh the the could also be hard code

00:08:32

um second problem is as we saw on those formula

00:08:36

everything that's going on intention is is linear

00:08:39

it's just a a dot product or weighted average

00:08:43

so stacking these things on top of each other just gets a new

00:08:48

uh_huh linear functions of linear functions is not powerful enough we know from the

00:08:53

neural networks that we need nonlinear it's so the solution to that

00:08:58

is to uh add nonlinearity is independently owned

00:09:02

every position so in between every self attention layer

00:09:06

there is a little multilayer perception feed forward neural network

00:09:11

that takes this factor and maps it into new vector

00:09:15

that gets used for the next layer of self attach to alternate between independently

00:09:21

a a nonlinear do some every position and then attention

00:09:26

that transfers information across the positions alternate between to the um

00:09:33

the third problem which is kind of a a technical issue

00:09:36

but it's important um is that if you're doing a left

00:09:41

like right left to right language model like g. p. t.

00:09:45

we want to predict the next word you don't want the the production of that next word

00:09:51

there to look at the next word condition on the next word or that would be too easy

00:09:57

so we need to somehow block the computation that predicts

00:10:01

one word from looking at itself and looking at future

00:10:05

words because it's not supposed to be of the um

00:10:09

uh uh do that if it's if it's a left to right language model you're only supposed condition let

00:10:16

um so that's done basically by going in and

00:10:19

hacking the attention functions you just set um the

00:10:24

the school or in the attention to a large negative number and then it ends up being zero

00:10:30

uh when you when you do the soft max so we just zero out the attention

00:10:36

and basically say you're not allowed to look into the future okay so to summarise that um

00:10:46

we have a a little uh transformers are him seeing a self attention

00:10:52

um they use position and headings to represent the sequential nature of

00:10:57

the text they add nonlinearity these uh applied independently at every position

00:11:05

and they use some kind of masking to guarantee that

00:11:08

you're not cheating by looking at the inputs that at that

00:11:12

particular moment you don't and you shouldn't be looking at and

00:11:16

and not last point is actually mostly important because you're not

00:11:21

you're not running this model once for every word prediction you're you're

00:11:25

sticking the whole thing on the g. p. u. and pumping data through

00:11:29

it so you have to kind of uh mm hard wire in this

00:11:33

causal a relationship to okay and so that's the basic ideas the transformers

00:11:41

there are a couple other things that are just kind of my my

00:11:46

deep learning black magic to get it to work optimisation issues and things

00:11:51

um one thing that's important is that you're not

00:11:55

actually just up doing one attention function at every layer

00:12:00

you have multiple heads and that's done basically by instead of having one query matrix

00:12:07

you split it up into multiple query nature sees um

00:12:11

multiple but smaller so it's the same amount of computation but essentially the

00:12:16

the normal as the the soft max is only going over a small portion

00:12:21

and that means you can have one attention had that look so one

00:12:24

kind of information another one that looks at the different kind of information

00:12:28

and you can then merge all these different different ways of looking at your context

00:12:34

that's that's really crucial to getting it to work another thing is crucial to getting it to work is that

00:12:41

you know the picture i showed use had this input and then something for dinner and then output

00:12:47

what's really happening is you're copying this vector here

00:12:52

and the the fee for bait is just computing a

00:12:57

um on modifications so it's called a residual network because

00:13:02

essentially we assume it's the sane and then we can this thing

00:13:07

is training to learn the the residual error out that corrects that

00:13:13

so it's uh it's just it's a way to to to to

00:13:17

get the deep learning to work on other thing is like your normalisation

00:13:23

so this is really black magic you're you're just taking

00:13:27

these this the range of values which can vary a lot

00:13:32

and saying okay every time i'm gonna not let it very

00:13:36

all i'm gonna only let it very within a normal distribution so

00:13:41

i forced the mean to always be the same i force the variance to always be the same

00:13:46

i even learn some parameters that say okay for this dimension the variance can be higher

00:13:52

and the me needs to be lower um total hack but it's really important to

00:13:58

get the the optimisation to work um scale dot product also and it's kind of

00:14:06

if you're not if these factors are really big then this school work can be really big

00:14:13

and so the soft max just gives you zeroes and ones and that's not what you want

00:14:17

so you define by a square root of the dimensionality and that says well

00:14:23

this is gonna this is gonna scale with the dimensionality spec keeps everything again

00:14:28

in a nice well behaved range um so that's also important um another thing

00:14:36

we need to do is that if we just had our tokens as words

00:14:42

then we'd always have new words that we haven't seen in training and we wouldn't know what to

00:14:47

do so that's no good we want to take those words and split them up into little pieces

00:14:54

where we don't know what to do with that little pieces and so that's

00:14:58

typically done with something sort of like a parent coding were or word pieces

00:15:03

which just says well i'm gonna only include in

00:15:07

my vocabulary sequence of characters that i see frequently enough

00:15:12

that i can learn it out then if the sequence of characters is

00:15:15

too infrequent pen i split it up into smaller pieces that i have seen

00:15:20

that's the basic idea there so there are no

00:15:22

unknown words um another thing that i alluded to um

00:15:30

but this is a bit more general that um

00:15:34

often self attention is done by direction lisa every word can look at every other word um

00:15:41

g. p. t. and and the large language

00:15:44

models like that they have this efficiency constraint that

00:15:49

a a given word can only look at the words of the preceding part of

00:15:54

the sentence so the tension function can look at future words so even if your

00:16:00

currently trying to predict word tan word five

00:16:05

is now allow the word look a word six

00:16:07

i mean is there you know what it is but you're not allowed to look at it

00:16:11

and the reason is if you can only look at things that earlier than that

00:16:15

factor never changes i i i computing bedding of word five one and predicting word five

00:16:21

i compute the embedding a word five one and predicting word six

00:16:25

but it's the same because i can't look at word six so

00:16:29

you just have to keep you one in batting for

00:16:31

every position that makes things much faster during training and

00:16:35

if it's on the jeep you really nice so that's another thing that's important for g. h. t. in particular

00:16:45

so summary of transformers um with its multi layer attend a attention based

00:16:52

sequence processing models um uh each

00:16:58

uh each layer um it's using

00:17:02

self attention to look at all the bearings of the previous lay layer

00:17:07

it use multiple attention had so that they can look at things i think more than one way um

00:17:13

and that it uses a set of vector representations the

00:17:17

like in space is a set of factors uh not

00:17:21

uh uh it's not explicitly a sequence um itself attention

00:17:26

only understand sets of actors um um okay so um hum

00:17:41

oh i think i'm way ahead of schedule so let me uh let me say that

00:17:47

the slide that i didn't add which is my personal perspective on these things

00:17:51

um it's so it's representing as a

00:17:57

set of vectors so it's it's but

00:18:01

it on top of that set of vectors you haven't attention function that's right

00:18:05

if we go back um to this picture here so you have this attention function

00:18:15

that's able to say well i know what sector this is and i'm going

00:18:20

to use it to look at these

00:18:22

factors um so these attention fact uh functions

00:18:28

same shirley give you a graph structure their

00:18:32

their the model is learning a graph structure

00:18:36

which i can then uh uh implement any attention

00:18:40

function it's in bedding the graph relationships into these

00:18:45

hairs of factors so if i know this factor and i know this factor then i can compute

00:18:51

that it has a high attention score and that's the kind of world implicit relation

00:18:57

so transformers are now sequence sequence models they are graph

00:19:02

processing models okay remember that that's that's what i did

00:19:07

for okay preacher in um so there are lots and

00:19:21

lots of things we want to do in natural language processing

00:19:24

most of them require understanding text in some way

00:19:28

so if every time we learn a sentiment analysis

00:19:32

task uh or something like that every time we

00:19:36

need to learn from scratch how to understand text

00:19:39

that's gonna be really hard we don't have enough data on sentiment analysis to learn english

00:19:45

right what would we do have a lots and lots and lots of text

00:19:48

so can we just first learned about understanding language understanding english and then learn

00:19:55

about sentiment analysis once after we've learned about line so that's the idea behind

00:20:00

retrain first we do the pretty training that just says how do i understand language

00:20:07

and then we do the the fine tuning or or or or or other ways of using language model

00:20:13

that that to a hundred we'll talk about later

00:20:17

um that ten uh extract that information for particular task

00:20:23

okay um and maybe we can do that and that's because of distributional

00:20:29

semantics the distributions of words in text

00:20:34

but the whole cover currents of words

00:20:37

in sequences that tell us a lot about the meaning

00:20:42

and text and um we can just train on these distributions

00:20:49

and all ready we can we can understand text and some can mean for what

00:20:54

so that's the way p. training works um and then once

00:21:00

we've learned that representation we need to transfer that knowledge to our

00:21:04

particular tasks they sent and just so that the uh hard

00:21:10

afterwards so some some models that are commonly and discussed in this

00:21:17

area the first one was burr so this is really

00:21:21

what started the revolution of of transformers in in an l.

00:21:26

p. um it's a transformer former in code or just you

00:21:31

give it text it produces a a set of factors um

00:21:36

that's pretty that's trained on and asked language model task you you

00:21:41

mask a word and you try to predict it from all the other

00:21:45

so you're learning the relationships between words which is which crucial in text

00:21:50

bart is similar but it's an coder and the decoder sabine coder takes the text

00:21:56

and produces a set of vectors then you have attention to from

00:22:01

a decoder that generates another text um and in this case it

00:22:06

was a pretty trained on i just reconstruction but be input is

00:22:11

some noisy version you've deleted some word you so to do some words

00:22:16

some noisy version your text and then that that model needs to learn how to correct

00:22:21

that so again it's the correlations between the words that lets it figure out how to correct

00:22:28

that and so by making the model learn those correlations it's learning how to understand language

00:22:35

uh t. five is similar to mine coder decoder but the the objective is a bit different

00:22:43

uh it's learning to predict spans instead of words

00:22:47

um and all of these models generally pretty train and then you take that model

00:22:54

and you fine tune which means that you do back propagation training a gradient ascent training

00:23:00

on your specific task in change all the parameters and home model but you

00:23:06

don't change in the very much just change them enough to do the task

00:23:10

but most of the knowledge about languages still and then the

00:23:15

last one is g. p. t. that will talk about today um

00:23:19

it's a just a transformer decoder so it's just

00:23:22

an uh generating text one word at a time

00:23:27

so predicting the next word conditioned on all the previous words that it's already generated um

00:23:34

uh and uh uh this one is is a

00:23:38

generally a it can be used for fine tuning

00:23:42

but most of the time you're just mining that language model that will will about how to do that

00:23:50

um so why what about distribution of semantics um

00:23:58

i um homework and you learn well here are some examples

00:24:05

stanford university located in california so we're

00:24:10

learning about facts about the world we

00:24:13

can't we can't predict this you know we know exactly what word goes there

00:24:18

it needs to be paolo alto but we know that because we know facts about

00:24:23

the world so the language model has to learn facts about the world in order

00:24:27

to solve this problem i point blank for work down on the table will it

00:24:34

has to be tough for a for the syntax says what that has to be

00:24:40

um and so if we want to predict that word we have to learn about syntax too

00:24:47

the woman walked across the street checking for traffic over shoulder

00:24:51

well it's in fact says it's it's probably her shoulder his shoulder

00:24:56

but it can't be his shoulder because it's a one or so in order to solve this problem we need to

00:25:02

be able to do call reference resolution that it's the one that we're talking about at the moment so weak who

00:25:09

just learning a language model we have to learn all

00:25:12

these different properties of language i went to the option to

00:25:17

see the fish turtle seals and i mean it's got

00:25:21

to be some kind of sea creature right it can be

00:25:25

uh dogs or what maybe just the canopy yeah uh yeah

00:25:31

chocolate right but that would have to be switched off um

00:25:39

overall the value i got from the two hours watching it what's the sum

00:25:43

total of the popcorn and a drink this movie was this the sentiment analysis

00:25:49

right it can be great right good because they're so they're saying this movie

00:25:54

with total waste of my time so it's gotta be horrible or something like that

00:26:03

we were we went into the kitchen to make some tea standing next to a rose who grew condor just

00:26:09

destinies and go left the uh well you need to

00:26:13

it's gonna be kitchen right so the to predict that word

00:26:17

uh you have to know about spatial relationships and and people moving around in the world

00:26:24

so a very complicated reasoning but you know if you're really really

00:26:28

good language model you'll we'll build the answer the square okay um

00:26:37

so back to some of the technical stuff um

00:26:41

one thing so now we're gonna pretty train our language model on these sequences of text

00:26:47

and in for g. t. were just talking about left or right uh a language modelling

00:26:53

uh try to predict the next word given the words we've predicted already over the words that

00:26:58

are in the text already uh training time you know all the words so at every position

00:27:04

uh you know all these words at every position you you want them predict the next word

00:27:12

and you just have your attention function nasty in such

00:27:15

a way that this prediction can look in the future

00:27:19

you can only look in the path so that makes it a left

00:27:23

to right language model but we can put this whole thing on the chip

00:27:28

on a jeep you all at once and train all these things in parallel that makes a training

00:27:34

enormously more efficient than trying to predict one word at a time you can do it on parallel

00:27:42

um hum right so g. p. g. is is

00:27:49

like that um it has twelve layers so it's

00:27:52

really deep it has uh it's just really big

00:27:58

really big it's g. it has huge hidden layers

00:28:02

it has really big feed forward networks it

00:28:05

um uses this but her cotton coding so small

00:28:10

n. grams of characters uh with forty thousand merges so if

00:28:15

there are forty thousand different a character n. grams um it's

00:28:22

trained on a whole lot a lot of data away i

00:28:26

mean for the for the most recent version we don't even know

00:28:30

how much data if it if your text is on the web it's probably been trained on your tax

00:28:36

um and j. p. g. probably means something

00:28:40

like generative patron transformer but we're not really sure

00:28:45

a lot of stuff we're not really sure i i'm a i'm a so we can use it for

00:28:55

fine tuning or at least if the of the earlier versions the if you have access to the model

00:29:01

uh then you can use it for fine tuning generally we just have access to to

00:29:06

the the predictions of the model um and this is just an example of how to do

00:29:12

a textual tell me you wanna know if i believe the man is

00:29:17

in the the doorway van i believe the person is near the door

00:29:23

and you want to predict whether that's true or not and

00:29:26

so you just input these things then you fine tune the model

00:29:30

to make this prediction and it does it does well okay so

00:29:38

summary pretty training has been hugely successful and improving the

00:29:42

state of the art in many tasks just for totally changed

00:29:47

uh uh the state of the art in and everything in in in a um there been

00:29:52

lots of different types of these models depending on the structure of the transformer depending on the

00:29:59

the the uh learning object to preach training objective we kind

00:30:03

of data it's trained on bart bert g. p. t. t. five

00:30:09

it's a recent one g. p. j. there are on very

00:30:14

large transformers trained on the left to right language modelling task um

00:30:20

and they just end up having a huge amount of information because they have so many parameters

00:30:26

and they've been trained on so much data that they just uh in code

00:30:31

a a huge amount of information so then next we'll talk about

Share this talk:

Conference Program

09:52

The Evolution of Large Language Models that led to ChatGPT (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 8:34 a.m.

664 views

30:48

Understanding Transformers
James Henderson, Idiap Research Institute
March 10, 2023 · 8:46 a.m.

369 views

25:22

Inference using Large Language Models (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:19 a.m.

12:41

Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:45 a.m.

19:12

ChatGPT for Digital Marketing
Floris Keijser, N98 Digital Marketing
March 10, 2023 · 9:58 a.m.

18:16

Biomedical Inference & Large Language Models
Oskar Wysocki, University of Manchester
March 10, 2023 · 10:19 a.m.

20:13

Abstract Reasoning
Marco Valentino, Idiap Research Institute
March 10, 2023 · 10:38 a.m.

120 views

15:42

Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 10:58 a.m.

18:35

The Risks Behind Large Language Models (Al Brown, Fujitsu)
Al Brown, Fujitsu
March 10, 2023 · 1:42 p.m.

05:07

Q&A: The Risks Behind Large Language Models (Al Brown, Fujitsu)
Al Brown, Fujitsu
March 10, 2023 · 2:01 p.m.

57:08

Round Table: Risks & Broader Societal Impact (Legal, Educational and Labor)
Lonneke van der Plas, Idiap Research Institute
March 10, 2023 · 2:07 p.m.

18:17

The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems
March 10, 2023 · 3:12 p.m.

06:37

Q&A: The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems
March 10, 2023 · 3:30 p.m.

Recommended talks

55:14

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

815 views

01:03:36

Component Analysis for Human Sensing
Fernando De la Torre, Carnegie Mellon University
Aug. 29, 2013 · 11:07 a.m.

399 views

Understanding Transformers
James Henderson, Idiap Research Institute

Embed

Transcriptions

Conference Program

The Evolution of Large Language Models that led to ChatGPT (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 8:34 a.m.

Understanding Transformers
James Henderson, Idiap Research Institute
March 10, 2023 · 8:46 a.m.

Inference using Large Language Models (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:19 a.m.

Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:45 a.m.

ChatGPT for Digital Marketing
Floris Keijser, N98 Digital Marketing
March 10, 2023 · 9:58 a.m.

Biomedical Inference & Large Language Models
Oskar Wysocki, University of Manchester
March 10, 2023 · 10:19 a.m.

Abstract Reasoning
Marco Valentino, Idiap Research Institute
March 10, 2023 · 10:38 a.m.

Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 10:58 a.m.

The Risks Behind Large Language Models (Al Brown, Fujitsu)
Al Brown, Fujitsu
March 10, 2023 · 1:42 p.m.

Q&A: The Risks Behind Large Language Models (Al Brown, Fujitsu)
Al Brown, Fujitsu
March 10, 2023 · 2:01 p.m.

Round Table: Risks & Broader Societal Impact (Legal, Educational and Labor)
Lonneke van der Plas, Idiap Research Institute
March 10, 2023 · 2:07 p.m.

The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems
March 10, 2023 · 3:12 p.m.

Q&A: The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems
March 10, 2023 · 3:30 p.m.

Recommended talks

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

Component Analysis for Human Sensing
Fernando De la Torre, Carnegie Mellon University
Aug. 29, 2013 · 11:07 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Understanding Transformers James Henderson, Idiap Research Institute

Embed

Transcriptions

Conference Program

The Evolution of Large Language Models that led to ChatGPT (Andre Freitas, Idiap) Andre Freitas, Idiap Research Institute March 10, 2023 · 8:34 a.m.

Understanding Transformers James Henderson, Idiap Research Institute March 10, 2023 · 8:46 a.m.

Inference using Large Language Models (Andre Freitas, Idiap) Andre Freitas, Idiap Research Institute March 10, 2023 · 9:19 a.m.

Q&A Andre Freitas, Idiap Research Institute March 10, 2023 · 9:45 a.m.

ChatGPT for Digital Marketing Floris Keijser, N98 Digital Marketing March 10, 2023 · 9:58 a.m.

Biomedical Inference & Large Language Models Oskar Wysocki, University of Manchester March 10, 2023 · 10:19 a.m.

Abstract Reasoning Marco Valentino, Idiap Research Institute March 10, 2023 · 10:38 a.m.

Q&A Andre Freitas, Idiap Research Institute March 10, 2023 · 10:58 a.m.

The Risks Behind Large Language Models (Al Brown, Fujitsu) Al Brown, Fujitsu March 10, 2023 · 1:42 p.m.

Q&A: The Risks Behind Large Language Models (Al Brown, Fujitsu) Al Brown, Fujitsu March 10, 2023 · 2:01 p.m.

Round Table: Risks & Broader Societal Impact (Legal, Educational and Labor) Lonneke van der Plas, Idiap Research Institute March 10, 2023 · 2:07 p.m.

The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems) Vinay Pondenkandath, Cerebras Systems March 10, 2023 · 3:12 p.m.

Q&A: The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems) Vinay Pondenkandath, Cerebras Systems March 10, 2023 · 3:30 p.m.

Recommended talks

Torch 1 Soumith Chintala, Facebook July 5, 2016 · 10:02 a.m.

Component Analysis for Human Sensing Fernando De la Torre, Carnegie Mellon University Aug. 29, 2013 · 11:07 a.m.

Klewel SA

What is Klewel?

Follow Us

Contact Us

Understanding Transformers
James Henderson, Idiap Research Institute

The Evolution of Large Language Models that led to ChatGPT (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 8:34 a.m.

Understanding Transformers
James Henderson, Idiap Research Institute
March 10, 2023 · 8:46 a.m.

Inference using Large Language Models (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:19 a.m.

Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:45 a.m.

ChatGPT for Digital Marketing
Floris Keijser, N98 Digital Marketing
March 10, 2023 · 9:58 a.m.

Biomedical Inference & Large Language Models
Oskar Wysocki, University of Manchester
March 10, 2023 · 10:19 a.m.

Abstract Reasoning
Marco Valentino, Idiap Research Institute
March 10, 2023 · 10:38 a.m.

Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 10:58 a.m.

The Risks Behind Large Language Models (Al Brown, Fujitsu)
Al Brown, Fujitsu
March 10, 2023 · 1:42 p.m.

Q&A: The Risks Behind Large Language Models (Al Brown, Fujitsu)
Al Brown, Fujitsu
March 10, 2023 · 2:01 p.m.

Round Table: Risks & Broader Societal Impact (Legal, Educational and Labor)
Lonneke van der Plas, Idiap Research Institute
March 10, 2023 · 2:07 p.m.

The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems
March 10, 2023 · 3:12 p.m.

Q&A: The Infrastructure to build Large Language Models (Vinay Pondenkandath, Cerebras Systems)
Vinay Pondenkandath, Cerebras Systems
March 10, 2023 · 3:30 p.m.

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.

Component Analysis for Human Sensing
Fernando De la Torre, Carnegie Mellon University
Aug. 29, 2013 · 11:07 a.m.