Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:01
okay um um thanks andrei so i'm a head
00:00:06
of the natural language understanding group and it's true
00:00:09
i spend all my time thinking about transformers are so uh this is kind
00:00:16
of the technical details start with attention um as an attention is all you need
00:00:22
then going to transformers and a bit about retraining transformers um some
00:00:28
of the slides are taken from the stanford n. l. p. course
00:00:32
which if you're interested in a technical details is a very good course
00:00:38
okay attention um the basic problem with text
00:00:43
is that a tax can be very long
00:00:47
and you need to somehow access be able to access all the information that's in
00:00:52
in a text um we can just compress a whole text into a single
00:00:58
vector and then condition on that factor because it's just the too much information
00:01:05
um but luckily we don't normally want to look at the whole text
00:01:10
we want to look at some little bit for one decision that some other part for another decision um
00:01:18
and so if we can do a solved this problem by having
00:01:25
every part of the text get a different factor so the number of
00:01:29
vectors grows with the size of the text and then for given
00:01:33
question you look at one vector to answer one question then another vector
00:01:38
to answer another question so you need to do but i have this alignment between what you want to know what the
00:01:44
moment and with that part of the text you want to look at and not a line is what attention gets you
00:01:52
says uh uh like a learned so blatant soft alignment um between what you
00:01:58
want to know and what part of the text you want to look at
00:02:03
so it looks basically like this um the different
00:02:08
uh uh parts of your tests of the words and up as being vectors
00:02:13
and then you have some vector representing your current state what you need to look at
00:02:18
and you're doing a similarity uh just checking is this the one i want to look at you get
00:02:24
a school or maybe it's just when we get to school or make it says when you get a score
00:02:29
you take all those scores and you put it in a
00:02:32
a normal us exponential so soft max to give you a distribution
00:02:37
over these these factors now these uh this distribution tells you what
00:02:44
you want to look at what you want to pay attention to
00:02:47
so now you do a weighted average you take
00:02:50
these factors you multiply them by this weights some together
00:02:55
and you get a factor which is in this case basically a
00:02:58
copy of this that or if it was a different state i
00:03:02
would get a copy of the different factor but it can be
00:03:05
kind of smooth version of multiple vectors that are all some together
00:03:10
and you get your resulting vector and then you could just condition on
00:03:15
that so so you can your conditioning on an individual factor but it's
00:03:19
it's a vector that's specific to the question you wanna ask at that moment
00:03:26
um and so just to uh to give you a bit of that details the
00:03:32
basic idea is uh you have these vectors you want to look at so it's
00:03:38
a set of actors and you have some state and menu new um computer score
00:03:46
which is the dot product between uh you take
00:03:49
this state uh_huh multiplied by matrix get a query vector
00:03:54
you take the the individual vector multiplied by
00:03:58
another meg matrix to get the chain vector
00:04:02
the dot product gives you a score soft max gives you a
00:04:06
normalised wait you then take a weighted average of one of the
00:04:10
vectors a map into a value vector so it's query key value
00:04:17
weighted average and that gives you the result of the attention function
00:04:22
so attention you take the sat and you end up with the
00:04:26
function from factors to vectors so that that's the attention function and everything
00:04:33
is based on that okay self attention um a crucial extension of
00:04:40
this idea is um if you have a text and you want to
00:04:46
uh encode the whole tax then every
00:04:49
position needs to compute its individual vector
00:04:53
but each one of those vectors need to look at all the other vectors so that's
00:04:58
self attention every word every part of the taxes looking at every other part of the
00:05:03
text i'm right and and you do this and multiple layers so that information can propagate
00:05:15
around the text you everything is context roll
00:05:20
eyes with respect its neighbouring words and then
00:05:23
contextual eyes with the with respect to those contractual those representations et cetera
00:05:29
so i'm off and many lights are okay so
00:05:36
attention is just learning a soft alignment um hum of for a a to
00:05:42
a sequence of tokens um this let you deal with a variable size inputs
00:05:50
but uh those inputs are represented as
00:05:53
a set now 'cause attention doesn't really
00:05:56
no do what position those factors are decides what to look at
00:06:01
just by looking at the content of the individual vector nothing
00:06:05
else um it's extremely affected that's why we're all here um
00:06:11
and uh uh self attention is is also extremely effective
00:06:16
a very general way to encode a text okay so
00:06:21
that uh the last point here um the uh it's
00:06:25
use itself attention as a general method for encoding text
00:06:29
is exactly the idea behind transforms so let's talk about transformers um
00:06:37
transformers or multiple levels of self attention um
00:06:42
every token here we have a word so
00:06:46
will just refer to these as tokens there's a vector associated with every one of these
00:06:52
and there's a different factor at everyone at every level but for multiple levels
00:06:58
self attention is used to propagate information service
00:07:03
level uses self attention to look at this level
00:07:07
that's just detected by these lines the different
00:07:11
colours or different kinds of self attention um
00:07:15
but that that gives you a a kind of learn structure
00:07:20
over this that um and then all other then uh attention
00:07:27
everything else is just a computation that's being done independently at every
00:07:32
position and that makes it extremely efficient to run under g. p. u.
00:07:37
'cause you're just doing the same computation lots and lots of times in parallel
00:07:45
so that was a a crucial of factor for the success of transformers
00:07:50
so um there are uh some problems with just using self attention
00:07:57
the first is that it's it's representing these tokens as
00:08:01
a set when in fact we it's a sequence somehow we
00:08:05
need to tell the model what the sequences and the answer
00:08:09
is that we just have a position bedding set we add
00:08:13
to the information about the word we add in the fact that this is this is a word
00:08:19
this particular word and it's a position five this
00:08:23
ones that position six those are typically learned um but
00:08:27
uh uh the the could also be hard code
00:08:32
um second problem is as we saw on those formula
00:08:36
everything that's going on intention is is linear
00:08:39
it's just a a dot product or weighted average
00:08:43
so stacking these things on top of each other just gets a new
00:08:48
uh_huh linear functions of linear functions is not powerful enough we know from the
00:08:53
neural networks that we need nonlinear it's so the solution to that
00:08:58
is to uh add nonlinearity is independently owned
00:09:02
every position so in between every self attention layer
00:09:06
there is a little multilayer perception feed forward neural network
00:09:11
that takes this factor and maps it into new vector
00:09:15
that gets used for the next layer of self attach to alternate between independently
00:09:21
a a nonlinear do some every position and then attention
00:09:26
that transfers information across the positions alternate between to the um
00:09:33
the third problem which is kind of a a technical issue
00:09:36
but it's important um is that if you're doing a left
00:09:41
like right left to right language model like g. p. t.
00:09:45
we want to predict the next word you don't want the the production of that next word
00:09:51
there to look at the next word condition on the next word or that would be too easy
00:09:57
so we need to somehow block the computation that predicts
00:10:01
one word from looking at itself and looking at future
00:10:05
words because it's not supposed to be of the um
00:10:09
uh uh do that if it's if it's a left to right language model you're only supposed condition let
00:10:16
um so that's done basically by going in and
00:10:19
hacking the attention functions you just set um the
00:10:24
the school or in the attention to a large negative number and then it ends up being zero
00:10:30
uh when you when you do the soft max so we just zero out the attention
00:10:36
and basically say you're not allowed to look into the future okay so to summarise that um
00:10:46
we have a a little uh transformers are him seeing a self attention
00:10:52
um they use position and headings to represent the sequential nature of
00:10:57
the text they add nonlinearity these uh applied independently at every position
00:11:05
and they use some kind of masking to guarantee that
00:11:08
you're not cheating by looking at the inputs that at that
00:11:12
particular moment you don't and you shouldn't be looking at and
00:11:16
and not last point is actually mostly important because you're not
00:11:21
you're not running this model once for every word prediction you're you're
00:11:25
sticking the whole thing on the g. p. u. and pumping data through
00:11:29
it so you have to kind of uh mm hard wire in this
00:11:33
causal a relationship to okay and so that's the basic ideas the transformers
00:11:41
there are a couple other things that are just kind of my my
00:11:46
deep learning black magic to get it to work optimisation issues and things
00:11:51
um one thing that's important is that you're not
00:11:55
actually just up doing one attention function at every layer
00:12:00
you have multiple heads and that's done basically by instead of having one query matrix
00:12:07
you split it up into multiple query nature sees um
00:12:11
multiple but smaller so it's the same amount of computation but essentially the
00:12:16
the normal as the the soft max is only going over a small portion
00:12:21
and that means you can have one attention had that look so one
00:12:24
kind of information another one that looks at the different kind of information
00:12:28
and you can then merge all these different different ways of looking at your context
00:12:34
that's that's really crucial to getting it to work another thing is crucial to getting it to work is that
00:12:41
you know the picture i showed use had this input and then something for dinner and then output
00:12:47
what's really happening is you're copying this vector here
00:12:52
and the the fee for bait is just computing a
00:12:57
um on modifications so it's called a residual network because
00:13:02
essentially we assume it's the sane and then we can this thing
00:13:07
is training to learn the the residual error out that corrects that
00:13:13
so it's uh it's just it's a way to to to to
00:13:17
get the deep learning to work on other thing is like your normalisation
00:13:23
so this is really black magic you're you're just taking
00:13:27
these this the range of values which can vary a lot
00:13:32
and saying okay every time i'm gonna not let it very
00:13:36
all i'm gonna only let it very within a normal distribution so
00:13:41
i forced the mean to always be the same i force the variance to always be the same
00:13:46
i even learn some parameters that say okay for this dimension the variance can be higher
00:13:52
and the me needs to be lower um total hack but it's really important to
00:13:58
get the the optimisation to work um scale dot product also and it's kind of
00:14:06
if you're not if these factors are really big then this school work can be really big
00:14:13
and so the soft max just gives you zeroes and ones and that's not what you want
00:14:17
so you define by a square root of the dimensionality and that says well
00:14:23
this is gonna this is gonna scale with the dimensionality spec keeps everything again
00:14:28
in a nice well behaved range um so that's also important um another thing
00:14:36
we need to do is that if we just had our tokens as words
00:14:42
then we'd always have new words that we haven't seen in training and we wouldn't know what to
00:14:47
do so that's no good we want to take those words and split them up into little pieces
00:14:54
where we don't know what to do with that little pieces and so that's
00:14:58
typically done with something sort of like a parent coding were or word pieces
00:15:03
which just says well i'm gonna only include in
00:15:07
my vocabulary sequence of characters that i see frequently enough
00:15:12
that i can learn it out then if the sequence of characters is
00:15:15
too infrequent pen i split it up into smaller pieces that i have seen
00:15:20
that's the basic idea there so there are no
00:15:22
unknown words um another thing that i alluded to um
00:15:30
but this is a bit more general that um
00:15:34
often self attention is done by direction lisa every word can look at every other word um
00:15:41
g. p. t. and and the large language
00:15:44
models like that they have this efficiency constraint that
00:15:49
a a given word can only look at the words of the preceding part of
00:15:54
the sentence so the tension function can look at future words so even if your
00:16:00
currently trying to predict word tan word five
00:16:05
is now allow the word look a word six
00:16:07
i mean is there you know what it is but you're not allowed to look at it
00:16:11
and the reason is if you can only look at things that earlier than that
00:16:15
factor never changes i i i computing bedding of word five one and predicting word five
00:16:21
i compute the embedding a word five one and predicting word six
00:16:25
but it's the same because i can't look at word six so
00:16:29
you just have to keep you one in batting for
00:16:31
every position that makes things much faster during training and
00:16:35
if it's on the jeep you really nice so that's another thing that's important for g. h. t. in particular
00:16:45
so summary of transformers um with its multi layer attend a attention based
00:16:52
sequence processing models um uh each
00:16:58
uh each layer um it's using
00:17:02
self attention to look at all the bearings of the previous lay layer
00:17:07
it use multiple attention had so that they can look at things i think more than one way um
00:17:13
and that it uses a set of vector representations the
00:17:17
like in space is a set of factors uh not
00:17:21
uh uh it's not explicitly a sequence um itself attention
00:17:26
only understand sets of actors um um okay so um hum
00:17:41
oh i think i'm way ahead of schedule so let me uh let me say that
00:17:47
the slide that i didn't add which is my personal perspective on these things
00:17:51
um it's so it's representing as a
00:17:57
set of vectors so it's it's but
00:18:01
it on top of that set of vectors you haven't attention function that's right
00:18:05
if we go back um to this picture here so you have this attention function
00:18:15
that's able to say well i know what sector this is and i'm going
00:18:20
to use it to look at these
00:18:22
factors um so these attention fact uh functions
00:18:28
same shirley give you a graph structure their
00:18:32
their the model is learning a graph structure
00:18:36
which i can then uh uh implement any attention
00:18:40
function it's in bedding the graph relationships into these
00:18:45
hairs of factors so if i know this factor and i know this factor then i can compute
00:18:51
that it has a high attention score and that's the kind of world implicit relation
00:18:57
so transformers are now sequence sequence models they are graph
00:19:02
processing models okay remember that that's that's what i did
00:19:07
for okay preacher in um so there are lots and
00:19:21
lots of things we want to do in natural language processing
00:19:24
most of them require understanding text in some way
00:19:28
so if every time we learn a sentiment analysis
00:19:32
task uh or something like that every time we
00:19:36
need to learn from scratch how to understand text
00:19:39
that's gonna be really hard we don't have enough data on sentiment analysis to learn english
00:19:45
right what would we do have a lots and lots and lots of text
00:19:48
so can we just first learned about understanding language understanding english and then learn
00:19:55
about sentiment analysis once after we've learned about line so that's the idea behind
00:20:00
retrain first we do the pretty training that just says how do i understand language
00:20:07
and then we do the the fine tuning or or or or or other ways of using language model
00:20:13
that that to a hundred we'll talk about later
00:20:17
um that ten uh extract that information for particular task
00:20:23
okay um and maybe we can do that and that's because of distributional
00:20:29
semantics the distributions of words in text
00:20:34
but the whole cover currents of words
00:20:37
in sequences that tell us a lot about the meaning
00:20:42
and text and um we can just train on these distributions
00:20:49
and all ready we can we can understand text and some can mean for what
00:20:54
so that's the way p. training works um and then once
00:21:00
we've learned that representation we need to transfer that knowledge to our
00:21:04
particular tasks they sent and just so that the uh hard
00:21:10
afterwards so some some models that are commonly and discussed in this
00:21:17
area the first one was burr so this is really
00:21:21
what started the revolution of of transformers in in an l.
00:21:26
p. um it's a transformer former in code or just you
00:21:31
give it text it produces a a set of factors um
00:21:36
that's pretty that's trained on and asked language model task you you
00:21:41
mask a word and you try to predict it from all the other
00:21:45
so you're learning the relationships between words which is which crucial in text
00:21:50
bart is similar but it's an coder and the decoder sabine coder takes the text
00:21:56
and produces a set of vectors then you have attention to from
00:22:01
a decoder that generates another text um and in this case it
00:22:06
was a pretty trained on i just reconstruction but be input is
00:22:11
some noisy version you've deleted some word you so to do some words
00:22:16
some noisy version your text and then that that model needs to learn how to correct
00:22:21
that so again it's the correlations between the words that lets it figure out how to correct
00:22:28
that and so by making the model learn those correlations it's learning how to understand language
00:22:35
uh t. five is similar to mine coder decoder but the the objective is a bit different
00:22:43
uh it's learning to predict spans instead of words
00:22:47
um and all of these models generally pretty train and then you take that model
00:22:54
and you fine tune which means that you do back propagation training a gradient ascent training
00:23:00
on your specific task in change all the parameters and home model but you
00:23:06
don't change in the very much just change them enough to do the task
00:23:10
but most of the knowledge about languages still and then the
00:23:15
last one is g. p. t. that will talk about today um
00:23:19
it's a just a transformer decoder so it's just
00:23:22
an uh generating text one word at a time
00:23:27
so predicting the next word conditioned on all the previous words that it's already generated um
00:23:34
uh and uh uh this one is is a
00:23:38
generally a it can be used for fine tuning
00:23:42
but most of the time you're just mining that language model that will will about how to do that
00:23:50
um so why what about distribution of semantics um
00:23:58
i um homework and you learn well here are some examples
00:24:05
stanford university located in california so we're
00:24:10
learning about facts about the world we
00:24:13
can't we can't predict this you know we know exactly what word goes there
00:24:18
it needs to be paolo alto but we know that because we know facts about
00:24:23
the world so the language model has to learn facts about the world in order
00:24:27
to solve this problem i point blank for work down on the table will it
00:24:34
has to be tough for a for the syntax says what that has to be
00:24:40
um and so if we want to predict that word we have to learn about syntax too
00:24:47
the woman walked across the street checking for traffic over shoulder
00:24:51
well it's in fact says it's it's probably her shoulder his shoulder
00:24:56
but it can't be his shoulder because it's a one or so in order to solve this problem we need to
00:25:02
be able to do call reference resolution that it's the one that we're talking about at the moment so weak who
00:25:09
just learning a language model we have to learn all
00:25:12
these different properties of language i went to the option to
00:25:17
see the fish turtle seals and i mean it's got
00:25:21
to be some kind of sea creature right it can be
00:25:25
uh dogs or what maybe just the canopy yeah uh yeah
00:25:31
chocolate right but that would have to be switched off um
00:25:39
overall the value i got from the two hours watching it what's the sum
00:25:43
total of the popcorn and a drink this movie was this the sentiment analysis
00:25:49
right it can be great right good because they're so they're saying this movie
00:25:54
with total waste of my time so it's gotta be horrible or something like that
00:26:03
we were we went into the kitchen to make some tea standing next to a rose who grew condor just
00:26:09
destinies and go left the uh well you need to
00:26:13
it's gonna be kitchen right so the to predict that word
00:26:17
uh you have to know about spatial relationships and and people moving around in the world
00:26:24
so a very complicated reasoning but you know if you're really really
00:26:28
good language model you'll we'll build the answer the square okay um
00:26:37
so back to some of the technical stuff um
00:26:41
one thing so now we're gonna pretty train our language model on these sequences of text
00:26:47
and in for g. t. were just talking about left or right uh a language modelling
00:26:53
uh try to predict the next word given the words we've predicted already over the words that
00:26:58
are in the text already uh training time you know all the words so at every position
00:27:04
uh you know all these words at every position you you want them predict the next word
00:27:12
and you just have your attention function nasty in such
00:27:15
a way that this prediction can look in the future
00:27:19
you can only look in the path so that makes it a left
00:27:23
to right language model but we can put this whole thing on the chip
00:27:28
on a jeep you all at once and train all these things in parallel that makes a training
00:27:34
enormously more efficient than trying to predict one word at a time you can do it on parallel
00:27:42
um hum right so g. p. g. is is
00:27:49
like that um it has twelve layers so it's
00:27:52
really deep it has uh it's just really big
00:27:58
really big it's g. it has huge hidden layers
00:28:02
it has really big feed forward networks it
00:28:05
um uses this but her cotton coding so small
00:28:10
n. grams of characters uh with forty thousand merges so if
00:28:15
there are forty thousand different a character n. grams um it's
00:28:22
trained on a whole lot a lot of data away i
00:28:26
mean for the for the most recent version we don't even know
00:28:30
how much data if it if your text is on the web it's probably been trained on your tax
00:28:36
um and j. p. g. probably means something
00:28:40
like generative patron transformer but we're not really sure
00:28:45
a lot of stuff we're not really sure i i'm a i'm a so we can use it for
00:28:55
fine tuning or at least if the of the earlier versions the if you have access to the model
00:29:01
uh then you can use it for fine tuning generally we just have access to to
00:29:06
the the predictions of the model um and this is just an example of how to do
00:29:12
a textual tell me you wanna know if i believe the man is
00:29:17
in the the doorway van i believe the person is near the door
00:29:23
and you want to predict whether that's true or not and
00:29:26
so you just input these things then you fine tune the model
00:29:30
to make this prediction and it does it does well okay so
00:29:38
summary pretty training has been hugely successful and improving the
00:29:42
state of the art in many tasks just for totally changed
00:29:47
uh uh the state of the art in and everything in in in a um there been
00:29:52
lots of different types of these models depending on the structure of the transformer depending on the
00:29:59
the the uh learning object to preach training objective we kind
00:30:03
of data it's trained on bart bert g. p. t. t. five
00:30:09
it's a recent one g. p. j. there are on very
00:30:14
large transformers trained on the left to right language modelling task um
00:30:20
and they just end up having a huge amount of information because they have so many parameters
00:30:26
and they've been trained on so much data that they just uh in code
00:30:31
a a huge amount of information so then next we'll talk about

Share this talk: 


Conference Program

The Evolution of Large Language Models that led to ChatGPT (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 8:34 a.m.
664 views
Understanding Transformers
James Henderson, Idiap Research Institute
March 10, 2023 · 8:46 a.m.
369 views
Inference using Large Language Models (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:19 a.m.
Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:45 a.m.
ChatGPT for Digital Marketing
Floris Keijser, N98 Digital Marketing
March 10, 2023 · 9:58 a.m.
Biomedical Inference & Large Language Models
Oskar Wysocki, University of Manchester
March 10, 2023 · 10:19 a.m.
Abstract Reasoning
Marco Valentino, Idiap Research Institute
March 10, 2023 · 10:38 a.m.
120 views
Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 10:58 a.m.
Round Table: Risks & Broader Societal Impact (Legal, Educational and Labor)
Lonneke van der Plas, Idiap Research Institute
March 10, 2023 · 2:07 p.m.

Recommended talks

Torch 1
Soumith Chintala, Facebook
July 5, 2016 · 10:02 a.m.
815 views
Component Analysis for Human Sensing
Fernando De la Torre, Carnegie Mellon University
Aug. 29, 2013 · 11:07 a.m.
399 views