Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
okay thanks under so let's uh let's go a bit of a
00:00:04
from the exciting marketing stuff to it more um like by medical interns
00:00:11
um so the slides and the results i'm going to present our in the
00:00:15
uh the results from the working in collaboration with that animates easement under um
00:00:23
so why we want to do a biomedical inference using
00:00:28
a language models because um they were trained on the huge
00:00:32
corpus a huge amount of text uh and some of them were trained also on a biomedical texts so a lot of
00:00:39
a scientific papers in the training corpus so we believe that
00:00:43
uh they learned not only the language but also some of the
00:00:47
biological relations and biological facts to what we want to do is we
00:00:51
want to extract this fact extract this informations so we can do that by
00:00:57
uh for example transform a small those of in a task where we want
00:01:01
to predict the mask uh so um is the first example uh then we can
00:01:08
we can uh uh generate text using generative models so in this example an important you must suppress again is
00:01:15
and then generate the rest of the sentence and then we extract information which was generated
00:01:20
and also we can ask a direct question to charge beatty like what
00:01:24
is an important too much suppressor gene and we expect an answer of uh
00:01:28
ah the finnish and and a nice description of of of the of the mechanism
00:01:35
so there is also a time presenting is uh are the results from the project uh
00:01:41
could approach which is a project with the collaboration between e. d. up and inform out
00:01:45
so it's a pharmaceutical company in getting dollar and what we want one of the goals
00:01:51
is to accelerate the selection of potential sources of on the bioethics using artificial intelligence
00:01:57
so we are it it's extracting the relation between palm was can we got and uh
00:02:03
uh and and the biotech activity on the chemical and we are using language
00:02:07
models like charge people example but we're comparing charge beauty with other language was
00:02:13
as well so as i mentioned that bobby speaking two types of task we
00:02:20
can do here is filling the mask or text generation opening the mouse is that
00:02:25
example of the top we're trying to predict what
00:02:29
chemical entity or which fungus entity is behind the mask
00:02:33
uh just prompting the model show an important thing here is
00:02:37
that we are really strict interval relation here we're just taking
00:02:41
the models which actually trained like all the shot models we don't do any fine tuning
00:02:46
uh so this is really strict evaluation of how much biomedical biological knowledge is in the
00:02:52
model might so we're not fine tuning we're not adjusting the model to overcome our task
00:02:59
um so generation desk so again prompt is a um as
00:03:03
a sentence with a chemical entity and the models should generate a
00:03:08
that the rest of the sentence which should contain the name of the focus in this in this example
00:03:13
and the last one is asking the right questions i mentioned you get the chance to pity will give you a
00:03:18
exact definition and the description uh so semantically i mean when you read the text
00:03:25
it makes sense you can you can distinguish if it was written by a human or
00:03:29
by option but we hope that this is factual knowledge might but this is true because
00:03:35
it is very tricky to evaluate that i will talk about it a more later um so based on our experience
00:03:43
um so mask language models i have some
00:03:47
limitations uh so in this particular task we achieved
00:03:51
quite ah okay let's say for performance and this is this can be attributed to the fact that
00:03:57
uh i forget chemical names are difficult to process by which was so i think this is
00:04:02
what a change was mentioning to with the speaking the words which are not in the dictionary
00:04:08
so usually a chemical compound has a long way you'd name which is most likely
00:04:13
was not seen or it's it's it's it's difficult to talk a nice basically so
00:04:18
maybe you know ah feature that action would be to use a
00:04:21
more sophisticated sophisticated recognise it but this is what we have a now
00:04:27
another finding is that productions are biased by the prompts
00:04:31
which essentially means that the way you ask a question really impacts the output but the question is about the same thing
00:04:38
is about the word that you're using good about the structure of the sentence that you are using but the productions have different
00:04:44
which uh you need to take into account right you can i just ask one
00:04:48
questions and bass and special conclusions based on one once engine to meet you too
00:04:54
uh do also there it is called prompt engineering so
00:04:57
trying different and many products uh we also observe that um
00:05:03
uh the catch of the information so that predicted chemical name contains
00:05:07
some name of the phone was um which is incorrect in this case
00:05:11
sometimes it is correct but not in this example and some of the predictions are just genetic so we're upset that we observe that
00:05:18
um the production are simply based on the frequency of the
00:05:22
uh of the chemicals or that for a funky in the corpus
00:05:26
uh so the most frequent out the uh in the corpus are the most become once predicted by the model
00:05:33
so it by some that that we have a we achieved
00:05:37
uh be let's say low performance we decided to test another
00:05:43
from another task which is not predicting um uh entity
00:05:48
a chemical entity of or fungus entity which is difficult
00:05:51
we wanted to predict uh uh worked which will describe elation or
00:05:56
uh and uh by the connectivity so in this example um can
00:06:01
become name and uh from his name is already in the prompt
00:06:04
but what we want to protect is the what's significant for
00:06:07
instance for having activity or no growth in conditional trained so
00:06:12
no octave it so this is simpler task for the model in the face we believe so
00:06:18
um but there there are some limitations that as well
00:06:22
so as i mentioned there is a prompt engine and we have many different problems in this case six different
00:06:28
the meaning of the point is the same but you can see we we're we obtain different results
00:06:34
and and for example the the first prompt which uh
00:06:39
it it it has a correct prediction which is significant significant
00:06:43
uh it's almost the same front as here but can we we have more calm context we but compound
00:06:51
and we had four wheels so the point is more complex what we're asking for a and
00:06:55
here the production is is walk right but it's not the rule sometimes it's quite deposit so um
00:07:03
as i said there are some differences there are some limitations that a problem it's design affect the production
00:07:11
and that i waste talk to my the prompts but again it's not that easy that just to what are
00:07:15
your uh ask one question and you get correct answer you need to have a weight that uh a lot more
00:07:24
and um we were actually is why uh why we have such a poor performance
00:07:30
basically and we investigated the problem's again and it's something we just called from the bias
00:07:36
so um we replace the name of the chemical and the name of the school by yeah
00:07:43
random combination was a combination of syllables and the
00:07:47
mother still projects to words like significant an excellent so
00:07:52
the conclusion here most likely is that the words of predicted based on the on other works in the problems
00:07:59
not based on the biological knowledge that is just extracted
00:08:04
from the knowledge uh there's a from the model sorry
00:08:08
um so that's one of the way that you can or test that
00:08:13
it just to give you an overview what was what is the complexity of of awaiting is this is the
00:08:18
whole list of problems that we used and different relations
00:08:22
and that we are investigating but that's a bit thick technique
00:08:27
so saying that that mark language models do not perform
00:08:31
that well let's talk about generative language models which are
00:08:35
a bit more exciting and more promising in this application
00:08:41
so again prompting generic we ask different questions
00:08:47
the different models you can see that we have a depleted we start
00:08:50
with should be to to to be the three age a long by age
00:08:54
thirty budget bikini large and then we end up with chuck typically and
00:08:57
we also included galactic uh uh so we start with a very simple problem
00:09:03
like entity so the name of can we got and generate text
00:09:07
whatever you know about the entity then we then go more specific so
00:09:12
and it is and it is a compound and it is uh some
00:09:15
substance and so on and um where i mean more than more context
00:09:20
and we also have some problems which are evaluating the um
00:09:25
relations so which follows produces which can be got so and it is produced by and
00:09:30
the rest are generated and in this generated text we are looking for the name of
00:09:36
the fungus so first we want to obtain the name of the phone was as a
00:09:41
um as a thing that we want to model to generate these names but the second thing is that
00:09:46
if then name generated is correct or not because it can generate some form was named but the
00:09:52
question is is it to or not doesn't produce this chemical or not is it factual knowledge on
00:09:58
so examples uh one chemical so start with galactic um correct because that's
00:10:03
very specific format of the prompt so we we we follow that and
00:10:09
and this is not the biggest galactic available most of the biggest
00:10:13
one is this a one hundred twenty billion we use here the
00:10:16
uh one of the the the smaller one um but
00:10:20
down so here is um yeah just repeating the entity
00:10:24
so semantic consistency is is zero me because we
00:10:27
are evaluating that by semantic consistency and factual knowledge
00:10:31
samantha consistency here is just um if you read the answer can
00:10:36
use either that's fluent in meaningful can you recognise that was um
00:10:41
within by machine with some errors no it's it's a fluent text
00:10:45
that a human could right so that's fine but factual knowledge is
00:10:48
really if you compared with the knowledge base is it true or not
00:10:53
so for biology beatty you can see that there is a semantic consistency
00:10:57
um and nice save my definition might generate the text but unfortunately this is not tool so
00:11:04
for me another by x. but i would be the the finish and then i would say yeah
00:11:10
fine my belief have um yeah but the correct definition is a produced by by o. g. b. t.
00:11:18
large which was announced um like for you you weeks ago i think uh i mean it was available
00:11:23
please go so the definition is correct and then we have charged beatty ah
00:11:29
in in the case of tragedy we did not really need the the the length of the generated text just to see the the whole picture
00:11:35
uh so also the semantic consistency here is is really good
00:11:39
still went it's meaningful and it happened that in this case all
00:11:43
the facts uh are correct in that generates the definition it but
00:11:47
it's not always the case is not always correct so i um
00:11:52
so another table with the results just briefly to show you the
00:11:57
perspective of the valuation of of a systematic evaluation of these models
00:12:01
so here i just want to just show you that uh it depends
00:12:04
on the prompt there are some some moderate consistent with generating the output
00:12:09
and some of the some of those are not so if there are
00:12:12
blank spaces here it means that the model did not generate any text
00:12:16
if you're drunk or generated text which did not contain a full name
00:12:23
so and also you can see the let's say the evolution of the
00:12:26
model so g. p. two is the the the uh the oldest one
00:12:31
well which is you know for years which is not that old but the oldest one in this comparison
00:12:36
uh so going back to to to the going to g. charged with the judge beauty
00:12:41
is uh generates always uh from his name and it's in this case it is correct um
00:12:50
and another result which is also quite interesting because the chemical induced
00:12:55
at all as far as i know can be produced by all this
00:12:58
won't be presented so all the answers are correct and different models
00:13:03
produce different dances with different problems so again there is some inconsistency
00:13:08
still correct in this case but um the question is why some
00:13:13
of those are predicting this one was and that was a different one
00:13:19
uh so how they did perform um so unsurprisingly tend
00:13:25
to be the outperforms the rest of them all those
00:13:29
we are again we're really strict here in our evaluation because we're not trying to nick bennett i just think we're not doing a
00:13:35
few shows that and we're just asking simple problems to actually train
00:13:39
model uh so out of uh uh so sixty percent of times
00:13:44
of time charged beatty was correct and it contained also a
00:13:48
specific uh description so because sometimes a model can generate a
00:13:54
the definition of a chemical that chemical is a hazard compound that
00:13:59
has many interesting properties so it is true right but isn't specific
00:14:04
yeah so it's not really useful in our case so the definition must be also specific
00:14:10
and yet another interesting thing uh about child you could
00:14:15
be here um i asked the same question to times
00:14:19
uh and i got a bit different onset this is still correct i guess uh but
00:14:26
um i yeah i i i just wonder if you already noticed one of the problem here
00:14:35
that if i want to evaluate the definition produced here first
00:14:40
would be nice if i i would then as an expert because
00:14:44
otherwise each each working this and decided to go to we keep the the or any knowledge base to evaluate the more that
00:14:51
so just to combine these two text right right i is it correct or not because
00:14:57
the the the texas so fluent that um there is there can be just one mine or
00:15:02
difference like one name of the chemical is is is is what has worked with something else
00:15:08
and then that makes the definition incorrect so this is important that evaluating the model evaluating degenerate the
00:15:15
text in this context is really a time consuming demanding so you need to buy a medical expert which
00:15:23
uh who who who reads the generation generated text in really uses his knowledge to evaluate that so
00:15:30
let's imagine that you want to have a systematic analysis of a genetic model
00:15:34
and you have like one hundred five hundred examples
00:15:38
that you want to evaluate very time consuming doubts
00:15:43
that needs to be taken into account uh because for for instance for in the computer vision
00:15:49
uh there is always a problem with on updating the data annotating the images but you can see that i would say
00:15:56
that um annotating images can be even it's a easier done
00:16:02
evaluating generated text it it's more time consuming more more challenging
00:16:08
and yeah that's graphic which is uh maybe interesting because it shows you the
00:16:14
um statistical properties of the mother so these are the occurrence
00:16:17
of a phone was nineteen generated text of all models so
00:16:21
um basically the ask big it was and out there now is the most frequent
00:16:28
um the phone goes and it happens that is also the most frequently in the knowledge base in
00:16:33
the the papers so stay we observe the um the the thing that the the generative models still
00:16:41
uh rely on the um occurrences tested and frequency
00:16:45
of the words in the corpus to some degree um
00:16:51
so concluding um there is a significant i must say there's a significant progress in which more that's
00:16:58
right so you use in due to the tool going to charge b. d. there is a significant progress
00:17:04
yeah that's notes and although no doubt about it uh still there are some limitations
00:17:11
um as i said in prince becomes very few any meaningful but at the same time it makes it extremely
00:17:19
challenging to verify the answer uh i'm a part on
00:17:24
the other hand we need a systematic analysis systematic evaluation because
00:17:28
just evaluating single examples or individual examples
00:17:33
really doesn't tell you what are the limitations of the mob so we need the uh i'm amount of
00:17:39
of examples evaluated but then we get some kind of percentage or something like this but do
00:17:45
we will uh give us um the some kind of estimation of the limitations of the model
00:17:51
and um again this is very exciting and promising direction of a i'm not much more that's so
00:17:58
i guess in the next year maybe next two years uh again to be a huge improvement

Share this talk: 


Conference Program

The Evolution of Large Language Models that led to ChatGPT (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 8:34 a.m.
664 views
Understanding Transformers
James Henderson, Idiap Research Institute
March 10, 2023 · 8:46 a.m.
369 views
Inference using Large Language Models (Andre Freitas, Idiap)
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:19 a.m.
Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 9:45 a.m.
ChatGPT for Digital Marketing
Floris Keijser, N98 Digital Marketing
March 10, 2023 · 9:58 a.m.
Biomedical Inference & Large Language Models
Oskar Wysocki, University of Manchester
March 10, 2023 · 10:19 a.m.
Abstract Reasoning
Marco Valentino, Idiap Research Institute
March 10, 2023 · 10:38 a.m.
120 views
Q&A
Andre Freitas, Idiap Research Institute
March 10, 2023 · 10:58 a.m.
Round Table: Risks & Broader Societal Impact (Legal, Educational and Labor)
Lonneke van der Plas, Idiap Research Institute
March 10, 2023 · 2:07 p.m.

Recommended talks

Brain MRI Analysis and Machine Learning for Diagnosis of Neurodegeneration
Dr. Ahmed Abdulkadir, University of Pennsylvania
Feb. 6, 2020 · 10:04 a.m.
163 views
ACM-W Athena Lecture: Large-Scale Behavioral Data: Potential and Pitfalls
Susan Dumais, Distinguished Scientist, Microsoft and Deputy Managing Director, Microsoft Research Lab
April 23, 2015 · 8:36 a.m.
203 views