Transcriptions
Note: this content has been automatically generated.
00:00:00
hello everybody an uh thanks for having me thanks for the invitation
00:00:05
um so i'm going to talk to a little bit
00:00:08
about the uh privacy machine learning and uh and health care
00:00:13
and uh and tell you a little bit about the work of many people across the the last uh i would say
00:00:19
five years um so as you know like our after system
00:00:24
is facing unprecedented challenges so we have a constantly gauging population
00:00:31
uh the cost of health care is increasing without really
00:00:35
uh any plateau on side so uh we're insurance uh
00:00:40
will uh will increase as well as of next year apparently
00:00:43
and we had to deal with the pandemic so so all of these uh uh seems uh
00:00:49
diverging quite a bit right so um so something is is being well i mean i mean the whole communities
00:00:56
be talking a lot about is is precision madison right so how can we make sense of all of this
00:01:03
massive amount of data we we're generating in hospitals in in laboratories
00:01:08
uh that can go from very that the the very tiny like the the genome uh
00:01:14
up to transcript on a core uh the effect of the environment or also like uh
00:01:20
uh i'm mobile devices a fit bates and and so on at two
00:01:26
uh what we can collect in in hospitals uh which
00:01:29
can be information that you can find in the electronic up
00:01:32
record so we can be structured it can be as structured in form of tax it can signal it can be imaging
00:01:40
so all of this data seems good but how to make sense of it
00:01:44
um and essentially transition from the current approach we're uh uh you know you
00:01:51
have one treatment fits all approach to so called the uh personalise magazine or
00:01:56
prevented many seen what hopefully will be able to target that appears on uh
00:02:01
a sub group of people and try to reduce cost
00:02:04
at the same time um so of course say i uh
00:02:10
comes into the picture right and and uh it brings a lot of promises on
00:02:13
how to make sense of this data and how to use this data to make predictions
00:02:19
um but uh expression in the after sector
00:02:23
uh its implementation is hindered by by several challenges
00:02:27
right so we had the the data sharing an privacy challenge so
00:02:30
i will uh uh expand the little bit on that during the talk
00:02:34
with of course the data quality and standardisation so in a data we have a nice people
00:02:39
is very sparse so there's a lot of missing data
00:02:41
it's uh open low quality so a very little late labels
00:02:47
so it's it's hard to uh to use uh we have
00:02:50
challenges about transparency uh of data like we're the data is coming
00:02:54
from right what was the process of the leading edge of the
00:02:57
data which was collected by whom uh what kind of intermediate process
00:03:02
uh it underwent a and also of course of a i algorithm so uh the meat are from in
00:03:08
for this morning mention about explain ability interpret that really sorry this is whole all another field of research
00:03:15
uh but then there are questions also about a patient safety right we've seen in the ah the
00:03:21
the different race categories so i'm pretty sure i'm
00:03:25
a i'm after falls into the high risk category
00:03:29
um so so and accountability right save the mistakes is made
00:03:34
uh who is accountable is the yeah i developer is doctor uh is the one who collected the data in the first place
00:03:41
so all of these are still open questions and uh and another
00:03:45
important aspect is of course the interaction of a i with uh
00:03:49
uh with the workforce right so how a i would change will be the clinical work flows and so on
00:03:56
um so do then gonna focus really on on uh probably the very first problem we have to
00:04:03
address right in in the medical medical field is getting access
00:04:07
to data right so in the medical field is extremely difficult
00:04:12
uh to get access to data and and the
00:04:15
reality is that today uh even if there are like
00:04:20
pants and maybe hundreds of papers coming out every
00:04:24
year about predicted models and how we can actually
00:04:28
uh do things in elk or uh the reality is that very few of this
00:04:33
model would get into into the clinic right and can be used at the bedside
00:04:38
and and this is because uh most of the small though are not valid
00:04:42
right so the the usually stop at okay i can get good performance on my test set
00:04:47
but rarely they try to to get into a validation another
00:04:51
court evaluation a so called external validation rides and uh this publication
00:04:57
but um of last year's show that essentially this is a trend and involving trent
00:05:04
right more kind more a publication get out uh that very few very few about it
00:05:10
and one of the reason is not the only one of course it is data sharing right so data showing is extremely difficult
00:05:16
uh uh in the medical sector or we have technical challenges right that are about
00:05:21
inter operability right so they uh that we collect for example in dos people geneva
00:05:26
is he a um is different from in in the way we collect data
00:05:30
in laws that so already sixty kilometres apart we cannot really share data easy
00:05:36
right and then we have the the the fear of cyber security a cyber attacks and and all that stuff
00:05:43
but it's also called the role challenge right so there's a there's a lot of this notion about data ownership
00:05:49
so doctor who sees patient believes that the of these patients belongs to him
00:05:55
that's not to write the data they lost the patient in the hospital is just the custodian
00:06:00
but somehow this this is a quarter all uh um role block right that
00:06:07
that prevents a data from flowing around a more easy and of course we've mentioned
00:06:12
many times right regulation uh uh the federal like them data protection g. d. p. r.
00:06:17
and so on and so forth i so how do people do right
00:06:21
now right so the the the approach for data sharing his more trying to
00:06:25
uh uh centralise they are in a single place rightward people then can get access to
00:06:31
and this is known to be a a do part time of sending the data to the algorithm
00:06:36
but of course from a security standpoint you have to trust uh the institution it is collecting this data to be
00:06:43
a secure right to have a secure infrastructure uh to control
00:06:47
who has access to the data and so on and uh uh
00:06:50
this can be seen as a single point of fit right so it is it to shun
00:06:54
seeing this example is compromised then all the data from institution a and b. is also compromise
00:07:00
uh so of course this uh it implies a loss of control and one way of meeting
00:07:05
getting races of course a tradition absurd any
00:07:08
musician organisation techniques uh which have shown to be
00:07:12
ineffective uh in many examples in the literature on the last uh ten to twenty years
00:07:18
so in the end what happens is that people rely on a legal contract so we we put lawyers
00:07:25
uh in the in the in the process and then of course it
00:07:29
takes a lot of time is costly lawyers don't agree with each other
00:07:33
uh and then uh research doesn't hop right so it's it's very slow
00:07:39
so a new approach that was kind of uh uh introduced a few years ago by by google is
00:07:45
really does a federated approach to data sharing work instead of sharing the data we're gonna sharing the ugly though
00:07:53
uh so the advantages that is uh i would say is compliant it's it's private by design
00:07:59
right we were not exposing data to direct a
00:08:03
privacy leakage is because we're not sharing the data
00:08:06
but we're just sharing go aggregated form of these data it can
00:08:09
be like statistics or were machinery mall that we might have twenty
00:08:15
uh so the advantages i said is that uh allows
00:08:19
hospitals or data provider to keep control over did data um
00:08:25
but it comes with the with some questions that i think they're not yet fully
00:08:32
fully understood hopefully answer right so of course uh we've
00:08:36
seen this morning that a machine learning model or not
00:08:41
cannot be considered anonymous data so the question is is
00:08:44
this federated approach really prizes preserving is it really worth it
00:08:49
question number one question number two is that um
00:08:53
do we need to trust institution c. or not right so because in the end institutions you will still get
00:08:59
uh all the models or the partial model strain at each institution
00:09:03
to to do the abrogation and uh what about it's it's legal qualification
00:09:09
are we talking about on any musician or mm sudanese age uh so this is just
00:09:15
an example for returning better but i guess this is not new for for many of you
00:09:19
so it works as following so they're a model is train locally right at each institution and then uh this
00:09:25
is shared with essential party that can be one of
00:09:28
the stipulations or the cloud and then there's a aggregate shame
00:09:32
uh that happen and uh the global model is then shared back
00:09:38
uh for another a round of a round of training uh to do to
00:09:42
the local institution and sign and so forth for a certain number of epochs
00:09:46
and this allowed basically uh to to not sure data even though the and
00:09:52
of course uh many people think that
00:09:55
sharing statistics or local model is privacy preserving
00:10:00
uh this might be the case when we have a lot of patience and not many attributes
00:10:04
but in precision magazine we are more and more striking finding a patient
00:10:09
populations word so we end up in a situation where we have but
00:10:13
two page if your patience and a lot of attributes right and uh and uh and this
00:10:20
poses pro so if we look in particular at at machine learning 'cause we were talking about a i
00:10:28
uh one of the i think fundamental a problem is that in the end we can consider
00:10:35
a machine learning model as a last c. compress version
00:10:39
of the original data set right so it's still capture
00:10:43
information about the original training set so um essentially a but
00:10:49
we can exploit this knowledge right because the model will behave differently
00:10:55
if it's exposed to the training data uh that was used to train the mobile
00:11:00
um or if it is exposed to new d. n. uh essentially
00:11:05
there's the whole community about the privacy in in machine learning is been
00:11:09
exploiting this phenomenon uh to perform different type of the inference attacks right
00:11:15
so we have the membership in france attack rate talked about this today
00:11:19
so trying to infer if a given individual was part of the training set
00:11:23
we have the attribute in france attacks where we try for example to infer attributes
00:11:29
uh other partially known record uh in the training set
00:11:33
with property in france and then we have great engine version
00:11:36
or like the every construction right uh so of course a
00:11:42
membership in france is the simplest one but it's it's probably
00:11:46
if we can do a membership inference is very likely that you can do also the other at right
00:11:52
and uh the the intuition behind that is that's a um samples that are in the training set
00:11:59
uh so the the model will behave differently from samples that are in the training set
00:12:04
uh with respect to sample that are not in the training set so in in ideally
00:12:09
we would the model model to to general lies right when b. a. but the same way
00:12:14
but in reality what happens is that models uh and more sophisticated models tend to over
00:12:21
feet on the training data and then try to you you're gonna be able to to distinguish
00:12:26
so as long as this there is this idea of generalisation gap right between
00:12:32
um uh the loss of your uh integration function
00:12:37
um on your training data and and and validation testing data
00:12:42
your your always able to tune in for uh some some private information on the trains
00:12:49
and uh this can be uh can be represented as a as a essentially
00:12:54
uh trying to to reconstruct the the the last distribution on the training
00:12:59
set versus the last distribution on the non training set and if you're able
00:13:04
to really separate those two distributions a damn
00:13:08
essentially you're able to carry out these uh these
00:13:11
in france attacked his membership in france and ducks and how can be done this impact is
00:13:17
essentially there are there are techniques called like a shot though model attacks
00:13:22
where you can create a multiple models that at the same
00:13:25
architecture of the one vector trying to attack um and then
00:13:31
uh you can use this show the models to to rebuild uh the the the last distributions all
00:13:38
the examples in the training set and in not in the training set and then train a second model
00:13:45
which would be us uh essentially classifier that will try to distinguish
00:13:49
the two the two distributions and then when you send me it like
00:13:53
a new record to the to this attacker model it will tell you whether it belongs to the training set or not
00:14:00
um and of course this can be pushed a pretty
00:14:03
far and actually very recent papers said as shown that
00:14:08
uh did i can be a reconstructed uh almost in an identical manner
00:14:14
and uh it especially in federated learning were a data sets are smaller just because you want
00:14:21
to put them together right so they they said that the different sites are smaller so models
00:14:25
that are coming from the different hospitals are more susceptible to these kind of because fraction attacks
00:14:31
so this works for a convolutional neural networks are
00:14:35
also for a four g. b. t. like models
00:14:39
and um and here for example it's it's it's an example where people
00:14:44
uh from uh from uh uh technical university in in a new niche in germany
00:14:50
a show that they could reconstruct some some enter i uh data with pretty pretty good good actors
00:14:56
so of course all of this is to say that in the end even if we're sharing the model and not the data
00:15:02
we're not really solving the problem right now so answering the questions we asked
00:15:09
in the beginning is is federated approach really privacy preserving the answer is no
00:15:15
what is the level of trust i think we have to try to this institution
00:15:18
see uh as if he was seeing the the in the the the level data
00:15:24
and uh we definitely cannot consider federated learning as an an any musician technique
00:15:30
uh so sorry about that but the interesting thing is that the uh the privacy
00:15:37
uh and security community have been developing for years a so called pricing is in technologies
00:15:43
the dark technologies rooted on mathematical principle them to keep overlap your or statistics
00:15:50
that can be used to compliment further to learning and try to me to get the pro
00:15:55
so i'm gonna show you how essentially the combination all different technologies could be
00:16:01
uh uh something really interesting and also as a
00:16:06
one of the talk this morning mention the there's not technology that
00:16:10
can solve all the prominence most are times a combination of different what's
00:16:16
so the first approach thank you uh the first approach uh that
00:16:20
that could kind to mind is using differential privacy so differential privacy
00:16:25
it's a formal notion of privacy that essentially tells you uh that if you
00:16:30
have two database that are different in just one record so x. n. x. brine
00:16:36
and then you have an algorithm or an analysis am uh that the probability of the outcome uh
00:16:43
it's been the the ratio of the probability of the outcome from the two data set is bounded by
00:16:48
a a finite uh and number which is actually a
00:16:53
and and and this provides very very strong guarantees so
00:16:58
when we apply differential privacy to machine learning we have a version of us to cast
00:17:02
a gradient descent uh that uses differential privacy what essentially you can add some noise during
00:17:10
uh during the the the gradient descent algorithm
00:17:15
and essentially a provides you with this uh privacy guarantees
00:17:20
so one way of looking at a differential pricing federated learning is basically uh
00:17:27
instead of sharing models uh in the clear
00:17:29
text you with the use stochastic differentially private
00:17:33
stochastic gradient descent when you train your model so introducing some noise in the model training um
00:17:41
and you will have like some prove mathematical guarantees that you
00:17:44
cannot perform these kind of inference attacks that that shouldn't be um
00:17:50
of course the use of the noise comes with a with a huge toll on on big predation and also some other
00:17:56
uh and desired effects like it it introduces some bias and some it increases
00:18:01
arbitrary mess of production so it's probably not the best way of solving the problem
00:18:07
so another technology it was discussed today it's it's really all market encryption so is this particular type
00:18:14
of encryption that allows it to compute uh on on encrypted data and just that as an example
00:18:23
uh as a toy example this became started to became popular
00:18:27
uh when people wanted to just clout to compute uh on on a on sensitive data
00:18:34
and then essentially the idea is that you can interrupt your your sensitive data set into the clouds
00:18:40
and then for example the cloud would run some some inference or some
00:18:45
a segmentation on on the image and then send it back uh to the hospital where the
00:18:50
description tin can decrypt um so this this is more or less how home arctic encryption works
00:18:58
and it works well for um to party computation right so you have the
00:19:04
and the the person owning the data and the so the data controller and the data processor
00:19:09
uh and so this this works pretty well in in in this way
00:19:13
and there's been like many progress is in you know like breaking the boundaries and and
00:19:19
will be the the conception of these being a very complex and and a resource intensive computation
00:19:26
uh so now we are i think we're able to train simple machine learning model something pretty data
00:19:31
um so this is going going uh uh going for pretty fast
00:19:37
uh of course the problem comes when we have more than two parties right if you want to share data
00:19:43
across more than two parties another interesting thing is secure multiparty computation so this is
00:19:50
is not using on the market encryption but uses other
00:19:53
other techniques always try to pretend happy were essentially the goal
00:19:57
is to compute a function over secret uh input without revealing each other
00:20:02
which is uh the input to the other part is one of the limitation
00:20:08
of a secure multiparty competition updated cans with high communication overhead
00:20:13
because there's a lot of back and forth of messaging across the different entities um
00:20:18
so what we uh what we decided to use eventually was a combination of the two
00:20:25
so combination of all the more pick encryption and secure people a multiparty computation that
00:20:29
uh it's known as a a multiparty almost the contraption uh where the idea is that
00:20:37
essentially you would use secure multiparty computation
00:20:40
for every operation that involves a secret key
00:20:45
um and you would use like a more pick
00:20:47
encryption for performing operations in a in a outsourcing button
00:20:53
so essentially what uh how how this would work is that you would uh you would create
00:20:59
an encryption key that you would secret share with a secure multiparty computation algorithm to the different parties
00:21:07
and then use the encryption key which is public to include the data
00:21:11
and data can be then processed and they're all more thick encryption
00:21:14
and then whenever you need some bootstrapping so to refresh your uh
00:21:18
your encryption uh to to allow for more complex operation in the side uh is separate
00:21:24
that's the main you can use a secure multiparty computation protocol which makes things much faster
00:21:30
then all more fake encryption and so the idea that we that we had was using
00:21:36
multiparty more pick encryption on top of federated learning what essentially
00:21:41
the computation that happen within the institution security boundaries happens on the
00:21:47
plain text in then we wouldn't creaked with multiparty more quick encryption the model
00:21:53
so this would guarantee that essentially institutions you
00:21:57
would not see anything but encrypted models we could
00:22:01
still perform aggravation because it relies on the on the more the properties of the get the system
00:22:06
and as a consequence inference techniques during the training process would
00:22:10
not be possible anymore and then just use the french of privacy
00:22:14
at the very right so so then essentially meaty
00:22:18
gating the utility lost the two with the encounter
00:22:22
and uh if you would use differential privacy it every step of the process um
00:22:29
so of course we we we wanted to
00:22:32
to test how does a approach uh could scale
00:22:36
and uh we we try to reproduce some more to send you get the sense that it that
00:22:41
uh carried out computation by centralise in data in a single place so we took the same data set
00:22:46
we split it apart and then we tried so we try to survival
00:22:50
analysis and with so that we could reproduce exactly the same results uh
00:22:55
regardless of the number of data providers this was skating very fast so
00:23:00
even with the whole most a hundred they uh providers we we could stay within
00:23:05
uh within ten seconds and we also tried more intensive computation like do you know why the situations that is
00:23:12
what you would basically train one model for for each of the positions
00:23:16
uh uh in the genome and and this we show that
00:23:20
we could do a essentially reproduce uh the same the same result
00:23:24
so here on the left you at the the original course so this is a standard manhattan
00:23:29
plotted were essentially dots that pass the the significance
00:23:34
line are considered to be associated with the phenotype
00:23:37
that you're standing in in this case was a viral load of h. again the plot
00:23:42
and then you see that if you use this the same approach
00:23:45
by distributing dating different places and running this under or more thinking friction
00:23:50
you would get the same results which is much better that what people do nowadays which is
00:23:55
made kinda like this where instead of training collaborative really a model on the data huge usher statistics
00:24:02
and it would be much better then training data on just one single hospital data
00:24:06
where it essentially you would see no see well and of course scalability uh it takes more than a than a
00:24:13
survival are nice but it's still in the real all
00:24:16
the practicality and this was pushed and many publication follow that
00:24:22
uh and of that is recent one about uh mm sell classifications
00:24:27
uh using convolutional neural networks uh with this approach and that does all technology uh
00:24:35
i mean we we had pretty good the publications uh but i think it's something that it's it's not that
00:24:41
frequent is uh the the transfer of the technology into
00:24:45
some spinoff and the ability to to put this into
00:24:49
i'm into concrete use cases so uh the lab in which
00:24:54
i was part essentially a uh to give our created the spinoff
00:24:59
uh we we are at should with a partner of the spinoff and in the squeeze person myself
00:25:05
network which is this initiative in switzerland trying to
00:25:08
put together a and share data across university hospitals
00:25:12
where a couple of use cases one imposition oncology where we build up or a whole softer system
00:25:18
that is able to provide the on call this in the molecule to more bored with the tool that allow
00:25:24
them to for example compute survival curves in the real
00:25:27
time across all patients in uh in the five university hospitals
00:25:33
and we also implemented the same approaching laboratory madison where
00:25:36
you can define for example the reference ranges for laboratory test
00:25:41
that are more personalise to to didn't page and one question
00:25:47
i'm almost close to the end is uh of course what is the legal qualification of this of this approach
00:25:53
so we work with the legal and ethical expert
00:25:56
from from a th uh everybody and in particular
00:26:00
we're we we essentially an allies the uh um we're essentially we argue
00:26:06
that this technology can provide an any musician according to g. d. p. r.
00:26:10
if we take a relative approach to g. d. p. r. um and we submitted
00:26:15
also this to the federal data protection authorities switzerland to uh but to have the feedback
00:26:21
and if it but was uh was clearly in the same
00:26:23
direction that we uh alluded to in the paper uh so
00:26:29
to conclude a um i think uh that as
00:26:33
opposed to maybe other domains that are less regulated
00:26:37
uh i'm the adoption of a i. m. l. in in after
00:26:41
is liking due to several open challenges and of course out data sharing
00:26:46
uh is probably the first that we need to saul otherwise if
00:26:49
we without they are we cannot be on any model and and that
00:26:53
i think that in the community which on that privacy missing technologies
00:26:58
couple with federated learning more further it analysis can be really instrumental
00:27:03
enabling large scale and a privacy preserving l. data sharing
00:27:08
project uh um and with this kind of legal assessment i think you can really
00:27:15
boast a data sharing across jurisdictions and of course uh that we and i'm very proud
00:27:21
about this work that we did also with the start that because most of the time
00:27:26
as academics with the we stop at the paper uh but seeing these uh used by doctors
00:27:33
um i think uh it's it's uh at least for me was that we
00:27:37
we we we working so of course we don't want to stop to switzerland
00:27:41
uh so we're talking also with uh with other colleagues in the us in germany and so
00:27:46
so the goal would be uh can i can we skip to hundreds of organisations with this approach