Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
o. i. e. at all yep
00:00:12
hello everyone and thanks for treason they still okay to say oh he and me
00:00:17
we're going to talk on high performance pricey by design using matures cat and spark
00:00:25
um me and doesn't have the gene is kind of back and the little
00:00:28
perhaps more yeah and in my free time i call to do it on zeal
00:00:33
and i wanted a sharp though i'm a big data architect engineer and co founder
00:00:37
of lot of thoughts about his bass concert in company so today we're going to
00:00:42
talk a bit about uh privacy framework that we be able to giving you a
00:00:46
small introduction and i can see that the structures on how to handle of i'm not
00:00:51
we're going to talk about recursion schemes using the sky
00:00:54
that i recall met here's a um present in depth the
00:00:58
three place you enjoying that we'd be out with this framework so what are
00:01:03
we talking about well uh we're talking about user information we're talking about protecting users
00:01:10
data this is especially important in the european union with the g. d. p. r. i. agreement so
00:01:16
what are we talking about precisely when we're talking about user information well
00:01:21
here's an example of a piece of data in this piece of data surveys
00:01:26
well a few fields so um our would be considered important would be considered user information
00:01:32
and some would be just meta data or not that important but would
00:01:37
we like to protect well for example and we would like to protect the name of the person
00:01:43
uh uh email or uh or phone number or
00:01:46
address lane and maybe the last position we so uh
00:01:51
when i mean protect i mean that's in any company we need to give access to
00:01:56
these kind of data to that engineers that
00:02:00
analysis that uh scientists the goal is to of
00:02:04
a clear sense about which kind of data you can devotion which kind of data you need to anonymous
00:02:10
and when i say protecting well there are many ways to protect information you know therefore that a science process
00:02:17
to be able to walk sometimes we don't function minimise
00:02:20
or harsh everything we want for example which you uh
00:02:25
encrypted the company each uh we've uh hashing mechanism in could relate with fashion mechanism but
00:02:31
maybe we want to mask at the person's name using dot x.
00:02:36
x. x. fired keeping the first that uh over last that uh
00:02:40
of a name and especially for emails we might want to keep the domain name
00:02:45
so we have many uh encryption functions that we might
00:02:49
want to apply and views functions are dependents not exactly um
00:02:54
the type of the field because the type is just going to be for example string but
00:02:59
and some deeper meaning you of what semantically these informations represents
00:03:05
so our goal is to build the genetic privacy framework that's can
00:03:09
dynamically upright privacy on specified fields with different and sheep encryption functions or
00:03:17
when we are a building a frame up we need to first clearly understand and
00:03:22
get a sense of what we're dealing with so we're going to use a very
00:03:25
usual a separation uh in our domain we're going
00:03:30
to separate they thought into a data set into
00:03:33
their schema on one hand and the data itself from another hand in big data it's very usual
00:03:40
the schema is what it's the field names on that type so that's the
00:03:44
other c. is which uh of strings and the data is going to be like
00:03:49
or follow blobs of they thought that we are going to stall we do that in the data for many purposes but
00:03:55
may need to be able to handle schema evolution and to be
00:03:59
able to fetch schema white before billions and billions of records of data
00:04:05
but it's not sufficient for privacy fire mark we need to add
00:04:10
meta data and information that's we'll tell you well
00:04:14
what makes another less well off protecting what is
00:04:18
and that was four percent is it's easy something i
00:04:21
need to protect noisy something related to a company for example
00:04:25
so we used a enough i'm up we use a specific set of types but are
00:04:31
well inspired by the domain of the semantic web so we use the the
00:04:35
f. s. uh type to be able to semantically annotated copies of the top
00:04:40
to say this is well this is a person and he sees its address this is a person this
00:04:45
is uh he made this is a company this is an address so i don't need to protect that
00:04:52
using this kind of a separation between you know the
00:04:55
schema the data and annotated schema with the tags we
00:04:58
can define power is just tidy jean abode of sense
00:05:02
and in the more generic way than we would need
00:05:06
uh in the specific data sets well we're not going to say i'm going to protect
00:05:10
the name field of this that i said we're going to say for any given person's email
00:05:15
applies this strategy for any persons password apply the strategy for any person's id delete the field
00:05:22
we don't want it to appear anywhere else this is particularly important well this
00:05:27
is the the interest of each person interested in low that's talking right now but
00:05:33
of the european union went into grad them off into not defining what user information on
00:05:38
what personal data was so this is a way to define what is worth protecting for us
00:05:45
so how can express such a privacy if i'm also using a simple
00:05:49
type uh yes we're going to say that all privacy strategies are simple map
00:05:53
we have the sequence of tags that we want uh to match
00:05:57
and a given privacy strategy that we are going to apply to these
00:06:01
set of tax applied assisted issues what it's just a simple contract
00:06:06
but is going to define how we are going to include the data
00:06:10
the encryption strategy but also about this privacy
00:06:14
a strategy is going to change the schema
00:06:17
of all data because for example if we protect and i'd each by hashing uh this integer
00:06:24
when it's going to change its type into a string so we need to have like a limitation on the schema itself as well
00:06:32
so for example if we want to apply a certain five this
00:06:35
is for the g. which can be useful to a that a seance
00:06:39
a process is in the end we might want to protect v. specific age of a person
00:06:44
by turning it into a category like young adult adult senor teenager except the uh likes it
00:06:52
so how can we uh build such a conference if i
00:06:56
most well we are going to create a schema which is
00:06:59
my pretty usual as usual we are going to define that
00:07:03
uh any piece of data is with those talked enough away
00:07:07
oh a value the value is like relief part of four
00:07:11
three it's the end game it's going to be body and
00:07:15
strings days double simple types those types and i'll focus of
00:07:20
data types let's just say that was talked as a field name
00:07:23
and like a c. v. t. schema so we just if type parameter which uh is practically the same thing
00:07:30
instead i what does it look like so it looks like a seal try t. schema simply duty whether just talked is
00:07:37
going to be released of field names and have tea schema
00:07:40
and v. o. is going to be like a simple concrete consistent
00:07:46
element type twelve euro it contains all of the elements of the same type
00:07:50
and the element type is going to be once again the g. schema look at the leaves node at the end
00:07:56
and we're going to add on every piece of our schema uh we're going to admit that data
00:08:01
isn't it the data is going to uh contains a semantic data that
00:08:05
we talked about and maybe some information regarding new label or mandatory fields
00:08:11
well they they spotted t. v. me ha concept because the mandate of a good together
00:08:17
so we need to have a did they thought that is going to happen on the same yeah hockey
00:08:21
in terms of types it's way simpler because it's supposed to only under the data so we've
00:08:26
got the concrete values um for the other type is going to be the sequence of data
00:08:33
that's the only changes that will uh spock was practically the same
00:08:38
concepts it's as a data type guy diapers type type and the values
00:08:43
and in terms of data it's a bit more lenient but a bit more
00:08:48
the formants it's going to be a low spots equal always basically now if
00:08:53
any he which contains well primitive types
00:08:56
of or lose all lists um so that's
00:09:01
not what we want to be dealing if we are going to do you live we've all
00:09:06
simple type of types that are type safe and
00:09:09
we're going to create legacy functions open source types
00:09:14
yeah we are going to t. v. so we can see data structures and you know
00:09:18
just to at night make they see on specified heat so we have to and cannons
00:09:24
a reclusive functions and reclusive functions is that while it's still can how
00:09:30
to its array our detest structure and what to do on each iteration
00:09:36
and i wanted to start to ease deeply nested and
00:09:39
we have to data structure as our schema and the data
00:09:44
still you know to to think about how to it's a eight hour
00:09:49
a data structure and acres uh through these data structure
00:09:54
and like to do on each layer this we wanna have a stack
00:09:59
overflow except in our mind before we get it in there and sound system
00:10:04
it is so we we wanna have all still to
00:10:09
complex code what he want to do is to
00:10:12
focus only on the uh what to do on each
00:10:17
layer which is what what you want to do is uh to apply privacy
00:10:24
so well these uh the best the best approach to do this is
00:10:29
recursion schemes which is uh describe it in non uh and and uh uh an answer i think
00:10:35
paper and things they bring functional programming with bananas
00:10:38
dances and set up this and uh a a wire
00:10:42
we couldn't schemes ease about separating how dirty the travellers
00:10:48
a recursive structure and like to do with each they are
00:10:53
so we wanna i had a lot nineteen other
00:10:56
code and degree instinctively alternates to recursion and that as
00:11:03
thinking only about our business because eek
00:11:08
and you are using skylark days uh the functional programming sky and i agree
00:11:13
collect my to go start that special lights on recursion schemes
00:11:22
it yeah lies the traversal at the reclusive travis uh is
00:11:26
using an arm off season better more fees and i don't know if he's more like to think that way too
00:11:33
however what is unknown data i know that's a rick
00:11:38
uh that's functions that we uh general eisenhower a recursion
00:11:44
and in order to understand those functions we have to prepare our data
00:11:51
and to simplify this let's take as an example our scheme now still
00:11:57
you know to try to make me or our schema we have an gradients
00:12:03
so that's uh it's our magical stats to follow
00:12:07
before we can cool so all our scheme um
00:12:14
is it good seems keen on that as a force that we want to move recursion
00:12:20
in order to make mature stack able to evaluate out where the guy
00:12:25
structure into i think the value of type eight still we will replace
00:12:31
i'll we're reclusive restaurants we that tape
00:12:35
parameter new generic type byron it's only a
00:12:40
but let's see let's if we want to uh to define or describe
00:12:45
that a ease another schema half of a and how would you describe
00:12:52
eric careers if schema i have with that set new type
00:12:57
and how would we uh the defining its hype uh with a reclusive scheme i have
00:13:06
and uh you know they're also to to think about if for example
00:13:10
we wanna have a function that takes as a parameter recursive schema at
00:13:15
i reach are all return us to not have a
00:13:18
with this uh of schema at whatever it's the uh
00:13:23
so all we need to generate nice this time right and letting us cat
00:13:29
i have a six point side that uh can have asked to
00:13:34
records are there a cartesian still now if we want to say
00:13:39
our steam nice reclusive schemas so we can't define a scheme as
00:13:44
a fixed all still have actually wrap each slayer with a feats
00:13:52
in order to make next year's guy able to try others how were they that structure
00:13:59
we have to to say he's schema at ease of factor
00:14:04
and we can plan and our than that of the phantom to
00:14:07
traverse the schema as of eight to another estimate f. l. p.
00:14:13
now we prepare our anger eighteen it's
00:14:18
to be able to use the earth recursion schemes nonsense
00:14:23
yeah that yeah that is and uh automatically
00:14:27
recuse our data structure for example and i would
00:14:32
or we could see and phones or construct our data structure at that hour
00:14:38
we really could simply fall out data structure and i high low we a revolt
00:14:44
how did that's active so now we i just sat next group we wanna have three reason he's
00:14:52
the first one east to create it's to have
00:14:57
from a single value of types part detect type they think it's basically
00:15:03
is two fold have from add it to a
00:15:08
single value of type a spark the tank third
00:15:12
one east to transform i think that's actually i'm not i think that type he yeah i schema half
00:15:20
in order to create a schema at
00:15:22
from spark scheme or we wanna neat
00:15:27
if that comes from a to f. off a
00:15:33
and match us cat has a a generic function call
00:15:38
it an era that we reach or or it could save
00:15:42
date that's structure uh well yeah rabbit in phoenix and this think that's actors
00:15:50
should be an x. tens of type factor which we already date and uh and
00:15:57
now we one have from it seemed that value from the sink the value of tax tax team we when i get
00:16:05
our what he wanted to ease to have ethics as tina i have
00:16:12
are now well it's it's really the hits our
00:16:16
it's f. from a spark schema from top to town
00:16:25
still
00:16:28
you enough to to be able to up the it uh and
00:16:33
using an amorphous need to define our cool algebra and kuwaiti brow
00:16:39
ease of parks in that fraction that uh i mention
00:16:43
it a yeah that comes from eight to l. a. and
00:16:47
in our case a ease it uh it they get
00:16:51
right and f. l. a. e. s. l. s. scheme half
00:16:56
now we are able to to mention our like in each slider to define
00:17:03
how were the c. p. a. what's he want to do on each they are
00:17:08
and then we can group using an hour and then this we uh automatically
00:17:14
uh to do recursion for as an beep from a single side a spark data type
00:17:22
to uh uh a schema i have fixed us to not have
00:17:27
cool now let's move to the in the next three c. a. or we want to fold
00:17:34
i just have to spark schema so we need
00:17:38
a function that comes from f. l. a. to a
00:17:43
and lets us got provides break a recursive folds
00:17:49
function generate function pull it kept ah and how actor murphy isn't works each we
00:17:57
think our data from the next that's deeply
00:18:00
nested they exactly start from the top half
00:18:04
and it's full that's still a long time and then they
00:18:08
keep the uh the other cheaper a tag it's to me and
00:18:14
full that's us sink tap and than to the top level so off season
00:18:21
is uh we'll um we'll a full hour a day that died
00:18:28
from uh from baton to ah and uh how would we define this
00:18:36
i'll put me in march and this uh recursive function would work how
00:18:41
would he say the alignments of for example if we have an array
00:18:47
off i think the idea of simple values how would we take them uh to the next
00:18:53
level to uh to uh it it to fall
00:18:58
and collapse our esteem i have had an algebra
00:19:05
and i did write our function from f. l. a. to a
00:19:10
an hour at least in half and the a. is they get that
00:19:14
i've been martinique happens in the uh type parameter detect that because it's really
00:19:22
keep the uh what's me teach in their previous taken as an iteration and there
00:19:28
oh would uh would keep it for the next iteration and now
00:19:34
we can define our easy p. on each layer
00:19:38
and at least see enough for example force tracked ad
00:19:42
it has fee and it'll take the scenes from the
00:19:46
previous level that uh that contains athletes in start at our
00:19:52
uh the least of the that's why that we ought
00:19:55
to it that we uh the computed on the previous iteration
00:20:01
from captain more phase and so now we can have a
00:20:05
steamer half and full that's using data and use in our algebra
00:20:12
cool let's see how to transform it it's it's
00:20:16
alright to i know that it that night as we
00:20:21
as he still in the previous uh a slight
00:20:26
we we eat we can lead this team have from
00:20:29
a value of tight data type using me can fold a a scheme i have to a data type
00:20:38
so we need to collapse it right and actually via mature stop
00:20:42
provides a fraction generate function that we do that in once that
00:20:49
which thought it i don't more fees and we can
00:20:52
collect and we can uh define our actually brian school algebra
00:20:59
so cool now
00:21:01
we understand that our what our fronts is that the
00:21:05
one i use in order to apply right this see
00:21:10
and for the first the first right at
00:21:13
privacy that's a pretty distracted g. that we apply
00:21:18
is uh how to apply pressure seen our still now which
00:21:22
was to do because we different than algebra and it's lead sake
00:21:28
it may do the same work and the same logic on
00:21:32
each slayer so i if the client was the simple act
00:21:36
we collect data more feasible in order to uh automate that recursion
00:21:42
now let's see how we uh applied right it's in our data
00:21:51
so we're going to talk now about the free engines but without the
00:21:57
just as a reminder so we have come up with
00:22:01
a sequence of tax but we want to match against
00:22:05
of the scheme of all data and we have a a privacy strategy that we're actually do the work
00:22:12
but the thing is that we want to think of the data
00:22:17
on leash event tags within its schema matches rows of the privacy strategy
00:22:24
so if we are thinking about very naive approach we know that we need when
00:22:30
we have a specific piece of data we need to be able to look at its
00:22:34
the the gated schema to be able to office
00:22:39
uh the tags and checking those dogs are well
00:22:43
what we're looking for what we want to protect so the first uh very naive approach that we
00:22:50
can have to this problem is to z. percussive leave the data and the schema to give up
00:22:55
that way about at each level at each layer we're going to have
00:22:59
two things for the time the schema and then we're going to apply just a simple button much
00:23:05
is going to say okay well if uh for this specific
00:23:10
piece of data i have the specific set of tags that i'm looking for i'm going to apply the places strategy if not
00:23:16
then just out to the data as it was there there's no need to secure that piece of data
00:23:21
so i can we do that well luckily sh well it's not luck but yeah uh uh the
00:23:27
uh matter scar a framework gives us enough t. enough t. is basically a button from top but
00:23:34
for the funk or w. you know case all data f.
00:23:38
uh we we can put just fight next to it label
00:23:42
of type e. which is going to be all schema in this case and it still has that type on it so i
00:23:48
but i parameter uh is very important this is this is the whole
00:23:52
but that is going to get field at each computation by the matthews
00:23:57
cafe mark does a a is what is going to be the final output
00:24:04
so uh we have to use basic your case class so we can better
00:24:07
match it to create it or we can better match it to extract the
00:24:11
decide components and we have to access to two methods ask allow uh which
00:24:16
we gives us perspective you've allowable the label and of a funk to itself
00:24:22
so for example uh if we have a specific piece of data and a specific piece of
00:24:27
schema which is john mcclain of gender zero yeah uh we have the person name on one side
00:24:35
and the the data on the other side then in the end result what we want is a g.'s talked have
00:24:40
but contains on the left hand side the schema with it stacks and on the right hand side of the data itself
00:24:47
that's what we want in the end so using that we ask our we'll need to match scheme and data
00:24:54
but in that way i won't well skin on data might not be compatible i mean most of the time
00:25:01
the schema evaluations we make maybe they do work but most of the
00:25:05
time if someone didn't follow the rules when it doesn't work at all
00:25:10
so we need to take into account the fact that when we'll be easy ping scheme and they thought together
00:25:17
we might have incompatibility is we might have a specific piece of
00:25:20
data that is not nearly of related to what the schema is
00:25:24
and so we're not going to output just the off t. which is that that we've schema uh we're
00:25:29
going to output in either within compatibility a simple case last which will give you the problems you're looking
00:25:35
for and fail because there's nothing that we want
00:25:41
to output from this framework which hasn't been verified if
00:25:44
you have a specific piece of data but does not conform to schema you have no idea what your
00:25:49
menu plating and the last thing you want is to give it that and it's the credit card or
00:25:55
that might sound a bit yeah okay so uh of of uh the steeply schema
00:26:00
uh but we are going to define is actually a very simple pattern matching as
00:26:05
we said well on any responsible immature scuffle giving the recipe for a specific layer so
00:26:12
at a specific layer you only going to say okay when vendors when
00:26:16
the schemas specifies that i'm expecting us talked the data must be us talked
00:26:22
when those schema specifies that i'm expecting and i have a date on is going on right
00:26:26
and that's it and if anything else happens then you haven't incompatibility
00:26:31
and you stop right there well you don't stop tried that but
00:26:34
incorrigible your going to output the left hand side that's to say the incumbent u. t. log okay
00:26:42
so zipping this data on scheme now we have the way speak
00:26:46
no and getting the data well this time it's another transformation from this evil
00:26:53
to the first transformation to quite what to fix of that i've
00:26:58
that's to say the tones from that they thought the high the seed
00:27:02
data or we need to do at this point is defined you corresponding
00:27:06
actual how that is going to extract the schema and the that up
00:27:10
form of t. v. and say okay i'm just going
00:27:14
to check for the places charges if those specific set
00:27:18
of times finger about and if they don't then i'm going to just out to the fixed value as it was
00:27:24
and if i do when i'm going to apply the policy studies your versus that of of black figures value
00:27:30
and that's it that's operation time the beauty of functional programming is most of the time you end up doing eighty percent of
00:27:36
the work in the types and in in the fall and then when you have to really do something and it's quite easy
00:27:44
so we have a question why we have a knowledgeable so what we're going to do is to play it with an idol
00:27:50
the i know it's going to take as input the schema and the
00:27:53
data because it's as a top or but we can see them together and
00:27:57
then get the necessary output and we only need to match the income but
00:28:01
you'd use of the and was out so this is a very naive approach
00:28:07
it's quite easy once you get the types right so we
00:28:11
have a very versatile and generic and line i we have big
00:28:16
yeah no we're not it's not very efficient so we're going to try
00:28:22
to do better ah we're going to build alum the place you enjoy
00:28:28
the person's line is fine but the thing is that if you have a thousand records then y'all
00:28:33
doing the thousand time zipping this should schema and the data together this guy's not going to change
00:28:40
the data is going to change with the schema gives you read the recipe i mean that the the whole point
00:28:45
of spock is that the schema gives you the the recipe so uh is it just possible to do some kind of
00:28:52
one time lance that's to say something that is going to put their
00:28:55
of imitation go down into the data on modify just what we need
00:29:00
we can do that we can divide by chaining functions so we are going to
00:29:04
try to build on that will go down into the data according to the schema
00:29:10
and it will be applied to the good thing is that the question we did previously
00:29:14
is always going to be applied because we need to do both f. r. g. y. n.
00:29:19
coverage about to be able to know if we need to do something but this this thing
00:29:24
it's only going to to be one once a checking the
00:29:29
schema and checking if there is anything to do on that's chemo
00:29:34
so if there's nothing to do then you just eat you can call identity and that's it that's over
00:29:40
so i'm going to define two types of places where going to define the that's right that is mutation up
00:29:46
and you've got okay subject that is going to be no ovulation nothingness weighed does don't go there
00:29:53
and the go down the passion that we don't have to functions apply to actually do the imitation and
00:30:01
i made on to chain and the function to be able to get inside a v. elements
00:30:08
so uh once again we have a transformation using that's clear sky but this time it's
00:30:12
not and i don't we don't need to transformation we only doing from schema f. two imitation
00:30:19
so we're going to call it put that hans von because in case of for example straining
00:30:24
application you want to do that at the start of your spinning application and then just a bright
00:30:29
for the rest of the lives get cycle of your application are going to use a knowledgeable on the caught up
00:30:34
but is going to come from the scheme not to limitation of a device edgewise once
00:30:40
again uh about a matching of all the cases possible that's just extract a high and values
00:30:46
and once again one layer and that's it in a theatre you always assume that the rest of the work has been done for you
00:30:54
so in case of values when it's practically the same colours before we're going
00:30:58
to just check the schema check out it's according to privacy strategist that we have
00:31:04
but this time we're defending the execution to later we're going we're creating a go down operation that is
00:31:09
going to take a specific piece of data later and you're going to a private parties parties using the closer
00:31:17
and that's it if if no privacy strategy was much when it's no up when you're dealing with you out
00:31:25
remember the whole in your funk the is always filled with
00:31:28
the previews computation so the element type is no previous operation
00:31:33
it's not in the lemon type anymore it's it's in a which has become imitation of you match
00:31:39
the element type you much what should be applied to the elements you match it fall no up or
00:31:47
anything else and if anything else needs to happen that means that you've got to secure all the elements in your high
00:31:53
that means that you need to loop and the elements of your high and he created i
00:31:58
with the privacy so g. g. place you see defined applied up and each data of your high
00:32:05
this is quite simple it's just and popping reviving
00:32:10
and it's there's no up if you don't want anything to do with your data when you don't need
00:32:14
to do anything on your high as well because yeah how itself is just a container for the data
00:32:21
post like it's a bit more complicated but interesting uh
00:32:24
just about as many feels so you're going to check if
00:32:29
every field is safe if at least one failed
00:32:34
needs to be well needs well done then your going
00:32:38
to manuel and cotton harvest like to be able to apply
00:32:43
on this specific element the privacy function but it's too
00:32:48
use so no according to any giving scheme that we can no beard on you once i
00:32:54
am done but will zoom into religious you data and only going to what it needs to
00:33:01
it's practically the same as a a very specialised a code that is going to do something like that i don't get
00:33:07
zero get one gives you will get one and that uh yeah that's what i need lands but far be tied data
00:33:14
and it can discern allies and applied many times
00:33:18
in this part process you consider as it in this training process you can say that i didn't use it when you want
00:33:25
so it's it's better it we we it's efficient fights eh
00:33:30
we can do better because it's still manage pave a garbage collector
00:33:36
uh who is not what was familiar with parker
00:33:40
okay so it's back simple you've got a this really
00:33:42
neat enjoying which is the catalyst enshrine symbolic manipulation but
00:33:47
allows you to define your job on and many play the data and do co gen that's to say that
00:33:52
spock is going to generate java code java byte code that is
00:33:56
going to be compiled on the flight sent to the executors and one
00:34:01
and it's extensible not many people do that but you can it's no hack you you can
00:34:08
integrating to spark a bit more so for example in a in a batch is part job
00:34:12
with millions of we got any of the previous methods will actually generate a lot of conversions
00:34:18
back and forth off its will go from in the worst case will go from this bar
00:34:23
code to data from twelve in the database schema that oven door and then back to whoa
00:34:28
which is half off and and it's not really integrated
00:34:32
us back so you're breaking the logical plan execution and
00:34:35
optimisation of spark because you're going back to where did is applying the transformation of the piece of data off and
00:34:41
playing the transformation on the schema then we creating the data frame it's it's
00:34:46
perfect so that's too but the catalyst enzyme is uh is going to do
00:34:54
all the steps to be able to go and she ended to my surgical blue
00:34:57
but if you said yeah and there's a selected to go down and
00:35:00
co generation we can actually be part of the of that storage we can
00:35:06
use the spark at least to generate adequate demise to java code that will go down into the data
00:35:12
precisely the same what we did with the lambda up but this time
00:35:16
uh in the unsafe world that's to say using the unsafe a. p. i. and and java code
00:35:22
it's not perfect but uh look at it that way you're doing
00:35:27
irish uh interesting functional programming to give orders to java which is cool
00:35:34
and your type safe it's not that your type safe i so you're going to imitated according
00:35:39
to privacy and stay as much as possible uh indian save well we're going to go there
00:35:46
and shot the territory so this part life cycle is pretty fly for what
00:35:50
we're going to go former scheme i have to drive a code as string
00:35:55
and then compiled by jane you know which is a java compiler of
00:35:58
a quick send the bike out to the executors on do its magic
00:36:04
so we need a little bit of walk down to be able not to lose on mine
00:36:08
we're going to pop in a value class of input variables that we're going to use
00:36:13
and we're going to define the catalyst code as being something that takes an input viable
00:36:17
and generators thing which is going to be the java code and the
00:36:21
if he's staying is going to execute code that is going to put
00:36:26
in java the output of its computation in an output valuable that is documented in this case class
00:36:33
so in a sense the input is given by the outside the output is given by you
00:36:38
that's it so how to create a new expression well uh you on any of the other children
00:36:45
that are going to be the columns of shoulder the frame but you're going to use you extends expression
00:36:50
when you uh a simple 'cause you can extend and their expression but this is not
00:36:54
the whole case it's not simple because we're going to apply it to the whole data set
00:36:59
uh it is you expression notable how does your expression tones from the original scheme of your data
00:37:05
which is quite right it's in the contract because we don't have to specify
00:37:08
it anywhere else and we already put out for that because we have all the
00:37:12
missus so transcriptions between all schema i have all transforms
00:37:15
can i've and the data types so we need we need
00:37:19
this and and spark is giving it to us uh there's something quite nice as well which is yeah val function
00:37:26
spock sometimes doesn't rely on code generation because if
00:37:29
things that um it can do better on hip
00:37:34
and this is just basically any of the previous enzyme that we defined that was going to be like that
00:37:40
and that's it but it's not for for the most complex cases or
00:37:44
in production it's it's not about what is going to be used and will do is encode popped so let's start with the end
00:37:52
the end is what uh we're going to define the knowledge about once again for mosquito f.
00:37:57
which is the basic recipe and it's going to you know to get at least a big uh with foods that i because i
00:38:03
needed that that i've as intermediates the computation i'm but i'm not
00:38:07
going to use it in the end but i needed in between
00:38:12
if there's nothing to do when uh you've got a very neat
00:38:15
java uh a code between staying which is the output call input
00:38:20
don't forget the cynical and its job if and if
00:38:24
you've got code then you're going to call v. uh
00:38:28
think generator you we'll go into cory omitted whiff them effort with the input viable that you have
00:38:34
and it's going to generate a huge chunk of code
00:38:38
a block of code that is going to define the
00:38:42
privacy output viable that you can't find any assign to y'all
00:38:47
results so it's cool but that's yeah and let's go inside once again we
00:38:54
are going to segregate between or the uh different cases stacked high values and
00:39:01
the value is a practically the same thing we're going to check according to the schema policy strategies
00:39:07
it's part of a very uh um remember we're doing things on the driver side but we need to send
00:39:13
them on the executor side so you need to say allies one way or another the privacy strategy that you have
00:39:20
and send it to the executor but luckily spark as in other films urge
00:39:26
which allows you to transfer a code to transfer
00:39:30
objects form your well the to view exactly door wall
00:39:34
this is going to give you a viable string that you can use in your java code
00:39:39
you're going to safely define your output type you output viable cast it's because you
00:39:46
'cause that's what we do uh and then apply it's a a private function in the executor well and
00:39:55
this is going to give you or your first catalyst code and then it's always
00:39:59
it's practically the same thing for the uh has a place just focus does not know
00:40:05
whether you have no ops that's to say but no code was generated and then you have nothing to
00:40:09
do and you're right but if you have code that was generated then you're going to take that code
00:40:14
and apply it in a very neat and very type safe uh a java loop
00:40:20
phone loop dove uh what one but two phone loop of uh uh we've uh
00:40:26
the most basic elements that you can find is going to apply
00:40:29
it feel dumb polish aware of objects and give you the specific output
00:40:36
your writing code blocks but with thing into operation it's not that that
00:40:42
and just what is basically the same thing but on each and every filled with
00:40:46
the same logic if there's nothing to do on any of the field i'm good
00:40:50
if there's something to do and one feel that well this time this isn't you double
00:40:55
of a internal so you can just mutate the battery need
00:41:01
sometimes because a spark relies avian the unsafe a. p. i.
00:41:05
and the unsafe a. p. i. is mentally not managed by
00:41:08
the garbage collector so for fixed size types you can put beep
00:41:13
very optimised but for a bit howie sized type like
00:41:18
strings or another stacked then you need to be more clever
00:41:23
okay so uh we have darn we have all find them effort
00:41:27
but is going to uh use over java strings but we nested neatly
00:41:33
together and use it and i'll put it in the final code block but we hope is not too big
00:41:42
the stuff but uh at least the data stays in the unsafe well
00:41:48
when it's not needed you can even stay in the
00:41:50
princeton better format uh when it's of fixed size it
00:41:55
it integrated with spock i must admit but it's not
00:41:59
a hack this is all public a. p. i. and documented
00:42:02
it's not widely used but uh you can do it
00:42:07
and the results are pretty cool because a false some perl apache spawn job on the misses cluster with ten
00:42:13
costs five gigs of feet five gigs of compost data would put the first time line is basically seventy minute longs
00:42:20
but on the engine is slightly better with forty five minutes long and the
00:42:23
cauldron is unbeatable because the you can't beat that if it's twenty one inches long
00:42:31
uh_huh yeah the test set to send at
00:42:34
least an algorithm are pattern for solving problems and
00:42:40
you you can came up with uh we can come up with a good
00:42:44
solution and elegant solution if you have a good design and in our case
00:42:49
we use it functional programming sky i like very much us cat and he
00:42:55
uh came up with we uh engines uh to apply
00:42:59
privacy with it that's the political it and maintain overcoat
00:43:04
if you are interested on uh the code you can uh check out our
00:43:08
a repository and so we and candidate also that there is there
00:43:14
and you can also use uh let us go and uh tonight that's right uh also
00:43:20
if you are interested about the idea of equations keen set you can check out the paper
00:43:27
and that uh we want to take this opportunity to thank the vinyl town that's nice
00:43:33
uh for uh he's a foundation of uh this design and
00:43:38
our colleagues especially and then sat down and then i met
00:43:43
and thank you all for your attention you can funnel as healthy yeah
00:43:56
any questions
00:44:04
okay thank you we are around ah i'm sorry go ahead with a mouse though
00:44:11
from your was struck by right you i saw like this one and other dozens of that's not that or
00:44:19
or spark as well uh looking out the of a schema that we defined
00:44:24
yeah and sometimes the the sauce team r. c. m. u.'s in tucson scheme our scheme
00:44:31
are as as thought it could be like that numeric and certain positions and that too
00:44:38
doubles of them please isn't meant to get like that's the morning
00:44:41
yeah different scaling position with that like the fact what you have
00:44:47
been doing not much um to be honest with uh with
00:44:51
this design um in the foreman on organisational nor a standpoint
00:44:56
this design was uh is the obstruction of the code
00:45:00
the uh first schemas well designed by data management teams
00:45:03
injuries and schema than they are translated to these a.
00:45:08
g. t. and then we define the transformation afterwards but
00:45:12
uh uh any kind of that type but can they were presented with spock
00:45:16
uh uh i mean it's it's not a matter or spot you we're using
00:45:20
spark here but if you want to define your your precise dismal type everything
00:45:25
you can do it and it doesn't change anything in the code that has been a design
00:45:31
the only thing that might change is that your privacy strategies your implementation of the places for that
00:45:37
uh we didn't get into that much that if we fit but as you can guess
00:45:42
they need to be typed as well i mean you're going to take uh the the five study
00:45:47
you cannot apply uh i'd would say fuzzy uh for
00:45:53
seeing of of g. p. s. coordinates on on hints
00:45:56
so your privacy strategies are going to be well type save all checking the
00:46:01
input types to check that the data you are applying them onto is actually compatible
00:46:07
so uh it doesn't change anything to shoot rational but the more types
00:46:13
you have the more you need to take into account and that's it
00:46:16
but if the value is a value so for example uh we've we've received quite a t. value
00:46:23
most of the time you don't want to get into the into what kind of value in the eighties you you just need to know
00:46:29
that it's not a recursive data type and that's it and that's the only thing that we ask i will uh need in a way
00:46:36
so doesn't matter that much that this remote tighter but by tara you've got a doesn't matter uh_huh go okay
00:46:45
yep but out of that uh it corruption so the question was
00:46:56
is it possible to have the sold value uh to the encryption
00:47:00
um we didn't had it to the open source a project but uh we
00:47:04
have something called the privacy context in all implementation and the place a context can
00:47:10
uh be uh well we love it i must lightly on the target you're going to well target
00:47:16
and and will contain for example sides for different
00:47:19
business out of a does not a different stakeholders and
00:47:23
and you watch functions are going to take into account the space you context so
00:47:27
yes you would need to have that context we needed to a button text yeah
00:47:36
i am thank you very much for thought the answer them again
00:47:41
just one h. what the nice application of the precautions
00:47:44
hinge test on the rest of the schema representations next um
00:47:50
one thing i have read about the and you may have to leave out all
00:47:54
the second that they're the scheme or presentations key law that pleasure is actually using uh
00:48:00
room recast it starts at six point that's asking the first one or another was a lot on the schema f.
00:48:06
on the second oh that's on your first before marcus for variable factor or yeah oh yeah
00:48:13
it depends on not okay let's let's put you on one thing um
00:48:18
this design as the t. schema angie data about all entities okay
00:48:24
in real life too if you if you don't have a and this is a a application for these eighties
00:48:30
if you don't have them in concrete types like in jason
00:48:33
to irritation loin skin allergies to as a generalisation you don't even
00:48:37
need to have them you can just get and keep the button
00:48:41
funk thought that's to say the schema f. of the data f.
00:48:44
and then yeah when your manipulating those uh
00:48:48
even if they are you know intermediary computation types that you're
00:48:53
not going to be using outside of the hem of materials cup
00:48:57
then to manipulate them uh and and make the compiler happy then yes you need
00:49:02
to make them fixed you need to have the fix that i have four fix schemas
00:49:07
for a simplifying the code um of of in this presentation i just picked
00:49:13
to v. t. schema we kept the t. schema we kept the g. data
00:49:18
a bunch a in a in real life application in these types are not needed
00:49:24
then you can just keep the button funk off on and and do its use it as a very cool
00:49:30
for the concrete types that you have with no it would be a jason schema or
00:49:34
those part that that type of an x. m. out or ask you know just have all
00:49:38
this uh schema i have this vehicle is what i know is you just
00:49:43
like you know the actually stuff shapeless value to transform between obvious howie formats
00:49:50
in in the data like that would be that they are used as devote format fall going
00:49:55
from jason's schema to spot that that type to have all that that type to pocket that that i've i mean
00:50:03
yeah we we use them as the by that would be that format so
00:50:09
and if you use customer have over that i have with that and found often then you need to fix point
00:50:15
whether it be fix all new well i yeah many fixed points that you can use we used the simplest one
00:50:25
your question the they answer your question
00:50:31
okay
00:50:36
thank you thank you

Share this talk: 


Conference Program

Welcome!
June 11, 2019 · 5:03 p.m.
1575 views
A Tour of Scala 3
Martin Odersky, Professor EPFL, Co-founder Lightbend
June 11, 2019 · 5:15 p.m.
8340 views
A story of unification: from Apache Spark to MLflow
Reynold Xin, Databricks
June 12, 2019 · 9:15 a.m.
1271 views
In Types We Trust
Bill Venners, Artima, Inc
June 12, 2019 · 10:15 a.m.
1569 views
Creating Native iOS and Android Apps in Scala without tears
Zahari Dichev, Bullet.io
June 12, 2019 · 10:16 a.m.
2233 views
Techniques for Teaching Scala
Noel Welsh, Inner Product and Underscore
June 12, 2019 · 10:17 a.m.
1296 views
Future-proofing Scala: the TASTY intermediate representation
Guillaume Martres, student at EPFL
June 12, 2019 · 10:18 a.m.
1159 views
Metals: rich code editing for Scala in VS Code, Vim, Emacs and beyond
Ólafur Páll Geirsson, Scala Center
June 12, 2019 · 11:15 a.m.
4697 views
Akka Streams to the Extreme
Heiko Seeberger, independent consultant
June 12, 2019 · 11:16 a.m.
1552 views
Scala First: Lessons from 3 student generations
Bjorn Regnell, Lund Univ., Sweden.
June 12, 2019 · 11:17 a.m.
577 views
Cellular Automata: How to become an artist with a few lines
Maciej Gorywoda, Wire, Berlin
June 12, 2019 · 11:18 a.m.
386 views
Why Netflix ❤'s Scala for Machine Learning
Jeremy Smith & Aish, Netflix
June 12, 2019 · 12:15 p.m.
5030 views
Massively Parallel Distributed Scala Compilation... And You!
Stu Hood, Twitter
June 12, 2019 · 12:16 p.m.
958 views
Polymorphism in Scala
Petra Bierleutgeb
June 12, 2019 · 12:17 p.m.
1113 views
sbt core concepts
Eugene Yokota, Scala Team at Lightbend
June 12, 2019 · 12:18 p.m.
1656 views
Double your performance: Scala's missing optimizing compiler
Li Haoyi, author Ammonite, Mill, FastParse, uPickle, and many more.
June 12, 2019 · 2:30 p.m.
839 views
Making Our Future Better
Viktor Klang, Lightbend
June 12, 2019 · 2:31 p.m.
1682 views
Testing in the postapocalyptic future
Daniel Westheide, INNOQ
June 12, 2019 · 2:32 p.m.
498 views
Context Buddy: the tool that knows your code better than you
Krzysztof Romanowski, sphere.it conference
June 12, 2019 · 2:33 p.m.
394 views
The Shape(less) of Type Class Derivation in Scala 3
Miles Sabin, Underscore Consulting
June 12, 2019 · 3:30 p.m.
2321 views
Refactor all the things!
Daniela Sfregola, organizer of the London Scala User Group meetup
June 12, 2019 · 3:31 p.m.
514 views
Integrating Developer Experiences - Build Server Protocol
Justin Kaeser, IntelliJ Scala
June 12, 2019 · 3:32 p.m.
551 views
Managing an Akka Cluster on Kubernetes
Markus Jura, MOIA
June 12, 2019 · 3:33 p.m.
736 views
Serverless Scala - Functions as SuperDuperMicroServices
Josh Suereth, Donna Malayeri & James Ward, Author of Scala In Depth; Google ; Google
June 12, 2019 · 4:45 p.m.
936 views
How are we going to migrate to Scala 3.0, aka Dotty?
Lukas Rytz, Lightbend
June 12, 2019 · 4:46 p.m.
710 views
Concurrent programming in 2019: Akka, Monix or ZIO?
Adam Warski, co-founders of SoftwareMill
June 12, 2019 · 4:47 p.m.
1975 views
ScalaJS and Typescript: an unlikely romance
Jeremy Hughes, Lightbend
June 12, 2019 · 4:48 p.m.
1377 views
Pure Functional Database Programming‚ without JDBC
Rob Norris
June 12, 2019 · 5:45 p.m.
6389 views
Why you need to be reviewing open source code
Gris Cuevas Zambrano & Holden Karau, Google Cloud;
June 12, 2019 · 5:46 p.m.
484 views
Develop seamless web services with Mu
Oli Makhasoeva, 47 Degrees
June 12, 2019 · 5:47 p.m.
785 views
Implementing the Scala 2.13 collections
Stefan Zeiger, Lightbend
June 12, 2019 · 5:48 p.m.
811 views
Introduction to day 2
June 13, 2019 · 9:10 a.m.
250 views
Sustaining open source digital infrastructure
Bogdan Vasilescu, Assistant Professor at Carnegie Mellon University's School of Computer Science, USA
June 13, 2019 · 9:16 a.m.
376 views
Building a Better Scala Community
Kelley Robinson, Developer Evangelist at Twilio
June 13, 2019 · 10:15 a.m.
245 views
Run Scala Faster with GraalVM on any Platform
Vojin Jovanovic, Oracle
June 13, 2019 · 10:16 a.m.
1342 views
ScalaClean - full program static analysis at scale
Rory Graves
June 13, 2019 · 10:17 a.m.
463 views
Flare & Lantern: Accelerators for Spark and Deep Learning
Tiark Rompf, Assistant Professor at Purdue University
June 13, 2019 · 10:18 a.m.
380 views
Metaprogramming in Dotty
Nicolas Stucki, Ph.D. student at LAMP
June 13, 2019 · 11:15 a.m.
1250 views
Fast, Simple Concurrency with Scala Native
Richard Whaling, data engineer based in Chicago
June 13, 2019 · 11:16 a.m.
624 views
Pick your number type with Spire
Denis Rosset, postdoctoral researcher at Perimeter Institute
June 13, 2019 · 11:17 a.m.
245 views
Scala.js and WebAssembly, a tale of the dangers of the sea
Sébastien Doeraene, Executive director of the Scala Center
June 13, 2019 · 11:18 a.m.
662 views
Performance tuning Twitter services with Graal and ML
Chris Thalinger, Twitter
June 13, 2019 · 12:15 p.m.
2003 views
Supporting the Scala Ecosystem: Stories from the Line
Justin Pihony, Lightbend
June 13, 2019 · 12:16 p.m.
163 views
Compiling to preserve our privacy
Manohar Jonnalagedda and Jakob Odersky, Inpher
June 13, 2019 · 12:17 p.m.
302 views
Building Scala with Bazel
Natan Silnitsky, wix.com
June 13, 2019 · 12:18 p.m.
565 views
245 views
Asynchronous streams in direct style with and without macros
Philipp Haller, KTH Royal Institute of Technology in Stockholm
June 13, 2019 · 3:45 p.m.
304 views
Interactive Computing with Jupyter and Almond
Sören Brunk, USU Software AG
June 13, 2019 · 3:46 p.m.
682 views
Scala best practices I wish someone'd told me about
Nicolas Rinaudo, CTO of Besedo
June 13, 2019 · 3:47 p.m.
2713 views
High performance Privacy By Design using Matryoshka & Spark
Wiem Zine El Abidine and Olivier Girardot, Scala Backend Developer at MOIA / co-founder of Lateral Thoughts
June 13, 2019 · 3:48 p.m.
754 views
Immutable Sequential Maps – Keeping order while hashed
Odd Möller
June 13, 2019 · 4:45 p.m.
278 views
All the fancy things flexible dependency management can do
Alexandre Archambault, engineer at the Scala Center
June 13, 2019 · 4:46 p.m.
390 views
ScalaWebTest - integration testing made easy
Dani Rey, Unic AG
June 13, 2019 · 4:47 p.m.
468 views
Mellite: An Integrated Development Environment for Sound
Hanns Holger Rutz, Institute of Electronic Music and Acoustics (IEM), Graz
June 13, 2019 · 4:48 p.m.
213 views
Closing panel
Panel
June 13, 2019 · 5:54 p.m.
400 views

Recommended talks

Rosie: clean use case framework
Jorge Barroso, Karumi / Madrid, Spain
Nov. 27, 2016 · 10:05 a.m.