Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
alright thank you um it's really on it to be here during so the ten year anniversary of the salad days conference
00:00:07
i was brainstorming was martin i want to talk about and that offer hey i could talk
00:00:11
a lot more about how sparkles callously working together how grey and essential sell us the spark
00:00:16
and martin so wine talk about something else um the rest of the conference feel
00:00:20
debates callous stories and i conferences and text um so let's talk about something else
00:00:29
and
00:00:32
right
00:00:34
um so i'm gonna be talking to you about actually three different things one is all the for those of you then never been
00:00:40
to a spark summit conference and in your house parker started i'm not telling the story of how spark chewy was a first grade
00:00:47
and then a lot has changed since part was created so won't talk to you about
00:00:51
what has changed what other new problems we have seen them on our customers a smart users
00:00:56
and two of the open source project were created um in response to those palatine an awful
00:01:01
right i'm just a little more about myself also to my p. c. u. c. berkeley and lap
00:01:05
um the thing i'm most proud of is actually updated more code added in spark
00:01:10
um once you consider all the pork was combine um
00:01:17
they see the night my neck contribution reached zero um i pop a bottle of champagne migrate i made it
00:01:23
uh i'll be so here's the three things are we talking
00:01:26
about let's get started um sports also around ten years old
00:01:33
by this time on the start yes a verse of academic prototype around ten
00:01:37
years ago and the reason we started um was because of this guy that's there
00:01:42
and ten years ago last those the p. c. c. and also the
00:01:44
u. c. berkeley amp lap use doing machine learning abides by michael jordan
00:01:49
um the and last year back there and found out one thing on language than ethics price how
00:01:55
many of you remember what this thing was right about like a third of you put up your hands
00:02:00
um no thanks back then decided hey we can agree that
00:02:04
challenge by anonymous in all our data set movie rating data sets
00:02:08
by our users and we'll put this l. agree the public contests whoever can
00:02:12
come up with the best recommendation algorithm for movies we win the million dollars
00:02:18
a million dollars a lotta money for lester back then um last was making about two thousand dollars someone's out
00:02:24
of this type and unlike the p. c. c. and c. e. p. f. l. years making a lot more
00:02:29
um d. c. job on the con um the contest
00:02:34
but he still relies hey this is the first day that's that you had
00:02:37
to work with it was larger then demanded this space on this laptop back then
00:02:43
and you need to some solution to just yeah all the process all those user move i'm waiting data
00:02:48
and he looked around is not a lot of good solutions back then either work really well on the single nope doesn't work well in it
00:02:54
it should be the steady or they worked well in addition to the setting but it's very inefficient the pro time it's very slow to iterate
00:03:01
so that's the talk to the simon pay these are yeah we may have you know to
00:03:05
the original creators of spark it home but hey hey i think if you give me couple primitives
00:03:11
um i could actually levers that to pro time my machine learning algorithms for quickly
00:03:16
and one of the key with machine learning is you have to be doing a lot of experiments with
00:03:20
the rate of speech matters a lot more than the result you get any some particular step joint time
00:03:26
so much yeah sure role um i think he for show you my the first five would be and then they they decided hey there's a really work
00:03:32
um then we use a more proper language and then he actually picked up stuff so i think cells also one
00:03:38
of the first uh crap um actually spark was one of the first projects mckay i use got for um the
00:03:46
so about a week and they they're only six hundred lines go this the very very
00:03:50
first version of spark of course look very different from start um what's part is today
00:03:55
um but we call it the first thing i find other books and in in the sense that spark had to functionalities one is
00:04:03
there's a way to process data because a lot of um so the
00:04:06
lot of parts of the machine learning is about how to process unprepared
00:04:10
data to be ready for mentioning models to train on the second parsley
00:04:13
actually your spark to pull time machine learning algorithms especially dish but once
00:04:20
and it's kind of smart just for cost volley from there right
00:04:24
what happened to lester is usually what people care about so that actually the work of the net flicks price
00:04:30
from back then um and you can see the top two places that hide unless owes part of on somebody
00:04:40
put on some old in some of the twenty minutes later in the first place so they lost the million dollars
00:04:48
and here's a picture of the other team happy accepting the check
00:04:53
if sparse invented twenty minutes earlier last that would be the million dollar richer right
00:05:00
not a lot has changed we started the brakes a company twenty thirty
00:05:04
about six years ago i'm sorry working with a lot of reel to reel
00:05:08
will customers in production and which white used to answer on this that hey
00:05:11
what are people doing with data what are the biggest hundred people are facing
00:05:15
and there's compose options when a um when we started the bricks one and i think
00:05:20
a very important one was hey sparsely you about other things and and that's all you need
00:05:25
we can do machine or anything train the models for you you can do data
00:05:29
prep for you that's a woods gonna focus on that everything else let's forget about it
00:05:34
but as we were more more was customers um we start to realise hey sparring is pretty powerful the such a
00:05:41
lot of surrounding pieces hardcore too i said data engineering
00:05:44
and also softer ensuring the last race all by spark itself
00:05:48
it's as some advice a complete different paradigms part of human touch is
00:05:51
so it's not even appropriated is extends parker cover so the use cases
00:05:56
and one of the um misconception about machine learning in those industries when
00:06:00
you talk to a lot of um people haven't done it for a while
00:06:04
they would think eight machine earnings about if i go down all kinds of lower down high torch
00:06:09
and apply the latest neural net architecture and iffy some data into it and boom um i get great results
00:06:16
but the reality is there's actually a lot more to machine learning then just
00:06:21
mentioning itself and this is i think is probably the best illustration of that
00:06:25
am i taking out of a google nips paper in twenty fifteen or a hidden technical definition running systems
00:06:31
and what this chart shows is who was taken 'em saved
00:06:36
at an analysis of all their machine running applications within do
00:06:40
in each of the box we present specific module for machine learning applications and the size of the box
00:06:46
indicate it basically amount of code um this return for
00:06:49
those systems so the proxy for complexity of the components
00:06:54
as you guys see or maybe you can see in the middle
00:06:57
of it is the little black box this as machine learning code
00:07:01
and everywhere else is a lot of all the boxes configuration data collection
00:07:05
verification serving infrastructure resource management all that are not specific to machine earning
00:07:12
um but i think there needs to be done as matter of fact we have
00:07:16
to be real war machine learning applications you spend most of the time to this
00:07:21
and even more interesting a is people they really understand the black box
00:07:26
are differ according to scientists some shunning engineers the people that you understand
00:07:31
um the rest of the boxes are we call data engineers is off engineers and
00:07:36
this to press on that's used very different technologies they
00:07:40
often sit in different places in many and the prices
00:07:43
they don't we wouldn't have the same reporting chain one of my favourite question i go to customers asked them
00:07:48
could you know where i if i see it inside his ass and he would you
00:07:51
know where data in here step in building and often the responses no i occasionally talk about
00:07:57
and slack and attacked the use also very different which create a lot of challenges in buildings overall systems
00:08:05
so what we've been doing a lot of the breaks after realising that is
00:08:09
how do we bridge the gap between the different persona as how do we actually
00:08:12
um make all this work by enabling people to collaborate when you define them
00:08:17
right um so with that we're talking about two separate projects the first ones delta it's
00:08:24
about scalable reliable the lakes and the key to delta to focus on making your data ready
00:08:30
for analytic switch could include a lot of them are just doing motion and
00:08:34
a second parts an awful which focused more on the life cycle management of machinery
00:08:39
let's get started um i don't know how many you buy our data engineers here on star was extremely essential to date engineers
00:08:46
live on many then she starts lazy de facto programming language the
00:08:50
data engineers a lot of things to sparking calf kind other frameworks
00:08:54
um but one of the day is the change in a ah the industry for uh
00:08:59
the the uh in the past decade is the transition or the addition of the data lex
00:09:04
um in addition to just do all screwed data warehouses and the constable the lakes pretty simple
00:09:10
um you collect all over data everything you have structure on structures sensor data images
00:09:16
tables transactions locks and don't thing to this data lake which typical is that
00:09:20
you should be the file system or some objects stored cloud i guess three
00:09:25
um and then you can have or did earlier for
00:09:28
or downstream another excuse cases too fancy machine earning our
00:09:31
reasons onto the the signs on it and you're wordperfect and it's kind of the picture the industry is paid
00:09:38
right and what we started data breaks one of the assumption we went with space
00:09:44
let's focus on computer which is park and storage he likes
00:09:47
about storage i things us all pro we actually believe this picture
00:09:52
um until we started building um so did alex ourselves re release a few problems
00:09:59
and here is is actually mirrors the journey we had to go through um at the
00:10:03
bricks in building our internal day likes a lot about data packets so what they look like
00:10:09
what we want to do is we hello events a week like from for
00:10:12
services um they actually been in the world i think tens of terror by today
00:10:17
and then there's a few things want to do one is we want
00:10:20
we pour our metrics in no time to can see what exceptions ordering
00:10:25
um across all of our manage bar clusters want to see how
00:10:28
people are using this bar clusters how you were using sparky the eyes
00:10:32
um and so streaming on alex and the same time was on the dome all those into
00:10:37
the lake so we can do historic analysis for example we want our like word to decide hey
00:10:43
what um eighty i can we break in the future for say the next version of spark when dallas how often do people you'd
00:10:49
be on a guy isn't it got almost nobody uses is causing all okay she read remove it right and then we also have
00:10:57
so the um machine learning algorithms running to predict for example what
00:11:01
the behaviour of different uh customers and users in terms of usage
00:11:07
now the first thing we have the bureau is hey dispersed reality it's
00:11:10
alright um once we don't all the events in the coffee pot it's not
00:11:14
too difficult to actually get spark working you right so the streaming job and spark
00:11:19
you can active strumming another takes it right you get your dance coming in
00:11:22
a few seconds later and a show about a dash or somewhere is great
00:11:27
now the problem is hey but you also want to allied unstable those event
00:11:32
i'm and r. dash works can just look at the last maybe day it also needs to look at
00:11:38
him any case where present for example to see over um to the bore you to show historic trends
00:11:44
right so we apply this should neither the trickle land the architecture which many of you might have heard of
00:11:49
and what it does is a you but i think a a pipe wise you have one pipeline for real time either
00:11:54
one pipeline for batch which is basic off line historic data
00:11:59
you know every time you buy for k. is software engineering
00:12:03
your wrists architecture tech that as was correct this box because now you have to write everything twice
00:12:10
right but it's not too bad because partner in many ways the sparkly get allows you to express one
00:12:17
um set the program and you can one imposing a batch fashion in a streaming fashion and all introduce reconfigure
00:12:25
and then the other thing we do or say when to do a and
00:12:28
reporting but what the problem was actually streaming data is creates a lot of um
00:12:33
as you create as you have events coming in real time and you want to really tried on a
00:12:38
late is you start writing how the data very quickly to your data lakes in our case as three
00:12:44
and as part of that you create a lot of actually um it's this if you problem once more data
00:12:49
um small files agreeable small files and you turn down improve our jobs
00:12:53
the the time uh dominated by just mete data fashionable small files even just file this
00:12:59
and the other problem is as the number pipelines increase so that i'm showing you if you're just one simple
00:13:05
diagram but in we are the different date engineers trying
00:13:08
different pipelines sometimes collecting from different data sources we actually
00:13:13
um have one program writing data in a specific schema and not a program written three months later
00:13:18
maybe by different team they assume a slightly different schema writing the same source now i had pretty messy
00:13:25
and so we've added validation hey let's make sure we have a validation jobs um of course you have added
00:13:31
to both places right and the other bigger from is a a um we write software with bugs everybody does
00:13:40
um what happens if there's a actually some failure
00:13:45
um sometimes not even because of a box because the machine went down the middle processing
00:13:50
um the the ways the most of industry uh worked out to set a
00:13:55
tackle this is the partition their data in this joint chunks for example by date
00:14:00
and every time if there's a fee if our job when you have partially which analysis specific set of data
00:14:06
um for example for today the next time we want to fix it you override the entire day of data
00:14:13
now it is adding one more logic to the application
00:14:17
um and then you re process today's data and sometimes you found
00:14:20
about three days maybe later you replace all the three days day
00:14:25
and your processing and then the other one is hail vocational yeah i realise hey maybe one of my customer change and name
00:14:32
i don't really want to like how to actually update that reflect then all my records
00:14:37
um the standard processes that actually do a
00:14:41
also partition replacement so sometimes to partition the customers
00:14:45
um you push and they also by customers and then you pacing hard customers data sometimes you
00:14:49
can do that for example get too many customers uh you have to rewrite all over it
00:14:56
it's kind of very difficult and then the last lot he says hey i have downstream jobs also doing
00:15:00
maybe real time clearing on on the stay that everything now and now every time you have every process
00:15:10
the reprocessing required deleting solid existing data and if they happen to be one
00:15:15
of the job that's not really the data's being d. v. d. that job without
00:15:21
so now you have to even come up with this of scheduling ways of doing hey man section
00:15:25
maybe only one yeah three am san francisco time and you add a european office and restate everybody's good
00:15:33
right so one complexity and war relies we have this him
00:15:36
coded engineering teams by people five engineers a rock solid data engineers
00:15:40
really what it's all these not data problems resolving log issue the system problems um in concurrence it prompts
00:15:47
and that's because actually the underline they like doesn't provide them sufficient guarantees
00:15:52
and they are distracted by i think three or four or a
00:15:56
big things and you i mean the the particular little bit more
00:15:59
the first there's no item a c. d. um provided by down the line depicts the store system
00:16:05
just takes whatever uh it gets if you have a
00:16:09
partial right coming from a job that she fell the sources
00:16:13
and understand any of it because it's so so but it should be the file system doesn't have any higher level semantics
00:16:20
and we have actually partially written data is very difficult to trust the data system and the other was no quality enforcement
00:16:28
you could add these writing garbage you have data coming from priests like the difference schema
00:16:33
um and that's very difficult to actually uh make sure the downstream jobs are correct
00:16:39
and last is this knoll isolation which means when you have a right job is
00:16:44
the meeting data has to do with reprocessing the re job would just fell right
00:16:51
and maybe show you how real how widespread this promise um
00:16:57
i've also went through a lot of my email so that so a support ticket i've got it from our customers
00:17:02
and i've taken screen shots of them at the uh anonymous that um there's a lot of problems would be the next
00:17:09
display them these of tenure all mature technologies um one here's some examples
00:17:16
actually missed all the the frame loading comments block commanded operations when profile them
00:17:20
hey the crease selfish me buying like a one or two
00:17:23
seconds base spends like a minute listing files figuring out what three
00:17:29
they then there really is very difficult to get out of town exception um this is what
00:17:33
i was talking about we have a re job that's a reading wow right job the the data
00:17:38
um different views have conflicting schemas 'cause now you might have paid a program
00:17:43
between different times is jim different schema um and see a lot of this coming
00:17:49
too many small files one of the classic we've spent any time to read
00:17:53
in engineering um concatenate kate have to make small thousand concatenate them how do i
00:17:57
control them and all that everybody wants him as another at some point this
00:18:01
class of the issues i think take up half of engineering support to get one
00:18:07
so we started working on um after realising that hate have
00:18:11
also portuguese and have not very little to do with spark or
00:18:14
anything is really does not lie sources and doesn't quite provide you
00:18:18
the right guarantees under start a new open source project for delta
00:18:25
and before i tell you how the other works um ours um maybe explain
00:18:29
to you what the picture look like when you go from this to doubt
00:18:35
the pretty complicated but with delta
00:18:40
here's the picture we get without choose you have or a dance is coming in from friday
00:18:45
different sources you actually don't them into a single delta table and typically the first go the table
00:18:51
would be able we call brown stable um and they just think who's wall events
00:18:57
and then you incrementally refine that you create so the pipeline of
00:19:01
delta tables each of them you have some e. t. o. job
00:19:04
connecting them um to go from one to another and what we see it you might know actually get
00:19:09
for example exactly three you might have applied or twenty map up i live thirty and my by for k.
00:19:14
because you have different is large x. the way it works is the first
00:19:17
one is typically the role events there's like no parsing virtually no application logic
00:19:23
so it becomes a archive and is a widget into the retention because
00:19:27
it's how much you can fit energy should be the file system are three
00:19:31
and then you incrementally refine the quality of the data goes higher and higher as you go um
00:19:37
through this delta pipelines and there's some point you have some day that's completely ready for another six
00:19:43
um that's either during machine earning reviewing streaming we're doing a dashboard
00:19:51
and what does that provides is a few guarantees one is it has forecasted transactions
00:19:57
um so now you actually get out from the c. d. you get isolation
00:20:01
um you get basically simple style um transactions and those open source power by spark
00:20:08
um but maybe just walk through all of that so bronze table
00:20:13
as simple as possible to whether be really write the whole point of that is to store all of the raw data
00:20:23
because uh it um and then i mean to me the table sometimes maybe part we do some parsing for
00:20:31
example j.'s on parties and now you structured is on the different fields and you clean up some garbage you have
00:20:37
and often the biz level i re gaze people actually rolled up the data for
00:20:40
example you might be getting a a dance data and say microsecond they um twenty very
00:20:45
for the purpose of the and um use cases you don't quite fair about a microsecond around ideas to roll them up every second
00:20:52
or even every day sometimes read now you can should be that spark an oppressed
00:20:58
and one of the key here is hey looks okay now you have a line
00:21:03
and maybe you can buy for casey which we like what's the big deal here
00:21:06
well the thing is a if your biz logic changes let's say the way you
00:21:10
parse they tried changes likely because you realise hey you're getting daytime even formats now
00:21:16
um it's actually very easy to uh i re process although day that because all of them
00:21:21
are stored in the single delta table um with infinite potential go back as long as you want
00:21:27
and the only to do was right maybe a batch job to actually um
00:21:31
according to the right is right the sparkly objected one either batch of course that's trimming
00:21:36
and it should be the white dial your downstream tables like this you weren't
00:21:40
go tables and then just restart that's truly job from specific point in time
00:21:49
so it's pretty powerful it makes a lot of things very simple um it essentially
00:21:54
out there we rode out delta um on the customers that are start using it
00:21:58
we start listing very few of the uh uh uh tickets about that point how does it work right
00:22:03
where the technical conference people care about exactly how does work on that but it seem like well the matter
00:22:09
um it the way the other words actually pretty simple
00:22:13
we applies a lot of the oh screw database techniques and
00:22:16
would base into this new settings some tweaks and idea here's we have a rider had transaction out um for data
00:22:24
and if you look at any of the delta table the store um on the docket
00:22:28
use relies hey there's data files in the data files are typically partition by for example day
00:22:33
um what country and you have the data files is stored parquet format which the most common calmer format
00:22:39
for a data and then um you have the transaction walk in addition
00:22:43
to just storing a bunch of all files transaction log some label the
00:22:47
monotonically increasing it was a zero dodgy someone dodgy designs keep going up
00:22:52
and the table was spacey defined by a set of actions in the transaction and the different
00:22:57
actions here are uh it's gonna this of them you have adding a file removing a file
00:23:04
um and there's one other thing is a lot different maybe get into here
00:23:08
um but once you have actually for example really in all the transaction logs effectively have
00:23:14
the latest snapshot um of the table you know what other valid files in the table
00:23:22
and if you want to actually change it um at the table we just keep adding
00:23:28
then you are transaction log files to it for example the first version
00:23:32
of the first transaction log zero dodgy some isaiah adding first on second how
00:23:37
but let's say you rely save wonder parking into the park is too small
00:23:43
and reading them is not really fission that's actually coalescing combined i'm
00:23:47
i'm into single bigger file gadgetry in your transaction one dodges on
00:23:51
and what they describe to say that's remove one then to top arcade has read all parquet
00:23:56
which is just my riding um one it took to get right on three simple
00:24:02
if
00:24:05
no one thing to actually enforce uh and to give you actually transaction guarantees
00:24:10
me to agree on the ordering of changes so when there are multiple riders
00:24:15
when i'm audible riders for example the usual one by zero dark days on and you should to
00:24:20
rise one dodges on but then it was hard to write to dodges on one that has the fell
00:24:26
so in this case you should two races ensure when's user to reclaim the two basic version to
00:24:33
right but um in many cases if there's actually no conflict user one could just reach why
00:24:39
and the software does it automatically for the users are you shouldn't have to worry about it to write three okay so
00:24:45
that's how the actually solve conflict because for example to transactions my contact they might be for
00:24:50
example contacting the same to the file and write them out they match up the the duplication
00:24:55
um it's also pretty simple really what we do is a if somebody else commits before you for example in
00:25:02
this case you should to commits a after user want to win user tries to commit you realise the ones taken
00:25:09
well i need to do is our really in one dodgy signed on is that what
00:25:13
was returned and what was the the that any of the rise that of the um
00:25:17
a user once transaction didn't contact was my release and you should too
00:25:21
um i could just commit again with no actually filling the end users shop
00:25:26
but if there's a real conflict for example bowser did even the same file probably
00:25:30
means an actual application logic conflict nine to fell the job unusual to tell
00:25:34
the user hey somebody else have done something the context was uh what you're doing
00:25:39
so you can try you know up until this point his like sugar
00:25:44
simple if ever taken database internal class and know exactly what's happening here
00:25:49
um one of the big a change was big data is hey if you have streaming
00:25:53
job is coming and and every maybe have a second even hundred millisecond riding your transaction
00:25:59
you can have a lot of days on files that wouldn't be a problem if um
00:26:04
you just turn them at the the are too many faust from you too many men today about pro
00:26:09
right so we thought about this a lot and we decided hey
00:26:13
we actually have a very scalable engine to process large amount of data
00:26:18
and the matter they they're really gets to lodge why don't which we never did
00:26:21
exactly as they are just not method is not a special class um in the system
00:26:27
and what we do is every once in a while we actually take all the jays on files read
00:26:31
them and using spark is all one checkpoint them into parquet format so sick shimmy scalable higher throughput to read
00:26:38
and then in the future which is read that checkpoint directly from uh
00:26:42
using smart itself all the method they that becomes just normal data here
00:26:47
and this is how we could also for example process billions of files in a single table
00:26:52
you could have reggie hundred hello by tables art because all the metadata are no longer bottlenecks
00:27:01
um i don't have enough time to go into the different use cases of doubt that here um for the projects
00:27:06
basie one year old um it's been in production one euro i. database recently open source it about two months ago
00:27:12
um every months right now is promising agenda that the buyer data
00:27:16
are more diverse platform and um it's numbers actually increasing very quickly
00:27:21
um the thing is production ready it solves a lot of a bit edgy impromptu decided hey doesn't just that if
00:27:27
it is the brits customer's gonna create open source version of it to actually make sure uh it works for everybody
00:27:34
so they told us about base e. date engineering pieces of analytic same machinery
00:27:41
um the next one to talk about m. l. for which machine running life cycle management
00:27:48
and see if i were to take a more machine
00:27:51
learning view um of data pipelines uh or just machining pipelines
00:27:57
or the i'm sure new uh a charter breaking down to
00:28:00
different components but you're let's simplify little bit to just basically starts
00:28:04
um this is a typical process machine learning engineers the the scientists go through
00:28:09
what they do is impaired person you prepare data a lot of is done since park a lot
00:28:13
of is done on this these also delta and second is based on those data to be building models
00:28:20
and one of the things as i uh talked about earlier is
00:28:23
hey the thing about machine earnings you gotta be experimenting a lot
00:28:27
to get to the best result is not about the best result a particular snapshot in time it's about iterate iterations
00:28:34
do you have to be a lot of things ever to be doing a lot of experiments and then last but not least once you have something you actually
00:28:39
kind of happy with you have to deploy in production or you can just do
00:28:42
a model and say hey i'm done with it you gonna use them or somehow
00:28:47
and there's actually very disparate technologies throughout this entire start um the technologies uh not
00:28:52
really designed to work with each other and as i said earlier you also require
00:28:56
different persona asked different as engineers and scientists um and they need to be working
00:29:01
together but there's really no tool um before to make them a easier to work together
00:29:08
as well you know there's also needs to be away for standardisation across is a three steps
00:29:13
and this is why we actually started open source i'm all for projects
00:29:17
with three separate components tracking projects models wrap going to each one of them
00:29:24
so but before explain um m. l. flow that's available before and after picture
00:29:31
so here's how it very so the python maybe more low training
00:29:37
coca look like a lot of the the scientists use python um which by the way is also promise most
00:29:43
they'd engineers to use python um but they know someday in they disliked the different functions every turn it right
00:29:50
it's of a print debugging right on the say hey this is all type parameters for mentioning model in for my data
00:29:57
and a half years so the accuracy i get based on my test data set and once i'm done
00:30:03
with the l. down the model somewhere was python pickle um so i can actually use a layer different program
00:30:10
and so the result you actually get looking at the sender out um
00:30:15
and now is different questions here what if i change my
00:30:18
input data is doesn't describe mine today that describe so the parameters
00:30:22
a machine learning model is produced by communication cool parameters
00:30:28
and data when you data set changes you get different models
00:30:33
and what about you and so the other parameters now maybe i should
00:30:37
put in especially um and what if i i'll check out the weather
00:30:42
the library i depend on the upgraded and they fixed the bug which actually that were regression in my model
00:30:50
and over maybe a span of them on some change my program quite a bit
00:30:56
what happened when i got this lock file but it happened right p. some people
00:31:00
a lot them use excel specially is really go especially shocked a lot of us
00:31:04
and funny things happen i remember very summer to uh
00:31:09
so if the situation i have run into when i was in a college
00:31:12
taking physics labs um i remember working on the small team we're doing experiments
00:31:18
so i was start tracking experiments in such a excel and
00:31:22
of course and we you know my a colleague those uh experiments
00:31:26
so i send them one that they do you want
00:31:29
dot um x. l. s. d. to be three final
00:31:34
final for real this time final final final um a become a mess
00:31:41
and the other one is a now let's say you got them all yeah should be happy with
00:31:47
as the the scientists and now you want to uh at the point production
00:31:51
were designed this you don't really know all the production engineering study don't normal
00:31:55
s. l. is really need to know how to achieve that you know what
00:31:58
i couldn't eddie's or containers you as the the engineer production engineer for help
00:32:03
and then the so the conversation happens while i'm on our
00:32:06
customers eight years i something actually was like it then please deployed
00:32:10
and here's like something i train with sparkles i train with tens of all your
00:32:14
side it was our other productive enters i have no idea we're talking about um really
00:32:19
use writing job or start code and i'm really good at at the pouring those them
00:32:23
application server all those are python cycle learned spark stuff i don't know what they are
00:32:29
um and it's interesting wanna say i'd read about something interesting
00:32:34
on this uh archive could you also the points for me
00:32:39
um no that's a cool guy would m. l. flowing neighbour to write very simple program is the o. program
00:32:47
for now is that just printed you import m. l. follow it the ice and then you actually lock them
00:32:53
right all it does is it cause m. l. so wait yeah and therefore i just was them in the database
00:33:00
and when you watch m. l. for you i yeah actually give your from the tracking component or experiments
00:33:06
so here's just a very simple screen i'd visualisation let's say your chart all the experiments and
00:33:11
now she actually go in and look at a what this output for each of the experiment
00:33:17
what do i get you could even add those two it also
00:33:19
tracks you're cool with the get command and the also into groups delta
00:33:24
um to give it to one other functionality delta uh gives you a guy didn't talk about time travel
00:33:29
because we actually save the transaction logs you could go back to refer to any specific version of the data in the past
00:33:36
and by integrating that was m. l. flow we actually allow you to
00:33:40
base reproduce your model and to and by combining the all the data
00:33:44
um your code itself would be good to meet hash and
00:33:48
all that count as are tracked to now to reproduce everything
00:33:52
you can also come um compare the different ones of the model
00:33:55
um because that's what you usually want to visualise in a big experiment
00:34:01
um the other one is uh the uh project component
00:34:06
m. l. flow allows you a dessert offers a standard spec
00:34:10
the base allows you to find a project was the co dependency on the
00:34:13
configuration and then you can i should one the project in all different places
00:34:18
for example to run it locally was just them up for open source project or why on sparc
00:34:22
processor um and all those used to be a bit done is just a single line of command
00:34:28
and here's how project looks like a base have a demo file um and because a lot of design is a lot
00:34:34
of the machinery dependencies in python would base allowed to define
00:34:38
environment using honda and you can waste of your python programs there
00:34:43
and this becomes a standard container think of it almost like a may the
00:34:46
the specifically go for machine learning um and you guys are wanted pretty much everywhere
00:34:54
and the last piece of the ah m. l. flow is the aspect for models abbe c. for running
00:34:59
them and um eh what he does is really just a simple robbery devices than the a. p. i. for
00:35:06
here are different types of models and here's how i catch it execute them for example visit tens of all specific model
00:35:13
um i could just one it was tens of all itself into
00:35:15
the general model i 'cause runs and how our country python function
00:35:20
and you cash to pour for example arbitrary five python function directly
00:35:23
in spark itself so you can actually penalise or influence uh inference
00:35:27
across the cluster machines even if you train your model just using kinds of always sell one single though
00:35:36
so now with the standards back did production engineer all
00:35:41
the reading to understand there's hey um here's a um
00:35:48
here's a command line to run because i'm given an awful project
00:35:53
and i can just run it across a lot different i will environments i have
00:36:03
so the components tracking projects models they're really done to make the life of
00:36:08
data scientists and also all the supporting zap including production engineers the the engineers easy
00:36:13
um because we so tried on this then what i did
00:36:16
the sciences and all the supporting sad really struggling with everyday
00:36:21
um and how we can make their lives that animal flows about your old um
00:36:27
we actually open source it a year ago um was now spark summit in san francisco
00:36:32
and that's already gotten a hundred plus contributors and i just looked up a pipe by five
00:36:38
sometimes index last night and this actually at clubs more than half a million dollar their remotes
00:36:44
i'm not a lot of this doubts are probably say i see the pie prices keep catching them
00:36:48
um but no no less is pretty uh impressive for only a one euro project or it's
00:36:52
actually solving some real problems for uh uh they designed is that i was the historically overlook
00:36:58
because everybody focus so much on just how do i feel
00:37:02
the machine any piece of it without looking at the surrounding constructor
00:37:07
um just to wrap up the talk spar was created as a huge fan of extension really try to unify
00:37:16
big day than a either smart taken the approach of how do we below the computer part
00:37:21
um of this piece and there's more get more more with
00:37:24
different customers in user relies is a lot of other pieces
00:37:28
um even the ones we thought it was all prawns and not really soft and people are struggling with
00:37:33
and a lot of responsibility of this industry and also they breaks in
00:37:37
particular is we need rebuilding a understanding the problems users and customers wanting to
00:37:43
providing high level solution so they don't have to spend as much time fiddling with the infrastructure
00:37:48
and they can focus on there's domain specific problems this requires choose
00:37:52
the action unify a lot of this and this purpose on us
00:37:55
um require using to smash a unified different language
00:37:59
tax um and i'm sort of go from there
00:38:04
and the two specific project talk about one still like a
00:38:08
lot of isabel making data ready to support the downstream on ethics
00:38:13
and the second was m. l. flow and smell making managing the life cycle the machine name projects
00:38:19
right um that's all i have um i can't be here without
00:38:24
telling you hey we also hiring we have offices in answer them
00:38:27
san francisco also small one control also opened the remote opportunities and i think
00:38:32
i can take a little bit of question um with the rest of the time
00:38:52
okay if there's no question a means of explain everything perfectly so i
00:38:57
i think this one there
00:39:03
yeah well i um i had a question about totally
00:39:08
yep um so you mentioned that there's a sort of
00:39:13
i guess complexion step with regards to metadata files is that right yeah
00:39:17
um is there any sort of tilted contraction for the data files this is something that you need to
00:39:23
run in and out of band process or does it just cover come for for you when you use delta
00:39:28
yeah so um it's a very good question um we so that it's
00:39:33
not something that's done automatically for you right now although experimenting with it
00:39:37
um it's actually very easy to do compassion you just for example schedule regular job every three minutes
00:39:44
what did the job is literally one right it's it's spark read the file in it by the file
00:39:49
um and do in a transaction um we are experiment
00:39:53
thing was automatically doing that directly in the hoses of
00:39:56
this part of the charger is complexion is depending on
00:40:00
data volumes all them complexion jobs could be very expensive
00:40:04
um soaps all of our customers explicitly don't want us to automatic compact for them
00:40:09
because they worry about the cost it would happen if they don't control it themselves
00:40:13
um so it is something we actively looking at but um i
00:40:17
suspect you'll be more they he can opt in to do automatic complexion
00:40:21
but there's a way to uh or or maybe you can buy the votes on bunny
00:40:25
hopped out in the future just to cover difference about use cases i i i think i
00:40:40
if you show it to rip your question about your uh well uh of the questions they compare deli with ice break
00:40:52
um the yeah i think the lexus more production ready much
00:40:57
more that to try to probably solve some more problems in away
00:41:00
on the other like has been running in production for a long time and
00:41:03
also more uniquely um one of the big thing we're probably sounds actually support streaming
00:41:09
um and serve incremental computation which i don't think i spurred us um because that's coming
00:41:15
from a lot of are so real rate requirements i want to see data and more uh
00:41:19
a real time and that's one big difference but the underlying i would say

Share this talk: 


Conference Program

Welcome!
June 11, 2019 · 5:03 p.m.
1574 views
A Tour of Scala 3
Martin Odersky, Professor EPFL, Co-founder Lightbend
June 11, 2019 · 5:15 p.m.
8337 views
A story of unification: from Apache Spark to MLflow
Reynold Xin, Databricks
June 12, 2019 · 9:15 a.m.
1267 views
In Types We Trust
Bill Venners, Artima, Inc
June 12, 2019 · 10:15 a.m.
1569 views
Creating Native iOS and Android Apps in Scala without tears
Zahari Dichev, Bullet.io
June 12, 2019 · 10:16 a.m.
2232 views
Techniques for Teaching Scala
Noel Welsh, Inner Product and Underscore
June 12, 2019 · 10:17 a.m.
1296 views
Future-proofing Scala: the TASTY intermediate representation
Guillaume Martres, student at EPFL
June 12, 2019 · 10:18 a.m.
1157 views
Metals: rich code editing for Scala in VS Code, Vim, Emacs and beyond
Ólafur Páll Geirsson, Scala Center
June 12, 2019 · 11:15 a.m.
4695 views
Akka Streams to the Extreme
Heiko Seeberger, independent consultant
June 12, 2019 · 11:16 a.m.
1552 views
Scala First: Lessons from 3 student generations
Bjorn Regnell, Lund Univ., Sweden.
June 12, 2019 · 11:17 a.m.
577 views
Cellular Automata: How to become an artist with a few lines
Maciej Gorywoda, Wire, Berlin
June 12, 2019 · 11:18 a.m.
386 views
Why Netflix ❤'s Scala for Machine Learning
Jeremy Smith & Aish, Netflix
June 12, 2019 · 12:15 p.m.
5026 views
Massively Parallel Distributed Scala Compilation... And You!
Stu Hood, Twitter
June 12, 2019 · 12:16 p.m.
958 views
Polymorphism in Scala
Petra Bierleutgeb
June 12, 2019 · 12:17 p.m.
1113 views
sbt core concepts
Eugene Yokota, Scala Team at Lightbend
June 12, 2019 · 12:18 p.m.
1656 views
Double your performance: Scala's missing optimizing compiler
Li Haoyi, author Ammonite, Mill, FastParse, uPickle, and many more.
June 12, 2019 · 2:30 p.m.
837 views
Making Our Future Better
Viktor Klang, Lightbend
June 12, 2019 · 2:31 p.m.
1682 views
Testing in the postapocalyptic future
Daniel Westheide, INNOQ
June 12, 2019 · 2:32 p.m.
498 views
Context Buddy: the tool that knows your code better than you
Krzysztof Romanowski, sphere.it conference
June 12, 2019 · 2:33 p.m.
394 views
The Shape(less) of Type Class Derivation in Scala 3
Miles Sabin, Underscore Consulting
June 12, 2019 · 3:30 p.m.
2321 views
Refactor all the things!
Daniela Sfregola, organizer of the London Scala User Group meetup
June 12, 2019 · 3:31 p.m.
514 views
Integrating Developer Experiences - Build Server Protocol
Justin Kaeser, IntelliJ Scala
June 12, 2019 · 3:32 p.m.
551 views
Managing an Akka Cluster on Kubernetes
Markus Jura, MOIA
June 12, 2019 · 3:33 p.m.
735 views
Serverless Scala - Functions as SuperDuperMicroServices
Josh Suereth, Donna Malayeri & James Ward, Author of Scala In Depth; Google ; Google
June 12, 2019 · 4:45 p.m.
936 views
How are we going to migrate to Scala 3.0, aka Dotty?
Lukas Rytz, Lightbend
June 12, 2019 · 4:46 p.m.
709 views
Concurrent programming in 2019: Akka, Monix or ZIO?
Adam Warski, co-founders of SoftwareMill
June 12, 2019 · 4:47 p.m.
1974 views
ScalaJS and Typescript: an unlikely romance
Jeremy Hughes, Lightbend
June 12, 2019 · 4:48 p.m.
1377 views
Pure Functional Database Programming‚ without JDBC
Rob Norris
June 12, 2019 · 5:45 p.m.
6375 views
Why you need to be reviewing open source code
Gris Cuevas Zambrano & Holden Karau, Google Cloud;
June 12, 2019 · 5:46 p.m.
484 views
Develop seamless web services with Mu
Oli Makhasoeva, 47 Degrees
June 12, 2019 · 5:47 p.m.
785 views
Implementing the Scala 2.13 collections
Stefan Zeiger, Lightbend
June 12, 2019 · 5:48 p.m.
811 views
Introduction to day 2
June 13, 2019 · 9:10 a.m.
250 views
Sustaining open source digital infrastructure
Bogdan Vasilescu, Assistant Professor at Carnegie Mellon University's School of Computer Science, USA
June 13, 2019 · 9:16 a.m.
375 views
Building a Better Scala Community
Kelley Robinson, Developer Evangelist at Twilio
June 13, 2019 · 10:15 a.m.
245 views
Run Scala Faster with GraalVM on any Platform
Vojin Jovanovic, Oracle
June 13, 2019 · 10:16 a.m.
1342 views
ScalaClean - full program static analysis at scale
Rory Graves
June 13, 2019 · 10:17 a.m.
463 views
Flare & Lantern: Accelerators for Spark and Deep Learning
Tiark Rompf, Assistant Professor at Purdue University
June 13, 2019 · 10:18 a.m.
380 views
Metaprogramming in Dotty
Nicolas Stucki, Ph.D. student at LAMP
June 13, 2019 · 11:15 a.m.
1250 views
Fast, Simple Concurrency with Scala Native
Richard Whaling, data engineer based in Chicago
June 13, 2019 · 11:16 a.m.
624 views
Pick your number type with Spire
Denis Rosset, postdoctoral researcher at Perimeter Institute
June 13, 2019 · 11:17 a.m.
245 views
Scala.js and WebAssembly, a tale of the dangers of the sea
Sébastien Doeraene, Executive director of the Scala Center
June 13, 2019 · 11:18 a.m.
661 views
Performance tuning Twitter services with Graal and ML
Chris Thalinger, Twitter
June 13, 2019 · 12:15 p.m.
2003 views
Supporting the Scala Ecosystem: Stories from the Line
Justin Pihony, Lightbend
June 13, 2019 · 12:16 p.m.
163 views
Compiling to preserve our privacy
Manohar Jonnalagedda and Jakob Odersky, Inpher
June 13, 2019 · 12:17 p.m.
302 views
Building Scala with Bazel
Natan Silnitsky, wix.com
June 13, 2019 · 12:18 p.m.
565 views
245 views
Asynchronous streams in direct style with and without macros
Philipp Haller, KTH Royal Institute of Technology in Stockholm
June 13, 2019 · 3:45 p.m.
304 views
Interactive Computing with Jupyter and Almond
Sören Brunk, USU Software AG
June 13, 2019 · 3:46 p.m.
681 views
Scala best practices I wish someone'd told me about
Nicolas Rinaudo, CTO of Besedo
June 13, 2019 · 3:47 p.m.
2707 views
High performance Privacy By Design using Matryoshka & Spark
Wiem Zine El Abidine and Olivier Girardot, Scala Backend Developer at MOIA / co-founder of Lateral Thoughts
June 13, 2019 · 3:48 p.m.
754 views
Immutable Sequential Maps – Keeping order while hashed
Odd Möller
June 13, 2019 · 4:45 p.m.
277 views
All the fancy things flexible dependency management can do
Alexandre Archambault, engineer at the Scala Center
June 13, 2019 · 4:46 p.m.
389 views
ScalaWebTest - integration testing made easy
Dani Rey, Unic AG
June 13, 2019 · 4:47 p.m.
468 views
Mellite: An Integrated Development Environment for Sound
Hanns Holger Rutz, Institute of Electronic Music and Acoustics (IEM), Graz
June 13, 2019 · 4:48 p.m.
213 views
Closing panel
Panel
June 13, 2019 · 5:54 p.m.
400 views

Recommended talks

Big Data with Health Data
Sébastien Déjean
Sept. 5, 2019 · 9:20 a.m.
How Docker revolutionized the IT landscape
Vadim Bauer, 8gears AG / Zürich, Switzerland
Nov. 26, 2016 · 4:32 p.m.
120 views