Player is loading...

Embed

Embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
hi everybody a um my name is uh but like i say i'm gonna present to um arm
00:00:08
and i stick search i like stashed given a set up or how it works why we do
00:00:13
it i know it's not something really new i mean a lot of people do it
00:00:17
but there are quite a few changes uh in that thing means setting this thought
00:00:23
i think it may be interesting to discuss this together so this is the agenda what we're gonna talk about
00:00:30
we start with a short presentation of me why i'm where i'm working with doing
00:00:36
um then i will describe the to the to be problem we try to solve
00:00:41
um why why we put the elastic search look stash and keep on uh
00:00:48
on then we show you how we solve it oddly ways yet 'kay
00:00:54
um we'll finish was two point that or the other way some quilt on during this process
00:01:00
of setting up a six o'clock stashed given a and what are the next steps
00:01:05
so we have a testing centre stage to been running in production for about
00:01:11
two or three years with it
00:01:14
so i'm i'm first few words about may on the software engineering engineer i
00:01:21
studied just a few meters away from here to p. f. l.
00:01:25
i'd either both ten years of software development um for the last four years
00:01:31
i'm working mostly in distributed system using plastic search
00:01:36
doing quite a lot of kafka spark
00:01:39
um the hyping my colleague who work was in development part of the company
00:01:45
to our cute picture um the operator distributed application
00:01:50
mostly in the crowd in the amazon crawled mostly but in different now
00:01:58
i work for a company that's called came to camp that is just a few blocks
00:02:03
a few building away here have tree department one is just
00:02:07
push assertion rebuild website that the average joe spatial information
00:02:12
for example does a website was was a sister
00:02:15
poses for administration for jews should that that
00:02:19
another one that is focused on enterprise resource planning and that the said one
00:02:24
well i'm working we were for in infrastructure division of the department
00:02:30
and we working with the mission we do a lot of thought that doctor um do yeah
00:02:37
generally encourage infrastructures code and what the mission
00:02:41
you almost never never do anything by connecting to the server but or was automate stuff so
00:02:48
if it was one so much of what it should work for an twenty o. one and
00:02:54
so we go from infrastructure crash creation architecture
00:02:58
to management to monitoring and skating doubt
00:03:03
i'm skipping down uh whenever necessary and always in the
00:03:08
virtual lies environment like the amazon crowd mostly
00:03:14
so let's go to the um
00:03:17
to to the program what
00:03:21
what are the problem when you're reading it distributed application that is composed
00:03:26
of many many different services and run on many servers persevered because
00:03:33
each server is is always on the cluster of at least two
00:03:37
machine to have redundancy and most of the time more to to be able to under the load
00:03:45
uh so it becomes difficult when you have so many different server
00:03:49
to when when your customer call you and tell you who
00:03:53
a few of my customer says they have you know all this the website is very slow and then what
00:04:00
we hope we can either but this how can you handle this problem and try to find the solution
00:04:06
of course you can not just try it and see how but it's quick for me it
00:04:10
works for me because there are many different servers so we would you just don't it
00:04:14
the same server is not because it works for you but it's working for your customers
00:04:20
so how how do we get the crick review what is happening in a distributed environment
00:04:25
how do we measure the impact of change if we change the configuration if we
00:04:30
if we put the new version of the application online how do we
00:04:33
know if it's quick use lower we've more more or less or
00:04:39
other we money though the error rate the right response time across the difference everest
00:04:46
forgiven service or or for what europe occasion and one more
00:04:51
difficult point is that sometimes one service will call
00:04:54
just we just call another server so it's not because one services slow that that's
00:05:00
this or is that those really has a problem because maybe it's just one of the service that it will and that
00:05:08
that it will uh a yeah ask information and that on so slowly or is not working as expected
00:05:15
so like i that is this helps a lot to answer this kind
00:05:18
of question and uh what i would like to underline it that
00:05:23
um like analysis is not only the tool for system administrators
00:05:29
but different people from different role can really um
00:05:35
uh find that interesting information in the bailey job was log analysis for
00:05:40
example application all that that that will be part of that though
00:05:44
not really technical uh would be interesting to see what all
00:05:48
the most used feature of the application for example
00:05:52
or what are the most of you or we use the data set in their application
00:05:56
and we can and so this kind of question uh with the
00:06:00
the log analysis the d. i. l. c. e. kale stack
00:06:04
uh application developers are very interesting in understanding yeah um
00:06:11
uh in seeing if they are uh error in the application i'm talking mostly
00:06:16
about web applications away or means that the answer is something like
00:06:20
four hundred four five five and we don't just doesn't work in the browser
00:06:25
of the of the customer um that they're also interesting in knowing
00:06:31
if if the if they just put a new really isn't production is it faster
00:06:35
than the previous one is it's slower what the impact of this new version
00:06:41
um another question but it's uh also quite a lot of time important
00:06:46
that it the cash working as expected uh i don't want to
00:06:50
to rebuild the on the response each time so i need to make sure that um
00:06:55
the caches working and it as expected in it's not that easy to pass that
00:07:00
and what is the ratio of hate to me is uh oh i'm of cache to cache
00:07:05
means and for the system administrator uh all that question that are very interesting is
00:07:11
um and not that easy to to answer is is the load balancing working as expected is the low
00:07:17
that um balance eventually between audio server and what
00:07:22
the impact of configuration chant maybe you just
00:07:25
change the configuration or a great divisional from either where um that you
00:07:30
want to see if it if it doesn't in in fact
00:07:34
and in the distributed application you can not imagine just connected to the server and having
00:07:39
a look at the modified there you really need to centralise index and search this
00:07:47
so that's why we will use uh oh i'll tell us except lock
00:07:51
stashing humana so this is the view of a web application for
00:07:55
from let's say the rubber point of view that the application
00:07:59
that that maybe it's in yellow is the client browser
00:08:03
i think the request to load but answered slot but in so we forward the request
00:08:07
to service or another um the the the seventh of the running in the cluster
00:08:13
we maybe could another service or maybe could not a database
00:08:16
and then answer to them to the client browser
00:08:22
no no this is how we see the data for we will add
00:08:25
two new that afro to each application we manage in production
00:08:30
we'll always add that i flew with that you see here in green
00:08:36
uh well we'll collect metrics on the server for example the
00:08:40
c. p. u. use h. m. m. or you
00:08:42
say it all this space or or maybe quite a lot of different metrics but this out numbers
00:08:50
and we we centralise them in a server the one in
00:08:53
green that's called graphite severed graphite is quite old and
00:08:57
something else but anyway you will have a time i turn surrey database where we collect or this matrix
00:09:05
um then we will be did um dashboard to to to see
00:09:10
and what's happening in your news server this awful numbers this could be
00:09:15
part of another dog but today we focus on the blue
00:09:19
on on the blue that afro well we we should the
00:09:23
rocks here we're talking mostly about access logs access what
00:09:27
about generated by web server apache or or cash caching
00:09:31
several like a brownish you use a lot the
00:09:35
the um the logs from the rule but answer because uh we like them because the other one but all
00:09:41
a closer to the end users so what you see in the log off the
00:09:44
road that answer is more or less what do and user will we see
00:09:50
um so we will take this this dog that we would shoot them from each
00:09:54
server we ship them to chewing server will maybe come back later on this
00:10:00
and then we'll have lots nash reading this dog's transforming them
00:10:06
crossing than adding information enriching them for example
00:10:10
and then indexing them in their sticks out and then it seemed the top we use here by now
00:10:17
to ah have a mandrake hit it and the greek you see
00:10:22
the view on who the slides and the dashboard so
00:10:26
this is the basic who hitch time it's not i mean this is just
00:10:30
an example what application but the important point is that each time we
00:10:35
put an application in production we always thought by having this to get that
00:10:40
through the fruit of the logs the fruit of the metric some
00:10:44
now
00:10:47
which are really um more battery for us to be able to run on an application in production
00:10:52
if if we're going to a more detail in to this group autumn
00:10:58
with the that it's it can be divide it more lesson four
00:11:01
stages before first stages to collect and should the log this
00:11:06
i'm on the web server i need to read read this
00:11:09
lot and ship them to centralise the um system
00:11:14
the second part is to transform and and reach the log so read
00:11:18
read than maybe other feed maybe parts then maybe change the format
00:11:23
make them more meaningful or more useful
00:11:29
the supposes to put them in a stick surge within searchable so in like them in a sticks out
00:11:35
and finally the final product what we want is babe being able to visualise
00:11:41
aggregate in aggregate information in a cuban and dashboard oh sometimes also
00:11:48
extract interesting information with a directories to elastic search
00:11:55
so now we go a little bit um deeper into all this for this four stages
00:12:01
this is a a simplified view this where you just see one of this blue
00:12:07
for each of the stage some first wants collide then analysing extract
00:12:13
them index make them searchable i'm finally have this cabinet dashboard
00:12:21
the first step is to collect and some provide the logs
00:12:26
which knows which means move the logs from the applications over the web
00:12:30
server or to the proxy or whatever the the light management infrastructure
00:12:37
this sounds very the but in fact i think it's the most ha maybe maybe
00:12:42
the most difficult part and and uh for sure the most dangerous one
00:12:46
because if you don't do it well maybe maybe you can get maybe you can keep your application server
00:12:51
which is of course uh not the go right on this so it must be very light
00:12:57
light on the sun sever side because if you do pour seeing or
00:13:01
if it is something that is as you plan your memory consuming
00:13:05
when you we have a high load you will have a lot of log so you it it will
00:13:09
consume a lot of resource and it's when you want your server to be able to use it
00:13:14
the resources to cancer requested not to post the logs it's not it's job so it
00:13:21
you you should do that the last possible on on this on this server
00:13:25
and the very important question but you have to ask yourself when setting up this says um
00:13:32
this part is what happened when when the receiving side is not report responding
00:13:37
what happened here i've let's dashes down for one reason or another the
00:13:41
the network link is down or you have a men's names
00:13:46
so first ideas are that you will buffers so maybe you would
00:13:49
keep them on your on your rex ever on then maybe
00:13:53
this will be full human you won't have enough enough pretty enough place to keep them to continue to both of them
00:14:01
and then in this case you will have to decide if you just broke
00:14:05
your application is it oh no i cannot log i can not
00:14:09
keep my dog so i prefer not to answer requests what would for example do the bank probably
00:14:14
the prefer not to um to request the right answer because the to the cannot log
00:14:20
or maybe use the you decide that you just drop the roads because you say okay i mean information
00:14:26
website if i lose if you are the most important is that the service continue to work
00:14:31
but you have to take this decision it's for sure you don't know
00:14:34
so it's no surprise you broccoli you drop once your brother awful
00:14:39
and uh if you don't decide it will probably brought can decide for you
00:14:46
uh the second step then is a once once you should you have shaped your out from
00:14:51
your web server to the log stash awful to do your centralise the log infrastructure
00:14:58
is to use like stash likely or maybe some maybe something has but in
00:15:03
this case we use a lot lot stash to at contextual information
00:15:08
or to enrich with information from external resource so for example
00:15:12
we had the the contextual information from the source
00:15:15
uh in a in a stall door access like you don't have exact name of the server that produce
00:15:20
the the lots uh so we had this information on the receiving side because we
00:15:25
know where the connection is coming from so we had the information about who
00:15:30
and with this dog and we had also information
00:15:34
about a year but are coming from a classification system so we're classification system
00:15:39
what which server um even if it's a development integration of production server
00:15:46
another one possible a transformation is to enrich the information
00:15:52
for example what we do is an adjudication of the requesting i peed so you can use a database with
00:15:58
with the ugly and try to put the position well in new um
00:16:03
in in your outline of um to to see where you use
00:16:09
or coming from or country maybe just if you don't want
00:16:12
the position or maybe when you have a service that is only used by with katie user
00:16:17
maybe we had something about your use of the tick tick from your user database
00:16:22
uh on the uh on another very in a classical we will uh
00:16:28
something that we do very often is to just possibly your head
00:16:31
and extract part of your power meters to put them in that would feed because in this case then it would be for easier
00:16:38
to search out to agree gate based on this feud interest except if it's just
00:16:43
within the you're out of your request would become more and more complicated
00:16:50
so that's for the transformation part just begin for this scan if you if you do
00:16:54
if you put a lot of uh of transformation with what of regular expression
00:16:59
uh in log stash it can become quite heavy a grunge presses
00:17:04
consuming quite because in quite a lot of c. p. u.
00:17:09
once we have and richard who this like i'd information
00:17:14
when we're in like them inelastic selection so let's dash is
00:17:18
a very i think one of the very nice feature
00:17:21
of lex test is that it does a lot of input and output plugin you can read from a
00:17:26
lot of different sauces and you can write to a lot of different outputs in this case we talk about
00:17:33
the elastic search but it's really amazing the number of old could you have a a and input
00:17:41
so we're indexing and elastic so trim uh we strongly recommend that you
00:17:48
you you know as except can use dynamic mapping just try to
00:17:51
to put to type on your data based on the first night i mean it it it index
00:17:56
but we strongly recommend that you use the predefined them but you you beat yourself you
00:18:01
but i'll let you decide what's the type of a fade and um the um
00:18:07
this makes things easier when then later on you want to be reduce option
00:18:12
um that the white by bad surprise um
00:18:17
okay there are none of your room and devices it the the the
00:18:22
maximum in excision rate we depend on the number of out yet
00:18:26
so if you have a really high level a website need to have enough shots so you can
00:18:31
and in that uh i was takes upset about the problem lies the index station process of
00:18:37
um but keep in mind that if you have a um indexed logs they will take quite a lot of space so
00:18:44
and uh for example for customer ah we who had where one of
00:18:49
our biggest customer is really high volume web site if we can
00:18:54
quickly quite a a costly to keep cool the logs for very long time
00:19:00
so yeah you have to say make a little bit about the data live sickle how long do you
00:19:04
keep this dog and maybe after a few moments with on only keep a subset of the logs
00:19:11
things are testing well and there are also or a
00:19:14
summary integrated um a subset of the logs
00:19:19
you can use it to that school curator to remove you over there in this is the
00:19:24
um the yeah life secure your data as he once you will log inelastic sojourn
00:19:32
you can you can start to use it so you can use cuba to build a new dashboard
00:19:38
and that um you can build different vegetable depending on your
00:19:43
target user so for example we have the dash board for the system administrators and
00:19:49
also back what for application developer and of another body doubled for customer
00:19:54
and um in the beginning we we started doing here
00:19:58
katie really as a tool for technical people
00:20:02
and uh it's really interesting because we realise that the management of the cause of or customer they have
00:20:10
q. and uh who put in like eight hours per day this probably they back opening day in maine and keep that ah
00:20:16
and the do you finish the day but clothing was was software it's
00:20:21
really give them the concrete you of their of their system something really
00:20:26
easy to understand and uh we were very surprised to see home hold a lot it's in fact
00:20:33
and uh right now the push us to have more of this
00:20:38
well in the beginning we had really to push quite hard to make them understand or useful to be
00:20:44
we also use this log regulations tacky this a tomb
00:20:50
that you're extraction but we we we have
00:20:53
time based job crown job but we did do a few request an elastic search
00:20:58
and then put them in another system because they have a system where there for ten
00:21:03
then last year was the number of requests per day for example so we do if your regular extraction of uh
00:21:12
of that of of aggregated data from elastic search and we also use it to measure
00:21:19
and um to measure the compliance of a given service their service level agreement for
00:21:25
the seventh mean imagine for the service manager about by or whatnot so
00:21:29
we can really really easily see the number of request but
00:21:33
here why they feel when the fader and it's very
00:21:37
uh interesting for the quality and management of your patients
00:21:43
so this is a small example this is quite old i mean i'm right not given uh the
00:21:50
latest version of q. gonna doesn't look exactly like basic change quite a lot can quite fast
00:21:55
and that there is just to show you an example where we
00:21:59
have a few different graphics so this this uh um
00:22:05
aggregate data from uh like i don't know like six or six or
00:22:09
seven server i think oh that's just an example which should
00:22:12
different kind of graphics that we can that we can have and
00:22:16
that or maybe of interest with different kind of people
00:22:20
uh the the first one is just the number of requests per per second here
00:22:25
um and the or per minute i think yes per minute
00:22:30
um the uh we have a different colour or depending on the response time
00:22:34
so if the response times less laws and half a second it's green
00:22:38
um between half second one second re light green and i sing
00:22:42
and once again to one top five it's it's alright and and then read if it's higher
00:22:48
yeah we just have the proportion of cash it in question is how
00:22:52
many of the record come just from the cash and little cat
00:22:56
and here for example with that that's at the buttons have been requested that
00:23:00
that is something that is extracted from the you are out from the request so
00:23:05
that's mostly for the customer they see the data that have been extracted
00:23:10
and he is here we have zoomed view on the request that
00:23:14
all slow so it's only the orange and red part
00:23:18
because him you don't see a lot because the whole very a slew low proportion
00:23:24
so he is in view of the slow request understate the slow repressed
00:23:30
response code we said that they'll slow but they don't they
00:23:34
and what's interesting here is for example we have different kind of performance depending of
00:23:39
the data set because maybe sometimes they didn't add the index was issue that
00:23:45
in the database so then you can select one uh of the data set just click
00:23:51
on it because it's given it will either feature added to refresh who the grass
00:23:57
with this filter so showing you only the statistic for this given that asset
00:24:02
or we could just take for example on kashmir is and see what all the data set up on
00:24:08
never in the cash and maybe have a look if that's to see if it's well come euros
00:24:17
so
00:24:20
uh a few lessons we learned that all the way so maybe i can just put
00:24:24
it here to maybe you don't have to spend time hiding it gets this
00:24:30
a bit below collection and shipping power to is the most critical because if it
00:24:37
face a lot uh with a lot of program it can it can maybe
00:24:41
brought your application if you use something like says log and uh and
00:24:46
and then you don't have connection if you use t. c. p. r. t. c. p. configuration if it
00:24:52
at one time it will it will just broke your web server and stop here
00:24:56
and um you need to decide what you do with and if you're carefree jeffrey configure
00:25:03
shape or you can do a lot of things but in it you will have to spend time on
00:25:07
this and you will have to decide what if it's acceptable for you to drop blocks in case
00:25:14
of in if the we were receiving and the log stash is not available anymore
00:25:21
um uh the other pointed that uh if you do you can do a lot of things in you know
00:25:27
a lot stash as input feature and how could we use a lot the input of the output
00:25:34
if you do a lot of things with the regular expression crossing in the filter part of what stash
00:25:41
it will use a lot a lot of resources it's quite a um
00:25:46
yeah it it is quite a lot of uh resources and this thing
00:25:50
this transformation can be quite difficult so if you have a few
00:25:54
quite easy sings i think it's really the right choice if you really have ah
00:26:00
if you want to do a lot of things in your feet to fit about if you want to change a lot of things in your lots before
00:26:06
inserting them in the last except if you want to congregate sings compute sayings
00:26:11
modify things maybe maybe you should use something else that the moustache filter maybe the
00:26:19
ethics search five in just note i never tried it's
00:26:23
it's on in but i know that right now
00:26:26
i think such can do a lot of transformation to get that that when you in back then
00:26:32
or all site we will use chaff guest re is where we it's it's it's a framework where you can
00:26:38
that you can use just to modify modify stream of that ah
00:26:42
all people use cross training i don't know but just saying it's it's good to have a
00:26:47
few for future and but it's not good to write a program that transform your data
00:26:53
uh they're already said it but keeping that have a very long time can can be quite
00:26:58
expensive that that indexed in elastic search takes quite a lot of space deferred lock stash
00:27:04
a template if you use it will keep you each fit in to format the one that is on allies didn't
00:27:11
talk in my uh to get nice and and i and another one that is just the row string
00:27:17
so it it will take for more space of course than if
00:27:20
you just have your you find is comprised on desk yeah
00:27:25
um so if you want to keep it out with i volume website
00:27:29
here to keep them for very long time maybe you should build
00:27:33
aggregated data set maybe you should agree gate i dunno data or week how many
00:27:38
requests to many come on many record the weekend a button above um
00:27:43
or or do i have and things like this are unit of time
00:27:47
because you will be able to get them room go for a lot of course if you
00:27:51
just one document or how you can you can keep them for four years and it
00:27:57
and i would also say with elastic such as with any distributive system
00:28:03
i would recommend start quite big with quite big infrastructure don't
00:28:08
don't be shy to put a lot of crime a lot of c. p. u. and then
00:28:13
you can lower your requirements if you see that it works well
00:28:17
uh it was quite well in doesn't use were your resources
00:28:21
doing it the oppose it starting sporting growing is quite
00:28:24
time consuming so i already everything's good ways
00:28:28
is to is to start with the bigger than what you expect you will need and then um lower lower unions
00:28:37
um but i think there's something else that is um it's important to understand
00:28:43
but it's it's a really it's too that we work perfectly well in almost three time most of the time
00:28:50
but if you have a huge spike like distributed denial of
00:28:54
service attack on your on your application for example
00:28:58
it will likely not help you as much as you would like because point thing
00:29:02
the logs takes quite a lot of time so you you want um
00:29:08
you won't have your infrastructure designed to handle this kind of traffic that you have during
00:29:13
your huge part so it it won't help you a lot because it will probably
00:29:18
for behind um try to index all this stuff and you won't see
00:29:23
really the last minute the last second in you keep an uh
00:29:26
because uh you can not design your infrastructure to be able
00:29:31
to indulge in real time unusual spite it will
00:29:33
be just too expensive to maintain it so for this
00:29:37
kind of a really huge and unexpected spike
00:29:41
don't expect to every time that i with this and maybe you you need
00:29:46
to focus on using matrix just the numbers coming from the servers
00:29:50
and not as something that have in computing that posting in indexing blogs
00:29:58
so i think that's more or less from uh how we do it that way just give you in a few
00:30:04
words what all the next step for us that big because it uh i think it's interesting cause it
00:30:10
we we hear what you so we just as exception to that i don't remember exactly and even uh
00:30:17
three ice thing i'm not so sure um that uh we we we are in the process of for installing in
00:30:24
putting everything yeah on the uh with version five a lot of things changed in between
00:30:29
so we have to really do a lot of saying that that's what happened with
00:30:33
host moving company like last that movie when we started awake i think
00:30:38
the where diana now we'll to integrate or something like that
00:30:43
of course that a lot of things so things move very quickly we don't have to change a lot of things
00:30:49
uh right now uh i show you we have a better for but this buffer is just
00:30:53
a in member in memory buffer you give us if you give us one day of
00:30:59
of these to replace to replace some single to change something but if you have if
00:31:04
we have a huge spiked like the knives or is that we are the ones
00:31:08
and then it will only and the like if you also fraud so it's not enough to really
00:31:14
stop losing on the very quickly it's not enough for us to react so we chant this
00:31:20
uh replace this was calf cat that will open a new
00:31:24
world of possibility jeff cat is a cute system but
00:31:27
but can write the cumin disk so you can you can keep huge amount of like and it's very performance
00:31:34
um the uh in our next step we we only use we would use a lot look stash a lot because
00:31:41
we do a lot of input output and it's really missing for this same was a lot of time
00:31:46
but we won't use the features anymore because we will do most
00:31:50
of the transformation in cactus streams uh also personally i think
00:31:56
it's quite difficult to test and debug alex transformation like station for
00:32:00
easier in a in in a programming language like constrains
00:32:06
um the last point is but not very important but we're really ah we will use doctor everywhere i
00:32:13
don't know if you know deduct doctor with a bit but we have with his talk running

Share this talk: 


Conference Program

Understand your Distributed Application with Elasticsearch, Logstash and Kibana
Benoît Quartier, Camptocamp
30 Nov. 2016 · 5:15 p.m.

Recommended talks

Building and deploying Kibana plugins ... And should I do it?
Alexandre Masselot
11 May 2017 · 2:04 p.m.
500 views
Artificial Intelligence at Swisscom
Andreea Hossmann, Swisscom / Bern, Switzerland
26 Nov. 2016 · 1:01 p.m.
204 views