Player is loading...

Embed

Copy embed code

Transcriptions

Note: this content has been automatically generated.
00:00:00
um so uh uh contrary to big that are uh the
00:00:04
roles of some situations where we can only work with the uh
00:00:09
small amounts of data uh inside the data sets
00:00:14
and um the the i'm of
00:00:17
my presentation is to discuss some uh
00:00:22
methods and specific tools uh that may be used uh in such situations
00:00:28
uh so before uh turning to these methods um i just
00:00:33
wanted to recall some basic aspects of data analysis using statistical tools
00:00:40
um the main question in general or
00:00:43
is uh to uh tell something uh from
00:00:48
observations in the sample uh that's a a population which
00:00:54
we don't know about so a population that is non observed
00:00:58
uh and uh another question of the possible question would be
00:01:03
uh to uh to estimate how we can locate and observed sample or or
00:01:09
sometimes a single person uh with respect to reference population
00:01:15
uh so there's always in statistics the issue of uh
00:01:20
saying something about the population uh when
00:01:24
uh observing as some part of this population
00:01:28
and then uh if we say something about if you want to say something about the population
00:01:34
uh we have to uh estimate the properties of this population correctly
00:01:40
and for example one very frequent basic assumption in statistic
00:01:45
is uh the grecian distribution for continuous variables
00:01:50
uh and also we have to choose these tools integration with
00:01:55
the properties of devoid of boards that we've observed so if
00:01:58
variables are containers so categorical we can't apply this the very same tools
00:02:04
i'm in statistics does a constant distinction between
00:02:09
what we call parametric testing and then parametric testing
00:02:13
uh parameters are numeric entity is that correct arises sampling
00:02:18
distributions so uh for example when we talk about the question
00:02:24
distribution uh parameters for the goshen
00:02:27
distribution arts mean and standard deviation um
00:02:33
the plan i'm going to uh address uh for this presentation uh
00:02:39
it is a white identifying issues
00:02:42
and solutions that are associated small samples
00:02:46
uh then uh presenting to speak very specific tools for analysing small samples
00:02:52
and then introducing resembling net switch maybe uh
00:02:57
which are general methods that may be appropriate for solving some
00:03:01
problems uh when dealing with both small and not so small samples
00:03:09
uh so parametric tools rely on an estimation
00:03:12
of the distribution parameters for the statistical estimate or
00:03:16
uh so for example or um uh the happy at
00:03:19
the hypothetical population is defined in such tools by mathematical formulation
00:03:25
uh in contrast when we use non parametric tools we do not rely on such an estimation
00:03:32
and uh the results do not depend on the distribution of the estimator or the variable
00:03:39
another classic edition which is often used is where the
00:03:42
the tools are distribution dependent or distribution independent distribution for it
00:03:49
uh so this tradition dependent tools classically are for example an of as students
00:03:54
to tests regression whether they are in your are logistic or based on mixed mothering
00:04:00
a distribution free tools are for example chi
00:04:03
squares by name your tests protocol wallace wilcox um
00:04:07
and also a basic bootstrap and re sampling
00:04:11
methods more specifically what we call non parametric bootstrap
00:04:16
uh and there are also some intermediate tools which i'm
00:04:19
not going to talk about uh which are a lot
00:04:23
i'm under a jury of what we call parametric re sampling methods
00:04:29
so i'm uh going to address a
00:04:33
mainly non parametric uh approaches to data analysis
00:04:38
uh and uh the information that are in these two slides are
00:04:44
easily available inside text books so i'm just going
00:04:47
to uh go uh after that and people can
00:04:51
read this page is uh if uh the need
00:04:55
uh when i give the the slides to too easily
00:04:59
so uh at first uh i will present
00:05:02
two tools for analysing small samples or individual data
00:05:06
and then uh the second part of the presentation will
00:05:10
be devoted to each introducing re sampling methods um hum
00:05:17
the the example that i'm going to to uh work out here
00:05:22
uh focused more particularly on analysing individual date that's a very small uh
00:05:28
there is more that assets that is uh yeah yeah code also single case research
00:05:34
uh and uh and the two tools i'm going to discuss uh are one uh which is
00:05:41
um based on parametric statistics but a dedicated to us more
00:05:47
a set of data and the other is typically non parametric
00:05:51
uh which is a dedicated to evaluating how uh and if i may and
00:05:59
an estimator sorry uh various uh between two discrete ditched create time slots
00:06:07
uh so the first uh example is dedicated to
00:06:11
uh deciding wherever we can locate an individual estimator with respect to
00:06:16
a reference sample or uh these issues are discussed by john crow ford
00:06:21
uh on his website which is uh available on this page
00:06:26
uh so the aim is to estimate this is a very classical a mean
00:06:30
in for example in in pathological or a work in your psychology for example
00:06:37
uh the aim is to estimate whether an individual exhibits a performance that is different
00:06:42
from what one expects should this individual come from the general
00:06:46
population so we want to the to decide whether we can
00:06:50
consider that an individual is extreme with respect to reference
00:06:54
a population uh but this comparison is
00:06:58
made to a non normative reference sample
00:07:01
because we don't know the population for sure and so we have to compare
00:07:07
data from an individual to data that has been collected from
00:07:11
a sample which is supposed to be or a fraction of the population
00:07:17
um one traditional approach in your psychology has been to compare
00:07:22
a a patient just said score with the estimated
00:07:25
distribution 'cause its course in the reference some pork
00:07:30
and uh the issue that has been a addressed by crawford
00:07:34
and how where for example was that these approaches tend to
00:07:38
uh over estimate the probability that you observe
00:07:41
vision is extreme so to overestimate the possibility that
00:07:45
do you observe asian is not port uh of the
00:07:48
reference sample is not i'm coming from the reference population
00:07:55
uh so uh i in in the slides i've put some uh are crowed
00:08:02
uh so people can who want to a test these aspects
00:08:07
may uh use the code it is written in the slides
00:08:10
to to test these issues but i'm not going to go
00:08:14
through the code each time on the sometimes when i think it's
00:08:18
it's useful to explain so here is just an
00:08:22
example for computer is computed is that's course in or
00:08:26
uh so uh here uh we are going to take an example let's say
00:08:30
we get a measurement from the patient where did you observe values values twelve
00:08:35
and we have the sampling of this measure uh from a sample or in the reference population
00:08:41
uh from which we've estimated that the mean uh
00:08:46
would be twenty five and the standard deviation would be five and we want to locate the
00:08:51
measure from this patient with respect uh to the
00:08:55
sample work uh that is associated with these parameters
00:08:59
so for example would compute uh the value of this ad score
00:09:04
depending on the score and the mean and standard deviations of the popular
00:09:09
the reference sample or uh and we would get a value of this it's
00:09:13
it's a score and this that score would be compared with standard sat scores
00:09:19
uh concerning uh for example a five person criterion level
00:09:22
or and then we would decide whether does that gore is
00:09:27
inferior to does value uh and would decide
00:09:31
depending on this whether the patient is extreme
00:09:35
so in this situation for example we may locate the patient
00:09:39
uh at this position on the estimated a gauche and distribution of does
00:09:45
that scores and then decide that the patient is extreme so uh and
00:09:50
uh not compatible with uh the reference sample
00:09:55
uh in other situations for example we may decide that
00:09:59
depending on these values for example with a higher score
00:10:03
uh the patient would not be considered as being extreme
00:10:08
um issues with this approach to uh that's as stated by
00:10:12
preferred and how well are related to the fact that that
00:10:16
the sample size the sump or reference size is not taken into account and so uh i
00:10:22
um in the sense the sample is you does it as being a director a
00:10:27
fraction of the population though we know that the sample isn't an estimate of the population
00:10:33
uh so uh the proposal or from core for then how well uh it has been
00:10:40
to uh use what they call it modified students to test
00:10:44
uh for which the formulas given here uh i won't get into
00:10:49
the details here it's available on the slides uh and uh if
00:10:54
uh you want to apply this formula the main
00:10:58
information to take into account here is that of course
00:11:01
you compare your observation with the mean of the
00:11:04
some pork depending on the standard deviation in the sample
00:11:09
but you will take into account the size of the
00:11:12
reference sample in order to compute this modified to test
00:11:17
and uh depending on the degrees of freedom of the comparison which is
00:11:22
a dependent on the number of observations in the reference sample or
00:11:28
uh you will uh get a an information that is adapted to the reference sample and
00:11:34
so in this situation for example or applying
00:11:37
this a formula here uh in uh our sorry
00:11:43
would let your uh decide more uh more id quickly with the
00:11:49
the individual observation is extreme with respect to the reference sample
00:11:57
uh another uh aspects that may be a useful in in pathology
00:12:04
a research for example is estimating devaluation of an individual or a
00:12:09
over a discrete time slots uh these issues are uh this just uh
00:12:16
mainly on the website which is single case research
00:12:20
uh and have been for example uh described in
00:12:24
one of the papers that are mentioned on this website by proctor van nest and davis um they
00:12:32
ah the ah for a methods for analysing what
00:12:35
the core no never that between sets of data
00:12:39
uh and i will just discuss one example of their uh
00:12:43
proposals which is the can doubt how uh for non overlap
00:12:47
uh so uh uh here you will just find the code for generating
00:12:52
the data set that is available here that is displayed on the slide
00:12:56
uh so uh for example uh let's say we've got
00:13:01
a a protest and the plus test for example with six observations during
00:13:06
the protests and seven observations during the post as we've got some scores
00:13:11
uh the scores graphical it uh are displayed
00:13:14
here and would like to uh ask whether
00:13:18
there's an increase in performance from the prey
00:13:22
a test session to the post test session
00:13:26
um in order to apply what uh
00:13:31
uh what the authors parker van nest and davy score uh the no no for that test
00:13:37
uh we have to consider each pair of points uh so uh each point has been too
00:13:44
paired to be it has to be paired with any other points in the data set
00:13:49
and we will compute uh some uh some numbers from this bears
00:13:56
we we need to compute the total number of pairs of
00:13:58
or the whole study uh so here it's a forty two
00:14:03
uh the number of decreasing pairs uh within what the
00:14:08
other score uh de overlap zone so the overlap zone is
00:14:12
a defined at by the two uh the to read lines uh on the graph
00:14:19
uh the red lines are uh in our
00:14:22
situation here uh the maximum for phase one
00:14:26
uh here and the minimum for phase two so record
00:14:29
here uh the question is to ask whether there's an increase
00:14:33
and so if we want to know whether there's an increase uh
00:14:36
we will restrict uh some computations through the overlaps on which is
00:14:42
uh we want to really meet in the protest uh to the maximum in phase one and
00:14:48
two the minimum in phase two and so we will uh some of the computation will only involved
00:14:54
these points the ones that are between the two lines within the of the labs um
00:15:00
so the numbers we want to compute or the total number of pairs over the whole study
00:15:07
plus the number of decreasing tears within the overlap zone
00:15:11
and the number of stable purrs within the overlaps on
00:15:15
and then from all these numbers we can deduce the total number of increasing purrs
00:15:20
uh which is the total number of players minus the
00:15:24
number of decreasing purrs minus the number of stable pairs
00:15:27
and the value of the tow none of or that which is the parameter
00:15:32
for uh the non overlapped testing uh it is uh the number of increasing increasing
00:15:38
uh uh overall pairs minus the number
00:15:42
of decreasing pairs minus the number the
00:15:45
total number of pairs uh in uh the uh uh on uh in the set
00:15:53
uh it's it's really the ulcers a mention that it's easy to
00:15:57
compute by hand uh but it's a little bit tricky to compute uh
00:16:02
uh automatically but it's visible or uh i've put the coat the our code for
00:16:08
computing all these values uh in the slide so i want
00:16:13
uh spend too much time discussing the code it again it will be available or
00:16:19
uh so we can use various uh methods to get all these numbers
00:16:24
and then once we get these numbers uh we can compute the value of
00:16:28
the tone an overlap which your uh it is uh eighty three percent approximately
00:16:34
so we can say that uh there is approximately eighty
00:16:39
free person no no overlap uh in these data depend
00:16:44
well depending on the way the the the computations performed
00:16:48
uh which is a specific record the tone an overlap uh and um
00:16:55
uh then once we get this computation this number eighty three
00:17:00
percent we will need to estimate with their this value is
00:17:05
a significantly uh high uh enough to reject the hypothesis is
00:17:11
that there's no uh change between the two uh sets of data
00:17:17
uh so there is uh just to compute the value there
00:17:20
is also another package which is called single case e. s.
00:17:24
uh so you can compute the tone on about that using this package but i
00:17:28
think it's interesting to see uh how
00:17:32
it is uh uh are computed by hand
00:17:36
ah so in our you would have to be used uh can doubt package
00:17:41
and sure applies some um modifications to the data it's
00:17:46
okay you can apply uh uh can that correlation test
00:17:51
on this data uh and the can decoration test would give you a
00:17:56
significant p. value here which would tell you with the the top son overlap
00:18:01
is significant or not uh so here in this situation the p. value is
00:18:06
a point all one approximately uh so we
00:18:10
may conclude that the uh the the uh the
00:18:14
tow non overlap value here is significant significant to
00:18:19
uh the value points you were five for example
00:18:22
uh and we may conclude uh that there's actually a
00:18:27
an increase in performance from the protest to the post test
00:18:36
oh sorry uh that that the does and in their situation in
00:18:40
which we would uh as the question whether there's a decrease in performance
00:18:45
and if we asked this question we have to compute the data a little bit
00:18:50
differently because if we want to test the decrease in performance rather than a a
00:18:55
defining the overlap zone by the maximum point in the protest and the minimum point
00:19:01
in the past test we will reverse the situation and so this would uh go
00:19:07
give a computing the overlaps on uh using the maximum point
00:19:11
in the past this than the minimum points in the protest
00:19:16
uh but apart from this chance uh the situation will be would be similar are
00:19:22
uh we would compute the number the total number of pairs
00:19:26
uh the number of increasing pairs so it's the reverse from the
00:19:30
previous situation number of stable purrs and the numb the total
00:19:34
number of decreasing purrs depending on the total number of pairs and
00:19:39
the a specific number of increasing its table prepares
00:19:43
within the overlaps on and we would then uh get the non
00:19:47
overlap rates uh here which would be sixty one point nine percent
00:19:52
and applying uh the uh uh or can direct correlation
00:19:56
test on this date that would tell us that these value
00:20:00
uh uh wouldn't let us conclude that there's a decrease in performance
00:20:05
because the p. value is not a low we're done point o. five
00:20:11
um how much time where
00:20:21
i did sorry i didn't look at when i started
00:20:27
yeah the
00:20:32
thirty oh so i've got plenty of time um but i i think i teach
00:20:38
i try to make it sufficiently short so you can ask questions um so
00:20:45
the chair would choose i i've discussed here our aim that um
00:20:50
uh asking specific questions on very small uh data sets
00:20:55
uh using uh in the second case
00:20:58
uh using a a non parametric uh approach
00:21:03
uh another up basic approach that may be used uh
00:21:10
with a small it's data set but also uh which
00:21:15
is a useful for dealing with big data sets but
00:21:19
uh i will talk about these are uh is what we call re sampling method uh resent
00:21:26
ignite that's our uh a part of a more
00:21:29
general we're approach which is called monte carlo simulations
00:21:34
so basically re sampling methods are associated
00:21:37
with generate a random information random data
00:21:41
uh but generating random data from uh the data uh that we've got
00:21:47
so we uh are going to perform a re
00:21:51
sampling of the original data on a random basis
00:21:56
uh_huh uh just to get back again to basics in statistics
00:22:01
uh when we uh want to buy a analysis set of
00:22:06
data that we want to estimate the properties of a population
00:22:11
or to evaluate i put faces for example to compare uh
00:22:15
the difference between two groups or between two uh two situations
00:22:20
um and we want to estimate this property is uh
00:22:25
in order to say something about the population but uh the population again is not accessible
00:22:31
and the only uh it thing that we know uh is uh the sample we had
00:22:37
some information about the simpler but we don't
00:22:40
have any information it unclear information uh about
00:22:44
the population and so we have to make up with this is about the population um
00:22:51
and uh in specific applications this is a very clear
00:22:56
for example uh when we want to
00:22:58
estimate to statistical parameter for example central tendency
00:23:02
uh or dispersion value or to estimate the correlation uh between data
00:23:09
and to compute for example what we call confidence intervals for this parameters
00:23:15
uh in the situations we have to make a hyper faces
00:23:19
make a statement about what the population should behave like what
00:23:24
the distribution of the parameter in the population and it would be
00:23:29
uh it's the same thing in hypothesis testing so when we want to compare means using
00:23:35
analysis of variance mixed effects murdering a v. r. is
00:23:39
a roach student status for example uh we have to
00:23:44
make some assumptions about distributions of the parameters in the population
00:23:50
uh this is a approaches which
00:23:54
uh make assumptions about the distribution
00:23:58
are also called as sympathetic uh approaches because uh actually the r.
00:24:04
better when the sample is larger uh and so the larger the some for
00:24:09
the better you will uh get information that is appropriate to describing the population
00:24:16
uh on the contrary uh re sampling methods uh
00:24:20
our uh any aimed at being independent of these assumptions
00:24:26
even though so i'm part metric a re sampling methods are available or
00:24:32
um and this is a re sampling methods
00:24:35
uh can be used for various tasks in statistics
00:24:39
so we can do what we say what we
00:24:42
call bootstrap a bootstrap is usually used to do parameter
00:24:47
estimation so for example if we want to estimate the
00:24:50
confidence interval or for apartment or we would use bootstrap
00:24:54
uh but we can also use a another a way
00:24:58
of doing re sampling uh which is called permutation test
00:25:03
this bum you should patient tests are used for hypothesis testing
00:25:08
uh and we can also use these uh tools to do
00:25:12
outlier detection and what when i talk about outlier detection it's
00:25:18
some uh it's partly related to what we call meeting
00:25:23
date that that have been discussed in the previous presentation
00:25:27
uh it's also a very useful a tool
00:25:32
a two or a a tack or what we call poorer computation
00:25:37
um because for example a power computation is a
00:25:41
very important step of data analysis before data collection
00:25:46
uh and it's relatively rarely applied a by scientists
00:25:52
uh and most uh choose a most current tools
00:25:59
for statistical analysis nowadays like mixed moderating for example
00:26:03
uh don't a low the computation of power from a basic
00:26:08
i'm a formula a contrary to an of us for example
00:26:13
and so uh re sampling methods out very um straight forward to compute power
00:26:19
for me it's not building models mixed uh analysis models
00:26:24
and finally uh but i i will talk only
00:26:28
of of a very small subset of these uh aspects
00:26:33
uh that i've talked about here and finally uh we
00:26:38
sampling methods are also something that is interesting to start understanding
00:26:43
what is don't in valuation frameworks which are very different from
00:26:48
uh current uh on hypothesis testing approaches
00:26:55
uh so in a scientific approaches we perform assumptions about
00:27:01
the underlying distribution so we have to know or to estimate
00:27:05
a mathematical model of the underlying distribution to which we re for
00:27:10
and the sample or that we are are working on is viewed
00:27:14
as though one random exemplar that is drawn from the underlying population
00:27:19
and so if we want to compute the confidence
00:27:21
interval or for any parameter and this a sample or
00:27:26
we have to use a specific formula a specific mathematical
00:27:30
formula that would uh take into account that we are
00:27:35
assuming something about the populations or that we are for example assuming that the uh perpetually
00:27:41
population this to the parameter distribution in the
00:27:45
population uh is uh uh goshen for example
00:27:51
in re something approaches we don't do any
00:27:54
and a new sanctions about this underlying distribution
00:27:58
and so to mathematical model of this underlying distribution is replaced with a a computer
00:28:04
did simulation uh out of uh the population by generating a very
00:28:11
high uh uh a very large amount of samples from the observed samples
00:28:16
so the original sample is the source uh for what we
00:28:20
call the bootstrap some boards which are extracts run then extracts
00:28:25
uh we've replacements uh is in in the case that i'm
00:28:30
talking about here with replacement uh from the original some form
00:28:37
and so in this situation will see that it's possible to compute a confidence interval
00:28:42
uh without any ascension and so we can compute a confidence interval
00:28:48
for any parameter even if we don't know the formula for this parameter
00:28:53
so let's say uh we have a gauche and distribute it viable or if you want
00:28:58
to compute a confidence it all into like for example in ninety five percent confidence interval
00:29:04
we have to apply uh what this a formula we
00:29:09
have to compute um uh the confidence interval as the mean
00:29:14
plus minus uh this uh parameter and then um
00:29:20
this is related this is directly related
00:29:23
to uh the normal distributions so actually
00:29:26
this formula is valid only for normally distributed variable uh so i will just
00:29:33
leave d. is a parts here so we can
00:29:36
compute a confidence interval or in any software actually
00:29:40
and here uh for a small set of data
00:29:44
i computed the confidence interval using is a formula as
00:29:48
uh and if we increase the size of the sample of
00:29:52
of course we get better and better estimates of the confidence interval
00:29:58
um but the problem with this confidence interval
00:30:02
computations out that they're reliant distributional assumptions and
00:30:06
there are specific to a specific parameters so we know the formula for computing
00:30:12
the confidence interval for main but if we want to compute the confidence interval for maybe an hour for
00:30:18
a correlation crayfish and then we have to know
00:30:21
another uh come another formula uh which would be adapted
00:30:27
in within the re sampling all bootstrap framework uh
00:30:31
the bootstrap principle is that the put the put the sample or
00:30:36
is to the population what the bootstrap sample is to the sample
00:30:40
uh so actually uh we in general in classical terms we think that the sample
00:30:45
is extracted from the population and we don't know the population uh in bootstrap uh we
00:30:52
think uh in terms of uh we have the sample would
00:30:55
be some bodies are are originating population we don't have anything else
00:31:00
and then uh we are going to generate a set samples
00:31:04
of this sample or a a very large number of times
00:31:10
and we can then use this principle to be owned a population of bootstrap samples
00:31:16
uh and uh the this a very high number of times uh
00:31:21
this can be done for any parameter would see this uh later
00:31:25
uh uh for example or uh if we generates the
00:31:30
set of origin or date that so let's say this is our original some pork
00:31:35
um we can uh draw a single bootstrap some for uh using
00:31:40
these uh approach uh in our so for example here and generating
00:31:46
uh the sample function extracts random uh sets of
00:31:50
data uh from uh the original data here and we
00:31:54
say we want to extract exactly the same number of
00:31:58
it that that we've had in the original sample but
00:32:02
we can page uh several times the same value
00:32:07
and uh in this way we can uh gets
00:32:12
um another sample that is
00:32:16
originating from the original sample
00:32:19
uh so and the left i've got my original sample and on the right it's a bootstrap some board so it's the
00:32:25
first one then we can do the same for another uh
00:32:29
try or and get another bootstrap sample and do the same again
00:32:34
and actually uh de general approach will consist in um
00:32:40
uh generating looks uh over a
00:32:43
very high number of repetition reputations
00:32:47
and uh for each replication generate a bootstrap sample
00:32:51
compute the statistical parameter on this bootstrap some pork
00:32:55
store the result in a vector and in the end uh we get a a for distribution of
00:33:02
or uh were computed parameters for each bootstrap some
00:33:06
for and and we can uh estimate how these parameters
00:33:12
uh_huh disbanded survivor is uh in the distribution so uh here is
00:33:17
a basic our code for computing this so actually uh i want
00:33:23
take too much time on this got that we can discuss it later if needed so and in the end
00:33:28
for example we've got a very uh here we've got
00:33:31
four thousand reputations of a bootstrap samples random bootstrap samples
00:33:37
and then we get four thousand um estimates
00:33:40
of the median on each of these bootstrap samples
00:33:44
we can compute um we can compute and did a histogram
00:33:49
so it distribution of this parameter uh and we can also
00:33:55
simply estimate measure the confidence interval are from directly
00:34:00
from uh the distribution using the contact function in or
00:34:05
which is uh these a set of values here
00:34:11
so there are some issues we've uh the standard bootstrap
00:34:17
which i've been addressed uh by several offers in their iterate sure
00:34:22
and uh there's a package uh in our which
00:34:27
uh let's uh one perform bootstrap approaches to using these
00:34:33
uh at that seeds uh methods to a bootstrap uh so we don't need to
00:34:40
generate the bootstrap ourselves but we have you know how it works
00:34:45
in the sense in order to uh be aware of what's it's going to do
00:34:50
uh so the packages the butch package in our uh and
00:34:54
when using the book package we have to uh uh first
00:34:59
um uh define a a home made
00:35:01
function uh which we'll uh perform the computation
00:35:06
of the parameter in a specific way that is adapted to do with the package
00:35:10
and then uh the butch function will call this homemade functions
00:35:14
so for example or uh yeah let's say we want to compute
00:35:18
uh the median uh for a a set of data we
00:35:22
will define the function with two arguments uh the function is
00:35:28
the the object that will contain the data which
00:35:31
is a which will be each bootstraps important to point
00:35:35
and the second argument is the index the index is a dean takes of the
00:35:39
data in uh the sample uh so
00:35:43
then uh the me jen uh is simply
00:35:47
uh applied on a subset of the data which is the index
00:35:51
inside the the original sample actually the index will uh correspond to do
00:35:58
the uh in the numbers of the observations a corresponding
00:36:03
to single bootstrap some pour in the original some pork
00:36:08
and uh we can just a very five had that computing uh this function
00:36:14
for example if we want to compute uh to me jen of the vector veg
00:36:19
uh over a all the observations of the vector we should get
00:36:24
exactly the same value as the one that is provided bag median
00:36:28
uh then uh we can call or uh it we uh function
00:36:33
uh here uh we that we generated a
00:36:37
as an argument uh of the puts function
00:36:41
and uh the main arguments of that would function will be the original sample containing the data
00:36:47
uh the function that will perform the computation of the parameter
00:36:51
and the number of repudiation so here it maybe two thousand four
00:36:55
thousand ten thousand uh depending on the choices of the the experimenter
00:37:01
and the puts function will return the least object which contains the from various informations
00:37:08
um among which uh the original or value
00:37:12
for the parameter in the original some pork
00:37:15
uh each of the bootstrap values from each a bootstrap some pork
00:37:20
uh the number of ricky patients and
00:37:23
the original data uh as a single object
00:37:28
uh and so uh it is then possible to use uh
00:37:32
the library to compute a corrected uh values for the confidence interval
00:37:39
so for example here uh i've requested uh using
00:37:43
the good that's a high function to compute uh
00:37:47
f. runs version of uh the bootstrap confidence interval
00:37:52
uh which is for accelerated a uh i four actually
00:37:57
um uh and so uh we gets and objects here
00:38:01
which contains again uh various information the number of repetitions
00:38:05
the parent it the original parameter value uh and and d.
00:38:10
uh uh a year the values for uh the confidence interval
00:38:15
and uh we can extract these values for the confidence interval which are
00:38:21
partly different from what we observed in our basic uh computation because they are corrected
00:38:27
uh in order to soar for some
00:38:29
issues uh with uh your bootstraps confidence intervals
00:38:36
um uh so this is mainly an introduction
00:38:40
uh to uh to bootstrapping that we've addressed here
00:38:45
uh and uh my aim was to show that there are
00:38:48
a methods that could be used in order to try to
00:38:55
uh uh uh perform analyses without being too in the pet too dependent on
00:39:00
assumptions about the distributions uh but bootstrap
00:39:05
and engine or or a re sampling approaches
00:39:08
uh maybe useful a very high number of different tasks
00:39:13
uh so i just listed some of these here uh it can be use we've
00:39:17
we've seen how to use it for part mister estimation uh i've also included or
00:39:23
something like a real life example on real data uh we've linear regression so people can see that
00:39:29
it's not only for means that we can use a bootstrap it can use be used for any parameter
00:39:36
uh but we may also use it for hypothesis testing what is called permutation test
00:39:42
uh it is also a very important to uh in the case of
00:39:47
deciding whether my some police more or log for example because sometimes it's difficult
00:39:53
to decide on an ally in the prison service more some porn or not
00:39:58
or uh and i going to be able to provide the tools that
00:40:02
i want to apply uh in in using a sample that is this size
00:40:09
and a very useful approach that may be performed we've
00:40:14
boot bootstrap or shudders is a computation of
00:40:18
paul work before i'm not using a hypothesis tests
00:40:22
uh because uh for uh for example
00:40:25
is um important to uh decide at first
00:40:30
what kind of size we need uh
00:40:33
for uh estimating some comparison between two groups
00:40:38
and uh also it's important once we've decided that we need some size uh
00:40:44
that we know whether the tools that we plan to apply are exactly to decide
00:40:49
and so for example using bootstrap you can combine these two so haitians because you could
00:40:55
which wrap up for a for a specific
00:40:58
statistical to like linear mixed modelling for example
00:41:03
oh and they may also be used to estimate the
00:41:07
specific contribution of some data points uh to statistical models
00:41:12
uh but most of these are computationally intensive and may sometimes require
00:41:17
access to computing servers uh in order to be run more efficiently
00:41:22
um so here i've put some conclusions but i i i will
00:41:27
stop now uh and just let us questions if you which thank you

Share this talk: 


Conference Program

ESR03 : Interpretable speech pathology detection
Julian Fritsch
Sept. 4, 2019 · 2:30 p.m.
156 views
ESR09 : Clinical relevance of intelligibility mesures
Pommée Timothy
Sept. 4, 2019 · 4:49 p.m.
Big Data with Health Data
Sébastien Déjean
Sept. 5, 2019 · 9:20 a.m.
ESR11 : First year review
Bence Halpern
Sept. 5, 2019 · 11:20 a.m.

Recommended talks

A story of unification: from Apache Spark to MLflow
Reynold Xin, Databricks
June 12, 2019 · 9:15 a.m.
1267 views
SGAN: An Alternative Training of Generative Adversarial Networks
Tatjana Chavdarova, Idiap Research Institute
April 19, 2018 · 10:32 a.m.
1046 views