using Distributions
using StatsBase, DataFrames
using CairoMakie, AlgebraOfGraphics
7 Probability distributions
Exploratory data analysis differs from statistics. A statistical model for a measurement involves a description of the random nature of the measurement. To describe randomness, the language of probability is used. Much of this language is implemented in the Distributions
package of Julia
, loaded below with other packages:
7.1 Probability
This section quickly reviews the basic concepts of probability.
Mathematically a probability is an assignment of numbers to a collection of events (sets) of a probability space. These values may be understood from a model or through long term frequencies. For example, consider the tossing of a fair coin. By writing “fair” the assumption is implicitly made that each side (heads or tails) is equally likely to occur on a given toss. That is a mathematical assumption. This can be reaffirmed by tossing the coin many times and counting the frequency of a heads occuring. If the coin is fair, the expectation is that heads will occur in about half the tosses.
The mathematical model involves a formalism of sample spaces and events. There are some subtleties due to infinite sets, but we limit our use of events to subsets of finite or countably infinite sets or intervals of the real line. A probability measure is a function
- The probability of the empty event is
, the probability of the the sample space is . - If events
are disjoint then the probability of their union is the sum of the individual probabilities.
A random variable,
The rules of probability lead to a few well used formulas:
A distribution of a random variable is a description of the probabilities of events generated by a random variable. For our purposes, it is sufficient to describe events of the type
There are 2 types of distributions where a related function describes the distribution, discrete and continuous distributions.
A discrete random variable is one which has
A continuous random variable is described by a function
When defined, the pdf is the basic description of the distribution of a random variable. It says what is possible and how likely possible things are. For the two cases above, this is done differently. In the discrete case, the possible values are all
A data set in statistics,
The joint cumulative distribution is the probability
- identically distributed, meaning
for each pair . - independent, which means that knowing the value of
does not effect the probabilities regarding values for , . (If you know a first toss of a fair coin is heads, it doesn’t change the odds that the second toss will be heads.)
With these two assumptions, the random variables
In general, the distribution of a random variable may be hard to describe. For example, suppose the random variable is the sample median,
When a distribution can not be fully identified, it can still be described. A few basic summaries of a probability distribution are:
The mean or average value. For a discrete distribution this is
, which is a weighted average of the possible values, weighted by how likely they are. For a continuous distribution, a similar formula using calculus concepts applies. Both can be viewed from a center of mass perspective. The symbol or (expectation) is used to represent the mean.1The variance of a probability distribution describes the spread; the standard deviation is the square root of the variance. For a random variable, the variance is described by the average value of the centered random variable, squared:
. The symbol is used to represent the variance, then is the standard deviation. is also used to represent the variance of a random variable, similarly is used for the standard deviation.
The transformation
The transformation
For a random sample
While
where the covariance is
7.1.1 Statistical language
In statistics we have seen a random sample is a sequence of random variables
The population is the common distribution of each random variable in the random sample. Populations have a pdf and cdf, often denoted
A statistic is some summary of a random sample. For example, the median, or middle value, or the sample mean
For the sample mean from an iid random sample, the above says:
The standard deviation then is
In short, the expected value of the mean is the expected value of the population; the variance of the mean (of an iid random sample) is the variance of the population divided by the sample size,
There are parallels between random variables and data sets:
- The random sample
is realized in a data set by values . - The distribution,
, is reflected in the data set. - The distribution,
is typically described by key parameters - The data set is typically summarized by sample statistics.
Statistical inference makes statements using the language of probability about the parameters in terms of various sample statistics.
An intuitive example is the tossing of a fair coin modeling heads by a
- A given data set is not random, but it may be viewed as the result of a random process and had that process been run again would likely result in a different outcome. These different outcomes may be described probabalistically in terms of a distribution.
- If
is large enough, the sample proportion should be close to the population proportion . - Were the sampling repeated, the variation in the values of
should be smaller for larger sample sizes, .
These imprecise statements can be made more precise, as will be seen. Figure Figure 7.2 illustrates a population, several different samples from it with the sample means visualized by a density plot. The density plot is centered at the same location as the population, but is narrower, reflecting smaller variability.
Inferential statistics is similar: A population is a distribution summarized by parameters. A random sample drawn from a population is summarized by statistics. A statistic summarizes just one sample, but the language of probability is used to infer from that one sample, statements about the parameters which would be apparent were there many different random samples.
7.2 The Distributions package
While populations are in general described by a cdf, populations which are discrete or continuous are more naturally described by their pdf. There are many standard pdfs used to describe different types of randomness and supported in the Distributions
package. This section reviews several.
using Distributions
7.2.1 Bernoulli, Binomial, Geometric
The simplest non-trivial distribution is that used to model a coin toss:
In the Distributions
package, named distributions are given a type, in this case Bernoulli
which requires a value of
= 0.25
p = Bernoulli(p)
Be mean(Be)
0.25
The Bernoulli(p)
distribution is very simple, but combinations of Binomial
distribution. It has two parameters: Binomial(n,p)
type. There are several summary statistics for the types supported by Distributions
, including: mean
, median
, std
:
= 10, 1/2
n, p = Binomial(n, p)
B mean(B), median(B), std(B)
(5.0, 5, 1.5811388300841898)
The mean and standard deviation for the Binomial can be easily computed when this statistic is viewed as a sample mean of 10 Bernoulli random variables (
More is known. For example, the value of this random variable can only be between 0 and 10. The extrema
function combines both the minimum
and maximum
functions to give this:
extrema(B)
(0, 10)
The insupport
function is similar, returning true
or false
if a value, x
, is in the support of the distribution, that is if x
is a possible value:
insupport(B, 5), insupport(B, 15), insupport(B, 5.5)
(true, false, false)
We see only 5
is a possible value. Its probability, pdf
:
pdf(B, 5)
0.24609375000000022
Figure 7.3 shows different examples of the binomial distribution. When both
The cumulative distribution function is computed by cdf
with the same calling style, cdf(D, x)
. The complementary cumulative distribution function, 1 - F(x)
, is computed by ccdf(D, x)
.
The cumulative distribution, quantile
.2 Quantiles are computed with the calling style quantile(D, p)
.
A related question is the number of coin tosses needed to get the first heads. If
For example, we can see here the mean, and standard deviation, and also that the the number is unbounded:
= 1/2
p = Geometric(p)
G mean(G), std(G), extrema(G) # 1/p, 1/√p, 0,1,...,∞
(1.0, 1.4142135623730951, (0, Inf))
7.2.2 Uniform and DiscreteNonParametric distributions
A discrete uniform random variable on DiscreteUniform(a,b)
type models this distribution. For example, to model the roll of a 6-sided die, we might have:
= DiscreteUniform(1,6)
D mean(D), std(D)
(3.5, 1.707825127659933)
More generally, a distribution which assigns weight DiscreteNonParametric(xs, ps)
distribution. For example, Benford’s “law” is an observation that the first non-zero digit of many data sets follows a certain pattern. For this the possible values are
= 1:9
xs = log10.(xs .+ 1) - log10.(xs)
ps = DiscreteNonParametric(xs, ps) B
DiscreteNonParametric{Int64, Float64, UnitRange{Int64}, Vector{Float64}}(
support: 1:9
p: [0.3010299956639812, 0.17609125905568124, 0.12493873660829996, 0.09691001300805646, 0.07918124604762478, 0.06694678963061318, 0.057991946977686726, 0.05115252244738133, 0.04575749056067513]
)
We can answer questions like the mean, the standard deviation, and what is the probability the number is 5
or less with:
mean(B), std(B), cdf(B, 5)
(3.4402369671232065, 2.4609982997506656, 0.7781512503836436)
= 5
a
= support(B)
ks = pdf.(B, ks)
ps = ifelse.(ks .<= a, "X ≤ $a", "X > $a")
as
= visual(Rangebars) + visual(Scatter)
layers = data((; ks, ps, as)) * layers * mapping(:ks, :ps, :ps => zero, color=:as)
p
draw(p)
In Distributions
the Categorical
type can alse have been used to construct this distribution, it being a special case of DiscreteNonParametric
with the xs
being
The multinomial distribution is the distribution of counts for a sequence of Categorical
distribution. This generalizes the binomial distribution. Let
7.2.3 The continuous uniform distribution
The continuous uniform distribution models equally likely outcomes over an interval rand
function for Float64 values returns random numbers which are modeled by being uniform over 0
but not 1
is a reality of how computers and mathematical models aren’t always exactly the same, but rather the model is an abstraction of the implementation.)
The Uniform(a,b)
type models these numbers. Here we see rand
being used to randomly sample rand
in base Julia
:
= Uniform(0, 1)
U rand(U, 3)
3-element Vector{Float64}:
0.902347057616552
0.23833132872089147
0.6566070633905712
If
= 2, 3
a, b = a * U + b
Y = mean(U) == mean(Y) - (b + mean(U)) == (1 + 0)/2
q1 = std(U) == std(Y)/a == sqrt(1/12)
q2 q1, q2
(true, true)
7.2.4 The normal distribution
The most important distribution in statistics is called the normal distribution. This is a bell-shaped distribution completely described by its mean and standard deviation, Normal(mu, sigma)
(though some math books use the variance,
= Normal(0, 1)
Z mean(Z), std(Z)
(0.0, 1.0)
There are many facts about standard normals that are useful to know. First we have the three rules of thumb – cdf
with:
between(Z, a, b) = cdf(Z,b) - cdf(Z,a)
between(Z, -1, 1), between(Z, -2, 2), between(Z, -3, 3)
(0.6826894921370861, 0.9544997361036416, 0.9973002039367398)
These values are important as the
The transformation Normal(mu, sigma)
.
A common question in introductory statistics is what values quantile
and the observation that since the pdf of quantile
. From
For example:
= 0.05
alpha = 10, 20
mu, sigma = Normal(mu, sigma)
Y = quantile(Y, 1 - alpha + alpha/2)
b = mu - (b - mu)
a between(Y, a, b) a, b,
(-29.19927969080115, 49.19927969080115, 0.9500000000000004)
An observation of De Moivre in 1733 about the binomial proved foundational. Suppose we consider the binomial distribution with
For example, the following searches for the largest discrepancy in the cdfs:
= 100, 1/3
n, p = n*p, sqrt(n*p*(1-p))
mu, sigma = Binomial(n,p), Normal(mu, sigma)
B, Y = 0:100
ks findmax(abs(cdf(B, k) - cdf(Y, k)) for k in ks)
(0.046989293624614625, 34)
This shows an error of not more than
findmax(abs(cdf(B, k) - cdf(Y, k+1/2)) for k in ks)
(0.004701503073942237, 34)
Of course, computationally it is often just as easy to call cdf(B, k)
as it is to call cdf(Y, k + 1/2)
, so the advantage here is theoretical for computationally tractable values of
7.2.5 The Chi-squared distribution
Let Chisq(n)
type.
7.2.6 The T and F distributions
There are two main distributions which arise in the distribution of many sample statistics related to the linear regression model, these are the
Consider two independent random variables, a standard normal
The
= TDist(5)
T5 maximum(abs, rand(T5, 100)), maximum(abs, rand(Z, 100))
(4.890560051290039, 2.903111699090995)
Figure 7.5 shows a quantile-normal plot of
The skewness
method measures asymmetry. For both the kurtosis
function measures excess kurtosis a measure of the size of the tails as compared to the normal. We can see:
skewness(T5), skewness(Z), kurtosis(T5), kurtosis(Z)
(0.0, 0.0, 6.0, 0.0)
A distribution with excess kurtosis exceeding the normal is called leptokurtic. Such distributions more frequently produce values with Distributions
:
isleptokurtic(T5), isplatykurtic(U)
(true, true)
For the
= TDist(100)
T100 = range(-3, 3, length=100)
xs findmax(abs(cdf(T100, x) - cdf(Z, x)) for x in xs)
(0.0015794482887208222, 25)
In Figure 7.5 the lower-right graphic is of the uniform distribution, the short tails, lead to a deviation from a straight line in this graphic; the lower-right graphic is of the skewed exponential distribution; the skew shows up in the lack of symmetry about the fitted line.
Consider now two, independent Chi-squared random variables
has a known distribution called the
This distribution has different shapes for different parameter values (Figure 7.6), with the distributions becoming more normal as the two parameters get large.
kurtosis(FDist(5, 10)), kurtosis(FDist(5,100)), kurtosis(FDist(100,100))
(50.86153846153846, 3.1474470779483217, 0.7278883188966445)
7.3 Sampling distributions
The normal distribution,
7.3.1 The sample mean
For an iid random sample, from a population with mean
What about the distribution itself? This depends on the underlying population.
If the underlying population is normal, then
Central limit theorem
De Moivre’s observation that the binomial distribution is asymptotically normal is even more general: the central limit theorem informs us that for large enough
where
The central limit theorem has some assumptions on the underlying population. The assumption that a mean and standard deviation exist are sufficient for application. It is true exactly for normal populations, for non-normal populations asymptotically so. For a binomial population, a rule of thumb is
In general, how large
Figure 7.7 shows simulations of the distribution of rand
method for a distribution type, with rand(D,n)
returning a random sample of size
xbar(D, n, N=200) = [mean(rand(D, n)) for _ in 1:N]
sxbar(D, n, N=200) = (xbar(D,n,N) .- mean(D)) / (std(D)/sqrt(n))
= (N = Normal(0,1),
Ds = Normal(0,1),
normal = TDist(3),
leptokurtic = FDist(2,5),
skewed = Uniform(0,1))
platykurtic
= range(0.01, 0.99, 200)
probs = DataFrame([quantile.((sxbar(D,10),), probs) for D in Ds],collect(keys(Ds)))
m = stack(m, Not(:N)) # need to use QQPlot
d = data(d) * mapping(:N, :value, layout=:variable => nonnumeric) *
p visual(QQPlot, qqline=:fit))
(
draw(p)
The -statistic
The central limit theorem requires standardizing by the population standard deviation. At times this is neither known or assumed to be known. However, the standard error may be known:
The standard error of a statistic is the standard deviation of the statistic or an estimate of the standard deviation.
For
The scaling in the central limit theorem can be redone using the standard error,
where
The variability in the sampling means that the standard error will occasionally be much smaller than the population standard deviation, as this is in the denominator, the random variable
For a normal population, the sampling distribution of
is the -distribution with degrees of freedom.
In general, a
7.3.2 The sample variance
For an iid random sample from a normal distribution the distribution of the sample variance
If the sample is an iid random sample of normals and were
For an iid random sample from a normal population
has a distribution.
The statistic
When there are two independent samples, there are two sample means and standard deviations. The ratio of the sample standard deviations gives insight into the question of whether the two populations have the same spread. The
If the two populations are normal with means
A special case is the
That
The distribution of the
In the following we use a long-tailed population (
T_n(D, n) = (xs = rand(D,n); (mean(xs) - mean(D)) / (std(xs) / sqrt(n)))
F_n(D, n1, n2) = (xs = rand(D,n1); ys = rand(D, n2); (std(xs)/(n1-1)) / (std(ys)/(n2-1)))
= 10
n = 10, 5
n1, n2 = 100
N
= range(0.01, 0.99, length=N)
ps = (T=TDist(n-1), F=FDist(n1-1, n2-1))
SDs = (var"long-tailed"=TDist(3), skewed=Exponential(1))
Pops
# use scatter to produce qqplot
= DataFrame([(D="$SDk $Popk", x=quantile(Popv, ps), y = quantile(rand(SDv, N), ps))
df in pairs(SDs) for (Popk, Popv) in pairs(Pops)])
for (SDk,SDv) = data(flatten(df, [:x,:y]))
d = d * (visual(Scatter) + linear(;interval=nothing)) *
p mapping(:x, :y, layout=:D, color=:D)
draw(p, facet=(; linkxaxes=:none, linkyaxes=:none))
7.3.3 The sample median
The sample median is the “middle” of an iid random sample. Like the sample mean, it is a measure of center, that, unlike the sample mean, is resistant to outliers. The central limit theorem instructs us that for most populations the distribution of the sample mean is normally distributed. The distribution of the sample median can be computed. It is asymptotically normal with mean
Figure 7.9 shows through boxplots the variability of the sample median and sample mean depends on the population. For the standard normal population the ratio of the variances of
7.3.4 The Chi-squared statistic
For the multinomial distribution for a fixed
For this scenario, the asymptotic distribution of
Suppose there are
The Chi-squared sampling distribution is asymptotic meaning for
A related statistic is the Cressie-Read power-divergence statistic, (Cressie and Read 1984), parameterized by real
When
Notationally,
is used for the mean of a random variable, is used for the mean of a probability distribution, though we use them interchangeably.↩︎This definition of a quantile is for a continuous distribution, when the cumulative distribution is not monotonic, then definition may be the infimum over all
for which .↩︎