Topics

- Stable distributions
- Asymptotics of factorials
- Poisson distribution
- Normal distribution
- Extreme value distributions

Further reading

- Many of the common probability distributions have special relationships between each other – see Distribution relationships.
- Experimentation and measurement by W. J. Youden (1961) is a classic text, and outlines an excellent way to approach problems of measurement.

There are many different distributions that can be useful in modelling, and you can explore more on your own (maybe even discover a new one ;). Many random processes that we want to understand in the real world are complicated with lots of variables and interactions, each of which needs to be accounted for in constructing a distribution theory for the outcomes.

But there are a few distribution families that seem to come up frequently – surprisingly frequently when you first think about it. They are the Poisson distribution, the Normal distribution, and the Extreme value distributions. These are *stable asymptotic distributions*. They are *asymptotic* because their formulas are calculated using the idea that the number of events is infinitely large (\(n \rightarrow \infty\)). They are *stable* in the sense that many different distributions of individual events converge to the same asymptotic distributions. Each of the three families appears under different circumstances. The Poisson distribution appears as we study the frequencies of superimposed independent rare events. The Normal distribution appears as we study the accumulation of large numbers of independent events. Extreme-value distributions appear when we study the extremes of large numbers of independent events.

In a modelling context, stable distributions have pro’s and con’s. On one hand, it is very useful to have a few common and well-understood distributions that can be used to explain so many data sets with reasonably good accuracy. On the other hand, it is very difficult to extract additional information beyond parameter estimates from data sets that are well-fit by stable distributions; in some sense, these distributions have *forgotten* the nature of the original processes that created them.

Factorials \(n!\) are major pains to calculate with even when \(n\) is only modestly large. For example, \(25!\) has 26 digits, and the numbers just keep getting bigger as \(n\) gets larger. But in order to simplify our calculation, we have to be able to do some calculations with factorials. Abraham de Moivre (1667 - 1754), a great neglected mathematician (in life and death), discovered a very useful formula that allows us to approximate factorials with multiplication and exponentiation.

\[n! \sim \sqrt{2 \pi n} \left( \frac{n}{e} \right)^n\]

We call this formula Stirling’s approximation (remember, I said de Moivre is forgotten – Stirling was one of de Moivre’s contemporaries), and it is an example of an *asymptotic approximation*. Two functions \(f(n)\) and \(g(n)\) are asymptotically equivalent (\(f \sim g\)) if and only if \[\lim _ {n\rightarrow
\infty} \frac{f(n)}{g(n)} = 1.\] Asymptotic equivalence is not the same as equality. For example, \[\begin{align*}
n &\sim n + 12,
\\
n^3 &\sim n (n+1) (n+2),
\\
1/n &\sim n/(n^2 + 1).
\end{align*}\] Asymptotic approximation methods are a deep and powerful field invigorated by Poincare, Stieltjes, and particularly Prandlt’s boundary layer theory of aerodynamics. In applied mathematics, important contributions to the theory were made by Watson, Erdelyi, Nayfeh, and Kevorkian, among others. But for now, let’s just take Stirling’s approximation as-is and see what we can do with it.

Stirling’s formula is commonly derived using the “method of steepest descent”, a very useful trick for approximating some integrals that would otherwise have to approximated numerically.

First, if you recall from differential equations, the Laplace transform of a monomial \(t^n\) is given by \[\begin{gather*} \mathscr{L}[t^n](s) = \int_{0}^{\infty} t^n e^{-st} dt = \frac{n!}{s^{n+1}}. \end{gather*}\] In the special case of \(s = 1\), then \[\begin{gather*} n! = \int_{0}^{\infty} t^n e^{-t} dt. \end{gather*}\] At first, this seems like a rather silly formula, does two very useful things. First, not that the integral is defined, not just for integer values of \(n\), but for all positive real numbers. Thus, we know have a way to make sense of \((1/2)!\). Second, we know have an algebraic expression we can approximate. In particular, by Taylor series expansion around an exponent maximum, \[\begin{align*} \int_{0}^{\infty} t^n e^{-t} dt &= \int_{0}^{\infty} e^{- n ( t/n - \ln t)} dt \\ & \approx \int_{0}^{\infty} e^{ n \log{\left (n \right )} - n - \frac{\left(t- n\right)^{2}}{2 n} } dt \\ & = e^{ n \log{\left (n \right )} - n} \int_{0}^{\infty} e^{-\frac{\left(t- n\right)^{2}}{2 n} } dt \\ & \approx \left( \frac{n}{e} \right)^n \int_{-\infty}^{\infty} e^{-\frac{\left(t- n\right)^{2}}{2 n} } dt \\ & = \sqrt{2 \pi n} \left( \frac{n}{e} \right)^n \end{align*}\] If this approximation feels a little weird to you, that’s probably good. It is not obvious that a local Taylor series approximation like this must give an accurate approximation across a semi-infinite integration interval turned into a bi-infinite interval. A more convincing argument for the reasonableness of these approximations can be give, but must be left aside for now.

Suppose we are interest in the distribution of the number of new phone calls a receptionist receives in an hour. One approach to this is to break the hour up into 15 minute quarters. In each quarter-hour, there’s a chance of the phone ringing, and as long as each ring is a new caller acting independently of previous callers, we can model the distribution of calls as the binomial distribution where \(k\) is the number of calls and \(p _ {15}\) is the chance of a call in a quarter hour:

\[\frac{4!}{k! (4-k)!} \, p _ {15}^k(1-p _ {15})^{4-k}.\]

However, this is a bit artificial – since we only considered 4 calling periods in our hour, the model predicts the receptionist will never have to handle 5 calls in the hour. We can fix this by breaking the hour down into 12 smaller intervals of 5 minutes each. When we do this, though, the chance of a call arriving in each interval should be smaller as well. Perhaps \(p _ {5} \approx p _ {15}/3\). But 5 minute intervals is rather arbitrary as well. Why not 1-minute intervals or 1 second intervals even?

But at the same time, as we do this, the numbers we have to work with in the binomial formula become more and more difficult to work with. The binomial and hypergeometric distributions rely on factorials, and hence difficult to work with when the number of events \(n\) is large – 100! has 157 digits! Fortunately, the calculations can be simplified. Consider the binomial distribution \[\hat{x}(s;p,n) = \sum _ {k=0}^{n} \frac{n!}{k! (n-k)!} \, p^k(1-p)^{n-k} s^k.\] When \(n\) is large but the probability of an event occurring is small, we can choose some positive constant \(\lambda\) such that \(p = \lambda/n\).

According to Stirling’s formula and Bernoulli’s limit, when \(n\) is large, \[\frac{n!}{(n-k)!} \sim n^k\] \[\left(1-\frac{\lambda}{n}\right)^{n} \sim e^{-\lambda}\] \[\frac{ (\lambda/n)^k }{(1-\lambda/n)^{k}} \sim (\lambda/n)^k\] And so, if we make the appropriate substitutions into the binomial distribution’s probability distribution, we find the probability of observing \(k\) events, \[P(k) = \frac{ \lambda^k e^{-\lambda} }{k!}\] This is what we call the Poisson distribution. The Poisson distributions probability and cumulant generating functions are respectively \[\hat{x}(s; \lambda) \sim \sum _ {k=0}^{\infty} \frac{ \lambda^k e^{-\lambda} }{k!} s^{k} = e^{\lambda (s-1)}, \quad\text{where}\; \lambda = p n, p \approx 0, n \approx \infty\] \[\tilde{x}(t) = \lambda(e^t-1).\] Poisson briefly rediscovered this formula after de Moivre’s work was forgotten, and then it was forgotten again until its successive rediscoveries and revivals by Bortkiewicz, Grinsted, Molina, Erlang, and probably others. With a little calculation, we can show that the parameter \(\lambda\) is both the mean (the expected number of impacts per unit time) AND the variance of a Poisson distribution.

“Everybody believes in the exponential law of errors [i.e., the Normal distribution]: the experimenters, because they think it can be proved by mathematics; and the mathematicians, because they believe it has been established by observation.” - Whittaker, E. T. and Robinson

The Normal distribution, also known as the Gaussian distribution, is famous for its many applications in statistical analysis. Like the Poisson distribution, it was first discovered by Abraham de Moivre. However, it did not rise to prominence until Carl Gauss showed the method of linear least squares was equivalent to assuming measurement errors were “normally” distributed. Gauss originally used it in accounting for errors in astronomical measurements. Its importance was further cemented by Ronald Fisher’s development of “analysis of variance”, a.k.a. ANOVA, methods at the beginning of the 20th century.

The normal distribution is described by the formula \[\frac{1}{\sigma\sqrt{2\pi} } \; e^{ -\frac{(x-\mu)^2}{2\sigma^2} }.\]

Like the Poisson distribution, the Gauss distribution is also **additive** - the sum of to Gauss variables is also a Gauss variable

\[\frac{1}{N} \sum_{i=1}^{N} X_i \sim \frac{1}{\sigma\sqrt{2\pi} } \; e^{ -\frac{(x-\mu)^2}{2\sigma^2} }\] where \(\mu\) is the expected value and \(\sigma\) is the standard deviation.

In fact, more general versions of this property also hold. If we take any realistic sequence of independent identically-distributed random variables \(X _ i\), then as the sequence grows, the distribution of the average of this sequence converges to the Normal distribution. This phenomena is celebrated by the name **The Central limit theorem**, and is the reason we can call the Normal distribution a **stable** distribution – no matter what distribution the original variables have, their average converges to be Normally distributed!

One illustration of this stability is how the Normal distribution can also be derived from the Binomial distribution under the assumption of a large sample size, but fixed positive probability of success for each attempt. Note that this differs from our derivation of the Poisson distribution, where we assumed the chance of success with each flip decreased linearly as sample size increased. This result was first discovered by De Moivre, but not widely appreciated until widely read in Laplace’s work.

Suppose we begin with the binomial distribution \[\begin{gather*} \frac{n!}{r!(n-r)!} p^r (1-p)^{n-r} \end{gather*}\] as \(n\) gets very large but when \(p\) is fixed between \(0\) and \(1\). From Sterling’s formula, we know

\[\begin{align*} n! &\sim \sqrt{2 \pi n } \left( \frac{n}{e} \right)^n \\ r! &\sim \sqrt{2 \pi r } \left( \frac{r}{e} \right)^r \\ (n-r)! &\sim \sqrt{2 \pi (n-r) } \left( \frac{(n-r)}{e} \right)^{(n-r)} \end{align*}\]

so

\[\begin{align*} \frac{n!}{r!(n-r)!} &\sim \frac{1}{\sqrt{2 \pi n (r/n) ( 1 - r/n)}} \frac{n^n}{r^r (n-r)^{n-r}} \\ &\sim \frac{1}{\sqrt{2 \pi n (r/n) ( 1 - r/n)}} \frac{n^n (n-r)^r}{r^r (n-r)^n} \\ &\sim \frac{1}{\sqrt{2 \pi n (r/n) ( 1 - r/n)}} (r/n)^{-r} (1-r/n)^{-(n-r)} \end{align*}\] Substituting back into the binomial and rearranging, \[\begin{gather*} \frac{1}{\sqrt{2 \pi n (r/n) ( 1 - r/n)}} (r/n)^{-r} (1-r/n)^{-(n-r)} p^r (1-p)^{n-r} \\ \frac{1}{\sqrt{2 \pi n (r/n) ( 1 - r/n)}} \left( \frac{np}{r} \right)^{r} \left( \frac{1-p}{1-r/n} \right)^{(n-r)} \end{gather*}\]

In this case the extreme values of \(r\) (all heads or all tails) will be very rare, while the most likely values of \(r\) will be those where heads comes up about as frequently as predicted by \(p\), \(r \approx n p\). Let’s make the substitution \(r = n p + z\), and now treat \(z\) as the variable of the distribution. \[\begin{gather*} \frac{1}{\sqrt{2 \pi n (p + z/n) ( 1 - p - z/n)}} \left( \frac{np}{np+z} \right)^{np+z} \left( \frac{1-p}{1-p -z/n} \right)^{(n-np-z)} \\ \frac{1}{\sqrt{2 \pi n (p + z/n) ( 1 - p - z/n)}} e^{ - (np+z)\ln \left( 1+z/np \right) - (n-np-z) \ln \left( 1 -z/n/(1-p) \right) } \end{gather*}\] Using the McLauren series \[\begin{gather*} \ln(1 + x ) = x - \frac{1}{2} x^2 + \frac{1}{3} x^3 + \ldots \end{gather*}\] we can expand the exponent in \(z\) to get \[\begin{gather*} \frac{1}{\sqrt{2 \pi n (p + z/n) ( 1 - p - z/n)}} e^{ - \frac{z^2}{2np(1-p)} + O\left(\frac{z^3}{n^2}\right) } \end{gather*}\] Discarding lower order terms, we obtain the Normal distribution \[\begin{gather*} \frac{1}{\sqrt{2 \pi n p ( 1 - p )}} e^{ - \frac{z^2}{2np(1-p)} } \end{gather*}\] Thus, the Binomial distribution is closely approximated by the Normal distribution when the number of samples is large and the probability of success \(p\) is not too close to \(0\) or \(1\). The same result can be obtained as an approximation to the hypergeometric distribution as well because the act of replacement has very little effect on the full probability when \(n\) is very large.

The extreme-value distribution, actually takes one of 3 different forms, depending on the nature of the random events involved.

Example: When making thread, the strength depending on weakest spot in the length, so the minimum of the strengths allong the length.

While we don’t study them much in our traditional school curriculum, \(min(x,y)\) and \(max(x,y)\) are just as valid mathematical operations as addition and subtraction. Perhaps even more so, since one of the first signs of intelligence is being able to pick out bigger things from smaller things.

Suppose \(F(x)\) is the cumulative probability distribution of independent identically distributed random variables \(X _ i\) over the interval \([0,1]\), so \(P(x < X _ i) = F(x)\), and that for \(x \approx 0\), \(F(x) \approx a x^k\). Then \[\begin{align*} P( x < min_{i=1 .. N} X_i ) &= 1 - ( 1 - F(x) )^N \\ &= 1 - e^{N \ln ( 1 - F(x) )} \\ &\approx 1 - e^{N \ln ( 1 - a x^k )} \\ &\approx 1 - e^{- N a x^k } \end{align*}\] This is a CDF of a Weibull distribution. To see how our approximation does, we can perform some simulations and compare the observed distribution to our Weibull distribution predictions. For minimum of a modest sequence of uniformly distributed samples, the approximation is pretty good, and the approximation becomes more accurate as the sequences become longer.

[Show code]

```
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sat Feb 24 11:21:44 2018
@author: treluga
"""
from pylab import *
runs = 500
N = 10
a = array([ min(rand(N)) for i in range(runs)])
a.sort()
y = 1 - exp(-N*a)
b = linspace(1/runs,1-1/runs,runs)
figure(1,figsize=(12,6))
subplot(1,2,1)
plot(a,b,'bx',label='Empirical simulation')
plot(a,y,'r-',label='Predicted')
legend(framealpha=0.)
xlabel('Minumum value',fontsize=18)
ylabel('Cumulative probability',fontsize=18)
title('Min of %d calls to rand()'%N,fontsize=18)
xlim(0,.5)
ylim(0,1)
subplot(1,2,2)
plot(y,b,'b-',(0,1),(0,1),'k-')
title('Probability plot',fontsize=18)
xlabel('Predicted',fontsize=18)
ylabel('Observed',fontsize=18)
tight_layout()
savefig('WeibullSim.png',transparent=True)
savefig('WeibullSim.pdf')
#show()
```

If instead \(F(x)\) does not have an absolute minimum, but rather has an exponential tail, \(F(x) \approx e^{\alpha + \beta x}\), \[\begin{align*} P( x < min_{i=1 .. N} X_i ) &= 1 - e^{N \ln ( 1 - F(x) )} \\ &\approx 1 - e^{- N e^{\alpha + \beta x} } \\ & = 1 - e^{- k e^{\beta x} } \end{align*}\] This is a version of the Gumbel distribution. If the cumulative distribution has fatter tails than exponential, the minimum behavies according to a Frechet distribution (though this rarely appears in practice).

- Emil Gumbel, who wrote the first major book on extreme-value probability theory, was involved in the fight against Nazi’s long before others foresaw the omens of World War II.

[Show code]

```
#!/usr/bin/env python3
"""
Created on Sat Feb 24 11:21:44 2018
@author: treluga
"""
from pylab import *
from scipy.linalg import lstsq
runs = 100
N = 40
x = array([ min(randn(N)) for i in range(runs)])
x.sort()
y = linspace(1/runs,1-1/runs,runs)
b = log(-log(1-y))
A = vstack([x**0, x**1]).T
(p_opt, residues, a_rank, svd_vals) = linalg.lstsq(A, b)
print(p_opt)
yopt = 1-exp(-exp(p_opt[0] + x*p_opt[1]))
figure(1,figsize=(18,6))
subplot(1,3,1)
plot(x,y,'b.',label='Empirical simulation')
plot(x,yopt,'r-',label='Predicted')
ylim(0,1)
legend(framealpha=0.,loc='upper left')
xlabel('Minumum value',fontsize=18)
ylabel('Cumulative probability',fontsize=18)
title('Min of %d calls to randn()'%N,fontsize=18)
subplot(1,3,2)
plot(x,log(-log(1-y)),'b.')
xlabel('Minumum value',fontsize=18)
ylabel('Nonlinear data transform',fontsize=18)
subplot(1,3,3)
plot((0,1),(0,1),'k-')
plot(yopt,y,'b.')
title('Probability plot',fontsize=18)
xlabel('Predicted',fontsize=18)
ylabel('Observed',fontsize=18)
xlim(0,1)
ylim(0,1)
tight_layout()
savefig('gumbelSim.png',transparent=True)
savefig('gumbelSim.pdf')
#show()
```

- Generalized extreme-value distribution, including the Weibull, Gumbel, and Frechet, have cumulative distribution function

\[\frac{[ Min _ i( X_i ) - X_N ]}{N} \sim \frac{d}{dx} \left( 1 - \exp \left\{-\left( 1 + s \frac{x - \mu}{\sigma}\right)^{1/s}\right\} \right)\]

A final class of distributions are Pareto distributions, also known as power-law distributions, which have formulas

\[\begin{gather*} P _ x = \frac{a \eta^a}{x^{a+1}} \end{gather*}\]

Scholars have constructed many different distributions over the years. Here, I have given examples of just a few of the most common of these. Some are constructed to model specific processes like the Hypergeometric and Binomial.

Physicists studying interacting particle systems often talk about “universality classes”. Universality classes are asymptotic limits of the behavior of certain interacting particle systems, and arose from the study of phase transitions of materials in statistical mechanics.

Brownian motion observed by Ingenhousz (1785) in solutions of coal-dust in alcohal and Brown (1827)

Phase transitions between solid and gas.

Transition between magnetized and not magnetized.

Surface of a growing crystal.

Shape of a forefire front.

- Let’s use python to test how good Stirling’s formula is.
For \(n\) from 1 to 12, plot \(n!\) with circles and Stirling’s approximation \(\sqrt{2 \pi n} \left(\frac{n}{e}\right)^n\) with x’s using arithmetic axes. What does this plot tell you about the accuracy of the approximation?

For \(n\) from 1 to 12, plot \[1-\frac{\sqrt{2 \pi n} \left(\frac{n}{e}\right)^n}{n!}.\] Based on this plot, what can we hypothesize about the relative errors of Stirling’s formula for \(n>12\)?

Use Stirling’s formula to find an asymptotic approximation for \(\dfrac{n!}{i!}\) when both \(n\) and \(i\) are large.

Show that when the number of stones in an urn is taken to be large, but of fixed initial proportion, then the hypergeometric distribution for drawing a small number of stones from the jar is approximately a binomial distribution, and hence, approximately a Poisson distribution.

Track down and recover one of the data sets presented or cited by Gumbel in his classic book statistics of extremes for its original source.