Horse kicks and Height

( previous, home, next )


Topics

Further reading


Application of the binomial

For the Gombaud-Pascal problem, double-sixes may not appear, may appear once, twice, maybe as many as 24 times. We will probably never observe 24 double-sixes in a row, although we will often observe no double-sixes. Here is a simulation that shows how often the various outcomes have occurred in experiment and compares them with the predicted frequencies from the binomial distribution.

[Show code]
from scipy import rand, array
from scipy.stats import binom
import matplotlib.pyplot as mpl

"""
Simulate a number of independent Bernoulli trials
and bin them to generate a histogram that should
be closely approximated by the Binomial distribution.
"""

numdraws = 24
p = 1./36
def run():
    outcome = [ (p > rand()) for i in range(numdraws) ]
    return outcome.count(True)

#numreps = int(1e6)
numreps = int(1e4)

A = [ run() for i in range(numreps) ]
binbnds = array(range(numdraws+2))-0.5
n, bins, patches = mpl.hist(A, binbnds, normed=1, facecolor='green', alpha=0.5)
x = binbnds + 0.5
y = [binom.pmf(i, numdraws, p) for i in x]
error = sum(map(abs,array(n) - y[0:-1]))
print n
print y
mpl.plot(x, y,'bo-')
mpl.legend(['Binomial Prediction','Observed Histogram'],loc='upper right',framealpha=0.)
mpl.xlim(-0.5,numdraws+0.5)
mpl.title('Experimental test of the Binomial measure', fontsize=20)
mpl.xlabel('Number of successes', fontsize=18)
mpl.ylabel('Frequency of occurence', fontsize=18)
mpl.text(12, .23, "Probability $p=1/36$", fontsize=18)
mpl.text(12, .2, "Draws $n=%d$"%numdraws, fontsize=18)
mpl.text(12, .17, "%d Replications"%numreps, fontsize=18)
mpl.text(12, .14, "%.1f %% error"%(100 * error), fontsize=18)
mpl.savefig('binomial_experiment.png',transparent=True)
mpl.savefig('binomial_experiment.pdf')
Experiment showing difference from binomial prediction
Experiment showing difference from binomial prediction

The binomial distribution’s numbers match quite closely the results of the simulation runs we have implemented. The advantage of using the binomial distribution is that it is much faster to calculate and the formulas it provides can be more efficiently incorporated into subsequent calculations. But note, this simulation experiment is not the same as testing Gombaud’s problem in the physical world. There are many possible complications that we did not account for in writting our simulation (are the dice imperfectly weighted, are the dice thrown well, …) that could effect the outcomes.

Application of the hypergeometric

Our chocolate testing problem from the previous lecture is one that is perfectly suited to an application of the hypergeometric distribution.

There are 128 crates, 25 with bad cacao, and the border agents test 5 crates. If all crates are equally likely to be sampled, then

[Show code]
>>> from sympy import factorial
>>> nCr = lambda n,r : factorial(n)/factorial(r)/factorial(n-r)
>>> [ nCr(25,i) * nCr(103,5-i)/nCr(128,5) for i in range(6) ]
>>> [ (nCr(25,i) * nCr(103,5-i)/nCr(128,5)).evalf() for i in range(6) ]

The results are approximately …

[0.33088, 0.41778, 0.20053, 0.045666, 0.00492, 0.00020]

which are in close agreement with the simulation results we initially obtained.

Deaths from horse-kicks

The Poisson distribution does a good job of describing some observations. But before we can make such a comparison, we have to know what the parameter \(\lambda\)’s true value is. How can we compare our prediction to our data if we don’t know \(\lambda\)?

Well, one answer is to just try many values of \(\lambda\) and look for the one that does the best job of fitting the data. Let’s look at a classical example.

Here is a classic example of something that seems totally unrelated by can be explained by the Poisson distribution. Notice that in this case, the poisson distribution only fails to explain about 4 percent of the distribution’s mass when the rate parameter is picked optimally. That’s pretty good, given the “totally random” nature of the process.

In 1898, Russian economist Ladislaus Bortkiewicz published his first statistics book entitled Das Gesetz der keinem Zahlen. In it he investigated the annual deaths by horse kicks in the Prussian Army from 1875-1984, across 14 different army corps including the “Guard Corps”. Have a look at the data:

[ Data : hide , shown as table , shown as CSV shown as QR ]

# number of deaths, number of occurrences 
# Annual deaths from horse kicks in the Prussian army (1875-1894)
# https://books.google.com/books?id=dh24EaSrmBkC&lpg=PA92&dq=prussian%20soldier%20horse%20deaths%20poisson&pg=PA92#v=onepage&q&f=false
# Originally from von Bortkiewicz's 1989 book, The Law of Small Numbers.
# Bortkiewicz rediscovered the Poisson distribution on his own
0,144
1,91
2,32
3,11
4,2
5,0
6,0
7,0

What we’re interested in, is the distribution on the number of deaths per corps per year. For that we have 280 observations (14 corps over 20 years). We can “bin” the data (how many 0 deaths? how many 1 death? etc.) by hand or quickly using python:

[Show code]
# Extracts & bins horse kick data

# First extract data
data = loadtxt('HorseKicks.csv',delimiter=",")

# Usually you would use data.count(1) to get the number of ones,
# for example, but that doesn't work for an ndarray!
# Instead...
unique, counts = numpy.unique(data[:,1:15], return_counts=True)

# now write to a .csv file
import csv

f = open('horsekicks.txt', 'wt')
try:
    writer = csv.writer(f)
    writer.writerow( ('#Deaths', 'Count') )
    for i in range(len(unique)):
        writer.writerow((int(unique[i]),counts[i]))
finally:
    f.close()

### if you want to use counts...
def histogram(x, maxval):
     return [ (i, x.count(i)) for i in xrange(maxval+1) ]

hist = histogram(list(data[:,1:15].flatten()),5) 

Now we want to fit this to a Poisson distribution, \(\Pr(N=k)=\dfrac{\lambda^ke^{-\lambda}}{k!}\). So what’s \(\lambda\)? The brute force approach is to try a bunch of \(\lambda\) values and then check the error relative to the frequency of measurements in the data, say with the 1-norm \(E_1=\sum_j|y^{data}_j-y^{pred}_j|\) where \(j=0,1,2,3,4\).

Best fit of a Poisson distribution to horse-kick mortality data
Best fit of a Poisson distribution to horse-kick mortality data

see gif animation of fit

[Show code]
#!/usr/bin/env python

"""
This script creates an animation showing how
a distribution can be fit to a data.

First, we simulate or load a data set.  Then we
compare the data to poisson distributions with
a range of success probabilities.
"""

from scipy import floor, cumsum, array, linspace, cos, pi, rand, loadtxt
from scipy.stats import poisson
from scipy.optimize import fminbound
from matplotlib.pyplot import *
import time

def norm(x,e):
    return sum(abs(x)**e)**(1./e)

def rndBinomial(p, n, replicas=1):
    if replicas == 1:
        return int(sum(floor(rand(n) + p)))
    trials = [ rndBinomial(p,n) for i in xrange(replicas)]
    return trials

def main():
    # load data
    data = loadtxt('horse_kicks.txt',delimiter=',')
    x, y = map(array, zip(*data))
    N = sum(y)

    Y = y*1./N
    # distribution fitting
    def fit_error(L_guess):
        # calculate the distance between our guess and the data
        z = poisson.pmf(x, L_guess)
        return norm(Y-z, 1)/norm(Y,1) # can be any p-norm
    L_best = fminbound(fit_error, 1e-5, 3-1e-5)
    err_best = fit_error(L_best)
    print "Best fit: L = %f,  error = %f"%(L_best, fit_error(L_best))

    # generate data for animation
    F = [ (p, fit_error(p)) for p in linspace(1e-4,3-1e-4, 137) ]
    u, v = map(array, zip(*F))

    # draw basic figure
    fig = figure()

    subplot(2,1,2)
    plot(u, v, 'b-')
    #plot( L_best, fit_error(L_best), 'ko')
    marker_fit_error = plot( u[2], v[2], 'ro').pop()
    ylim(0,1)
    xlabel('Poisson intensity $\lambda$')

    subplot(2,1,1)
    plot(x, y, 'ko')
    z = poisson.pmf(x, L_best)
    width = .3
    xw = x-width*.5
    bar_ob = bar(xw, z*N, width, color='r')
    ttl = 'N = %d observations, $\lambda_{best} = %.3f$, $Err_{best} = %1.2g$'
    title(ttl%(N, L_best, err_best))
    ylabel('Number of times observed')
    xlabel('Number of successes')
    xlim(-0.5,max(x)+.5)
    ylim(0,max(y)*1.2)

    show(block=False)

    i, di = 0, 1
    i_max = len(u)
    j = 0
    while True:
        marker_fit_error.set_data([ u[i] ], [v[i]])
        z = poisson.pmf(x, u[i])*N
        for rect, h in zip(bar_ob, z):
            rect.set_height(h)
        fig.canvas.draw()
        savefig('frame%04d.tif'%j)
        j += 1
        i += di
        if i == i_max:
            di = -1
            i += 2*di
        elif i == 0:
            break

main()

Statisticians will usually say that a better way to determine \(\lambda\) is by maximum-likelihood: for what value of \(\lambda\) are we most-likely to observe this data?

\[\mathcal{L}(\vec{x}|\lambda) = \prod_{n=0}^{\infty} \left( \frac{\lambda^n e^{-\lambda}}{n!} \right)^{x_n}\] \[\ln \mathcal{L} (\vec{x}|\lambda) = \sum _ {n=0}^{\infty} x _ n \left( n \ln \lambda - \ln n! - \lambda \right)\]

We could minimize this numerically, without to much effort, but in this problem, we can make more analytic progress. Minimizing by differentiating in \(\lambda\) and setting equal to zero, we find \[\lambda _ {MLE} = \frac{\sum _ n n x _ n}{\sum _ n x _ n} = 0.7\] Note that this is pretty close to the \(0.665\) obtained by minimizing the one-norm but not exactly the same.

Maximum likelihood fit of a Poisson distribution to horse-kick mortality data
Maximum likelihood fit of a Poisson distribution to horse-kick mortality data

see gif animation showing fit of Poisson distribution to horse-kick mortality data from Prussian army using maximum likelihood

[Show code]
#!/usr/bin/env python3

from scipy import floor, cumsum, array, linspace, log10, pi, rand, loadtxt
from scipy.stats import poisson
from scipy.optimize import fminbound
from matplotlib.pyplot import *
import time

def main():
    # load data
    csvfilename = 'horse_kicks.csv'
    data = loadtxt(csvfilename, delimiter=',')
    x, y = list(map(array, list(zip(*data))))
    N = sum(y)

    Y = y*1./N
    # distribution fitting
    def loglikelihood(L_guess):
        # calculate the log likelihood
        z = sum(log10(poisson.pmf(x, L_guess))*y)/N
        return z
    Lbounds = 1e-5, 3-1e-5
    L_best = fminbound(lambda x : -loglikelihood(x), min(Lbounds), max(Lbounds))
    err_best = loglikelihood(L_best)
    z = poisson.pmf(x, L_best)
    cz = cumsum(z)
    print("Best fit: L = %f,  error = %f"%(L_best, loglikelihood(L_best)))
    print("\t", sum(x*y)/sum(y))

    # generate data for likelihood
    F = [ (p, loglikelihood(p)) for p in linspace(Lbounds[0],Lbounds[1], 137) ]
    u, v = list(map(array, list(zip(*F))))

    # draw basic figure
    fig = figure()
    subplot(2,1,2)
    plot(u, v, 'b-')
    #plot( L_best, fit_error(L_best), 'ko')
    marker_fit_error = plot( L_best, err_best, 'ro').pop()
    ylim(-1.5,0)
    xlabel('Poisson intensity $\lambda$')
    ylabel('Log likelihood')

    subplot(2,1,1)
    plot(x, y, 'ko', label='Observations')
    width = .3
    bar_ob = bar(x, z*N, width, color='r')
    title('N = %d observations, $\lambda_{MLE} = %.3f$, $LL_{MLE} = %1.2g$'%(N, L_best, err_best))
    ylabel('Number of times observed')
    xlabel('Number of successes')
    legend(framealpha=0)
    xlim(-0.5,max(x)+.5)
    ylim(0,max(y)*1.2)
    tight_layout()
    savefig('horsesMLE1.png', transparent=True)

    figure(2)
    for t in range(20):
        r = list(poisson.rvs(L_best,size=int(N)))
        yy = cumsum(array([r.count(i) for i in range(7+1)]))/N
        plot(cz, yy, 'r+')
    plot(cz,cumsum(y/N),'bo',label='Model fit',markersize=8)
    plot((0,1),(0,1),'k-',linewidth=2)
    xlabel('Predicted probabililty')
    ylabel('Observed frequency')
    xlim(0,1)
    ylim(0,1)
    legend(framealpha=0)
    tight_layout()
    savefig('horsesMLEPP.png', transparent=True)
    show()
    return

main()

No matter how you do it, the Poisson distribution does a remarkably good job explaining the Prussian army horse kick data. Yet death by horse-kick is a rare thing, and it is hard to imagine a single common circumstance accounting for all of these deaths. Perhaps if there is a particular circumstance related to these deaths, the value of the parameter \(\lambda\) carries additional information and the Poisson distribution can tell us something more about the circumstances of the deaths. But this is probably not the case.

The explanation of the data in terms of a Poisson distribution is probably a consequence of the stability of Poisson distribution under the superposition of many independent and even-more-rare circumstances. One of the special properties of the Poisson distribution is that it is additive – if you take two or more circumstances well-described by the Poisson distribution and add them together, the new distribution is also a Poisson distribution. The convergence of the binomial distribution to the Poisson distribution is sometimes called “The law of rare events” (Bortkiewicz, 1898), and holds more generally than just for the binomial distribution. The hypergeometric distribution of draws from an urn is also approximately a Poisson distribution when the number of stones is large and the number of draws is small (see exercises). If you shuffle a large shoe of cards well, the number of cards that end up at the same point they started is approximately Poisson distributed. If each event is independent and has its own probability, and those probabilities are all small, then the frequency of the events in a population will converge to the Poisson as the population gets infinitely large. And it turns out that the original circumstances don’t even have to be described by the Poisson distribution for the superposition to be a Poisson distribution. Khinchin shows that as independent renewal processes with nice interevent distributions are superimposed, the distribution of events in a finite window converges to a Poisson process (Mathematical Methods in the Theory of Queuing, 1960 in translation).

An example, which seems like it should also be a Poisson distribution, was collected for the rates of accidents among machinists during World War I.

[ Data : hide , shown as table , shown as CSV shown as QR ]

# number of accidents, number of machinists with that many accidents
#
# The number # of accidents over 3 months among 414 machinists during World War
# One.  The data on point comes from a 1920 study of Greenwood & Yule.
# http://www.jstor.org/stable/2341080
#
0,296
1,74
2,26
3,8
4,4
5,4
6,1
7,0
8,1
[Show code]
#!/usr/bin/env python

"""
This script creates an animation showing how
a distribution can be fit to a data.

First, we simulate or load a data set.  Then we
compare the data to binomial distributions with
a range of success probabilities.
"""

from scipy import floor, cumsum, array, linspace, cos, pi, rand, loadtxt
from scipy.stats import poisson
from scipy.optimize import fminbound
from matplotlib.pyplot import *
import time

def norm(x,e):
    return sum(abs(x)**e)**(1./e)

def rndBinomial(p, n, replicas=1):
    if replicas == 1:
        return int(sum(floor(rand(n) + p)))
    trials = [ rndBinomial(p,n) for i in xrange(replicas)]
    return trials

def main():
    # load data
    data = loadtxt('munitions_factory.data',delimiter=',')
    x, y = map(array, zip(*data))
    N = sum(y)

    Y = y*1./N
    # distribution fitting
    def fit_error(L_guess):
        # calculate the distance between our guess and the data
        z = poisson.pmf(x, L_guess)
        return norm(Y-z, 1)/norm(Y,1) # can be any p-norm
    L_best = fminbound(fit_error, 1e-5, 3-1e-5)
    err_best = fit_error(L_best)
    print "Best fit: L = %f,  error = %f"%(L_best, fit_error(L_best))

    # generate data for animation
    F = [ (p, fit_error(p)) for p in linspace(1e-2,1.2-1e-4, 137) ]
    u, v = map(array, zip(*F))

    # draw basic figure
    fig = figure()

    subplot(2,1,2)
    plot(u, v, 'b-')
    #plot( L_best, fit_error(L_best), 'ko')
    marker_fit_error = plot( u[2], v[2], 'ro').pop()
    ylim(0,1)
    xlabel('Poisson intensity $\lambda$')

    subplot(2,1,1)
    plot(x, y, 'ko')
    z = poisson.pmf(x, L_best)
    width = .3
    xw = x-width*.5
    bar_ob = bar(xw, z*N, width, color='r')
    title('N = %d observations, $\lambda_{best} = %.3f$, $Err_{best} = %1.2g$'%(N, L_best, err_best))
    ylabel('Number of times observed')
    xlabel('Number of successes')
    xlim(-0.5,max(x)+.5)
    ylim(0,max(y)*1.2)
    tight_layout()

    show(block=False)

    i, di = 0, 1
    i_max = len(u)
    j = 0
    while True:
        marker_fit_error.set_data([ u[i] ], [v[i]])
        z = poisson.pmf(x, u[i])*N
        for rect, h in zip(bar_ob, z):
            rect.set_height(h)
        fig.canvas.draw()
        savefig('frame%04d.tif'%j)
        j += 1
        i += di
        if i == i_max:
            di = -1
            i += 2*di
        elif i == 0:
            break

main()

Student Height

Photograph of students binned by height, year 1914
Photograph of students binned by height, year 1914

[ Data : hide , shown as table , shown as CSV shown as QR ]

#feet,inches,total inches,count
#
# @article{bib:Blakeslee1914,
#   author = {Blakeslee, Albert F.},
#   title = {{CORN AND MEN}: The Interacting Influence of Heredity and Environment-Movements
#       for Betterment of Men, or Corn, or Any Other Living Thing
#       One-sided Unless They Take Both Factors into Account},
#   journal = {Journal of Heredity},
#   volume = {5},
#   number = {11},
#   pages = {511},
#   year = {1914},
#   doi = {10.1093/oxfordjournals.jhered.a107785},
#   URL = {http://dx.doi.org/10.1093/oxfordjournals.jhered.a107785},
# }
#
# This is the data from a famous photograph
# visually showing the distribution of male
# height in at the Connecticut State
# Agricultural College
#
# http://www.sbs.utexas.edu/levin/bio213/genetics/height.dist.gif
#
4,09,57,0
4,10,58,1
4,11,59,0
5,00,60,0
5,01,61,1
5,02,62,5
5,03,63,7
5,04,64,7
5,05,65,22
5,06,66,25
5,07,67,26
5,08,68,27
5,09,69,17
5,10,70,11
5,11,71,17
6,00,72,4
6,01,73,4
6,02,74,1
6,03,75,0

Students at the Connecticut Agricultural research station in 1913 were binned according to their inches of height and asked to stand in order on their sports field. The resulting picture has a strong resemblance to a Normal distribution. However, fitting a Normal distribution to data is a little more complicated than fitting a Poisson distribution – the Normal has two parameters that must be determined – the mean \(\mu\) and the standard deviation \(\sigma\). Again, we can measure the goodness of fit in several different ways, with different merits. If we use likelihood, things are a little different for continuous distributions since the probability of each individual number is infinitessimally small. But if we use the log-likelihood, this infinitesimal part just becomes an additive constant and has no effect on the parameter inference.

\[\ln \mathcal{L}(x | \mu, \sigma) \propto C - n \ln \sigma -\frac{1}{2\sigma^2} \sum _ {i=1}^n (x _ i - \mu)^2\]

Using a standard minimization algorithm, we can find the the parameters for which the Normal curve fits the data best. But we don’t have to search for the best fit numerically. There are quick formulas to find the parameter values where the data are most likely. If we differentiate the likelihood with respect to the mean and standard deviation of the normal distribution and the solve for the values where the derivative is zero (maximizing the likelihood), we find the maximum likelihood estimates (MLE) \[\mu _ {MLE} = \frac{\sum _ i x _ i}{n}\] and \[\sigma _ {MLE} = \sqrt{ \frac{1}{n} \sum _ i (x _ i - \mu _ {MLE})^2}.\] These are standard results from ANOVA analysis in statistics, and built into most statistical software tools like R and SAS.

[Show code]
#!/usr/bin/python

from pylab import *
import csv

from scipy.stats import norm as gaussian
from scipy.optimize import fmin


def main():
    fname='height.csv'
    data = loadtxt(fname, delimiter=',', comments='#',dtype=int)
    x = data[:,2]
    y = data[:,3]
    m =data.shape[0]

    z = []
    for i in range(m):
        z += [x[i]]*y[i]
    n = len(z)
    z = array(z)

    # slow, inaccurate method
    def fit_likelihood(X):
        a,b = X
        return -sum(log(gaussian.pdf((z-a)/b)/b))
    objective = fit_likelihood
    result = fmin(objective, [67.,5.])
    mu, sigma = result[0], result[1]
    print("Numerical Parameter fits: ", (mu, sigma))
    print("\tAverage Log-Likelihood: ", fit_likelihood(result)/n)

    # fast, exact method
    mu = sum(z)/float(n)
    sigma = sqrt(sum((z-mu)**2)/n)
    print("Exact Parameter fits: ", (mu,sigma))
    print("\tAverage Log-Likelihood: ", fit_likelihood((mu,sigma))/n)
    print("\tn = : ", fit_likelihood((mu,sigma))/n)


    u = linspace(min(z)*.95,max(z)*1.05, 256)
    figure(1, figsize=(8,16))
    subplot(3,1,1)

    plot(u, n*gaussian.pdf((u-mu)/sigma)/sigma, 'b-')
    width = .3
    bar_ob = bar(x-width*.5, y, width, color='k')
    xlim(min(u), max(u))
    ylabel('Count', fontsize=18)
    title('Best Fit Gaussian: $\mu = %f$, $\sigma = %f$'%(mu,sigma), fontsize=18)

    subplot(3,1,2)
    q = linspace(0,1,n)
    plot(z,q,'ko', u, gaussian.cdf((u-mu)/sigma),'b-')
    xlabel('Height (inches)', fontsize=18)
    ylabel('Cumulative Probability', fontsize=18)
    xlim(min(u), max(u))

    subplot(3,1,3)
    plot(gaussian.cdf((z-mu)/sigma), q, 'o', q, q, 'k-')
    xlabel('Expected probability', fontsize=18)
    ylabel('Observed probability', fontsize=18)

    savefig('height_gaussian.png', bbox_inches='tight')
    savefig('height_gaussian.pdf', bbox_inches='tight')
    show()

main()
Visual illustration of our maximum likelihood fit of a Normal distribution to the student height data
Visual illustration of our maximum likelihood fit of a Normal distribution to the student height data

We can see from the images that the Normal does a good job of representing the data. The fit CDF goes right through the middle of the empirical CDF, and the probability plot is right over the diagonal line.

Measurements of paper thickness

Another example is a set of measurements of paper-thickness performed by students.

[ Data : hide , shown as table , shown as CSV shown as QR ]

# student number, measurement 1, measurement 2, measurement 3, measurement 4
#
# measurements of paper thickness, in millimeters
# Each student made 4 measurements
#
# Data from Youden, Experimentation and Measurement, 1962
# reprinted September, 1991
#
01, .0757, .0762, .0769, .0746
02, .0808, .0793, .0781, .0821
03, .0811, .0772, .0770, .0756
04, .0655, .0683, .0714, .0746
05, .0741, .0710, .0748, .0711
06, .0756, .0772, .0776, .0759
07, .0775, .0785, .0760, .0761
08, .0747, .0765, .0735, .0776
09, .0719, .0762, .0802, .0713
10, .0734, .0833, .0833, .0783
11, .0755, .0740, .0714, .0743
12, .0788, .0817, .0794, .0766
13, .0731, .0716, .0726, .0714
14, .0833, .0794, .0783, .0788
15, .0767, .0775, .0765, .0793
16, .0787, .0798, .0864, .0817
17, .0784, .0799, .0789, .0802
18, .0784, .0820, .0796, .0818
19, .0830, .0796, .0778, .0767
20, .0741, .0680, .0733, .0723
21, .0759, .0766, .0772, .0466
22, .0810, .0812, .0789, .0776
23, .0777, .0759, .0795, .0790
24, .0784, .0786, .0797, .0859

This data are not binned by default. We could bin it or use a kernel estimator, but it can be analyzed using the Normal’s cumulative distribution function in its current form. The normal distribution does a moderately good job fitting the data, but there is some systematic error, as revealed by the probability plot.

Comparison of the empirical (for paper thickness data) and predicted (by maximum likelihood Normal) cumulative distributions
Comparison of the empirical (for paper thickness data) and predicted (by maximum likelihood Normal) cumulative distributions
Probability plot of quality of fit of the Normal distribution to paper thickness data
Probability plot of quality of fit of the Normal distribution to paper thickness data

Exercises

  1. In a 1-page paper in 1946, Clarke described the number of bomb strikes in central London during the World War II. Fit these data with a Poisson distribution, estimating the intensity \(\lambda\).

    count # sites with this many bombs
    0
                         229
    1
                         211
    2
                         93
    3
                         35
    4
                         7
    5+
                         1
  2. Is the number of hurricanes reaching Florida each year between 1850 and 2010 Poisson-distributed?

  3. In 1975 observations, a subset of Penn State’s female students were binned according to their height and asked to stand in order in front of Old Main (see below).
    Fit these data with a Normal distribution.

[ Data : hide , shown as table , shown as CSV shown as QR ]

# Data from Figure 1 of Joiner's Living Histograms (1975)
# histogram of women sorted by height
# height (in), number
59,2
60,5
61,7
62,10
63,16
64,23
65,20
66,16
67,9
68,6
69,6
70,3
71,1
72,1
Photograph of students binned by height (1975)
Photograph of students binned by height (1975)
  1. A common problem in rock and mineral mining is estimating the distribution of stone sizes during crushing. Empirical data is collected by dropping crushed stone through mesh sieves of successively smaller hole size, and measuring the fraction that does not fall through each. These numbers can be interpreted as the complementary cumulative distribution function (\(1- F _ {CDF}(x)\)). In 1933, Rosin and Rammler proposed a model for the distribution of particle sizes of ground coal based on a two-parameter Weibull distribution (a member of the Fisher-Tippett-Gumbel family), which has cumulative density function \[F _ {CDF}(x) = 1 - e^{-(x/\lambda)^n}.\] A very convenient feature of the Weibull distribution is that its CDF becomes a line under the transformation \(\ln\left( - \ln(1-F _ {CDF}(x))\right)\).

    1. Use the transformation above and linear least square to estimate \(\lambda\) and \(n\) for the this data on milled cork particle sizes.

    2. Plot the best-fit complementary CDF and the sieve data on top of each other. How good a job did the Weibull do?

[ Data : hide , shown as table , shown as CSV shown as QR ]

#Min size (mm), max size (mm), Mesh size (mm), Fraction (g), Fraction (%), Cumulative % weight (under), Cumulative % weight (over)
# doi:10.1016/j.matchar.2004.04.007
# table 1, p 161
# PSD analysis of milled cork
0,0.71,0.71,1.00,1.39,1.39,98.61
0.71,1,1.0,1.70,2.36,3.75,96.25
1,1.4,1.4,8.60,11.94,15.69,84.31
1.4,2,2.0,7.20,10.00,25.69,74.31
2,2.8,2.8,17.70,24.58,50.27,49.73
2.8,4,4.0,20.50,28.47,78.74,21.26
4,5.6,5.6,11.40,15.83,94.57,5.43
5.6,8,8.0,2.30,3.19,97.76,2.24
8,11.2,11.2,1.60,2.22,100,0
  1. Below is a data set for maximum rainfalls in Brussels, Belgium over a 50 year period. Two possible models of this data are the Normal distribution (a.k.a. the Normal distribution) and Gumbel’s distribution. Gumbel’s distribution seems the better choice, a-priori, because of its frequent use in risk analysis to describe extreme natural events, but we’d like to test this hypothesis ourselves.

    1. Plot the time series.

    2. Plot the empirical cumulative distribution function (CDF) for the maximum daily rainfall.

    3. Find the Normal distribution under which the data are most likely.

    4. Using scipy’s fmin function, find Gumbel’s distribution under which the data are most likely.

    5. Draw the empirical CDF, togethor with the best-fit Normal CDF and the best-fit Gumbel CDF together on a single plot. Using this plot, discuss the quality of the distribution fits.

[ Data : hide , shown as table , shown as CSV shown as QR ]

# year, maximum rainfall in one day (mm)
#
# Annual maximum daily rainfall for Uccle (Brussels, Belgium) for
# the 50-year period 1934-1983 measured in millimeters.
#
# @book{demaree1985intensity,
#   title={Intensity-duration-frequency Relationship of Point Precipitation at Uccle: Reference Period 1934-1983},
#   author={Demar{\'e}e, Gaston R},
#   year={1985},
#   publisher={Institut Royal M{\'e}t{\'e}orologique de Belgique}
#   url={https://books.google.com/books/about/Intensity_duration_frequency_Relationshi.html?id=5dectgAACAAJ}
# }
1934,23.9
1935,37.6
1936,23.8
1937,29.2
1938,33.9
1939,37.7
1940,60.1
1941,30.9
1942,72.1
1943,51.4
1944,24.2
1945,46.9
1946,26.6
1947,29.3
1948,29.5
1949,19.8
1950,39.0
1951,43.0
1952,51.1
1953,39.2
1954,31.7
1955,23.0
1956,34.6
1957,37.7
1958,42.9
1959,30.8
1960,48.0
1961,32.2
1962,58.6
1963,78.3
1964,43.0
1965,48.4
1966,39.3
1967,23.9
1968,26.0
1969,52.9
1970,29.2
1971,45.7
1972,28.6
1973,30.0
1974,30.7
1975,25.8
1976,33.0
1977,26.3
1978,49.2
1979,33.6
1980,38.2
1981,56.8
1982,50.1
1983,24.2
  1. The Boeing Airplane Company makes large use of aluminum in airplane construction, so it was important to them to know the failure distribution aluminum parts. As an experiment, they subjected 101 aluminum strips to periodic loading of up to 21,000 pounds per square inch 18 times per second. The number (in thousands of cycles) until failure are given below.

    1. Find the Normal distribution under which the data are most likely.

    2. Using scipy’s fmin function, find Gumbel’s distribution under which the data are most likely.

[ Data : hide , shown as table , shown as CSV shown as QR ]

# Thousands of cycles until rupture of rectangular strips of 6061-T6 aluminum
# sheeting, subjected to periodic loading with maximum stress of 21,000 psi
# 
# Birnbaum, Z. W. and Saunders, S. C., A Statistical Model for Life-Length of
# Materials, Journal of the American Statistical Association, 53(281) p 151-160. 1958.
# http://www.itl.nist.gov/div898/handbook/eda/section4/eda4291.htm
# 
370
706
716
746
785
797
844
855
858
886
886
930
960
988
990
1000
1010
1016
1018
1020
1055
1085
1102
1102
1108
1115
1120
1134
1140
1199
1200
1200
1203
1222
1235
1238
1252
1258
1262
1269
1270
1290
1293
1300
1310
1313
1315
1330
1355
1390
1416
1419
1420
1420
1450
1452
1475
1478
1481
1485
1502
1505
1513
1522
1522
1530
1540
1560
1567
1578
1594
1602
1604
1608
1630
1642
1674
1730
1750
1750
1763
1768
1781
1782
1792
1820
1868
1881
1890
1893
1895
1910
1923
1940
1945
2023
2100
2130
2215
2268
2440
  1. We’ve described stable distributions for processes that involve superposition, addition, and minimization, but other processes can generate asymptotic distributions as well. For example, the Wigner distribution describes the distribution of the eigenvalues of a very large symmetric matrix where the individual entries in the matrix are drawn from a standard normal distribution. Generate 100 random matrices of this form for matrix sizes of \(N = 10\), \(10^2\), \(10^3\), and \(10^4\). Empirically estimate and plot the distribution of the eigenvalues for such matrices.

There are more data sets like these that you can explore, out on the internet, like at the journal of statistical education and UFL and github.

( previous, home, next )