( previous, home, next )


Further reading

Common discrete distributions

Probability distributions provide handy models of many random phenomena and are an important building blocks for many sophisticated stochastic models. Many of the common probability distributions have special relationships between each other – see Distribution relationships.

While probability distributions in general can be applied to data with any number of dimensions, their are a few common ones describing chance outcomes on the natural numbers with which everybody should be familiar. Here is a list of a few common and important discrete probability distributions. For reference, given a random variable \(x\) that takes the value of a natural number, then the probability generating function for \(x\) is \[\hat{x}(s) = E[s^n] = \sum _ {n=0}^{\infty} p _ n s^n\] while the cumulant generating function \[\tilde{x}(t) = \ln E[e^{t}] = \ln \hat{x}(e^{t}).\]

Bernoulli Distribution

The Bernoulli distribution describes the outcome of one flip of a (possibly) biased coin. A random variable representing a single coin flip will take the value 1 (heads, success) with probablity \(p\) and 0 (tails, failure) with probability \(1-p\).

\[\hat{x}(t) = \sum _ {k=0}^1 \Pr(X = k) s^k = (1-p) + p s\] \[\tilde{x}(t) = \ln( 1- p + p e^t)\]

The Bernoulli distribution is the simplest meaningful distribution.

Hypergeometric Distribution

Assume an urn contains \(n\) stones, of which \(j\) are black, and the rest are white. From this urn, draw exactly \(m\) stones all at once. What is the probability that \(k\) of those stones are black?

\[\hat{x}(s) = \sum_{k=0}^{m} \Pr(X = k) s^k = \sum_{k=0}^{m} \frac{{}_jC_k \; {}_{n-j}C_{m-k}}{{}_nC_m} s^k.\]

The hypergeometric distribution can also be thought of as describing the probability of drawing \(k\) wins in \(n\) trials, without replacing stones as their drawn.

Binomial Distribution

What is the probability of having \(k\) successes among \(n\) trials when each trial is independent and identical? The answer is the binomial probability distribution, with \(p\) being the probability of an individual success. The binomial probability generating function

\[\hat{x}(s) = \sum_{k=0}^{n} \Pr(X = k) s^k = \sum_{k=0}^{n} {}_nC_k \, p^k(1-p)^{n-k} s^k = (1-p + ps)^n.\]

\[\tilde{x}(t) = n \log(1-p + pe^t).\]

The symbol \({}_nC_k\) represents a number of possible combinations. Suppose we have \(n\) different items. If we don’t care about the order in which they were picked, how many different outcomes are there for picking \(k\) of these \(n\) items. Well, don’t pick any (\(k=0\)), then there is only one possible outcome (an empty selection), so \(_nC_0 = 1\). If we pick 1 item, then there are \(n\) possible outcomes, so \(_nC_1 = n\). And if we pick all of them, then there is only one outcome, so \(_nC_n = 1\). All other cases can be calculated from the two-dimensional linear difference equation \[{}_nC_k = {}_{n-1}C_{k-1} + {}_{n-1}C_{k}\] governing Pascal’s triangle. (If you find this interesting, you might want to explore Gian-Carlo Rota’s 12-fold way, as discussed in Stanley’s book.) The general formula can then be shown to be \[_nC_k = \frac{n!}{k! (n-k)!}\] where \(n!\) (read “n factorial”) is the product of all integers from \(1\) to \(n\) (and 0! = 1).

Geometric Distribution

What is the probability that you can have k successes before your first failure? The answer is a geometric distribution with success probability p. \[\hat{x}(s) = \sum _ {k=0}^{\infty} (1-p) p^k s^k = \frac{1-p}{1 - p s}\] \[\tilde{x}(t) = \ln((1-p)/(1-pe^{t}))\]

Pascal Distribution (a.k.a. Negative binomial)

What is the probability of having k successes before r failures, when each trial is independent and identical? The answer is a Pascal probability distribution, (a special case of the negative binomial probability distribution) \[\hat{x}(s) = \sum _ {k=0}^{\infty} {} _ {k+r-1}C _ {k} (1-p)^r p^k s^k = \left( \frac{1-p}{1-ps} \right)^r\] \[\tilde{x}(\theta) = r\ln((1-p)/(1-pe^{\theta}))\]

Common distributions a line

All of the examples we have given are distributions over the natural numbers. There are distributions defined over the real numbers as well. In fact, each of the Binomial, Geometric, and Pascal distributions have continuous analogous. Distributions over continuums feel different than discrete ones because the probability of each outcome is (usually) zero – when you have pick one thing from an infinitely large set and all the elements have similar chances of being picked, each element alone must have a vanishing chance of being picked. Only when targetting full intervals of outcomes can we expect to have a positive probability of success. The formal mathematics of this is subtle, and was the motivation for the development of “measure theory”, which is now a fundamental component of mathematical analysis. However, this digression will take us too far afield from our current aims.

Instead, we will rely on our calculus experience and intuition against pathologies. Let us say that when drawing a sample from a distribution \(P\) over the positive numbers, the probability of drawing a specific outcome \(x\) is \(P(x) dx\), which is infinitely small. But, collectively, the probability of drawing an outcome from a finite interval \([a,b]\) has probability \[\int _ a ^ b P(x) dx\] which may well be finite. One of the consequences of this is that these distributions do not have probability generating functions although they do still have cumulant generating functions.

Exponential distribution

The most common distribution over a line is the exponential distribution \[P(t) = \lambda e^{-\lambda t}.\] The exponential distribution is the continuous analog of the geometric distribution. The exponential distribution’s atom \(\lambda e^{-\lambda t} dt\) is the chance of waiting exactly \(t\) for a failure event.
\[\tilde{x}(\theta) = - \ln ( 1 - \theta/\lambda)\] The mean \(1/\lambda\) is expressed in terms of the single parameter \(\lambda\) called the hazard rate,

Gamma distribution

Unimodal distributions on a continuous line

It might be tempting to think that given the “shape” of some data distribution you observe, there is a distribution to match that data and that that distribution hence informs the data you are studying. Sometimes, this is the case, but you should also be aware that for any given shape, there are many different distributions that can be good matches – we are only limited by the limits we allow our imaginations. To illustrate this point, let me list a few different unimodal distributions on a line

\[\frac{1}{2 \pi} e^{-x^2/2}\]

\[\frac{1}{2 \cosh\left( \frac{\pi}{2} x \right)}\]

All of these look about the same – a single peak, centered on the origin, symmetric to the left and right and decreasing to zero. There are others as well, though their expressions become increasingly more complex and abstract.

Stable Distributions

There are many different distributions that can be useful in modelling, and you can explore more on your own (maybe even discover a new one ;). Many random processes that we want to understand in the real world are complicated with lots of variables and interactions, each of which needs to be accounted for in constructing a distribution theory for the outcomes.

But there are a few distribution families that seem to come up frequently – surprisingly frequently when you first think about it. They are the Poisson distribution, the Normal distribution, and the Extreme value distributions. These are stable asymptotic distributions. They are asymptotic because their formulas are calculated using the idea that the number of events is infinitely large (\(n \rightarrow \infty\)). They are stable in the sense that many different distributions of individual events converge to the same asymptotic distributions. Each of the three families appears under different circumstances. The Poisson distribution appears as we study the frequencies of superimposed independent rare events. The Normal distribution appears as we study the accumulation of large numbers of independent events. Extreme-value distributions appear when we study the extremes of large numbers of independent events.

In a modelling context, stable distributions have pro’s and con’s. On one hand, it is very useful to have a few common and well-understood distributions that can be used to explain so many data sets with reasonably good accuracy. On the other hand, it is very difficult to extract additional information beyond parameter estimates from data sets that are well-fit by stable distributions; in some sense, these distributions have forgotten the nature of the original processes that created them.


Factorials \(n!\) are major pains to calculate with even when \(n\) is only modestly large. For example, \(25!\) has 26 digits, and the numbers just keep getting bigger as \(n\) gets larger. But in order to simplify our calculation, we have to be able to do some calculations with factorials. Abraham de Moivre (1667 - 1754), a great neglected mathematician (in life and death), discovered a very useful formula that allows us to approximate factorials with multiplication and exponentiation.

\[n! \sim \sqrt{2 \pi n} \left( \frac{n}{e} \right)^n\]

We call this formula Stirling’s approximation (remember, I said de Moivre is forgotten – Stirling was one of de Moivre’s contemporaries), and it is an example of an asymptotic approximation. Two functions \(f(n)\) and \(g(n)\) are asymptotically equivalent (\(f \sim g\)) if and only if \[\lim _ {n\rightarrow \infty} \frac{f(n)}{g(n)} = 1.\] Asymptotic equivalence is not the same as equality. For example, \[\begin{align} n &\sim n + 12, \\ n^3 &\sim n (n+1) (n+2), \\ 1/n &\sim n/(n^2 + 1). \end{align}\] Asymptotic approximation methods are a deep and powerful field invigorated by Poincare, Stieltjes, and particularly Prandlt’s boundary layer theory of aerodynamics. In applied mathematics, important contributions to the theory were made by Watson, Erdelyi, Nayfeh, and Kevorkian, among others. But for now, let’s just take Stirling’s approximation as-is and see what we can do with it.

Stirling’s formula is commonly derived using the “method of steepest descent”, a very useful trick for approximating some integrals that would otherwise have to approximated numerically.

First, if you recall from differential equations, the Laplace transform of a monomial \(t^n\) is given by \[\begin{gather} \mathscr{L}[t^n](s) = \int_{0}^{\infty} t^n e^{-st} dt = \frac{n!}{s^{n+1}}. \end{gather}\] In the special case of \(s = 1\), then \[\begin{gather} n! = \int_{0}^{\infty} t^n e^{-t} dt. \end{gather}\] At first, this seems like a rather silly formula, does two very useful things. First, not that the integral is defined, not just for integer values of \(n\), but for all positive real numbers. Thus, we know have a way to make sense of \((1/2)!\). Second, we know have an algebraic expression we can approximate. In particular, by Taylor series expansion around an exponent maximum, \[\begin{align*} \int_{0}^{\infty} t^n e^{-t} dt &= \int_{0}^{\infty} e^{- n ( t/n - \ln t)} dt \\ & \approx \int_{0}^{\infty} e^{ n \log{\left (n \right )} - n - \frac{\left(t- n\right)^{2}}{2 n} } dt \\ & = e^{ n \log{\left (n \right )} - n} \int_{0}^{\infty} e^{-\frac{\left(t- n\right)^{2}}{2 n} } dt \\ & \approx \left( \frac{n}{e} \right)^n \int_{-\infty}^{\infty} e^{-\frac{\left(t- n\right)^{2}}{2 n} } dt \\ & = \sqrt{2 \pi n} \left( \frac{n}{e} \right)^n \end{align*}\] If this approximation feels a little weird to you, that’s probably good. It is not obvious that a local Taylor series approximation like this must give an accurate approximation accross a semi-infinite integration interval turned into a bi-infinite interval. A more convincing arguement for the reasonablness of these approximations can be give, but must be left aside for now.

The Poisson Distribution

Suppose we are interest in the distribution of the number of new phone calls a receptionist recieves in an hour. One approach to this is to break the hour up into 15 minute quarters. In each quarter-hour, there’s a chance of the phone ringing, and as long as each ring is a new caller acting independently of previous callers, we can model the distribution of calls as the binomial distribution where \(k\) is the number of calls and \(p _ {15}\) is the chance of a call in a quarter hour:

\[\frac{4!}{k! (4-k)!} \, p _ {15}^k(1-p _ {15})^{4-k}.\]

However, this is a bit artificial – since we only considered 4 calling periods in our hour, the model predicts the receptionist will never have to handle 5 calls in the hour. We can fix this by breaking the hour down into 12 smaller intervals of 5 minutes each. When we do this, though, the chance of a call arriving in each interval should be smaller as well. Perhaps \(p _ {5} \approx p _ {15}/3\). But 5 minute intervals is rather arbitrary as well. Why not 1-minute intervals or 1 second intervals even?

But at the same time, as we do this, the numbers we have to work with in the binomial formula become more and more difficult to work with. The binomial and hypergeometric distributions rely on factorials, and hence difficult to work with when the number of events \(n\) is large – 100! has 157 digits! Fortunately, the calculations can be simplified. Consider the binomial distribution \[\hat{x}(s;p,n) = \sum _ {k=0}^{n} \frac{n!}{k! (n-k)!} \, p^k(1-p)^{n-k} s^k.\] When \(n\) is large but the probability of an event occurring is small, we can choose some positive constant \(\lambda\) such that \(p = \lambda/n\).

According to Stirling’s formula and Bernoulli’s limit, when \(n\) is large, \[\frac{n!}{(n-k)!} \sim n^k\] \[\left(1-\frac{\lambda}{n}\right)^{n} \sim e^{-\lambda}\] \[\frac{ (\lambda/n)^k }{(1-\lambda/n)^{k}} \sim (\lambda/n)^k\] And so, if we make the appropriate substitutions into the binomial distribution’s probability distribution, we find the probability of observing \(k\) events, \[P(k) = \frac{ \lambda^k e^{-\lambda} }{k!}\] This is what we call the Poisson distribution. The Poisson distributions probability and cumulant generating functions are respectively \[\hat{x}(s; \lambda) \sim \sum _ {k=0}^{\infty} \frac{ \lambda^k e^{-\lambda} }{k!} s^{k} = e^{\lambda (s-1)}, \quad\text{where}\; \lambda = p n, p \approx 0, n \approx \infty\] \[\tilde{x}(t) = \lambda(e^t-1).\] Poisson rediscovered this formula after de Moivre’s work was forgotten, and then it was forgotten again until Bortkiewicz, Molina, Erlang, and others revived it. With a little calculation, we can show that the parameter \(\lambda\) is both the mean (the expected number of impacts per unit time) AND the variance of a Poisson distribution.

The Normal distribution

“Everybody believes in the exponential law of errors [i.e., the Normal distribution]: the experimenters, because they think it can be proved by mathematics; and the mathematicians, because they believe it has been established by observation.” - Whittaker, E. T. and Robinson

The Normal distribution, also known widely as the Gaussian distribution, is probably the most widely known distribution because of its wide applications in statistical analysis. Like the Poisson distribution, it was first discovered by Abraham de Moivre. However, it did not rise to prominence until Carl Gauss showed the method of linear least squares was equivalent to assuming measurement errors were “normally” distributed. Gauss originally used it in accounting for errors in astronomical measurements. Its importance was further cemented by Ronald Fisher’s development of “analysis of variance”, a.k.a. ANOVA, methods at the beginning of the 20th century.

The normal distribution is described by the formula \[\frac{1}{\sigma\sqrt{2\pi} } \; e^{ -\frac{(x-\mu)^2}{2\sigma^2} }.\] The cumulant generating function for a normally distributed random variable

\[\tilde{x}(t) = \mu \theta + \frac{1}{2} \sigma^2 \theta^2.\]

Like the Poisson distribution, the Gauss distribution is also additive - the sum of to Gauss variables is also a Gauss variable

\[\frac{1}{N} \sum_{i=1}^{N} X_i \sim \frac{1}{\sigma\sqrt{2\pi} } \; e^{ -\frac{(x-\mu)^2}{2\sigma^2} }\] where \(\mu\) is the expected value and \(\sigma\) is the standard deviation.

In fact, more general versions of this property also hold. If we take any realistic sequence of independent identically-distributed random variables \(X _ i\), then as the sequence grows, the distribution of the average of this sequence converges to the Normal distribution. This phenomena is celebrated by the name The Central limit theorem, and is the reason we can call the Normal distribution a stable distribution – no matter what distribution the original variables have, their average converges to be Normally distribited!

De Moivre-Laplace theorem

To illustrate this stability, let me show you how the Normal distribution can also be derived from the Binomial distribution under the assumption of a large sample size, but fixed positive probability of success for each attempt. Note that this differs from our derivation of the Poisson distribution, where we assumed the chance of success with each flip decreased linearly as sample size increased. This result was first discovered by De Moivre, but not widely appreciated until widely read in Laplace’s work.

Suppose we begin with the binomial distribution \[\begin{gather} \frac{n!}{r!(n-r)!} p^r (1-p)^{n-r} \end{gather}\] as \(n\) gets very large but when \(p\) is fixed between \(0\) and \(1\). From Sterling’s formula, we know

\[\begin{align} n! &\sim \sqrt{2 \pi n } \left( \frac{n}{e} \right)^n \\ r! &\sim \sqrt{2 \pi r } \left( \frac{r}{e} \right)^r \\ (n-r)! &\sim \sqrt{2 \pi (n-r) } \left( \frac{(n-r)}{e} \right)^{(n-r)} \end{align}\]


\[\begin{align} \frac{n!}{r!(n-r)!} &\sim \frac{1}{\sqrt{2 \pi n (r/n) ( 1 - r/n)}} \frac{n^n}{r^r (n-r)^{n-r}} \\ &\sim \frac{1}{\sqrt{2 \pi n (r/n) ( 1 - r/n)}} \frac{n^n (n-r)^r}{r^r (n-r)^n} \\ &\sim \frac{1}{\sqrt{2 \pi n (r/n) ( 1 - r/n)}} (r/n)^{-r} (1-r/n)^{-(n-r)} \end{align}\]

Substituting back into the binomial and rearranging, \[\begin{gather} \frac{1}{\sqrt{2 \pi n (r/n) ( 1 - r/n)}} (r/n)^{-r} (1-r/n)^{-(n-r)} p^r (1-p)^{n-r} \\ \frac{1}{\sqrt{2 \pi n (r/n) ( 1 - r/n)}} \left( \frac{np}{r} \right)^{r} \left( \frac{1-p}{1-r/n} \right)^{(n-r)} \end{gather}\]

In this case the extreme values of \(r\) (all heads or all tails) will be very rare, while the most likely values of \(r\) will be those where heads comes up about as frequently as predicted by \(p\), \(r \approx n p\). Let’s make the substitution \(r = n p + z\), and now treat \(z\) as the variable of the distribution. \[\begin{gather} \frac{1}{\sqrt{2 \pi n (p + z/n) ( 1 - p - z/n)}} \left( \frac{np}{np+z} \right)^{np+z} \left( \frac{1-p}{1-p -z/n} \right)^{(n-np-z)} \\ \frac{1}{\sqrt{2 \pi n (p + z/n) ( 1 - p - z/n)}} e^{ - (np+z)\ln \left( 1+z/np \right) - (n-np-z) \ln \left( 1 -z/n/(1-p) \right) } \end{gather}\] Using the McLauren series \[\begin{gather} \ln(1 + x ) = x - \frac{1}{2} x^2 + \frac{1}{3} x^3 + \ldots \end{gather}\] we can expand the exponent in \(z\) to get \[\begin{gather} \frac{1}{\sqrt{2 \pi n (p + z/n) ( 1 - p - z/n)}} e^{ - \frac{z^2}{2np(1-p)} + O\left(\frac{z^3}{n^2}\right) } \end{gather}\] Discarding lower order terms, we obtain the Normal distribution \[\begin{gather} \frac{1}{\sqrt{2 \pi n p ( 1 - p )}} e^{ - \frac{z^2}{2np(1-p)} } \end{gather}\] Thus, the Binomial distribution is closely approximated by the Normal distribution when the number of samples is large and the probability of successs \(p\) is not too close to \(0\) or \(1\). The same result can be obtained as an approximation to the hypergeometric distribution as well because the act of replacement has very little effect on the full probability when \(n\) is very large.

Extreme-value distributions

markings of famous flood heights in Ellicott City, Maryland

The extreme-value distribution, actually takes one of 3 different forms, depending on the nature of the random events involved.

Fisher–Tippett derivation

Suppose \(F(x)\) is the cumulative probability distribution of independent identically distributed random variables \(X _ i\) over the interval \([0,1]\), so \(P(x < X _ i) = F(x)\), and that for \(x \approx 0\), \(F(x) \approx a x^k\). Then \[\begin{align*} P( x < min_{i=1 .. N} X_i ) &= 1 - ( 1 - F(x) )^N \\ &= 1 - e^{N \ln ( 1 - F(x) )} \\ &\approx 1 - e^{N \ln ( 1 - a x^k )} \\ &\approx 1 - e^{- N a x^k } \end{align*}\] This is a CDF of a Weibull distribution. To see how our approximation does, we can perform some simulations and compare the observed distribution to our Weibull distribution predictions. For minimum of a modest sequence of uniformly distributed samples, the approximation is pretty good, and the approximation becomes more accurate as the sequences become longer.

Comparison of simulation to Weibull approximation
Comparison of simulation to Weibull approximation
[Show code]
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
Created on Sat Feb 24 11:21:44 2018

@author: treluga

from pylab import *

runs = 500
N = 10
a = array([ min(rand(N)) for i in range(runs)])
y = 1 - exp(-N*a)
b = linspace(1/runs,1-1/runs,runs)
plot(a,b,'bx',label='Empirical simulation')
xlabel('Minumum value',fontsize=18)
ylabel('Cumulative probability',fontsize=18)
title('Min of %d calls to rand()'%N,fontsize=18)
title('Probability plot',fontsize=18)

If instead \(F(x)\) does not have an absolute minimum, but rather has an exponential tail, \(F(x) \approx e^{\alpha + \beta x}\), \[\begin{align*} P( x < min_{i=1 .. N} X_i ) &= 1 - e^{N \ln ( 1 - F(x) )} \\ &\approx 1 - e^{- N e^{\alpha + \beta x} } \\ & = 1 - e^{- k e^{\beta x} } \end{align*}\] This is a version of the Gumbel distribution. If the cumulative distribution has fatter tails than exponential, the minimum behavies according to a Frechet distribution (though this rarely appears in practice).

Comparison of simulation to Gumbel approximation
Comparison of simulation to Gumbel approximation
[Show code]
#!/usr/bin/env python3
Created on Sat Feb 24 11:21:44 2018

@author: treluga

from pylab import *
from scipy.linalg import lstsq

runs = 100
N = 40
x = array([ min(randn(N)) for i in range(runs)])
y = linspace(1/runs,1-1/runs,runs)

b = log(-log(1-y))
A = vstack([x**0, x**1]).T

(p_opt, residues, a_rank, svd_vals) = linalg.lstsq(A, b)

yopt = 1-exp(-exp(p_opt[0] + x*p_opt[1]))

plot(x,y,'b.',label='Empirical simulation')
legend(framealpha=0.,loc='upper left')
xlabel('Minumum value',fontsize=18)
ylabel('Cumulative probability',fontsize=18)
title('Min of %d calls to randn()'%N,fontsize=18)
xlabel('Minumum value',fontsize=18)
ylabel('Nonlinear data transform',fontsize=18)
title('Probability plot',fontsize=18)

\[\frac{[ Min _ i( X_i ) - X_N ]}{N} \sim \frac{d}{dx} \left( 1 - \exp \left\{-\left( 1 + s \frac{x - \mu}{\sigma}\right)^{1/s}\right\} \right)\]

Pareto distributions

A final class of distributions are Pareto distributions, also known as power-law distributions, which have formulas

\[\begin{gather} P _ x = \frac{a \eta^a}{x^{a+1}} \end{gather}\]


Scholars have constructed many different distributions over the years. Here, I have given examples of just a few of the most common of these. Some are constructed to model specific processes like the Hypergeometric and Binomial.

Physicists studying interacting particle systems often talk about “universality classes”. Universality classes are asymptotic limits of the behavior of certain interacting particle systems, and arose from the study of phase transitions of materials in statistical mechanics.

Brownian motion observed by Ingenhousz (1785) in solutions of coal-dust in alcohal and Brown (1827)

Phase transitions between solid and gas.

Transition between magnetized and not magnetized.

Surface of a growing crystal.

Shape of a forefire front.


  1. Let’s use python to test how good Stirling’s formula is.
    1. For \(n\) from 1 to 12, plot \(n!\) with circles and Stirling’s approximation \(\sqrt{2 \pi n} \left(\frac{n}{e}\right)^n\) with x’s using arithmetic axes. What does this plot tell you about the accuracy of the approximation?

    2. For \(n\) from 1 to 12, plot \[1-\frac{\sqrt{2 \pi n} \left(\frac{n}{e}\right)^n}{n!}.\] Based on this plot, what can we hypothesize about the relative errors of Stirling’s formula for \(n>12\)?

  2. Use Stirling’s formula to find an asymptotic approximation for \(\dfrac{n!}{i! (n-i)!}\) when both \(n\) and \(i\) are large.

  3. Show that when the number of stones in an urn is taken to be large, but of fixed initial proportion, then the hypergeometric distribution for drawing a small number of stones from the jar is approximately a binomial distribution, and hence, approximately a Poisson distribution.

  4. Track down and recover one of the data sets presented or cited by Gumbel in his classic book statistics of extremes for its original source.

  5. Show that the geometric distribution formula \(P(k;p)\) is the solution of the following linear difference equation recurrence formula and provide an intuitive explanation of why the formula works. \[\begin{gather} P(k; p) = p P(k-1;p) \end{gather}\]

  6. Show that the binomial distribution formula \(P(k;n,p)\) is the solution of the following linear difference equation recurrence formula and provide an intuitive explanation of why the formula works. \[\begin{gather} P(k; n,p) = (1-p) P(k;n-1,p) + p P(k-1; n-1,p) \end{gather}\]

  7. Show that the hypergeometric distribution formula \(P(k;n,m,j)\) is the solution of the following linear difference equation recurrence formula and provide an intuitive explanation of why the formula works. \[\begin{gather} P(k; n,m,j) = \left(1 - \frac{j - k}{n - (m - 1)}\right) P(k; n,m - 1,j) + \frac{(j - (k - 1))}{n - (m - 1)} P(k-1; n,m - 1,j) \end{gather}\]

( previous, home, next )