Markov chains

( previous, home, next )

Chutes and Ladders boardgame board


While probability distributions can be very useful models, they don't help us describe or analyze how a system changes over time. For that, we need to create new tools. One of the most useful is the Markov chain. A Markov chain is a special kind of stochastic process than can describe systems changing randomly over time. A Markov chain consists of a set of "states", and a rule for each state that gives the chances or rates for that state changing to another.

The childhood board game Chutes and ladders can be described as a Markov chain. The game is played on a \(10 \times 10\) board, with the squares numbered 1 to 100. A player has a piece to mark their current spot on the board. In each turn, you roll the dice, and depending your roll and your current spot, you count forward to see where you land next, including any chutes or ladders you land on at the end of your turn. Your 'state' in the game is your piece's current spot. The allowed transitions are all the possible spots you might land next from your current spot. And the transition probabilities are the probabilities of each needed dice roll.

The key property of a Markov chain is that once you know the current state, extra information about the past history of system does not provide any additional information for predicting the future states -- all useful information is captured in the current state. Any stochastic processes with this property is called "Markovian". Some stochastic processes are not Markovian. For example, a marble bouncing between different numbers on a roulette wheel does not form a Markov chain because the chances of different outcomes depends not just on the marble's current number, but also on if this is the first number the marble hits and so is moving fast, or if it's a later number and getting close to settling.

After this first pass, the dependence only on the current state seems like a serious restriction. Markov chains seem like they might be useful only occasionally -- lots of things that we might like to model have history dependence in some form. The roulette wheel marble's speed, which depends on how many times it has already bounced, or in baseball, where the outcome of a pitch depends not just on how it is thrown but also on how many balls and strikes the batter has accumulated in the past. BUT, as we will see, this is not as big a hurdle as it first appears.

Weather in the land of Oz

In Oz, the weather switches among 3 states every day... Sunny, Rainy, or Snowy.
The weather department observed the following daily history in Oz over 9 days

0 Sunny
1 Sunny
2 Sunny
3 Sunny
4 Rainy
5 Snowy
6 Snowy
7 Rainy
8 Snowy
9 Sunny

The longer history in compressed using (s) for sunny, (w) for snowy, and (r) for rainy, over 1600 days is


Matrix analysis and Digraph representation

The first step in analyzing data we want to model as a Markov chain is to count the transitions to see what occurs and how often.

[Show code]
data = "sssrwwrwssssrwwwrwwrssswwrssssswwrswrwrssswswrwwrsrwrswsrswssswwssrwwrsssssrwrsssrwwrswssssrwwwwsrwwsswssrssrsssswrsrwwwrsrsrwrsrwsrsssssrwwssrwsswsswrwswswwwswwwrssssrssswssssswwwwwwwrssswwwwssssswrwwwwwwwswswrwwswrwwrsrwrswsswrsrwwwwwwwrswwwrswswwrssssswrwswssssrssssrsrwrsssssrssrwrwwrwwwwwwsrwssrwwwwrsssrsswsssrsswwwswswrssssswwrssrssrwwrswwwwwwwssssrswrsssrwwsssrwwrsrwsrwswwwsswwwwwwwrssswrwsrsrswrswsswwssssssrwswrwswwwwwrswsssrwwwwwwwwsrwrwsrwrswswwwwrwwwwwrswrswsssssrwwrwsrsssssrssrsrswwwrwsswwwrsssssswwrwsrwwwwrwwwrssrwrwwssrssswswrwwwsswrsrsssrwrswrssrsrswssrwwrswrwrwwswsrwssssrsssswwsssrswwrsswrwrssrswrwwwrwssrssrsrsswwwrwwwsrwwwsswswwwwsswsssrsrsrswwwwwwrssssrwrwsswsswsrwrssssrsrwwswrsrssrssssswswrsswrsrssrwwwwrswwrswrswwwrwrssrwwwsrsswwssrswwwwwwwwrwwwsswswsrwrssswrwwsrwwsrwswrsrwrswwrswsrsrsswwwwswwssrssssrwsrsswssswssssrwssrsswwssrsssswwwrswwsrwwsrsrsswwrwrssswwsrssrssrsrwwsrwwsrwwwsrsrsrwrswwrwwsswrsrssssrsrwwwssswwwwwsrwwwssrsswwwrssssrwwrwwrwrsswwwwwwwssrwwwwwrswrssswsrwwrsswwrwwwwwrwrwswwrsrwwrwrsrsrssrwsrsrsswwswwsssssssrwwwwssswwwrsrwsswsswrssrssrwrsssrssssrwsrwsrswrswsrwsrssssswrwswssrswssswrswwsswwrssrsssswwrwsssssrsssrswsrsrwwrwwwwrwsrssrwwrwrwsssrwswsssrwssrwwrwrwswwwrwrwwwwwwwssswwsrsswrsrsswsrssrwrwwwwsssrwwwrwwwrssssswrssswrssrwwsrsrswrssswsrswwsrwrwwwrsswwrssssrwwwwwrwwwwrwwwwwwrsrsssrwrwswswrwwwwrsrsrwrwrsswsrwwsssssswwwswwwwwwwrsswwwwrswwwrssrwswrwsrwwwsrwwwwsrssrwswswwsrwswwwwsswrsssswwwswrsswsswsssrwwsrwwwrwssrssssrsssssrsssrwrswrsrwwsswrwrwwwwswwrwwssswsrsswsssrwwwwwwwwwwrwwsrssrwwsrwrssssswrssssswwswwwrssssswsswwwwsssswwssswwrswwssssswwwrsrs"

states = [ 's', 'r', 'w' ]

pairs = zip(data[:-1], data[1:])

counts = {}
for i in states:
    for j in states:
        counts[(i,j)] = pairs.count( (i,j) )
        print i,j, counts[(i,j)]

from pylab import *

C = array([[ counts[(i,j)] for j in states ] for i in states])
S = diag(ones(3).dot(C)**-1)
A =
print C
print S
print A
Counts of pairs Sunny today Rainy today Snowy today

Sunny tomorrow




Rainy tomorrow



Snowy tomorrow




One of the things we see in inspecting this is that it never rains two days in a row, although it may snow or be sunny for many days in a row.

At this point, our counts are equally valid reading the weather forward or backward. But since we like to be able to think of time as flowing forward, its useful to renormalize the data by dividing each column by its total to get approximately the chance of each transition, given the current state. Based on the Normal approximation of the binomial distribution, we can not expect get an accuracy of more than \(1.5/\sqrt{600} \approx 1/16\). If we round to the nearest sixteenth (making sure columns sum to 1), the transition probabilities come out as follows.

Prob( x -> y ) Sunny today Rainy today Snowy today

Sunny tomorrow




Rainy tomorrow




Snowy tomorrow




Notice that the columns now sum to 1. In every proper Markov chain, the columns of the transition matrix always sum to 1. We can also represent the transitions between states with a labelled directed graph.

We can simulate this chain with an 8-sided die, or computer simulations with a very simple python algorithm.

[Show code]
#!/usr/bin/env python2

from scipy import rand

s = 'S'
r = 'R'
w = 'W'

transitions = {s: (s,s,r,w), r:(s,s,w,w), w:(s,r,w,w)}
states = transitions.keys()

def rndStep(state):
    return transitions[state][int(rand()*4)]

def rndPath(t, x, tmax):
    # x is the initial state
    # t is the initial time
    # tmax is the time to stop at
    while t < tmax:
        yield t, x
        t, x = t + 1, rndStep(x)

def main():
    n = 80*20
    path = [x for t, x in rndPath(0,s,n)]
    print ''.join(path)


This is not a very general method (though it is fast) of simulating from a Markov chain described with a transition matrix. Better methods like aliasing exist.

Path integrals

If it is sunny today, what is the chance that it will snow in 3 days? (we don't care about the weather for the next 2 days)

Well, there are several paths the weather might take between now and then. If we can list all the paths, calculate the probability of each path and then sum up all the paths, that would give us the probability of snow in 3 days.

Paths Probability



















So the probability of snow in 3 days is \((4+4+4+2+0+4+1+2+4)/64 = 25/64\), a little better than 1 in 3. This kind of calculation is a special case of what's popularly called a "path integral". Path integrals are a generalization of calculus introduced by famous physicist Richard Feynman to help with calculations in quantum mechanics.

Doing a path integral this way is rather tedious -- there are many paths to check and it turns out there's a faster way. We just have to raise the matrix \(A\) to the appropriate power. If \(\vec{p}(0)\) is the vector of probabilities of starting in each state at step 0, and \(\vec{p}(t)\) is the probabilities of ending in each state in step \(t\),

\[\vec{p}(t) = A^t \vec{p}(0)\]

By raising the matrix \(A\) to the third power, we can calculate the probabilities of ending in each state, depending on the starting state.

\[A^3 = \left[\begin{matrix}\frac{13}{32} & \frac{13}{32} & \frac{25}{64}\\\frac{13}{64} & \frac{3}{16} & \frac{13}{64}\\\frac{25}{64} & \frac{13}{32} & \frac{13}{32}\end{matrix}\right]\]

It's sunny today, so \(p(0)=[1,0,0]^T\) and so

\[p(3)=A^3 p(0)=\left[\begin{matrix}\frac{13}{32} & \frac{13}{32} & \frac{25}{64}\\\frac{13}{64} & \frac{3}{16} & \frac{13}{64}\\\frac{25}{64} & \frac{13}{32} & \frac{13}{32}\end{matrix}\right]\left[\begin{matrix}1\\0\\0\end{matrix}\right]=\left[\begin{matrix}\frac{13}{32}\\\frac{13}{64}\\\frac{25}{64}\end{matrix}\right].\]

Therefore, three days from now it will be sunny with probability 13/32, rainy with probability 13/64, and snowy with probability 25/64.

Steady-state calculations

Some Markov chains, when run long enough into the future, forget their initial state and approach a stationary distribution. These kinds of Markov chains are called ergodic. The simplest test for ergoticity is to see if the chain can every get trapped in a subset of state-space. If you can always move from each state to each another by some path, then the chain is ergodic.

For an ergodic Markov chain, the steady-state distribution \(p(\infty)\) must solve the equation

\[p(\infty) = A p(\infty).\]

This also happens to be the eigenvector equation for eigenvalue \(\lambda = 1\). Since the columns of \(A\) sum to \(1\), there must be an eigenvalue equal to \(1\).

(has simple eigenvalues and steady-state) The characteristic polynomial \(z^{3} - z^{2} -{z}/{16} + {1}/{16} = 0\) factors to \[\left(z - 1\right) \left(4 z - 1\right) \left(4 z + 1\right) = 0\]

Calculate equilibrium distribution using reduced row-echelon form. (\(I-A\) will be singular! But we also want \(p_1 + p_2 + p _ 2 = 1\).)

\[p(\infty) = \frac{1}{5} \begin{bmatrix}2\\1\\2\end{bmatrix}\]

So, after forgetting the initial condition, we expect it to be sunny \(2/5\)th of the time, snowy \(2/5\)th of the time, and rainy only \(1/5\)th of the time.

(There is actually a bit more work here that I'm glossing over. Specifically, how can I be sure that \(\lim _ {n\rightarrow\infty} A^n p(0)\) doesn't blow up? It's because the eigenvalues of \(A\) have to be less than 1, which we establish via the Perron-Frobenius, since the eigenvalue 1 is the "Perron root.")

Continuous-time Markov chains

Let \(p_k(t)\) be the probability that a system is in state \(k\) at time \(t\). Then, if \(a_{kj}(\Delta t)\) is the probability of a system changing state from state \(j\) to state \(k\) in an interval of time \(\Delta t\), then \[\begin{gather*} p_k(t+\Delta t) = \sum_{j} a_{kj}(\Delta t) p_j(t). \end{gather*}\] This can also be written as a matrix equation \[\begin{gather*} p(t+\Delta t) = A(\Delta t) p(t), \end{gather*}\] where \(A(\Delta t) = [ a_{kj}(\Delta t) ]\). The individual transition probabilities \(a _ {ji}(\Delta t)\) from state \(i\) to state \(j\) of \(A(\Delta t)\) depend on the time step. Since this is a Markov chain, the transition probabilities out of \(i\) must sum to one: \[1 = \sum _ j a _ {ji}(\Delta t).\] Now, suppose we start shrinking our time-step. How do we expect the transition-probabilities to change? Well, intuition suggests that as the time-step becomes smaller, it is less likely that the chain will change state, so the probability of staying in the same state will increase, while the probabilities of moving to a different state will decrease: \[\begin{gather*} \lim_{\Delta t \rightarrow 0} a_{ii}(\Delta t) = \begin{cases} 1 & \text{if $i=j$,} \\ 0 & \text{if $i \neq j$.} \end{cases} \end{gather*}\] Let's suppose that for small time-steps, the off-diagonal entries are approximately linear in the time-step-size, so \(a _ {ji}(\Delta t) \approx m _ {ji} \Delta t\). The coefficients \(m _ {ji}\) are called the "transition rates" or "hazard rates". Since the columns must sum to unity, we must also have \(a _ {ii}(\Delta t) = 1 - \Delta t \sum _ {j} m _ {ji}\). Rewriting these relationships in matrix form, we have for small \(\Delta t\), \[\begin{gather*} A(\Delta t) = I + \Delta t M \end{gather*}\] where the diagonal entries of \(M\) are non-positive, the off-diagonal entries of \(M\) are non-negative, and the columns of \(M\) sum to zero. (Matrices of this kind fall into a special class called Metzler matrices with a number of useful properties.) We call \(M\) the transition-rate matrix of the chain. If we now re-arrange \[\begin{align*} p(t + \Delta t) &= A(\Delta t) p(t), \\ p(t + \Delta t) &= (I + \Delta t M ) p(t), \\ p(t + \Delta t) &= p(t) + \Delta t M p(t), \\ \frac{p(t + \Delta t) - p(t)}{\Delta t} &= M p(t), \end{align*}\] and take the limit as \(\Delta t \rightarrow 0\), we recover a first-order linear differential equation system \[\begin{gather*} \frac{dp}{dt} = M p. \end{gather*}\]

This equation is called the master equation of a continuous-time Markov chain.

Continuous-time Markov chains have one important advantage over discrete-time Markov chains -- no two independent events ever happen at exactly the same time. In discrete time, for example, we have to consider the possibilties that one things happens in a time-step, another thing happens in that time step, or they both happen in that time step. But in continuous-time, we'll never see both events happen at the exact-same time -- we will always have one happen before the other. The practical consequence of this is that the transition rate matrix \(M\) is usually simplier than the transition probability matrix \(A(\Delta t)\).

The big postulate we've used in this derivation is that the transition probabilities are approximately linear functions of the step size. There are alternatives to the linearity postulate which are very interesting from a research standpoing. But practical experience suggests that it is sufficient for many common applications of countable-state Markov chains.

The art of state-spaces

Modelling with Markov chains has two parts. (1) specifying a set of states, called a "state-space", for the process of interest, and (2) estimating transition probabilities among the states. Determining the transition probabilities, though not always easy, is a statistical inference problem which we will leave to other studies. The art of Markov chain modelling comes in choosing a good state-space.

The state is that part of the system that can change from time to time and we are interesting in tracking. The state-space is a set that contains all the possible states of the system. In a Markov chain, since the transitions can only depend on the current state, not any of the past states, we have to make sure the state contains all of the information we need to make our predictions about the future.

Whenever we use language to talk about nature, we're forced to "tokenize" our observations -- to assign one of a relatively small number of discrete states of being or coordinates to represent a moment. For example, in our modeling of the weather in Oz, our data described every day's weather with one of three states : rain, snow, or sun. Nature is seldom this tidy. Oz's meteorologist made no comment about days where it started as rainy and ended up sunny, or days when it was cloudy but not rainy or snowy. As a general rule, we'd like to use as few states as possible, while still capturing all the important features of the natural process, but the more precise an answer you want, the more state's you'll need to use.

Example 1: A one-window icecream parlor serves customers 1 at a time. To model this simple service process, we need a state-space that should atleast include the current length of the queue and the current order being filled. If we associate each order with an integer, then the state-space can be modelled by two natural numbers, both of which can be given upper bounds in practice.

Example 2: The shuffling of a deck of cards is an important process in gambling. In a standard deck, there are 52 cards, each of which can have 52 positions. So a naive representation of the state of a deck of cards might be to use 52 natural numbers, one for each position in the deck, with the number representing which of the 52 cards is in that position. Thus, we would allow for \(52^{52} \approx 1.7 \times 10^{89}\) states. But not all of these states are realistic -- no card can appear more than once in the deck, for example. There are only \(52! \approx 8\times 10^{67}\) actual permutations of a deck of cards. It would be nice if we could find a way to represent just the realistic states as simply as the 52 integers we've already suggested (see Diaconis's work for further discussion). In practice, it is quite common to use representations that included unrealistic states, and we just have to be careful to avoid these states when construction our transitions. Interestingly, you could represent the deck's state the opposite way also -- have one number for each card, and each number would represent the position of that card in the deck. Both styles work.

Markov chains make a major assumption. Markov chains predict of future events based only on the current state of the chain. Knowing more about past events in the chain never improves our predictions of the future once we know the current state. In the examples above, there are natural relationships to describe how knowing the current state allows us to predict the next state without needing to know more about the past history. But sometimes we need to include more in the "state" to nullify extra history information. For example, in our icecream parlor example, our state-space neglected the amount of time each customer had been waiting in line.

Example 3: When a skier is travelling down a ski slope, the state-space of the model requires knowing the location of skier on the slope. However, the positions alone don't give full predictive power. If we knew the past location as well as the current location, we could use their difference to estimate the speed of the skier and make better predictions where the skier will go next. But, if we include the current position and velocity in the state, then knowing the past won't improve the predictive power. A state representing position and velocity in three dimensions needs \(3 + 3 = 6\) dimensions. We'd also probably like to know the forces currently applied to the skier, but those are typically functions of the current location that can accounted for in transition rates. All information about past forces has already been integrated into obtaining the current location and velocity. A better model could also account for body orientation and limb position.

Example 4: To represent the state of a college basketball game, we need to track the positions and velocities of the players and the ball, and then perhaps also rules variables like which half we are in, the time in the half, the number of fouls on each player and the number of time outs each team still has. If player motion is tracked in 2 dimensions, while ball motion is tracked in 3 dimensions, then the state requires \(2 \times 5 \times (2 + 2) + (3 + 3) = 46\) dimensions plus rules variables.

Markov chain modelling can have a flavor similar lateral-thinking brain-teasers. The smaller the state-space you can use, the easier calculations will be. But, too small a state-space may not be able to represent all the important features of the process. So the choice of Markov chain state-space embodies the aphorism (attributed to both Albert Einstein and Herbert Spencer) "Everything should be as simple as it can be, but not simpler". The two puzzles that follow exhibit the opportunities and pitfalls of simple state-spaces.

Myst's Elevator problem

The first example will suggest some of the advantages of working with a simple state-space. In the 1993 video game, "Myst 3: Exile", one of the puzzles involved setting the lock combination below to 3-1-3 by pushing 2 different buttons.

1 , 1 , 1

If we push the buttons at random, this becomes a Markov chain with \(3^3 = 27\) states. However, we can analyze the puzzle more efficiently by working in a smaller state-space. For each combination (x,y,z), let the state \(s = (x-y+z)\% 3\), where the percent sign represents the remainder operation. This state-space has only 3 states, \(s \in \{ 0, 1, 2\}\). Even more informative, we can convince ourselves with some experimentation or analysis that no matter what order we push the buttons, the state never actual changes (try clicking the buttons above to experiment):

( 1 - 1 + 1 )%3 = 1

Since the initial state is 1 and the desired final state we want to reach is (3-1+3)%3 = 2, the puzzle is unsolvable, no matter how we try to push our buttons.

This kind of formula we've used in our trick is sometimes called an "invariant" or a "conserved quantity". Energy is probably the most famous example of an invariant. Invariants can be powerful tools, and besides being used in analysis of other games like the 15 puzzle, have applications in knot theory, mechanics, chemistry, and crystalography. In this case, using the invariant simplifies the state-space to a point where the analysis is obvious.

The river-crossing puzzle

The second puzzle is a variant of the river crossing problems from more than a 1,000 years ago (Alcuin of York, 700's AD), and illustrates the dangers of over-simplification.

In colonial times, a woman was heading to market in Williamsport with a quail, a sack of magic beans, and a cat. At the Susquehanna river, the there was a dory which she could use to cross to the other bank. The dory was small and could only carry herself and one of her possessions at a time. However, if she leaves the quail and cat alone together, the cat would kill the quail, and if she leaves the quail alone with the sack, it would eat the beans. Can she ferry all her goods to market safely?

The problem is one of finding a good sequence for the woman to load and unload the dory for each crossing of the river. The full nature of the puzzle involves tracking the locations of 5 objects (woman, dory, cat, quail, and sack), so is at-least 5-dimensional. However, we can greatly simplify things by just tracking which side of the river things are on. Each can be in only 1 of the 2 locations after each trip -- 2 possibilities for the cat, 2 for the quail, and 2 for the sack, so a total of \(2^3 = 8\) possible states. We know we start with all of them located on the south side of the river, and finish when we get all of them to the north side of the river.

(Graph of our simplified state-space for river-crossing problem. Nodes are marked with who has crossed the river: (c)at, (q)uail, and (s)ack. Red nodes are ruled out from allowed paths by the puzzle constraints.)

We've discovered that only 4 of the possible 8 states are legal under the conditions of the problem, and that there is no sequence of moves allowed through these 4 states that start with all the goods on the south bank and end up with all the goods on the north bank. It appears that it was impossible for the woman to get to market with all her goods.

But we have fooled ourselves by over-simplifying our state-space. Let's take a step backward and make one more observation. At the beginning of the puzzle, the cat, quail, and sack were all on the same side of the river, but it was okay because the woman was there to keep them apart from each other -- the woman's location matters. However, our initial guess at the state-space did not include the woman's location -- perhaps her location will make a difference.

If we include the woman's location in addition to the cat, the quail, and the sack of beans, we are considering \(2^4 = 16\) possible states. In this enlarged state-space, the illegal states are a little different: the cat and quail can not be left alone together (meaning they CAN be together as long as the woman is there also); and the quail and sack can not be left alone together (similarly). Using our physical intuition for the puzzle, we can now fill in which transitions are allowed between which states.

(Graph of expanded state-space for river-crossing problem. Nodes are marked with who has crossed the river: (w)oman, (c)at, (q)uail, and (s)ack. Red nodes are ruled out.)

In this enlarged state-space, there is a sequence of allowed state changes that gets all the goods across the river safely. There are actually two solutions. One of them is this.

  1. Ferry the quail across. (reaching state "wq")
  2. Row back. (reaching state "q")
  3. Ferry the cat across. (reaching state "wcq")
  4. Ferry the quail back. (reaching state "c")
  5. Ferry the sack across. (reaching state "wcs")
  6. Row back. (reaching state "cs")
  7. Ferry the quail across. (reaching state "wqcs")

Our initial error was to over-simplify the state-space. Once we had done that, there was no solution, no matter how hard we looked. When we reconsider our state-space and use a larger but still managable version, our search then discovers a solution. One can aargue that our state-space is still oversimplified because there are times when the woman and dory the middle of the river, where she cannot protect things on either side of the river. Our 16-state state-space has implicitly assumed the river crossing is so fast as to be ignored. An algorithmist may point out that the final solution can be discovered with a depth-first search once given a sufficiently precise state-space is formulated. The downside of the larger state spaces is that they take more time to search.

Application to code-breaking

Consider the following patristocrat cipher.


We know immediately that this is not English -- For example, there is no word that contains "hqqh" or "whlt". Decrypting this message is hard if we aren't smart about it. Every letter has been switched with some other letter, meaning that there are \(26! \approx 10^{27.7}\) different possibilities to sort through -- far too many even for computers. But there are good ideas we can use to reduce the solution-space we need to explore -- Edgar Poe lays many them out nicely in his entertaining story The Gold-Bug. For example, different letters appear with different frequencies in the English language -- vowels appear in every word, while consonants like j and z are used infrequently, the letter q is always followed by the letter u, and so on.

Frequency of use of letters in the English alphabet

Frequency of use of letters in the English alphabet

One of the challenges in writing a computer program to test solutions is scoring how close the resulting text is to meaningful English -- while we ourselves are pretty good at telling when something is written in English and when it isn't, it is not immediately clear how to transform that knowledge into an algorithm. A Markov chain model can fix this -- while the letters in English do NOT obey a Markov chain, they can be described as such.

If \(f(x)\) is a bijection that maps each letter to its true value, and letters appear according to a Markov chain, the likelihood of \(f\) is \[\log \mathcal{L}(f) = \sum _ {i=1} ^ {n-1} \log A(f(s_i),f(s_{i+1}))\] where \(A(i,j)\) is the probability that letter \(i\) will follow letter \(j\). We can estimate \(A(i,j)\) by examining a large corpsus of works or even just a single book.

Digram use of letters in the English alphabet

Digram use of letters in the English alphabet

Here's a plain-text data file for approximate natural English diagram frequencies.

[ Data : hide , shown as table , shown as CSV shown as QR ]

# English plain text digram log frequencies
# Empirical log-likelihoods over seeing the first letter be
# followed by the second, as if the language were a 
# discrete time Markov chain.  Data were calculated by
# parsing the texts of a number of classic free online books.
# T. Reluga, 2019-04-05.
gw, -3.
gv, -6.
gu, -3.
gt, -2.
gs, -3.
gr, -2.
gq, -7.
gp, -4.
gz, -9.
gy, -5.
gg, -2.
gf, -4.
ge, -1.
gd, -4.
gc, -4.
gb, -4.
ga, -2.
go, -2.
gn, -3.
gm, -4.
gl, -3.
gk, -5.
gj, -7.
gi, -2.
gh, -1.
ty, -3.
tv, -6.
tw, -3.
tt, -2.
tu, -3.
tr, -3.
ts, -3.
tp, -5.
tq, -7.
tn, -5.
to, -2.
tl, -4.
tm, -4.
tj, -7.
tk, -6.
th, -1.
ti, -2.
tf, -4.
tg, -6.
td, -5.
te, -2.
tb, -4.
tc, -4.
ta, -2.
zl, -3.
zo, -3.
zh, -5.
zi, -2.
zj, -5.
zd, -5.
ze, -0.
zf, -5.
za, -2.
zb, -4.
zc, -5.
zy, -4.
zz, -2.
zt, -4.
zu, -4.
zv, -4.
zw, -4.
zs, -5.
va, -2.
vi, -1.
vo, -2.
me, -1.
md, -6.
mg, -6.
mf, -5.
ma, -1.
mc, -5.
mb, -3.
mm, -3.
ml, -5.
mo, -2.
mn, -4.
mi, -2.
mh, -4.
mk, -8.
mj, -7.
mu, -3.
mt, -3.
mw, -4.
mv, -7.
mq, -8.
mp, -2.
ms, -3.
mr, -3.
vt, -9.
my, -2.
vu, -5.
vw, -9.
fp, -4.
fq, -9.
fr, -2.
fs, -3.
ft, -1.
fu, -3.
fv, -6.
fw, -4.
fy, -5.
fz, -9.
fa, -2.
fb, -4.
fc, -4.
fd, -4.
fe, -2.
ff, -3.
fg, -5.
fh, -3.
fi, -2.
fj, -6.
fk, -6.
fl, -3.
fm, -3.
fn, -5.
fo, -1.
sz, -10.
sy, -5.
ss, -2.
sr, -4.
sq, -6.
sp, -3.
sw, -3.
sv, -5.
su, -3.
st, -1.
sk, -5.
sj, -6.
si, -2.
sh, -2.
so, -2.
sn, -4.
sm, -3.
sl, -4.
sc, -3.
sb, -4.
sa, -2.
sg, -5.
sf, -3.
se, -2.
sd, -4.
lf, -3.
lg, -5.
ld, -2.
le, -1.
lb, -4.
lc, -4.
la, -2.
ln, -5.
lo, -2.
ll, -2.
lm, -4.
lj, -8.
lk, -5.
lh, -4.
li, -2.
lv, -4.
lw, -4.
lt, -3.
lu, -4.
lr, -4.
ls, -3.
lp, -4.
lq, -8.
lz, -10.
ly, -2.
yi, -2.
yh, -3.
yk, -6.
yj, -6.
ym, -2.
yl, -3.
yo, -1.
yn, -4.
ya, -2.
yc, -3.
yb, -3.
ye, -2.
yd, -3.
yg, -4.
yf, -3.
yy, -5.
yz, -8.
yq, -6.
yp, -3.
ys, -2.
yr, -3.
yu, -5.
yt, -2.
yw, -2.
yv, -5.
em, -3.
el, -3.
eo, -3.
en, -2.
ei, -3.
eh, -3.
ek, -5.
ej, -6.
ee, -3.
ed, -2.
eg, -4.
ef, -3.
ea, -2.
ec, -3.
eb, -4.
ey, -4.
ex, -4.
ez, -7.
eu, -5.
et, -2.
ew, -3.
ev, -3.
eq, -5.
ep, -3.
es, -2.
er, -1.
rt, -2.
ru, -3.
rv, -4.
rw, -4.
rp, -4.
rq, -8.
rr, -3.
rs, -2.
rx, -7.
ry, -3.
rz, -10.
rd, -3.
re, -1.
rf, -3.
rg, -4.
ra, -2.
rb, -4.
rc, -3.
rl, -4.
rm, -3.
rn, -3.
ro, -2.
rh, -4.
ri, -2.
rj, -7.
rk, -4.
xj, -7.
xh, -3.
xi, -2.
xn, -6.
xo, -4.
xl, -5.
xm, -4.
xb, -4.
xc, -1.
xa, -2.
xf, -4.
xg, -5.
xd, -4.
xe, -2.
xx, -4.
xy, -4.
xr, -4.
xs, -3.
xp, -1.
xq, -4.
xv, -4.
xw, -3.
xt, -1.
xu, -3.
kc, -4.
kb, -4.
ka, -2.
kg, -5.
kf, -4.
ke, -1.
kd, -5.
kk, -8.
kj, -7.
ki, -1.
kh, -3.
ko, -2.
kn, -2.
km, -4.
kl, -3.
ks, -2.
kr, -5.
kq, -8.
kp, -4.
kw, -3.
kv, -7.
ku, -5.
kt, -3.
ky, -4.
dn, -3.
do, -2.
dl, -4.
dm, -3.
dj, -6.
dk, -6.
dh, -2.
di, -2.
df, -3.
dg, -4.
dd, -3.
de, -1.
db, -3.
dc, -4.
da, -2.
dz, -8.
dy, -4.
dv, -5.
dw, -3.
dt, -2.
du, -3.
dr, -3.
ds, -2.
dp, -3.
dq, -6.
qu, -0.
qo, -6.
wn, -3.
wg, -6.
wf, -5.
we, -1.
wd, -4.
wc, -5.
wb, -5.
wa, -1.
wo, -2.
ju, -1.
wm, -5.
wl, -4.
wk, -7.
wj, -8.
wi, -1.
wh, -1.
ww, -4.
wv, -7.
wu, -6.
jo, -1.
ws, -4.
ji, -3.
wp, -6.
je, -1.
ja, -2.
wy, -5.
wt, -4.
wr, -4.
ck, -3.
cj, -8.
ci, -2.
ch, -1.
co, -1.
cn, -9.
cm, -7.
cl, -3.
cc, -3.
cb, -6.
ca, -2.
cg, -7.
cf, -7.
ce, -1.
cd, -7.
cy, -5.
cs, -5.
cr, -3.
cq, -6.
cp, -7.
cw, -6.
cv, -8.
cu, -3.
ct, -2.
pr, -2.
ps, -3.
pp, -2.
pv, -9.
pw, -5.
pt, -3.
pu, -3.
py, -4.
pb, -6.
pc, -7.
pa, -1.
pf, -6.
pg, -7.
pd, -7.
pe, -1.
pk, -8.
ph, -3.
pi, -2.
pn, -8.
po, -2.
pl, -2.
pm, -5.
iy, -9.
ix, -5.
iz, -5.
vd, -8.
ve, -0.
iq, -7.
ip, -4.
is, -2.
ir, -3.
iu, -6.
it, -2.
iw, -5.
iv, -3.
ii, -6.
ih, -5.
ik, -5.
ij, -9.
im, -3.
il, -2.
io, -2.
in, -1.
ia, -3.
vy, -5.
ic, -2.
ib, -4.
ie, -3.
id, -3.
ig, -3.
if, -3.
vr, -7.
vs, -9.
bd, -5.
be, -1.
bf, -8.
ba, -2.
bb, -4.
bc, -7.
bl, -2.
bm, -5.
bn, -8.
bo, -2.
bh, -7.
bi, -3.
bj, -4.
bt, -4.
bu, -2.
bv, -7.
bw, -6.
bp, -7.
br, -2.
bs, -3.
by, -2.
oo, -3.
on, -1.
om, -2.
ol, -3.
ok, -4.
oj, -7.
oi, -4.
oh, -4.
og, -4.
of, -2.
oe, -5.
od, -3.
oc, -4.
ob, -4.
oa, -4.
oz, -8.
oy, -5.
ox, -7.
ow, -2.
ov, -3.
ou, -1.
ot, -2.
os, -3.
or, -2.
oq, -7.
op, -3.
hy, -5.
hr, -4.
hs, -4.
hp, -5.
hq, -8.
hv, -7.
hw, -4.
ht, -3.
hu, -4.
hj, -8.
hk, -8.
hh, -4.
hi, -1.
hn, -6.
ho, -2.
hl, -5.
hm, -4.
hb, -5.
hc, -5.
ha, -1.
hf, -5.
hg, -6.
hd, -5.
he, -0.
uy, -6.
ux, -6.
uz, -7.
uu, -8.
ut, -1.
uw, -5.
uv, -7.
up, -3.
us, -2.
ur, -1.
um, -3.
ul, -2.
uo, -5.
un, -2.
ui, -3.
uh, -5.
uk, -6.
ue, -3.
ud, -3.
ug, -3.
uf, -4.
ua, -3.
uc, -3.
ub, -4.
aa, -6.
ac, -3.
ab, -3.
ae, -7.
ad, -2.
ag, -3.
af, -4.
ai, -3.
ah, -5.
ak, -4.
aj, -6.
am, -3.
al, -2.
ao, -5.
an, -1.
aq, -7.
ap, -3.
as, -2.
ar, -2.
au, -4.
at, -2.
aw, -4.
av, -3.
ay, -3.
ax, -8.
az, -6.
nh, -3.
ni, -3.
nj, -6.
nk, -4.
nl, -4.
nm, -4.
nn, -4.
no, -2.
na, -2.
nb, -4.
nc, -2.
nd, -1.
ne, -2.
nf, -4.
ng, -2.
nx, -6.
ny, -4.
nz, -10.
np, -5.
nq, -5.
nr, -5.
ns, -2.
nt, -1.
nu, -4.
nv, -5.
nw, -3.

We can use this data, in combination with an optimization algorithm to maximize the likelihood of the observed text under the observed natural frequiencies. Note that we don't need very precise estimates of the frequencies -- estimates with just a single significant digit are sufficient.

[Show code]
#!/usr/bin/env python2

from random import sample
from pylab import log, rand
import string
import textwrap

def loadEncrypted(s):
    badchars = "\t\n\x0b\x0c\r!\"#$%&\'()*+,-./:;<=>?@[\\]^`{|}~ "
    return string.translate(s, string.maketrans('',''), \
    return s

def loadLikelihoodDict(filename):
    f = open(filename,'r')
    s = f.readlines(10000)
    d = {}
    for i in s:
        if '#' == i[0]: # skip comments
        i = i.split(',')
        d[i[0]] = float(i[1])
    return d

def calcLogLike(guess, digrams, L):
    # When a digram does not appear in
    # the cannon of text we use for training,
    # the default would be to give it a
    # probability of 0.  However, that implies
    # certain things are impossible, when in practice
    # new things we have never seen before DO happen,
    # especially when people are playing tricky games.
    # To avoid this trap, we assign a default log-likelihood
    # of log_p_rare to any digram we have not seen
    # in our training set.
    log_p_rare = -10.0
    score = 0
    for x in digrams:
        q = guess[x[0]] + guess[x[1]]
        score += L.get(q, log_p_rare)*digrams[x]
    return score


def main():
    likelihoods = loadLikelihoodDict('digrams.txt')
    encrypted = loadEncrypted(puzzle)
    print(textwrap.fill(encrypted, 50))
    alphabet = list(string.ascii_lowercase)
    digrams = {}
    for i in range(2,len(encrypted)):
        pair = encrypted[(i-2):i]
        digrams[pair] = digrams.get(pair,0) + 1

    # construct initial guess
    guess = dict(zip(alphabet,alphabet))
    score = calcLogLike(guess, digrams, likelihoods)

    initial_temperature = 1e5
    final_temperature = 1e-2
    cooling_rate = 1-1e-3

    # Use simulating annealing to optimize
    print("# Initial log score: %.0f"%score)
    T = initial_temperature
    while T > final_temperature:
        T *= cooling_rate
        x,y = sample(alphabet,2)
        guess[x], guess[y] = guess[y], guess[x]
        delta = calcLogLike(guess, digrams, likelihoods) - score
        if delta < 0 and delta < log(rand())*T:
            # reject
            guess[x], guess[y] = guess[y], guess[x]
            # accept
            score += delta
    print("# Final log score: %.0f"%score)
    # decode and print
    plain = ''.join([guess[i] for i in list(encrypted)])
    print(textwrap.fill(plain, 50))


With some (or much) work, our algorithm will discover the following decrypted text.

The passage, which at first scarcely admitted us, quickly grew narrower and lower; we were almost bent double; yet still we persisted in making our way through it. At length we entered a wider space, and the low roof heightened; but, as we congratulated ourselves on this change, our torch was extinguished by a current of air, and we were left in utter darkness. The guides bring with them materials for renewing the light, but we had none--our only resource was to return as we came. We groped round the widened space to find the entrance, and after a time fancied that we had succeeded. This proved however to be a second passage, which evidently ascended. It terminated like the former; though something approaching to a ray, we could not tell whence, shed a very doubtful twilight in the space. By degrees, our eyes grew somewhat accustomed to this dimness, and we perceived that there was no direct passage leading us further; but that it was possible to climb one side of the cavern to a low arch at top, which promised a more easy path, from whence we now discovered that this light proceeded. With considerable difficulty we scrambled up, and came to another passage with still more of illumination, and this led to another ascent like the former.

Of course, that still leaves the mystery of where it is from and what it means ;)


  1. The following examples are to give you some practice with Markov chain calculations. For each transition matrix, draw a directed graph, with nodes labeled for each state and edges labeled with their transition probabilities. Then, try to find the equilibrium distribution \(\tilde{p}(\infty)\) such that \(\tilde{p}(\infty) = A \tilde{p}(\infty)\) and discuss the meaning of your result.

    1. \[A := \begin{bmatrix}\frac{1}{2} & 0 & \frac{1}{3}\\\frac{1}{2} & \frac{1}{2} & 0\\0 & \frac{1}{2} & \frac{2}{3}\end{bmatrix}.\]

    2. \[A := \begin{bmatrix}\frac{1}{2} & 0 & 0 & \frac{2}{3}\\\frac{1}{2} & \frac{1}{2} & 0 & 0\\0 & \frac{1}{2} & \frac{1}{2} & 0\\0 & 0 & \frac{1}{2} & \frac{1}{3}\end{bmatrix}\]

    3. \[A:=\begin{bmatrix}1 & \frac{1}{4} & 0 & 0\\0 & \frac{1}{2} & \frac{1}{4} & 0\\0 & \frac{1}{4} & \frac{1}{2} & 0\\0 & 0 & \frac{1}{4} & 1\end{bmatrix}\]

  2. (from Breneman) Alan Quartermain has fallen off the side of a mountain while adventuring in South America. He has stopped his fall by grabbing a tree root, but just barely. If his footing slips and he takes one more step backwards, he will fall to his death. He must take 3 steps up the very crumbly face of the cliff from his current spot to reach the top and safety. Each time he tries to take a step, the odds are 2 in 3 that he moves up the cliff towards safety, and 1 in 3 that he moves down the cliff towards his death. What are Alan's chances of survival from his dire predicament?

  3. The breakdown of machines and structures are often modelled using Markov chains. One example of this is the deterioration of bridges. Imagine a new rope bridge in the Andes mountains held up by 3 jut ropes. Every time a person crosses the bridge, there is a chance that one of the ropes will break. Let \(p _ n\) be the probability that one of the ropes breaks when there are \(n\) ropes, and assume that two ropes never break at once. For our purposes, \(p _ n = \dfrac{1}{200 n}\).

    1. Draw graph of this Markov chain. Label all nodes with their corresponding state, and all edges with their transition probability. Discuss any uncertainties in the model structure and parameterization.

    2. Construct a transition matrix for changes in the bridge's state every time a person crosses the bridge.

    3. If the Markov chain begins when the bridge is first built, what is the initial condition?

    4. What is the probability that the bridge will have collapsed by the time 1000 people have crossed it? (Hit: Use python to perform this calculation. The function linalg.matrix_power will be useful. Be sure to explain your answer.)

  4. (Jonas, 2014) In a manufacturing setting, Markov chains can be useful for describing throughput for certain processes. One example of this is a repair process at an electronics manufacturer. For this process, that states of the chain are called "stages". A device enters through "receive" stage and then is moved to the "debug" stage. Depending on the test results from "debug", it is moves to "scrap" or "repair" stages. From "repair", it moves to "test". From the "test" stage, the device will move to either "ship" or back to "debug". About 40% of the time, the debugging stage fails and the device is scrapped. About 80% of the time, the testing phase succeeds and the device is shipped. The average processing times for each transient stage are 10 minutes in receiving, 1 hour in debugging, 30 minutes in repair, and 15 minutes in testing. (The "scrap" and "ship" stages are absorbing stages, and formally have infinite processing times.)

    1. What is the probability that a device in the debug stage eventually returns to the debug stage?

    2. What is the expected number of times a device will enter the debug stage? (Hint: the number of returns is geometrically distributed)

    3. The total load of a device is the expected number of hours needed to handle each device before it is scrapped or shipped. Calculate the total load under the above model description.

  1. For each of the dynamic processes below, determine which aspects of the process you would want to incorporate into a Markov chain model's state space. For each, give a feature which can be observed and could be included as part of the state-space, but which you feel can be ignored when answering the question at hand. Explain.

    1. What's the typical score distribution in a round of shuffleboard.

    2. How likely is a skater to fall during a short-track iceskating race?

    3. What's the chance a student living in a dorm this winter gets the flu?

( previous, home, next )