# Zipf and preferential attachment

( previous, home, next )

Zipf’s law does not just apply to word-frequencies. It’s been found to apply to city sizes, income distributions, social networks, and a variety of contexts.

One common form of data we might encounter in a study is what people call a “bag” or “multiset”. A bag is like a set in that the elements have no intrinsic order, but a bag differs from a set in the one important way. In a set an element is either present or absent, so when we list the elements, each thing in the set is listed only once. In a bag, each element can occur once, twice, or any arbitrary number of times, and two bags are equal if and only if each element occurs the same number of times in both. Bags are such a natural idea that they have been used in mathematics for centuries even though they weren’t formally recognized and named until the 1970’s. Here are some examples of bags.

• First names of all students
• Cities different people live in
• Number of links to a website
• List of each species of each tree in a forest
• Words in the bible, ordered alphabetically

Often, rather than list out all the elements of a multiset repeatedly, we can save space by listing each element along with an integer representing how many times that element appears (it’s multiplicity).
So the bag $\{ a,a,a,f,b,b,a,a \}$ might be written $\begin{gather*} \{ (a,5), (b,2), (f,1) \} \quad\text{or just}\quad a^5 b^2 f \end{gather*}$ In computations, vectors are often used to represent bags.

The commonness of words in the King James bible is an easily accessible example, as the data can be found online.

[ Data : hide , shown as table , shown as CSV shown as QR ]

# https://www.kingjamesbibleonline.org/Popular-Bible-Words.php
# times words show up in the King James version of the Bible, ranked from most to least (top 250)

28364
28269
21257
12044
11683
11285
9022
8684
8326
7582
7365
7127
6796
6664
6638
6579
6568
6442
6073
5973
5900
5606
5431
4563
4487
4432
4369
4293
3981
3679
3608
3603
3581
3478
3441
3410
3307
3237
3162
3058
2952
2937
2902
2834
2828
2747
2634
2606
2551
2509
2496
2271
2251
2210
2210
2165
2154
2145
2124
2078
2042
1990
1980
1953
1881
1874
1860
1840
1760
1759
1727
1692
1658
1656
1648
1641
1609
1575
1547
1544
1526
1488
1487
1477
1468
1438
1419
1408
1396
1392
1381
1358
1329
1306
1296
1255
1254
1239
1207
1189
1164
1163
1122
1088
1084
1080
1070
1061
1049
1046
1024
1020
1012
992
989
978
977
977
977
974
965
959
956
953
953
944
934
925
923
923
894
881
878
873
861
859
835
834
831
819
819
817
804
798
793
793
787
785
778
774
770
756
737
734
727
726
716
714
700
699
699
671
659
657
648
644
641
640
639
633
633
631
621
617
610
609
604
604
603
601
599
591
590
584
578
571
565
554
553
551
550
550
548
544
544
540
538
533
528
528
526
524
523
523
511
503
503
501
501
498
496
496
495
490
487
483
482
480
475
470
470
463
463
462
458
456
452
450
443
442
441
440
433
426
424
424
423
423
415
414
413
410
405
402
401
399
399
399
398
398


You might think “So what? What can we hope to do with a data set of number of times each word is used? Without the order, the words are nearly meaningless random jumble.” That’s true, in sense that the order is key in communicating meaning. But in another way, it turns out there is atleast one somewhat surprising thing lurking in these data.

Words are unequal. Some words are used much more commonly than others – “the” and “and” are used all at the time, while “gerrymander” and “upholster” are used rarely, and only in specific contexts. One might argue that the linguistic importance of a word is predicted by how often it is used relative to other words. One way to represent this is to plot the empirical CDF of word usage. To make an empirical CDF, make a list of the number of times each word appears (but forget about the words themselves). Sort numbers from smallest (most rarely used) to largest (most commonly used), and plot the value as a function of normalized rank, i.e.,

$\mathcal{F} _ n(t)=\frac{\textrm{num of values }\leq t}{\textrm{num of samples, }n}$ where $$n$$ is the total number of different words,

[Show code]
plot(value, rank/float(n))

This picture shows that there are lots of words that are rather uncommon. There are almost 800,000 words in the bible. Half of those words occur less than 2000 times each, while a few words occur more than 10,000 times each. But this picture doesn’t reveal much more of a pattern than that. A better picture can be made by ranking all the words, in order from most used to least used, and then plotting the frequency of each word’s use as a function of it’s rank. These are called rank-order statistics. Rank order statistics are particularly useful if you don’t know the denominator. What we’d need for the empirical CDF, above, is frequency, but we don’t know how many words there are in the Bible. (OK, there are 783,137 words, but in many contexts, the denominator is uncertain or inconvenient to obtain. Recall, for example, our analysis of telephone line loading.) If plotted on a log-log plot, they reveal a surprising pattern.

[Show code]
n = len(values)
values.sort().reverse()
rank = range(1,n+1)
subplot(2,1,1)
plot(rank, value, 'bo')
subplot(2,1,2)
loglog(rank, value, 'go')

Notice that the rank-order statistic is a straight line on log-log plot! This indicates a power law relationship between rank and abundance, $$y=ax^{-b}$$. Using linear least squares or polyfit or whatever you prefer, we find the elasticity $$b \approx 0.93$$.

The predictive power here is a function of rank. We predict that in the Bible, the $$1000^{\textrm{th}}$$ most common word will appear roughly 129 times. This phenomena is an example of Zipf’s law: given some body of written work, the frequency of any word is inversely proportional to its rank. Today, Zipf’s law is used to refer to any discrete power-law distribution of ranked data: $f(r;b) \propto 1/r^b.$ The associated probability generating function is $\hat{r}(s) = \frac{1}{\zeta(b)} \sum _ {r=1}^{\infty} \frac{s^b}{r^b}$ where $$\zeta(b)$$ is the Riemann zeta-function.

The normalization is, on the one hand, limiting: without normalizing we were able to extrapolate to less frequent words not observed. However with the normalization we can compare with other works. For example, word frequency in “A Connecticut Yankee in King Arthur’s Court” by Mark Twain:

[Show code]
from pylab import *

###############
# BIBLE EXAMPLE
###############

# Empirical CDF
data = list(data)
data.sort()
n = len(data)
rank = array(range(1,n+1))
figure(1,figsize=(8,6))
plot(data,rank/float(n+1),'bo-')
xlabel('# of times word appears',fontsize=24)
ylabel('Emp. Cum. Prob.',fontsize=24)
xticks(fontsize=20)
yticks(fontsize=20)
plt.tight_layout()
savefig('BibleWords_EmpCDF.png',transparent=True)
clf()

# Rank order statistics
# Sort data from most abundant to least
data.sort() # makes it least to most
n = len(data)
rank = array(range(1,n+1))
figure(2,figsize=(12,6))
subplot(1,2,1)
plot(rank,data[::-1],'bo-') # data[::-1] makes it most to least
ylabel('# of times word appears',fontsize=24)
xlabel('Rank',fontsize=24)
xticks(fontsize=20)
yticks(fontsize=20)
subplot(1,2,2)
loglog(rank,data[::-1], 'go')
#ylabel('# of times word appears',fontsize=24)
xlabel('Rank',fontsize=24)
xticks(fontsize=20)
yticks(fontsize=20)
plt.tight_layout()
savefig('BibleWords_Rank.png',transparent=True)
clf()

## Fit power law & plot
coeff = polyfit(log(rank),log(data[::-1]),1)
# prints the polynomial expression
rank2=array(range(1,1000))
y_fitted = exp(coeff[1])*rank2**coeff[0]
loglog(rank,data[::-1], 'go')
loglog(rank2,y_fitted, 'r')
legend(['data', 'fit'],fontsize=18,framealpha=0.)
ylabel('# of times word appears',fontsize=24)
xlabel('rank',fontsize=24)
xticks(fontsize=20)
yticks(fontsize=20)
plt.tight_layout()
savefig('BibleWords_loglogfit.png',transparent=True)
clf()

## Fit power law & plot -- ZIPF
coeff = polyfit(log(rank),log(data[::-1]),1)
# prints the polynomial expression
rank2=array(range(1,1000))
y_fitted = exp(coeff[1])*rank2**coeff[0]
loglog(rank,data[::-1]/sum(data), 'go')
loglog(rank2,(1./rank2)**(-coeff[0])/sum((1./rank2)**(-coeff[0])),linewidth=2)
legend(['data', 'fit'],fontsize=18,framealpha=0.)
ylabel('word frequency',fontsize=24)
xlabel('rank',fontsize=24)
xticks(fontsize=20)
yticks(fontsize=20)
#title('Bible (King James version)',fontsize=26)
plt.tight_layout()
savefig('BibleWords_zipf.png',transparent=True)
clf()

###############
# Repeat for Twain's A Connecticut Yankee...
###############
data2 = loadtxt('CYnums.txt')## Fit power law & plot -- ZIPF
data2.sort() # makes it least to most
n = len(data2)
rank = array(range(1,n+1))
rank2 = array(range(1,1000))
coeff = polyfit(log(rank),log(data2[::-1]),1)
# prints the polynomial expression
rank2=array(range(1,1000))
loglog(rank,data2[::-1]/sum(data2), 'go')
loglog(rank2,(1./rank2)**(-coeff[0])/sum((1./rank2)**(-coeff[0])),linewidth=2)
legend(['data', 'fit'],fontsize=18,framealpha=0.)
ylabel('word frequency',fontsize=24)
xlabel('rank',fontsize=24)
xticks(fontsize=20)
yticks(fontsize=20)
title('A Connecticut Yankee...',fontsize=26)
plt.tight_layout()
savefig('CYiKAC_zipf.png',transparent=True)
clf()

###############
###############
data3 = loadtxt('FBnums.txt')## Fit power law & plot -- ZIPF
data3.sort() # makes it least to most
n = len(data3)
rank = array(range(1,n+1))
rank2 = array(range(1,1000))
coeff = polyfit(log(rank[200:len(rank)]),log(data3[::-1][200:len(rank)]),1)
# prints the polynomial expression
rank2=array(range(1,1000))
loglog(rank,data3[::-1]/sum(data3), 'go')
loglog(rank2,(1./rank2)**(-coeff[0])/sum((1./rank2)**(-coeff[0])),linewidth=2)
legend(['data', 'fit'],fontsize=18,framealpha=0.)
ylabel('word frequency',fontsize=24)
xlabel('rank',fontsize=24)
xticks(fontsize=20)
yticks(fontsize=20)
plt.tight_layout()
savefig('FB_zipf.png',transparent=True)
clf()


They look pretty close! In the case of A Connecticut Yankee, word frequency in the first 250 words is only slightly different, approximately $$f(r)\propto r^{-1}$$. (Fun fact: In both cases, the top two words are ‘and’ and ‘the’, but the order is switched! It’s ‘and’ first in the Bible and ‘the’ first in Twain’s book).

Warning: Be careful of the interpretation. The empirical CDF at the start of lecture, with words on the $$x$$-axis, gave us the probability that a word chosen from a list of unique words appearing in the Bible less than $$n$$ times.

Zipf’s law is a closely related to the Pareto distribution and power-law distributions. These are important particularly because the way a distribution’s tail thins out can have important consequences in many theories.

## Preferential attachment

One proposal to explain the commonness of data fit well by Zipf’s law is that there is a network underlying these data and that this network has self-similar properties reflected by power-law behaviors.

The more connections you have, the faster you make new connections.

Consider a network where each node represents an person and an edge between two nodes represents a “friendship” between two people. What happens to the network as new people join? When someone new joins the network, they are represented by a new node corresponding to this person. And let’s suppose that the new person immediately adds $$m$$ relationships to distinct existing network members.

Let’s let $$N(k,t)$$ be the expected number of nodes of degree $$k$$ (i.e. person connected to $$k$$ others) after $$t$$ new people have joined the network. The new subscriber is more likely to find friends that are well-connected. So, suppose the new member’s relationships are essentially draw at random from all the other connections among members. If the network initially had $$T _ 0$$ connections, then will be $$T(t) = T _ 0 + 2mt$$ connections total out there after $$t$$ new people have joined, since $$t$$ people each add $$m$$ relationship edges, and each edge is connected to two individuals. A degree $$k$$-individual (person with $$k$$ friends) has $$k$$ connections, so the chance that that individual is picked by the new subscriber will be $$k/(2mt + T _ 0)$$.

The precise process for the evolution of $$N(k,t)$$ requires finding the expectation of $$P(x,k,m)$$ where $$P$$, the probability that $$x$$ nodes of degree $$k$$ are picked after $$m$$ draws of edge connections, with replacement, is determined by the difference equation $\begin{gather*} P(x+1,k,s+1) = \left( \frac{k N(k)-k x }{T} \right) P(x,k,s) + \left(1 - \frac{k (N(k) - (x+1))}{T} \right) P(x+1,k,s), \\ P(0,k,0) = 1, \quad P(x>0,k,0) = 0, \quad P(x > N(k),k,s) = 0. \end{gather*}$ But when the number of connections is large enough ($$T \gg k N(k)$$, $$T\gg m$$) and there are many people ($$N(k) \gg m$$), then, on average, $$m k N(k,t) /(2mt + T _ 0)$$ people with $$k$$ connections gain a new relationship and now have $$k+1$$ connections.

If we apply this to actors of all degree, taking $$T _ 0 = 0$$, we expect $$N(k,t)$$ to obey the following difference equation as each new movie is produced. $\begin{gather*} N(k,t+1) = N(k,t) + \frac{ m (k-1)}{2 m t} N(k-1, t-1) - \frac{ m k}{2 m t} N(k, t-1) + 1_{k = m} \end{gather*}$ If $$k > m$$, then we can simplify this to the difference equation

$N(k, t+1) - N(k, t) = \frac{(k-1) N(k-1, t)}{2t} - \frac{k N(k,t)}{2 t}.$

If $$k = m$$, then we have the slightly simplier equation

$N(m,t+1) - N(m,t) = 1 - \frac{mN(m,t)}{2t}.$

We can’t get $$k < m$$ since everyone starts with $$m$$ connections! This should make us throw our assumption of $$c _ 0 = 0$$ kind-of out the window, but I’ll leave it as something for you to ponder.

So, we now have a system of linear recurrence equations, where the recurrences occur in two variables: the degree $$k$$ of nodes in our network, and the number of movies $$t$$ made so far. The variable $$t$$ does not appear in our coefficients, but the variable $$k$$ does. Thus, our recurrence system is similar to a linear, variable coefficient, partial differential equation. Variable coefficient differential equations rarely have solutions that can be expressed without recourse to special functions, and so we don’t have much prior reason to be optimistic here. But, it turns out that if we try the ansatz $N(k,t) = \frac{Ct}{k(k+1)(k+2)}$ we discover that it works provided $$C = 2 m (m+1)$$. And so, we discover the solution $N(k,t) = \frac{2 m t \left(m + 1\right)}{k \left(k + 1\right) \left(k + 2\right)}$

That’s the average number of people with $$k$$ friends. For $$m=10$$,

Sympy code to help with solving:

[Show code]
from sympy import *
from sympy.abc import *
N = lambda k,t : C * t/k/(k+1)/(k+2) # ansatz for solution

"""
If the above form is a solution,
the two equations should simplify to zero.

We substitute into the first and show that it does equal zero.
Then we use the second to determine what value our unknown
constant C should take.
"""

eq1 = N(k,t) + m*(k-1)/(2*m*t)*N(k-1,t) - m*k/(2*m*t)*N(k,t) - N(k,t+1)
assert 0 == eq1.factor()

eq2 = N(m,t+1) - N(m,t) - 1 + N(m,t)*m/2/t
N(k,t).subs(C, solve(eq2,C).pop())

Sum(2*m*t*(m+1)/k/(k+1)/(k+2),(k,m,oo))

This is a scale free relationship, in the sense that the curve is approximately self-similar for a range of observations – for large $$k$$, $$N(k,t) \sim k^{-3}$$.

## Your friends have more friends than you do

From this calculation, we can compute the number of friends the average person will have: $\textrm{num of friends }=\frac{\sum_{k=m}^\infty k N(k,t)}{\sum_{k=m}^\infty N(k,t)}=2m.$

And the fraction of people who have less than the average number is $\frac{\sum_{k=m}^{2m-1}N(k,t)}{\sum_{k=m}^\infty N(k,t)}=\frac{(3m+1)}{2(2m+1)}\geq\frac{2}{3}.$

(since $$2/3\leq (3m+1)/(2(2m+1))<3/4$$ for finite $$m\geq 1$$). But that means, more than half the people have less than the average number of friends!

That’s why your friends probably have more friends than you do…

# Exercises

1. Another important form of Zipf’s law is the citation patterns found in scientific articles. Rather than doing equations, let’s make a simulation of a simplified version of this.

2. Choose one number randomly from the list and append it back to the list. You can use the sample()'' function from therandom’’ module to do this.