If there are 200 typographical errors randomly distributed in a 500 page manuscript, find the probability that a given page contains exactly 3 errors.

We can abstract this type of problems as follows:

Suppose there are nn distinguishable boxes and kk indistinguishable balls. Now, we randomly put the balls into the boxes. For each of the boxes, what is the probability that it contains mm balls?

For example, if the first page contains 3 errors, the second page contains 197 errors, and the rest of the pages contain no errors, then the situation corresponds to the situation where the first box contains 3 balls, the second box contains 197 balls, and the rest of the boxes contain no balls. The balls are indistinguishable because we can only determine how many errors are on each page but not which errors are on the page.

To deal with the problem, we simply need to find these two numbers:

  • the number of ways to put kk indistinguishable balls into nn distinguishable boxes, and
  • the number of ways to put kmk-m indistinguishable balls into n1n-1 distinguishable boxes.

The latter corresponds to the number of ways to put the balls into the boxes provided that we already know that the given box contains mm balls. After we find these two numbers, their ratio is the probability in question.

To find the number of ways to put kk indistinguishable balls into nn distinguishable boxes, we can use the stars and bars method. To see this, we write a special example. Here is an example of n=4n=4 and k=6k=6:

,{}|{}\star{}\star{}|{}\star{}|{}\star{}\star{}\star{},

which corresponds to the distribution 0,2,1,30,2,1,3. We can see that there are n1n-1 bars and kk stars. Therefore, the number of ways to put the balls is the same as the number of ways to choose the kk positions of the stars among n+k1n+k-1 positions. Therefore, the number of ways is

Nn,k=(n+k1k)=(n+k1)!k!(n1)!.N_{n,k}=\binom{n+k-1}{k}=\frac{\left(n+k-1\right)!}{k!\left(n-1\right)!}.

Therefore, the final probability of the given box containing mm balls is

Pn,k(m)=Nn1,kmNn,k=(n1)k!(n+km2)!(km)!(n+k1)!.P_{n,k}(m)=\frac{N_{n-1,k-m}}{N_{n,k}} =\frac{\left(n-1\right)k!\left(n+k-m-2\right)!}{\left(k-m\right)!\left(n+k-1\right)!}.


Another easy way to derive this result is by using the generating function. The number Nn,kN_{n,k} is just the coefficient of xkx^k in the expansion of the generating function (1+x+x2+)n\left(1+x+x^2+\cdots\right)^n. The generating function is just (1x)n\left(1-x\right)^{-n}, which can be easily expanded by using the binomial theorem.


We are now interested in the limit n,kn,k\to\infty with λk/n\lambda\coloneqq k/n fixed. By Stirling’s approximation, we have

Pn,k(m)(n1)kk+1/2(n+km2)n+km2+1/2(km)km+1/2(n+k1)n+k1+1/2ekm+n+k1knk+m+2.P_{n,k}(m)\sim\left(n-1\right) \frac{k^{k+1/2}\left(n+k-m-2\right)^{n+k-m-2+1/2}}{\left(k-m\right)^{k-m+1/2}\left(n+k-1\right)^{n+k-1+1/2} } \mathrm e^{k-m+n+k-1-k-n-k+m+2}.

The 1/21/2’s in the exponents can just be dropped because you may find that if we extract the 1/21/2’s, the factor tends to unity. The exponential is just constant e\mathrm e. Therefore, we have

Pn,k(m)(n1)(λn)λn(n+λnm2)n+λnm2(λnm)λnm(n+λn1)n+λn1e=(n+λnm2n+λn1)n((n+λnm2)λn(λnm)(n+λn1))λn(λnmn+λnm2)m(n1)(n+λn1)(n+λnm2)2eem+1λ+1emem+1λ+1λ(λλ+1)m1λ+1e=(λλ+1)m1λ+1.\begin{align*} P_{n,k}(m)&\sim\left(n-1\right) \frac{\left(\lambda n\right)^{\lambda n}\left(n+\lambda n-m-2\right)^{n+\lambda n-m-2} } {\left(\lambda n-m\right)^{\lambda n-m}\left(n+\lambda n-1\right)^{n+\lambda n-1}}\mathrm e\\ &=\left(\tfrac{n+\lambda n-m-2}{n+\lambda n-1}\right)^n \left(\tfrac{\left(n+\lambda n-m-2\right)\lambda n}{\left(\lambda n-m\right)\left(n+\lambda n-1\right)}\right)^{\lambda n} \left(\tfrac{\lambda n-m}{n+\lambda n-m-2}\right)^m \tfrac{\left(n-1\right)\left(n+\lambda n-1\right)}{\left(n+\lambda n-m-2\right)^2}\mathrm e\\ &\to\mathrm e^{-\frac{m+1}{\lambda+1}}\,\mathrm e^m\, \mathrm e^{-\frac{m+1}{\lambda+1}\lambda}\left(\tfrac\lambda{\lambda+1}\right)^m\tfrac1{\lambda+1}\mathrm e\\ &=\left(\tfrac\lambda{\lambda+1}\right)^m\tfrac1{\lambda+1}. \end{align*}

This is just the geometric distribution with parameter p=1/(λ+1)=n/(k+n)p=1/(\lambda+1)=n/(k+n).


If you want to simulate the number of balls in a box, here is a simple way to do this. First, because each box is the same, we can just focus on the first box without loss of generality. Then, we just need to randomly generate the positions of the n1n-1 bars among the n+k1n+k-1 positions, and then return the index of the first bar (which is the number of balls in the first box).

We can then write the following Ruby code to simulate the number of balls in the first box:

1
2
3
def simulate n, k
  (n-1).times.inject(npkm1 = n+k-1) { |bar, i| [rand(npkm1 - i), bar].min }
end

Compare the simulated result with the theoretical result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def frequency m, n, k, trials
  trials.times.count { simulate(n, k) == m } / trials.to_f
end

def truth m, n, k
  (n-1) * (k-m+1..k).reduce(1,:*) / (n+k-m-1..n+k-1).reduce(1,:*).to_f
end

def approx m, n, k
  n*k**m / ((n+k)**(m+1)).to_f
end

srand 1108
m, n, k = 3, 5000, 8000
p frequency m, n, k, 10000 # => 0.0902
p truth m, n, k # => 0.08965012972626446
p approx m, n, k # => 0.08963271594131858