The distribution when indistinguishable balls are put into boxes
If there are 200 typographical errors randomly distributed in a 500 page manuscript, find the probability that a given page contains exactly 3 errors.
We can abstract this type of problems as follows:
Suppose there are distinguishable boxes and indistinguishable balls. Now, we randomly put the balls into the boxes. For each of the boxes, what is the probability that it contains balls?
For example, if the first page contains 3 errors, the second page contains 197 errors, and the rest of the pages contain no errors, then the situation corresponds to the situation where the first box contains 3 balls, the second box contains 197 balls, and the rest of the boxes contain no balls. The balls are indistinguishable because we can only determine how many errors are on each page but not which errors are on the page.
To deal with the problem, we simply need to find these two numbers:
- the number of ways to put indistinguishable balls into distinguishable boxes, and
- the number of ways to put indistinguishable balls into distinguishable boxes.
The latter corresponds to the number of ways to put the balls into the boxes provided that we already know that the given box contains balls. After we find these two numbers, their ratio is the probability in question.
To find the number of ways to put indistinguishable balls into distinguishable boxes, we can use the stars and bars method. To see this, we write a special example. Here is an example of and : which corresponds to the distribution . We can see that there are bars and stars. Therefore, the number of ways to put the balls is the same as the number of ways to choose the positions of the stars among positions. Therefore, the number of ways is Therefore, the final probability of the given box containing balls is
Another easy way to derive this result is by using the generating function. The number is just the coefficient of in the expansion of the generating function . The generating function is just , which can be easily expanded by using the binomial theorem.
We are now interested in the limit with fixed. By Stirling’s approximation, we have The ’s in the exponents can just be dropped because you may find that if we extract the ’s, the factor tends to unity. The exponential is just constant . Therefore, we have This is just the geometric distribution with parameter .
If you want to simulate the number of balls in a box, here is a simple way to do this. First, because each box is the same, we can just focus on the first box without loss of generality. Then, we just need to randomly generate the positions of the bars among the positions, and then return the index of the first bar (which is the number of balls in the first box).
We can then write the following Ruby code to simulate the number of balls in the first box:
1 2 3 |
def simulate n, k
(n-1).times.inject(npkm1 = n+k-1) { |bar, i| [rand(npkm1 - i), bar].min }
end
|
Compare the simulated result with the theoretical result:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
def frequency m, n, k, trials
trials.times.count { simulate(n, k) == m } / trials.to_f
end
def truth m, n, k
(n-1) * (k-m+1..k).reduce(1,:*) / (n+k-m-1..n+k-1).reduce(1,:*).to_f
end
def approx m, n, k
n*k**m / ((n+k)**(m+1)).to_f
end
srand 1108
m, n, k = 3, 5000, 8000
p frequency m, n, k, 10000 # => 0.0902
p truth m, n, k # => 0.08965012972626446
p approx m, n, k # => 0.08963271594131858
|