Both the Gini coefficient and the variance are measures of statistical dispersion. We are then motivated to find the relationship between them. It turns out that there is a neat mathematical relationship between them.
This article is translated from a Chinese article on my Zhihu account. The original article was posted at 2021-04-25 10:06 +0800.
First, define the Lorenz curve: it is the curve that consists of all points (u,v) such that the poorest u portion of population in the country owns v portion of the total wealth.
The Gini coefficient G/μ is defined as the area between the Lorenz curve and the line u=v divided by the area enclosed by the three lines u=v, v=0, and u=1.
Now, suppose the wealth distribution in the country is p(X), where p(x)dx is the portion of population that has wealth in the range [x,x+dx].
Then, the Lorenz curve is the graph of the function g defined as g(F(x))=μ1∫−∞xtp(t)dt, where F(x):=∫−∞xp(t)dt is the cumulative distribution function of p(X), and μ:=∫−∞+∞tp(t)dt(1) is the average wealth of the population, which is just E[X] (X is a random variable such that X∼p(X)).
Then, the Lorenz curve is v=g(u):=μ1∫−∞F−1(u)tp(t)dt.
According to the definition of the Gini coefficient, G:=2μ∫01(u−g(u))du=μ−2μ∫01g(u)du=μ−2∫u=01∫t=−∞F−1(u)tp(t)dtdu. Interchange the order of integration, and we have
G=μ−2∫t=−∞+∞∫u=F(t)1tp(t)dtdu=μ−2∫−∞+∞(1−F(t))tp(t)dt. Substitute Equation 1 into the above equation, and we have G=∫−∞+∞2tF(t)p(t)dt−μ=∫−∞+∞(2tF(t)−1)tp(t)dt=∫01(2u−1)F−1(u)du. Now here is the neat part. Separate it into two parts, and write them in double integrals:
G=∫01uF−1(u)du−∫01(1−u)F−1(u)du=∫u2=01∫u1=0u2F−1(u2)du1du2−∫u1=01∫u2=u11F−1(u1)du1du2.
Interchange the order of integration of the second term, and we have G=∫u2=01∫u1=0u2(F−1(u2)−F−1(u1))du1du2=21∫u2=01∫u1=01F−1(u2)−F−1(u1)du1du2=21∫−∞+∞∫−∞+∞∣x2−x1∣p(x1)p(x2)dx1dx2=21E[∣X2−X1∣],
where X1 and X2 are two independent random variables with p being their respective distribution functions: (X1,X2)∼p(X1)p(X2).
By this result, we can easily see how the Gini coefficient represents the statistical dispersion.
We can apply similar tricks to the variance σX2. σX2=E[X2]−E[X]2=∫−∞+∞t2p(t)dt−(∫−∞+∞tp(t)dt)2=∫01F−1(u)2du−(∫01F−1(u)du)2.
Separate the first into two halves, and write the altogether three terms in double integrals: σX2=21∫01F−1(u2)2du2∫01du1=−∫01F−1(u1)du1∫01F−1(u2)du2=+21∫01F−1(u1)2du1∫01du2=21∫01∫01(F−1(u2)2−2F−1(u1)F−1(u2)+F−1(u1)2)du1du2=21∫−∞+∞∫−∞+∞(x2−x1)2p(x1)p(x2)dx1dx2=21E[(X2−X1)2].
Then we can derive the relationship between the Gini coefficient and the variance: 2σX2−4G2=σ∣X2−X2∣2.