The Gini coefficient is commonly used to measure the inequality of a population, particularly in economic contexts. For example, the Gini coefficient is commonly used to measure and compare wealth or income inequality across countries or areas (see Wikipedia). The Gini coefficent is defined as \[ \frac{ \sum_i \sum_j | x_i - x_j |}{2 n \sum_j x_j}. \] The Gini coefficient ranges from 0, when all \(x_i = x_j\) and the population is completely uniform, to 1, when some \(x_k > 0\) and all other \(x_{i} = 0\), i.e. one person has all the wealth.
The Gini coefficient is used in genomic contexts to compare inequality of a sampled population, e.g. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1010-4 or https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0843-6. The latter paper is where our motivation comes from. The Gini coefficient is used to assess the quality or selection of sequenced CRISPR screening libraries. Where the usual Gini coefficients assess the distribution of sampled quantities, e.g. sampling random people in a population and obtaining the income for each one, our motivating application involves Poisson or multinomial sampling from a population to assess the frequency. With the latter, there is a sampling-based bias that can affect our estimates.
To see this point, consider the extreme situation where the population being sampled is completely uniform. The true Gini coefficient is equal to zero, however because the observed sample is not completely uniform we get a stricly positive estimated Gini coefficient. Therefore we have some sampling-based bias. How much exactly? We shall see in this investigation.
We’ll do some simulations to determine the rate of bias in the Poisson case.
gini <- function(x){
if(length(x) == 0){
return(0)
}
if(sum(x) == 0){
return(1)
}
if(!all(x >= 0)){
return(1)
}
x.sorted = sort(x)
l = length(x)
numer = sum((l - 1:l + 1)*x.sorted)
denom = sum(x.sorted)
return((l + 1 - 2*numer/denom)/l)
}
n_rep = 100
n_sgrna = 200000
sampling_depth = n_sgrna*c(1/16, 1/8, 1/4, 1/2, 1, 2, 4, 8, 16, 32)
sampled_gini = data.frame(sampling_depth = rep(sampling_depth, each = n_rep),
gini = rep(0, times = length(sampling_depth)*n_rep))
for(i in 1:dim(sampled_gini)[1]){
lam = sampled_gini$sampling_depth[i]
x = rpois(n_sgrna, lambda = lam)
sampled_gini$gini[i] = gini(x)
}
library(dplyr)
gini_summary = sampled_gini %>% group_by(sampling_depth) %>% summarize(gini = mean(gini))
library(ggplot2)
ggplot(gini_summary, aes(x = sampling_depth, y = gini)) + geom_line() + theme_bw()
ggplot(gini_summary, aes(x = sampling_depth, y = gini)) + geom_line() + theme_bw() + scale_y_log10() + scale_x_log10()
We would naturally think that the convergence rate is some \(O(n^p)\), e.g. \(O(n^{1/2})\), as we see from the above \(\log - \log\) plot.
summary(lm(log(sampling_depth) ~ log(gini), data = gini_summary))
##
## Call:
## lm(formula = log(sampling_depth) ~ log(gini), data = gini_summary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.340e-04 -1.449e-04 -2.100e-07 1.369e-04 5.385e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1448271 0.0007031 -1628 <2e-16 ***
## log(gini) -2.0000006 0.0001016 -19686 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0003198 on 8 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.875e+08 on 1 and 8 DF, p-value: < 2.2e-16
We see that the convergence rate of the error is \(O(n^{-2})\) in the uniform Poisson case. But what if the we’re not in the perfectly uniform case, as we’re commonly not.
We’ll get to that next time.