A common problem in bioinformatic data analysis is how to handle zero counts when taking \(\log\) transformation. The standard solution is to add 1 because this ensures that \(0\) maps to \(0\). However, this introduces bias into the transformed distribution (see Lun 2018). Here I’ll discuss an alternative transformation that relies upon a mechanistic model of library preparation and sequencing.

First I’ll explain Good-Turing smoothing and connect it to adding 1 to the counts.

Good-Turing smoothing

Suppose we have count data that arises from Poisson sampling with \(\lambda\)s sampled from a (prior) distribution \(f(\lambda)\). Given an observation \(x\), what’s the posterior estimate of \(\lambda\)?

\[\begin{align} \mathrm{E} (\lambda | x) &= \int_{0}^{\infty} \lambda \Pr(x \cap \lambda) / \Pr(x) \, d \lambda \notag \\ &= \frac{1}{\Pr(x)} \int_{0}^{\infty} \lambda \frac{\lambda^{x} e^{-\lambda}}{x!} f(\lambda) d \lambda \notag \\ &= \frac{1}{\Pr(x)} (x + 1) \int_{0}^{\infty} \frac{\lambda^{x + 1} e^{-\lambda}}{(x + 1)!} f(\lambda) d \lambda \notag \\ &= \frac{1}{\Pr(x)} (x + 1) \Pr(x + 1). \end{align}\]

This formula was originally derived by IJ Good in the Binomial model, but he attributes the formula to Alan Turing’s work during WWII (see IJ Good 1953 and IJ Good 1979).

Good-Turing and adding 1: Laplace’s rule of succession

Consider formula (1) above with a uniform prior on all possible counts, i.e. \(\Pr(x) = \Pr(x + 1)\) for all \(x\). Then the corresponding posterior estimates of \(\lambda | x\) are equivalent to adding \(1\) to all the counts. This is known as Laplace’s rule of succession.

The idea that we’re gonna assign equal probability to all possible counts is absurd. zero and one billion are obviously not equally likely. What if we instead take an empirical Bayes approach, use the whole sample to inform each individual observation?

The log-Normal-Poisson as a mechanistic prior

Consider that at the start every distinct molecule is present once in the pool of molecules. At each round of PCR amplification we should double the number of copies of each molecule, but PCR amplification is not 100% efficient. So at each step we double some random fraction of the number of copies each molecule, and after \(R\) rounds we get \[\begin{equation} \big( \ldots \big( \big( 2^{\epsilon_{1} } \big)^{\epsilon_{2}} \big)^{\epsilon_{3}} \ldots \big)^{\epsilon_{R}} = 2^{\sum_{i=1}^{R} \epsilon_{i}} \tag{2} \end{equation}\] copies of a specific molecule in the pool, where \(\epsilon_{i}\) are random variables with some mean < 1. Note that they are not identically distributed because they depend on the outcomes of the previous rounds of PCR. Taking the log of equation (2) we get that the number of copies is proportional to the sum of random variables with the same mean. If \(R\) is large enough then we can invoke the Central Limit Theorem and get that the log of the number of molecules is approximately Normally distributed. Since sequencing essentially samples molecules from the pool in proportion to their abundance, we get our desired log-Normal-Poisson distribution for the number of sequenced reads for each original molecule.

Including zero-inflation

A common issue in NGS data analysis is zero inflation. This may occur if a molecule “drops out” of the library, e.g. see this review. The idea was that molecules would literally get stuck to a test tube and drop out of the sample. To model this we can include a zero-inflation part to the model so that \[ \Pr(\lambda) = p \, 1(\lambda = 0) + (1 - p) \, f(\lambda). \]

A standard way to fit this is to fit a zero-truncated distribution to the non-zero counts and then attribute the missing zero mass to zero-inflation.

Following the work from above, if \(x > 0\) then necessarily \(\lambda > 0\) and so the previously derived formula holds. However, if \(x = 0\) then there is some chance that \(\lambda = 0\). Therefore \[ \begin{aligned} \mathrm{E}(\lambda | x = 0, p, g) &= p \cdot 0 + (1 - p) \frac{\Pr(1 | f)}{\Pr(0 | f)}. \end{aligned} \]

Essentially the zero-inflation will push \(\lambda\) towards \(0\) for the zero observations. This will capture the uncertainty about whether zero counts are zero because they’ve dropped out or because of random chance.

Application: Quality Control for pooled CRISPR screens

To illustrate the application of this idea I’ll use quality control analysis of CRISPR pooled screens. I’ll use my standard dataset, the Toronto Knockout Library Experiments because it doesn’t require any mapping or preprocessing.

require(poilog)
estimate_lnp <- function(counts){
  lnorm_fit = poilog::poilogMLE(counts, zTrunc = FALSE)
  return(list(mu = lnorm_fit[['par']]['mu'], 
              sig = lnorm_fit[['par']]['sig']))
}

estimate_zilnp <- function(counts){
  ztlnorm_fit = poilog::poilogMLE(counts[counts > 0], zTrunc = TRUE)
  p0 = poilog::dpoilog(0, mu = ztlnorm_fit[['par']]['mu'], sig = ztlnorm_fit[['par']]['sig'])
  D = sum(counts > 0)
  n0 = D*p0/(1.0 - p0)
  p = 1 - (length(counts) - n0)/length(counts)
  return(list(mu = ztlnorm_fit[['par']]['mu'], 
              sig = ztlnorm_fit[['par']]['sig'],
              p = p))
}

smooth_counts_lnp <-function(counts, mu, sigma, p0 = 0){
  smoothed_table = data.frame(x = unique(counts))
  if(p0 == 0){
    smoothed_table['lambda'] = sapply(smoothed_table['x'],
                                      function(x) (x + 1)*poilog::dpoilog(x + 1, mu = mu, sig = sigma)/poilog::dpoilog(x, mu = mu, sig = sigma))
  }
  else{
    smoothed_table['lambda'] = sapply(smoothed_table['x'],
                                      function(x) (x + 1)*poilog::dpoilog(x + 1, mu = mu, sig = sigma)/poilog::dpoilog(x, mu = mu, sig = sigma))
    # with zero-inflation have to move the 0 count down
    smoothed_table$lambda[smoothed_table$x == 0] = (1 - p0)*smoothed_table$lambda[smoothed_table$x == 0]
  }
  lambdas = smoothed_table$lambda[match(counts, smoothed_table$x)]
  return(lambdas)
}
# since the column names are not the same for each library we will have to do each library separately
lib = 'DLD1'
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$DLD_T0
counts_lnp_fit = estimate_lnp(counts)
print("log norm fit:")
## [1] "log norm fit:"
print(counts_lnp_fit)
## $mu
##       mu 
## 6.020901 
## 
## $sig
##       sig 
## 0.9052012
counts_compare = data.frame(original_counts = counts, 
                            corrected_counts = smooth_counts_lnp(counts,
                                                                 mu = counts_lnp_fit[['mu']],
                                                                 sigma = counts_lnp_fit[['sig']]))
library(ggplot2)
ggplot(data.frame(counts = counts + 1), aes(x = counts)) + geom_density() +  theme_bw() + ggtitle("density of original counts (+ 1), log scale") + scale_x_log10() 

ggplot(data.frame(counts = counts_compare$corrected_counts), aes(x = counts)) + geom_density() + theme_bw() + ggtitle("density of smoothed counts, log scale") + scale_x_log10()

binwidth = 30
fitted_line = data.frame(x = 0:max(counts))
fitted_line['y'] = binwidth*length(counts)*poilog::dpoilog(fitted_line$x, counts_lnp_fit[['mu']], counts_lnp_fit[['sig']])
ggplot(data.frame(counts = counts), aes(x = counts)) + geom_histogram(colour = "black", fill = "white", binwidth = binwidth) + geom_line(data = fitted_line, aes(x = x, y = y), colour = "red") +  theme_bw() + ggtitle("density of original counts with log-Normal-Poisson fit") 

ggplot(counts_compare, aes(x = original_counts + 1, y = corrected_counts)) +geom_point(alpha = 0.33) + geom_abline(intercept = 0, slope = 1, color = "red") + scale_y_log10() + scale_x_log10() + theme_bw() + ggtitle("comparison of original and smoothed counts")

head(counts_compare)
##   original_counts corrected_counts
## 1             813        812.17231
## 2              61         63.29544
## 3             842        841.12953
## 4             275        275.49323
## 5             272        272.50656
## 6            1156       1154.74260

What we see happening is that the abundance of the small counts get pushed up and the abundance of the large counts gets pushed down. This is the smoothing aspect of the log-normal Poisson prior.

Now let’s look at the zero-inflated version.

counts_zilnp_fit = estimate_zilnp(counts)
print("zero-inflated log norm fit:")
## [1] "zero-inflated log norm fit:"
print(counts_zilnp_fit)
## $mu
##       mu 
## 6.055796 
## 
## $sig
##       sig 
## 0.8173307 
## 
## $p
## [1] 1.343549e-09
counts_compare_zi = data.frame(counts_compare,
                            zi_corrected_counts = smooth_counts_lnp(counts,
                                                                 mu = counts_zilnp_fit[['mu']],
                                                                 sigma = counts_zilnp_fit[['sig']],
                                                                 p0 = counts_zilnp_fit[['p']]))

binwidth = 30
fitted_line = data.frame(x = 0:max(counts))
zilnp_density <- function(y, mu, sig, p){
  return(p*(y == 0) + (1 - p)*poilog::dpoilog(y, counts_lnp_fit[['mu']], counts_lnp_fit[['sig']]))
}
fitted_line['y'] = binwidth*length(counts)*zilnp_density(fitted_line$x, counts_zilnp_fit[['mu']], counts_zilnp_fit[['sig']], counts_zilnp_fit[['p']])
ggplot(data.frame(counts = counts), aes(x = counts)) + geom_histogram(colour = "black", fill = "white", binwidth = binwidth) + geom_line(data = fitted_line, aes(x = x, y = y), colour = "red") +  theme_bw() + ggtitle("density of original counts with zero-inflated log-Normal-Poisson fit") 

ggplot(counts_compare_zi, aes(x = original_counts + 1, y = zi_corrected_counts)) +geom_point(alpha = 0.33) + geom_abline(intercept = 0, slope = 1, color = "red") + scale_y_log10() + scale_x_log10() + theme_bw() + ggtitle("comparison of original and zero-inflated smoothed counts")

head(counts_compare_zi)
##   original_counts corrected_counts zi_corrected_counts
## 1             813        812.17231           812.03727
## 2              61         63.29544            63.85445
## 3             842        841.12953           840.98480
## 4             275        275.49323           275.65633
## 5             272        272.50656           272.67265
## 6            1156       1154.74260          1154.51024

The amount of zero inflation is estimated to be \(\approx 10^{-9}\) and the number of guides in the pool is about 91K. Since \(10^{-9} << 1/91000 \approx 10^{-5}\) we can infer that zero-inflation is not really an issue an issue here (assuming the model is correct).

Estimating Gini Coefficient

The MACGeCK software package proposed the use of the Gini coefficient for quality control analysis of CRISPR pooled screens. This is computed using log-transformed abundances. Let’s look at a comparison of the unsmoothed versus the smoothed Gini coefficients.

What’s interesting about using the smoothed counts is that we don’t need to add one before taking the log, as you need to do with the original counts because you can’t take the log of zero counts.

gini_coef <-function(y){
  y_sorted = sort(y, decreasing = F)
  n = length(y)
  denom = sum(y)
  num = sum((n + 1 -1:n)*y_sorted)
  return((n + 1 - 2*num/denom)/n)
}

libs = c("DLD1", "GBM", "HCT116_1", "HCT116_2", "HeLa", "RPE1")
gini_df = data.frame(lib = libs, original_gini_coeff = 0, 
                     smoothed_gini_coeff = 0)
# for some reason the authors didn't use a consistent naming schema,
# so we can't loop over the libraries
lib = "DLD1"
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$DLD_T0
counts_lnp_fit = estimate_lnp(counts)
corrected_counts = smooth_counts_lnp(counts,
                                     mu = counts_lnp_fit[['mu']],
                                     sigma = counts_lnp_fit[['sig']])
gini_df$original_gini_coeff[gini_df$lib == lib] = gini_coef(log(counts + 1))
gini_df$smoothed_gini_coeff[gini_df$lib == lib] = gini_coef(log(corrected_counts + 1))


lib = "GBM"
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$T0
counts_lnp_fit = estimate_lnp(counts)
corrected_counts = smooth_counts_lnp(counts,
                                     mu = counts_lnp_fit[['mu']],
                                     sigma = counts_lnp_fit[['sig']])
gini_df$original_gini_coeff[gini_df$lib == lib] = gini_coef(log(counts + 1))
gini_df$smoothed_gini_coeff[gini_df$lib == lib] = gini_coef(log(corrected_counts ))


lib = "HCT116_1"
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$LIB1_T0
counts_lnp_fit = estimate_lnp(counts)
corrected_counts = smooth_counts_lnp(counts,
                                     mu = counts_lnp_fit[['mu']],
                                     sigma = counts_lnp_fit[['sig']])
gini_df$original_gini_coeff[gini_df$lib == lib] = gini_coef(log(counts + 1))
gini_df$smoothed_gini_coeff[gini_df$lib == lib] = gini_coef(log(corrected_counts ))

lib = "HCT116_2"
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$T0
counts_lnp_fit = estimate_lnp(counts)
corrected_counts = smooth_counts_lnp(counts,
                                     mu = counts_lnp_fit[['mu']],
                                     sigma = counts_lnp_fit[['sig']])
gini_df$original_gini_coeff[gini_df$lib == lib] = gini_coef(log(counts + 1))
gini_df$smoothed_gini_coeff[gini_df$lib == lib] = gini_coef(log(corrected_counts))

lib = "HeLa"
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$T0
counts_lnp_fit = estimate_lnp(counts)
corrected_counts = smooth_counts_lnp(counts,
                                     mu = counts_lnp_fit[['mu']],
                                     sigma = counts_lnp_fit[['sig']])
gini_df$original_gini_coeff[gini_df$lib == lib] = gini_coef(log(counts + 1))
gini_df$smoothed_gini_coeff[gini_df$lib == lib] = gini_coef(log(corrected_counts))

lib = "RPE1"
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$T0
counts_lnp_fit = estimate_lnp(counts)
corrected_counts = smooth_counts_lnp(counts,
                                     mu = counts_lnp_fit[['mu']],
                                     sigma = counts_lnp_fit[['sig']])
gini_df$original_gini_coeff[gini_df$lib == lib] = gini_coef(log(counts + 1))
gini_df$smoothed_gini_coeff[gini_df$lib == lib] = gini_coef(log(corrected_counts))

df = data.frame(gini_coef = c(gini_df$original_gini_coeff, 
                              gini_df$smoothed_gini_coeff),
                type = rep(c("unsmoothed", "smoothed"), each = length(libs)),
                lib = rep(gini_df$lib, times = 2))
ggplot(df, aes(x = lib, y = gini_coef, color = type, fill = type)) + geom_bar(stat = 'identity', position = 'dodge') + scale_fill_brewer(palette = 'Set1') + scale_colour_brewer(palette = 'Set1') + ggtitle("day 0 Gini coefficients") + theme_bw()

gini_coef <-function(y){
  y_sorted = sort(y, decreasing = F)
  n = length(y)
  denom = sum(y)
  num = sum((n + 1 -1:n)*y_sorted)
  return((n + 1 - 2*num/denom)/n)
}

libs = c("DLD1", "GBM", "HCT116_1", "HCT116_2", "HeLa", "RPE1")
post_sel_gini_df = data.frame(lib = libs, original_gini_coeff = 0, 
                     smoothed_gini_coeff = 0)
# for some reason the authors didn't use a consistent naming schema,
# so we can't loop over the libraries
lib = "DLD1"
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$DLD_ETOH_R1
counts_lnp_fit = estimate_lnp(counts)
corrected_counts = smooth_counts_lnp(counts,
                                     mu = counts_lnp_fit[['mu']],
                                     sigma = counts_lnp_fit[['sig']])
post_sel_gini_df$original_gini_coeff[post_sel_gini_df$lib == lib] = gini_coef(log(counts + 1))
post_sel_gini_df$smoothed_gini_coeff[post_sel_gini_df$lib == lib] = gini_coef(log(corrected_counts + 1))


lib = "GBM"
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$T21A
counts_lnp_fit = estimate_lnp(counts)
corrected_counts = smooth_counts_lnp(counts,
                                     mu = counts_lnp_fit[['mu']],
                                     sigma = counts_lnp_fit[['sig']])
post_sel_gini_df$original_gini_coeff[post_sel_gini_df$lib == lib] = gini_coef(log(counts + 1))
post_sel_gini_df$smoothed_gini_coeff[post_sel_gini_df$lib == lib] = gini_coef(log(corrected_counts ))


lib = "HCT116_1"
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$LIB1_T18_A
counts_lnp_fit = estimate_lnp(counts)
corrected_counts = smooth_counts_lnp(counts,
                                     mu = counts_lnp_fit[['mu']],
                                     sigma = counts_lnp_fit[['sig']])
post_sel_gini_df$original_gini_coeff[post_sel_gini_df$lib == lib] = gini_coef(log(counts + 1))
post_sel_gini_df$smoothed_gini_coeff[post_sel_gini_df$lib == lib] = gini_coef(log(corrected_counts ))

lib = "HCT116_2"
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$T18A
counts_lnp_fit = estimate_lnp(counts)
corrected_counts = smooth_counts_lnp(counts,
                                     mu = counts_lnp_fit[['mu']],
                                     sigma = counts_lnp_fit[['sig']])
post_sel_gini_df$original_gini_coeff[post_sel_gini_df$lib == lib] = gini_coef(log(counts + 1))
post_sel_gini_df$smoothed_gini_coeff[post_sel_gini_df$lib == lib] = gini_coef(log(corrected_counts))

lib = "HeLa"
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$T18A
counts_lnp_fit = estimate_lnp(counts)
corrected_counts = smooth_counts_lnp(counts,
                                     mu = counts_lnp_fit[['mu']],
                                     sigma = counts_lnp_fit[['sig']])
post_sel_gini_df$original_gini_coeff[post_sel_gini_df$lib == lib] = gini_coef(log(counts + 1))
post_sel_gini_df$smoothed_gini_coeff[post_sel_gini_df$lib == lib] = gini_coef(log(corrected_counts))

lib = "RPE1"
lib_path = paste0('readcount-', lib, '-lib1')
counts_df = read.table(file = lib_path, header = T)
counts = counts_df$T18A
counts_lnp_fit = estimate_lnp(counts)
corrected_counts = smooth_counts_lnp(counts,
                                     mu = counts_lnp_fit[['mu']],
                                     sigma = counts_lnp_fit[['sig']])
post_sel_gini_df$original_gini_coeff[post_sel_gini_df$lib == lib] = gini_coef(log(counts + 1))
post_sel_gini_df$smoothed_gini_coeff[post_sel_gini_df$lib == lib] = gini_coef(log(corrected_counts))

df = data.frame(gini_coef = c(post_sel_gini_df$original_gini_coeff, 
                              post_sel_gini_df$smoothed_gini_coeff),
                type = rep(c("unsmoothed", "smoothed"), each = length(libs)),
                lib = rep(post_sel_gini_df$lib, times = 2))
ggplot(df, aes(x = lib, y = gini_coef, color = type, fill = type)) + geom_bar(stat = 'identity', position = 'dodge') + scale_fill_brewer(palette = 'Set1') + scale_colour_brewer(palette = 'Set1') + ggtitle("post-selection Gini coefficients") + theme_bw()

df = data.frame(gini_coef = c(gini_df$original_gini_coeff,
                         gini_df$smoothed_gini_coeff,
                         post_sel_gini_df$original_gini_coeff, 
                         post_sel_gini_df$smoothed_gini_coeff),
           type = rep(rep(c("unsmoothed", "smoothed"), each = length(libs)), times = 2),
           lib = rep(post_sel_gini_df$lib, times = 4),
           selection = rep(c("before", "post"), each = 2*length(libs)))
ggplot(df, aes(x = lib, y = gini_coef, color = type, fill = type, alpha = selection)) + geom_bar(stat = 'identity', position = 'dodge') + scale_fill_brewer(palette = 'Set1') + scale_colour_brewer(palette = 'Set1') + ggtitle("Gini coefficients") + theme_bw()
## Warning: Using alpha for a discrete variable is not advised.

We see that the log-Normal Poisson smoothed Gini coefficients are generally lower than the non-smoothed. This is natural because the smoothed counts will be pushed more towards each other. However, this is expected because the posterior estimates of \(\lambda\) are essentially trying to remove the variation due to Poisson sampling. Whether this is desirable or not I have yet to determine.