This post arises out of an argument I had with another researcher about the nature and interpretation of correlations of counts between replicates and conditions in a CRISPR screening experiment. A lot of researchers have the mindset that high correlation between replicates and experiments is a good thing, because that indicates that most of the noise is sampling noise. Now, I’m not talking about the fold changes, but rather the raw count data (e.g. comparing two replicates of the control condition). For example, consider the following quotes:
However, I don’t think that this is true. It might seem counterintiutive that high correlation is not good, but I hope that the toy examples I provide will help to convince you that this is not the case.
Consider an extreme toy example. Suppose that all guides are represented in the library at uniformly equal abundance. Then the only noise would be sampling noise, which would mean that the correlation between the control replicates will be \(0\). For the corresponding treatment replicates, assuming a standard matched control-treatment design, the guides targeting true hit genes would have positive correlation because of the selection effects. All other guides would have zero correlation between replicates.
Let’s consider a specific example. Suppose we have 100,000 guides and 10,000,000 (ten million) reads, for an average sequencing depth of 100 reads per guide. If all guides have the same abundance, then the replicates will look as follows.
set.seed(12345)
n_guides = 100000
avg_seq_depth = 100
counts_df = data.frame(replicate_1 = log(rpois(n_guides, avg_seq_depth) + 1),
replicate_2 = log(rpois(n_guides, avg_seq_depth) + 1))
library(ggplot2)
library(ggpubr)
ggplot(counts_df, aes(x = replicate_1, y = replicate_2)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of perfect experimental replicates") + xlab("log(counts + 1)") + ylab("log(counts + 1)")
Now let’s consider the case where the abundances are not equal. What happens?
Let’s assume that the changes in abundances due to technical effects are normally distributed on the log scale and that this change is consistent between replicates. What happens as there’s more spread in the abundances?
lambdas = exp(rep(log(100), times = n_guides) + rnorm(n_guides, 0, 0.1))
# renormalize to have same mean
lambdas = avg_seq_depth*lambdas/mean(lambdas)
counts_df = data.frame(replicate_1 = log(rpois(n_guides, lambdas) + 1),
replicate_2 = log(rpois(n_guides, lambdas) + 1))
ggplot(counts_df, aes(x = replicate_1, y = replicate_2)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of somewhat correlated experimental replicates") + xlab("log(counts + 1)") + ylab("log(counts + 1)")
lambdas = exp(rep(log(100), times = n_guides) + rnorm(n_guides, 0, 0.75))
# renormalize to have same mean
lambdas = avg_seq_depth*lambdas/mean(lambdas)
counts_df = data.frame(replicate_1 = log(rpois(n_guides, lambdas) + 1),
replicate_2 = log(rpois(n_guides, lambdas) + 1))
ggplot(counts_df, aes(x = replicate_1, y = replicate_2)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of highly correlated experimental replicates") + xlab("log(counts + 1)") + ylab("log(counts + 1)")
We see that as the technical effects increase relative to the sampling effects then the correlation between replicates increases.
What about the treatment replicates? There will be consistent changes in a few guides, compared to the control. This should result in a relative increase in the correlation, from the control to the treatment.
lambdas = rep(avg_seq_depth, times = n_guides)
n_positive = 1000
# average log fold change of 2ish
lambdas[1:n_positive] = lambdas[1:n_positive] + exp(rnorm(n_positive, mean = log(avg_seq_depth), 1))
# renormalize to have same mean
lambdas = avg_seq_depth*lambdas/mean(lambdas)
counts_df = data.frame(replicate_1 = log(rpois(n_guides, lambdas) + 1),
replicate_2 = log(rpois(n_guides, lambdas) + 1))
ggplot(counts_df, aes(x = replicate_1, y = replicate_2)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of a high signal to noise experiment treatment replicates") + xlab("log(counts + 1)") + ylab("log(counts + 1)")
print(paste0("mean of true hit guides: ",
mean(lambdas[1:n_positive])))
## [1] "mean of true hit guides: 262.454666838809"
print(paste0("mean of null guides: ",
mean(lambdas[(n_positive+1):length(lambdas)])))
## [1] "mean of null guides: 98.359043769305"
lambdas = exp(rep(log(100), times = n_guides) + rnorm(n_guides, 0, 0.1))
n_positive = 1000
# average log fold change of 2ish
lambdas[1:n_positive] = lambdas[1:n_positive] + exp(rnorm(n_positive, mean = log(avg_seq_depth), 1))
# renormalize to have same mean
lambdas = avg_seq_depth*lambdas/mean(lambdas)
counts_df = data.frame(replicate_1 = log(rpois(n_guides, lambdas) + 1),
replicate_2 = log(rpois(n_guides, lambdas) + 1))
ggplot(counts_df, aes(x = replicate_1, y = replicate_2)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of a medium signal to noise experimental treatment replicates") + xlab("log(counts + 1)") + ylab("log(counts + 1)")
print(paste0("mean of true hit guides: ",
mean(lambdas[1:n_positive])))
## [1] "mean of true hit guides: 270.052867167592"
print(paste0("mean of null guides: ",
mean(lambdas[(n_positive+1):length(lambdas)])))
## [1] "mean of null guides: 98.2822942710344"
lambdas = exp(rep(log(100), times = n_guides) + rnorm(n_guides, 0, 0.75))
n_positive = 1000
# average log fold change of 2ish
lambdas[1:n_positive] = lambdas[1:n_positive] + exp(rnorm(n_positive, mean = log(avg_seq_depth), 1))
# renormalize to have same mean
lambdas = avg_seq_depth*lambdas/mean(lambdas)
counts_df = data.frame(replicate_1 = log(rpois(n_guides, lambdas) + 1),
replicate_2 = log(rpois(n_guides, lambdas) + 1))
ggplot(counts_df, aes(x = replicate_1, y = replicate_2)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of a low signal to noise experimental treatment replicates") + xlab("log(counts + 1)") + ylab("log(counts + 1)")
print(paste0("mean of true hit guides: ",
mean(lambdas[1:n_positive])))
## [1] "mean of true hit guides: 219.948468344646"
print(paste0("mean of null guides: ",
mean(lambdas[(n_positive+1):length(lambdas)])))
## [1] "mean of null guides: 98.78839930965"
We see that as the technical noise increases, the change in correlation due to the signal is decreased. This seems to indicate that it will be more difficult to identify the true signal guides from the background null guides.
Now let’s consider a specific case, where we specify the generating distribution.
Let’s now take a more formal approach by making some model-based assumptions. Suppose that the abundances are determined by three effects: - sampling effects; - technical effects that are consistent between replicates; - on target selection effects, which will only be present in the treatment samples.
For the moment let’s assume simple Poisson sampling, so that the expected correlation from sampling is zero. This is obviously a simplification, since we expect overdispersion in the counts, but this assumption will allow us make valuable insights. Like my old PDE professor Dr Wang would say, “We lie so that we can see the truth.”
Let \(x_{ij}\) be the count of guide \(i\) in library \(j\) and let \(\lambda_{ij}\) be the corresponding expected count, which we will call the abundance (and \(p_{ij} = \lambda_{ij}/\sum_{k = 1}^{N} \lambda_{kj}\) is relative abundance). We’ll assume, as previously, that effects happen on the log scale.
\[
\log \lambda_{ij} = \mathrm{E}(\log \lambda_{ij}) + \epsilon_{ij} + \zeta_{ij} + \eta_{ij},
\] where \(\epsilon_{ij}\) are random effects (part of the sampling effects); \(\zeta_{ij}\) are the technical effects; and \(\eta_{ij}\) are selection effects.
We will assume (for the moment) that the random effects are zero. The others, however, will not be zero. We shall assume that technical effects affect the initial abundance of the library and are consistent between replicates, i.e. \(\eta_{ij} = \eta_{ik}\) for all \(j\), \(k\). The selection effects only affect the treatment conditions, e.g. \(\nu_{ij} = 0\) for \(j \in \text{ control}\). We assume that the selection effects are positive (typical for CRISPRa screens, while CRISPRi/ko tend to be negative). An increase in abundance for some molecules will cause a decrease in the relative abundance of all other molecules. To make our calculations easier we will ignore this (because otherwise we’re gonna have some constant terms floating around, which won’t affect the variance/covariance calculation anyways).
Because the effects are independent and additive (in the log space), we can calculate the variance as follows, \[ \text{Var} (\log \lambda_{\cdot j}) = \text{Var} (\epsilon) + \text{Var} (\eta) + 1(j \in \text{treatment} ) \text{Var}(\nu_{\cdot j}). \]
For two libraries \(j, k\) from the control condition, the covariance can be calculated as follows. \[\begin{align} \text{Cov}(\log \lambda_{\cdot j}, \log \lambda_{\cdot k}) &= \mathrm{E} \big( (\log \lambda_{\cdot j} - \mathrm{E} (\log \lambda_{\cdot j})) (\log \lambda_{\cdot k} - \mathrm{E} (\log \lambda_{\cdot k})) \big) \notag \\ &= \mathrm{E} \big( (\epsilon_{\cdot j} + \zeta_{\cdot j} + \eta_{\cdot j}) (\epsilon_{\cdot k} + \zeta_{\cdot k} + \eta_{\cdot k}) \big) \notag \\ &= \mathrm{E}(\zeta_{\cdot j} \zeta_{\cdot k}), \end{align}\] where the last part follows from the independence between the types of errors, the selection effects are zero, and sampling effects are independent between replicates.
For two libraries \(j, k\) from the treatment condition we can calculate the covariance as, \[\begin{align} \text{Cov}(\log \lambda_{\cdot j}, \log \lambda_{\cdot k}) &= \mathrm{E} \big( (\log \lambda_{\cdot j} - \mathrm{E} (\log \lambda_{\cdot j})) (\log \lambda_{\cdot k} - \mathrm{E} (\log \lambda_{\cdot k})) \big) \notag \\ &= \mathrm{E} \big( (\epsilon_{\cdot j} + \zeta_{\cdot j} + (\eta_{\cdot j} - \mathrm{E} \eta_{\cdot j})) (\epsilon_{\cdot k} + \zeta_{\cdot k} + (\eta_{\cdot k} - \mathrm{E} \eta_{\cdot k})) \big) \notag \\ &= \mathrm{E}(\zeta_{\cdot j} \zeta_{\cdot k}) + \mathrm{E}((\eta_{\cdot j} - \mathrm{E} \eta_{\cdot j}) (\eta_{\cdot k} - \mathrm{E} \eta_{\cdot k})). \end{align}\]
For a library \(j\) from the control replicates and library \(k\) from the condition replicates we can calculate the covariance as \[\begin{align} \text{Cov}(\log \lambda_{\cdot j}, \log \lambda_{\cdot k}) &= \mathrm{E} \big( (\log \lambda_{\cdot j} - \mathrm{E} (\log \lambda_{\cdot j})) (\log \lambda_{\cdot k} - \mathrm{E} (\log \lambda_{\cdot k})) \big) \notag \\ &= \mathrm{E} \big( (\epsilon_{\cdot j} + \zeta_{\cdot j} + \eta_{\cdot j}) (\epsilon_{\cdot k} + \zeta_{\cdot k} +(\eta_{\cdot k} - \mathrm{E} \eta_{\cdot k})) \big) \notag \\ &= \mathrm{E}(\zeta_{\cdot j} \zeta_{\cdot k}). \end{align}\]
We see that under our assumptions that there is higher covariance in the condition replicates, and that this extra covariance is due to the signal part of the experiment. We can calculate the variance due to signal as the covariance of the treatment replicates minus the covariance of control replicates. Therefore the ratio of this difference divided by the covariance of the treatment replicates can be taken as the fraction of signal that is due to the selection (signal) effects, and seems like a great idea for a quality control metric. Let’s take a look.
An issue with comparing the variance between experiments is that the scales can be different. In order to control for the scale we can divide by the square roots of the mean sequencing depth of each experiment, which is the standard deviation if the only variance is simple sampling variance.
One reasonable idea is to use the standard deviations of each experiment, i.e. the correlation. However, I observed through experimentation that increasing the sampling variance while leaving all other effects the same increased the correlation-based metric. This is the opposite of what we would want.
log_lambdas = rep(log(100), times = n_guides) + rnorm(n_guides, 0, 0.1)
n_positive = 1000
# renormalize to have same mean
lambdas = exp(log_lambdas)
#lambdas = avg_seq_depth*lambdas/mean(lambdas)
counts_df = data.frame(control_rep1 = log(rpois(n_guides, lambdas) + 1),
control_rep2 = log(rpois(n_guides, lambdas) + 1))
# average log fold change of 2ish
log_lambdas[1:n_positive] = log_lambdas[1:n_positive] + rnorm(n_positive, mean = 1, 0.5)
# renormalize to have same mean
lambdas = exp(log_lambdas)
#lambdas = avg_seq_depth*lambdas/mean(lambdas)
counts_df = data.frame(counts_df,
treatment_rep1 = log(rpois(n_guides, lambdas) + 1),
treatment_rep2 = log(rpois(n_guides, lambdas) + 1),
treatment = factor(c(rep(TRUE, times = n_positive), rep(FALSE, times = n_guides - n_positive))))
# shuffle data frame for better visualization of positives
counts_df = counts_df[sample(nrow(counts_df)), ]
ggplot(counts_df, aes(x = control_rep1, y = control_rep2)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of control replicates for a low technical effects, high signal screen") + xlab("log(counts + 1)") + ylab("log(counts + 1)")
ggplot(counts_df, aes(x = treatment_rep1, y = treatment_rep2, col = treatment)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of treatment replicates for a low technical effects, high signal screen") + xlab("log(counts + 1)") + ylab("log(counts + 1)") + scale_color_manual(values = c("black", "red"))
treat_stat = cov(counts_df$treatment_rep1, counts_df$treatment_rep2)/(sqrt(mean(counts_df$treatment_rep1))*sqrt(mean(counts_df$treatment_rep2)))
control_stat = cov(counts_df$control_rep1, counts_df$control_rep2)/(sqrt(mean(counts_df$control_rep1))*sqrt(mean(counts_df$control_rep2)))
control_cov = cov(counts_df$control_rep1, counts_df$control_rep2)
print(paste0("proportion of variance due to signal: ",
(treat_stat - control_stat)/treat_stat))
## [1] "proportion of variance due to signal: 0.553844999562577"
log_lambdas = rep(log(100), times = n_guides) + rnorm(n_guides, 0, 0.5)
n_positive = 1000
# renormalize to have same mean
lambdas = exp(log_lambdas)
#lambdas = avg_seq_depth*lambdas/mean(lambdas)
counts_df = data.frame(control_rep1 = log(rpois(n_guides, lambdas) + 1),
control_rep2 = log(rpois(n_guides, lambdas) + 1))
# average log fold change of 2ish
log_lambdas[1:n_positive] = log_lambdas[1:n_positive] + rnorm(n_positive, 1, 0.5)
# renormalize to have same mean
lambdas = exp(log_lambdas)
#lambdas = avg_seq_depth*lambdas/mean(lambdas)
counts_df = data.frame(counts_df,
treatment_rep1 = log(rpois(n_guides, lambdas) + 1),
treatment_rep2 = log(rpois(n_guides, lambdas) + 1),
treatment = factor(c(rep(TRUE, times = n_positive), rep(FALSE, times = n_guides - n_positive))))
# shuffle data frame for better visualization of positives
counts_df = counts_df[sample(nrow(counts_df)), ]
ggplot(counts_df, aes(x = control_rep1, y = control_rep2)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of control replicates for a high technical effects, high signal screen") + xlab("log(counts + 1)") + ylab("log(counts + 1)")
ggplot(counts_df, aes(x = treatment_rep1, y = treatment_rep2, col = treatment)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of treatment replicates for a high technical effects, high signal screen") + xlab("log(counts + 1)") + ylab("log(counts + 1)") + scale_color_manual(values = c("black", "red"))
treat_stat = cov(counts_df$treatment_rep1, counts_df$treatment_rep2)/(sqrt(mean(counts_df$treatment_rep1))*sqrt(mean(counts_df$treatment_rep2)))
control_stat = cov(counts_df$control_rep1, counts_df$control_rep2)/(sqrt(mean(counts_df$control_rep1))*sqrt(mean(counts_df$control_rep2)))
control_cov = cov(counts_df$control_rep1, counts_df$control_rep2)
print(paste0("proportion of variance due to signal: ",
(treat_stat - control_stat)/treat_stat))
## [1] "proportion of variance due to signal: 0.0460179175700384"
If we consider the case where sampling noise is bigger than 0. What we expect to happen is that the covariance stay the same, but the variance of each replicate increases. This, in turn, will decrease the correlation, particularly in the condition replicates. Let’s see some examples.
log_lambdas = rep(log(100), times = n_guides) + rnorm(n_guides, 0, 0.1)
n_positive = 1000
# renormalize to have same mean
rep1_sampling_effects = rnorm(n_guides, 0, 1)
rep2_sampling_effects = rnorm(n_guides, 0, 1)
rep1_lambdas = exp(log_lambdas + rep1_sampling_effects)
#rep1_lambdas = avg_seq_depth*rep1_lambdas/mean(rep1_lambdas)
rep2_lambdas = exp(log_lambdas + rep2_sampling_effects)
#rep2_lambdas = avg_seq_depth*rep2_lambdas/mean(rep2_lambdas)
counts_df = data.frame(control_rep1 = log(rpois(n_guides, rep1_lambdas) + 1),
control_rep2 = log(rpois(n_guides, rep2_lambdas) + 1))
# average log fold change of 2ish
log_lambdas[1:n_positive] = log_lambdas[1:n_positive] + rnorm(n_positive, 1, 0.5)
rep1_sampling_effects = rnorm(n_guides, 0, 0.25)
rep2_sampling_effects = rnorm(n_guides, 0, 0.25)
rep1_lambdas = exp(log_lambdas + rep1_sampling_effects)
#rep1_lambdas = avg_seq_depth*rep1_lambdas/mean(rep1_lambdas)
rep2_lambdas = exp(log_lambdas + rep2_sampling_effects)
#rep2_lambdas = avg_seq_depth*rep2_lambdas/mean(rep2_lambdas)
counts_df = data.frame(counts_df,
treatment_rep1 = log(rpois(n_guides, rep1_lambdas) + 1),
treatment_rep2 = log(rpois(n_guides, rep2_lambdas) + 1),
treatment = factor(c(rep(TRUE, times = n_positive), rep(FALSE, times = n_guides - n_positive))))
# shuffle data frame for better visualization of positives
counts_df = counts_df[sample(nrow(counts_df)), ]
ggplot(counts_df, aes(x = control_rep1, y = control_rep2)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of control replicates for a low technical effects, high signal, high sampling noise screen") + xlab("log(counts + 1)") + ylab("log(counts + 1)")
ggplot(counts_df, aes(x = treatment_rep1, y = treatment_rep2, col = treatment)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of treatment replicates for a low technical effects, high signal, high sampling noise screen") + xlab("log(counts + 1)") + ylab("log(counts + 1)") + scale_color_manual(values = c("black", "red"))
treat_stat = cov(counts_df$treatment_rep1, counts_df$treatment_rep2)/(sqrt(mean(counts_df$treatment_rep1))*sqrt(mean(counts_df$treatment_rep2)))
control_stat = cov(counts_df$control_rep1, counts_df$control_rep2)/(sqrt(mean(counts_df$control_rep1))*sqrt(mean(counts_df$control_rep2)))
control_cov = cov(counts_df$control_rep1, counts_df$control_rep2)
print(paste0("proportion of variance due to signal: ",
(treat_stat - control_stat)/treat_stat))
## [1] "proportion of variance due to signal: 0.65866543772626"
log_lambdas = rep(log(100), times = n_guides) + rnorm(n_guides, 0, 0.5)
n_positive = 1000
# renormalize to have same mean
rep1_sampling_effects = rnorm(n_guides, 0, 1)
rep2_sampling_effects = rnorm(n_guides, 0, 1)
rep1_lambdas = exp(log_lambdas + rep1_sampling_effects)
#rep1_lambdas = avg_seq_depth*rep1_lambdas/mean(rep1_lambdas)
rep2_lambdas = exp(log_lambdas + rep2_sampling_effects)
#rep2_lambdas = avg_seq_depth*rep2_lambdas/mean(rep2_lambdas)
counts_df = data.frame(control_rep1 = log(rpois(n_guides, rep1_lambdas) + 1),
control_rep2 = log(rpois(n_guides, rep2_lambdas) + 1))
# average log fold change of 2ish
log_lambdas[1:n_positive] = log_lambdas[1:n_positive] + rnorm(n_positive, mean = 1, 0.5)
rep1_sampling_effects = rnorm(n_guides, 0, 0.25)
rep2_sampling_effects = rnorm(n_guides, 0, 0.25)
rep1_lambdas = exp(log_lambdas + rep1_sampling_effects)
#rep1_lambdas = avg_seq_depth*rep1_lambdas/mean(rep1_lambdas)
rep2_lambdas = exp(log_lambdas + rep2_sampling_effects)
#rep2_lambdas = avg_seq_depth*rep2_lambdas/mean(rep2_lambdas)
counts_df = data.frame(counts_df,
treatment_rep1 = log(rpois(n_guides, rep1_lambdas) + 1),
treatment_rep2 = log(rpois(n_guides, rep2_lambdas) + 1),
treatment = factor(c(rep(TRUE, times = n_positive), rep(FALSE, times = n_guides - n_positive))))
# shuffle data frame for better visualization of positives
counts_df = counts_df[sample(nrow(counts_df)), ]
ggplot(counts_df, aes(x = control_rep1, y = control_rep2)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of control replicates for a high technical effects, high signal, high sampling noise screen") + xlab("log(counts + 1)") + ylab("log(counts + 1)")
ggplot(counts_df, aes(x = treatment_rep1, y = treatment_rep2, col = treatment)) + geom_point(alpha = 0.3) + theme_bw() + stat_cor(method="pearson") + ggtitle("Example correlation of treatment replicates for a high technical effects, high signal, high sampling noise screen") + xlab("log(counts + 1)") + ylab("log(counts + 1)") + scale_color_manual(values = c("black", "red"))
treat_stat = cov(counts_df$treatment_rep1, counts_df$treatment_rep2)/(sqrt(mean(counts_df$treatment_rep1))*sqrt(mean(counts_df$treatment_rep2)))
control_stat = cov(counts_df$control_rep1, counts_df$control_rep2)/(sqrt(mean(counts_df$control_rep1))*sqrt(mean(counts_df$control_rep2)))
control_cov = cov(counts_df$control_rep1, counts_df$control_rep2)
print(paste0("proportion of variance due to signal: ",
(treat_stat - control_stat)/treat_stat))
## [1] "proportion of variance due to signal: 0.0306341409460458"
We see above, that the effect size remains the same but the the technical noise drowns out the effects, which makes it difficult to determine which are the true hits. In the low technical noise example, on the other hand, it’s clear which are the true hits.