There are a number of issues I want to discuss in the problem of analysing CRISPR inhibition or CRISPR activation (CRISPRi/a) screens. Most of this was learned during my postdoc at Stanford, working in the labs of Stanley Qi and Wing Hung Wong.
The most prevalant screen design is the classic treatment-control. The key to this design is to ensure that everything is the same between the treatment and control, except for the the effect you want to measure. For example, in gene essentiality screens the treatment is CRISPRi or CRISPRko treated and the the control is the untreated population. However, you need to make sure that everything else between the two conditions are the same. Another example is drug treatment screens to identify the mechanism or pathways of a drug. In this case, the CRISPRi/a treatment is applied to both treatment and control. The difference is that the treatment population is treated with the drug.
Another design, discussed in the Qi lab's neuron screen paper, is to compare positive vs negative selection. This is particularly relevant in the context of gain of function screens. In this design, the CRISPRi/a treatment is applied to the population. Then after the effect is allowed to set in, the population is selected for and against the desired phenotype. In the neuron screen this was beta-tubulin postive and beta-tubulin negative, respectively. This design should control for effects of the CRISPR modulation that are not associated with the desired phenotype.
The outcomes we see from the screen are the counts of sequenced guide RNAs, typically like a table below taken from Gilbert 2014. This is a genome-wide CRISPRi screen for ricin sensitivity/susceptibility.
import sys
!{sys.executable} -m pip install seaborn
import pandas as pd
import seaborn
data_loc = '20150112_ctxdta_crispri_mergedcountstable.txt'
gilbert_2014 = pd.read_csv(data_loc, sep = '\t', header = 0)
gilbert_2014.head()
import numpy as np
def log2_fc(df, condition_col, control_col, zero_correction = 1):
# formula is log2(treatment_prob/condition_prob)
# where x_prob is relative proportion
# x_prob = (count + zero_correction) / sum(count + zero_correction)
# more accurate to work in log space
log_num = np.log2(df[condition_col] + zero_correction) - np.log2(sum(df[condition_col] + zero_correction))
log_denom = np.log2(df[control_col] + zero_correction) - np.log2(sum(df[control_col] + zero_correction))
return log_num - log_denom
gilbert_2014['essential_lfc_JEV'] = log2_fc(gilbert_2014, "untreatedJEV", "T0JEV")
gilbert_2014['essential_lfc_MH'] = log2_fc(gilbert_2014, "untreatedMH", "T0MH")
gilbert_2014['ricin_lfc_MH'] = log2_fc(gilbert_2014, "treatedMH", "untreatedMH")
gilbert_2014['ricin_lfc_JEV'] = log2_fc(gilbert_2014, "treatedJEV", "untreatedJEV")
gilbert_2014.head()
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax = seaborn.scatterplot(x = "essential_lfc_JEV", y = "essential_lfc_MH",
data = gilbert_2014, alpha = 0.1, color = 'black')
ax.set_xlabel('JEV')
ax.set_ylabel('MH')
ax.set_title("Distribution of essential log2 fold changes")
fig, ax = plt.subplots()
ax = seaborn.scatterplot(x = "ricin_lfc_JEV", y = "ricin_lfc_MH",
data = gilbert_2014, alpha = 0.1, color = 'black')
ax.set_xlabel('JEV')
ax.set_ylabel('MH')
ax.set_title("Distribution of ricin log2 fold changes")
We see that the distribution of the log fold changes of the "essential" experiment is towards the negative side, while for the "ricin" experiment it goes both ways but mostly towards the postive side. This is a definite difference in the experiments.
We see a lot of noise in the above plots. Except for the extreme outliers it's difficult to determine what the signal and what's the noise. A good solution to this problem is using a moderated log fold change. There are a number of packages to compute this. The basic idea behind all of them is to smooth the log fold changes towards zero, accounting for noise from sequencing, usually based on a negative binomial regression. I particularly like DESeq2, but it's implemented in R.
import sys
!{sys.executable} -m pip install rpy2
%load_ext rpy2.ipython
import warnings
warnings.filterwarnings('ignore')
%R require(DESeq2); require(ggplot2);
%Rpush gilbert_2014
%R counts = gilbert_2014[, c(4, 5, 8, 9)]; coldata = data.frame(condition = c(0, 0, 1, 1), rep = c(0, 1, 0, 1)); \
rownames(coldata) = colnames(counts);\
WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq = DESeq2::DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition);\
WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq = DESeq2::DESeq(WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq);\
WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq = DESeq2::results(WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq);\
log2fc = WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq$log2FoldChange;\
adj_pvals = WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq$padj
%Rpull log2fc adj_pvals
gilbert_2014['essential_moderated_log2fc'] = log2fc
gilbert_2014['essential_moderated_adj_pval'] = adj_pvals
gilbert_2014.head()
import warnings
warnings.filterwarnings('ignore')
%R require(DESeq2); require(ggplot2);
%Rpush gilbert_2014
%R counts = gilbert_2014[, c(8, 9, 6, 7)]; coldata = data.frame(condition = c(0, 0, 1, 1), rep = c(0, 1, 0, 1)); \
rownames(coldata) = colnames(counts);\
WeissmanCRISPRiRicinTreatmentDESeq = DESeq2::DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition);\
WeissmanCRISPRiRicinTreatmentDESeq = DESeq2::DESeq(WeissmanCRISPRiRicinTreatmentDESeq);\
WeissmanCRISPRiRicinTreatmentDESeq = DESeq2::results(WeissmanCRISPRiRicinTreatmentDESeq);\
log2fc = WeissmanCRISPRiRicinTreatmentDESeq$log2FoldChange;\
adj_pvals = WeissmanCRISPRiRicinTreatmentDESeq$padj
%Rpull log2fc adj_pvals
gilbert_2014['ricin_moderated_log2fc'] = log2fc
gilbert_2014['ricin_moderated_adj_pval'] = adj_pvals
gilbert_2014.head()
Let's compare the moderated log fold changes to the average of the individual log fold changes.
gilbert_2014['ricin_lfc_avg'] = gilbert_2014[['ricin_lfc_MH', 'ricin_lfc_JEV']].mean(axis=1)
gilbert_2014['essential_lfc_avg'] = gilbert_2014[['essential_lfc_JEV', 'essential_lfc_MH']].mean(axis=1)
seaborn.scatterplot(x = 'essential_moderated_log2fc', y = 'essential_lfc_avg',
data = gilbert_2014, alpha = 0.2, color = 'black')
fig, ax = plt.subplots()
for x in ['essential_moderated_log2fc', 'essential_lfc_avg']:
seaborn.kdeplot(gilbert_2014[x], ax=ax, label = x)
ax.set_xlim(-12, 6)
ax.legend()
seaborn.scatterplot(x = 'ricin_moderated_log2fc', y = 'ricin_lfc_avg',
data = gilbert_2014, alpha = 0.2, color = 'black')
fig, ax = plt.subplots()
for x in ['ricin_moderated_log2fc', 'ricin_lfc_avg']:
seaborn.kdeplot(gilbert_2014[x], ax=ax, label = x)
ax.set_xlim(-12, 6)
ax.legend()
[gilbert_2014['ricin_moderated_log2fc'].mean(), gilbert_2014['ricin_lfc_avg'].mean()]
Note the difference in means between the moderated log fold change and the average of the individual log fold changes. The former is close to zero, while the latter seems a bit different than zero. It's worth checking the distribution of negative control guides.
gilbert_2014['gene'].value_counts()
neg_ctrl_flag = (gilbert_2014['gene'] == 'negative_control')
fig, ax = plt.subplots()
seaborn.kdeplot(gilbert_2014[neg_ctrl_flag]['ricin_moderated_log2fc'], ax=ax,
label = 'negative control', color = 'darkred')
seaborn.kdeplot(gilbert_2014[~neg_ctrl_flag]['ricin_moderated_log2fc'], ax=ax,
label = 'gene targeting', color = 'skyblue')
ax.set_xlim(-12, 6)
ax.legend()
fig, ax = plt.subplots()
seaborn.kdeplot(gilbert_2014[neg_ctrl_flag]['ricin_lfc_avg'], ax=ax,
label = 'negative control', color = 'darkred')
seaborn.kdeplot(gilbert_2014[~neg_ctrl_flag]['ricin_lfc_avg'], ax=ax,
label = 'gene targeting', color = 'skyblue')
ax.set_xlim(-12, 6)
ax.legend()
Negative control guides give us an estimate of the null distribution. It's always good to check the distribution of the negative control guides. One thing to check is whether the peaks of the gene targeting and the peak of the negative control guides match up. This validates the assumption that most genes have no effect. The other main thing to check is if there's a over-representation of the gene targeting guides in the tail in which you expect the signal to lie. For example, in the essential log fold change we expect an over-representation towards the negative end of the log fold change. This gives us an assurance that there is a signal present in the data.
fig, ax = plt.subplots()
seaborn.kdeplot(gilbert_2014[neg_ctrl_flag]['essential_moderated_log2fc'], ax=ax,
label = 'negative control', color = 'darkred')
seaborn.kdeplot(gilbert_2014[~neg_ctrl_flag]['essential_moderated_log2fc'], ax=ax,
label = 'gene targeting', color = 'skyblue')
ax.set_xlim(-12, 0)
ax.set_ylim(0, 0.2)
ax.legend()
A pernicious problem is when the distributions of the negative controls and gene targeting guides don't match up. I've seen this once before, in our neuron screen. This was an activation screen of transcription factors that led mouse embryonic stem cells to differentiation to neurons. But there is a perfectly reasonable explanation for this behavior. If a transcription factor is associated with some cell fate other than neurons then we should expect that those factors would be depleted (compared to the negative control, which have some non-zero base rate of neuron differentation). This would be true even if a non-negligible number of transcription factors would lead to non-neuronal fates.
neuron = pd.read_csv('NeuronPerfectMatchCounts.txt', sep = '\t')
neuron.head()
neuron['gene'].value_counts()
# there are micro RNA's in the library. need to subset to TFs only
mir_flag = neuron['gene'].str.startswith('Mir')
print(mir_flag.value_counts())
print(neuron.shape)
neuron = neuron[~mir_flag]
print(neuron.shape)
neuron['pos_vs_neg_lfc_rep1'] = log2_fc(neuron, "CD8_pos_rep1", "CD8_neg_rep1")
neuron['pos_vs_neg_lfc_rep2'] = log2_fc(neuron, "CD8_pos_rep2", "CD8_neg_rep2")
neuron['pos_vs_neg_lfc_avg'] = neuron[['pos_vs_neg_lfc_rep1', 'pos_vs_neg_lfc_rep2']].mean(axis=1)
neuron.head()
neuron['gene'].value_counts()
neg_ctrl_flag = (neuron['gene'] == 'random')
fig, ax = plt.subplots()
seaborn.kdeplot(neuron[neg_ctrl_flag]['pos_vs_neg_lfc_avg'], ax=ax,
label = 'negative control', color = 'darkred')
seaborn.kdeplot(neuron[~neg_ctrl_flag]['pos_vs_neg_lfc_avg'], ax=ax,
label = 'gene targeting', color = 'skyblue')
ax.legend()
The solution proposed by Salil Bhate was to calcaulte a score of the enrichment of the gene targeting guides compared to the negative controls.
import random
import statistics
def enrich_score(scores, neg_ctrl_scores):
assert(len(scores) == len(neg_ctrl_scores))
return(statistics.mean(scores) - statistics.mean(neg_ctrl_scores))
def cal_enrich(gene_targeting_guides, neg_ctrl_guides, n_bootstraps = 10000):
n_gene = len(gene_targeting_guides)
samples_scores = []
for i in range(n_bootstraps):
neg_ctrl_sample = random.sample(neg_ctrl_guides, n_gene)
samples_scores.append(enrich_score(gene_targeting_guides, neg_ctrl_sample))
return statistics.mean(samples_scores)
g = 'Gm4736'
guides = neuron[neuron['gene'] == g]['pos_vs_neg_lfc_avg'].tolist()
print(len(guides))
print(guides)
neg_ctrl_guides = neuron[neuron['gene'] == 'random']['pos_vs_neg_lfc_avg'].tolist()
print(len(neg_ctrl_guides))
neg_ctrl_sample = random.sample(neg_ctrl_guides, len(guides))
print(len(neg_ctrl_sample))
print(neg_ctrl_sample)
cal_enrich(neuron[neuron['gene'] == g]['pos_vs_neg_lfc_avg'].tolist(),
neuron[neuron['gene'] == 'random']['pos_vs_neg_lfc_avg'].tolist())
genes = neuron['gene'].unique().tolist()
genes.remove('random')
print(len(genes))
gene_scores = pd.DataFrame({'gene': genes})
gene_scores['score'] = \
gene_scores['gene'].map(lambda g: cal_enrich(neuron[neuron['gene'] == g]['pos_vs_neg_lfc_avg'].tolist(),
neuron[neuron['gene'] == 'random']['pos_vs_neg_lfc_avg'].tolist()))
gene_scores['n_guides'] = gene_scores['gene'].map(lambda g: sum(neuron['gene'] == g))
gene_scores.head()
gene_scores.sort_values(by = 'score', ascending = False).head(30)
gene_scores['score'].hist(bins = 25, color = 'black')
gene_scores['score'].median()
We see a lot of these genes overlap with the original list, though it doesn't correspond exactly. But that's not totally unexpected. We also see that the median of the gene scores is slightly negative, which means that most gene scores are negative. We cannot assume there that a null gene has zero effect, relative to the negative controls. Instead most genes have a negative effect. Seems strange, but it can trip you up.
Sorry for the aside, let's return to the Gilbert data.
We now come to the problem of how to combine information at the guide level to make inferences at the gene level. Simple methods, such as taking the mean of the guide-level log fold changes will ignore issues such as non-working guides or off-target effects.
With these types of screens it's difficult to know the ground truth. We usually have to use proxy information in order to judge how good our inferences are, such as GO terms of the top hits. Essential screens are pretty well studied. Hart et al, 2014 compiled a list of common essential genes across a number of a number of different studies. We can use this as a "ground truth" to compare different methods.
essential_genes = pd.read_csv("core-essential-genes-sym_HGNCID.txt", sep = '\t',
header = None, names = ['gene', 'hgnc'])
print(essential_genes.shape)
essential_genes.head()
sum(essential_genes['gene'].isin(gilbert_2014['gene']))
There's a few ways to combine p-values. Fisher's method, Stouffer's method, and robust rank aggregation (which MAGeCK uses). Let's compare them. But one thing to account for is that the default implementation of Stouffer's method in scipy.stats won't distinguish the two sided p-values. So we need to do it ourselves.
import scipy.stats as stats
import math
def stouffers_method(pvals, orig_z_vals):
assert(pvals.shape == orig_z_vals.shape)
# convert p-values to z_values
# don't know why scipy uses point percent function to refer to the quantile function
transformed_zvals = np.sign(orig_z_vals)*pvals.map(lambda p: stats.norm.ppf(p/2.0))
z = sum(transformed_zvals)/math.sqrt(pvals.shape[0])
# the absolute value of the z-value is half-normally distributed
# P(x > |z|) = 1 - 2*(phi(|z|) - 0.5)
return 1.0 - 2.0*(stats.norm.cdf(abs(z)) - 0.5)
# test if opposite signs cancel
def test_stouffers():
test_pvals = pd.Series([0.05, 0.05])
test_zvals = pd.Series([-2.16, 2.16])
return stouffers_method(test_pvals, test_zvals)
test_stouffers()
gilbert_gene_lvl = pd.DataFrame({'gene': gilbert_2014['gene'].unique().tolist()})
gilbert_gene_lvl['essential_stouffer_pval'] = gilbert_gene_lvl['gene'].map(lambda g: stouffers_method(gilbert_2014[gilbert_2014['gene'] == g]['essential_moderated_adj_pval'],
gilbert_2014[gilbert_2014['gene'] == g]['essential_moderated_log2fc']))
gilbert_gene_lvl.head()
from scipy.stats import combine_pvalues
fishers_pvals = gilbert_2014[['gene', 'essential_moderated_adj_pval']].groupby('gene', as_index = False)\
.agg(lambda p: combine_pvalues(p)[1])
fishers_pvals.rename(columns = {'essential_moderated_adj_pval': 'essential_fisher_pval'},
inplace = True)
fishers_pvals.head()
print(gilbert_gene_lvl.shape)
gilbert_gene_lvl = gilbert_gene_lvl.merge(fishers_pvals, on = 'gene', how = 'inner')
print(gilbert_gene_lvl.shape)
gilbert_gene_lvl.head()
# I noticed N/A's above. Let's see if they're identical
print("number with stouffer N/A: ", sum(gilbert_gene_lvl['essential_stouffer_pval'].isna()))
print("number with fisher N/A: ", sum(gilbert_gene_lvl['essential_fisher_pval'].isna()))
print("number with both N/A: ", sum(gilbert_gene_lvl['essential_fisher_pval'].isna() & \
gilbert_gene_lvl['essential_stouffer_pval'].isna()))
Cool! Looks like the missing p-values are the same for both method. Let's see if we can figure out why.
gilbert_gene_lvl[gilbert_gene_lvl['essential_stouffer_pval'].isna()].head()
gilbert_2014[gilbert_2014['gene'].isin(['ANGPT2', 'ARID1A', 'CD40LG'])]
Looks like I forgot to remove the entries with all zeros. Whoops. We can remove these rows are redo the calculations.
print(gilbert_2014.shape)
gilbert_2014_no_zeros = gilbert_2014[(gilbert_2014[['T0JEV', 'T0MH', 'treatedJEV', 'treatedMH', 'untreatedJEV',
'untreatedMH']].apply(lambda row: sum(row), axis = 1) > 10)]
print(gilbert_2014_no_zeros.shape)
gilbert_gene_lvl = pd.DataFrame({'gene': gilbert_2014_no_zeros['gene'].unique().tolist()})
gilbert_gene_lvl['essential_stouffer_pval'] = gilbert_gene_lvl['gene'].map(lambda g: stouffers_method(gilbert_2014_no_zeros[gilbert_2014_no_zeros['gene'] == g]['essential_moderated_adj_pval'],
gilbert_2014_no_zeros[gilbert_2014_no_zeros['gene'] == g]['essential_moderated_log2fc']))
fishers_pvals = gilbert_2014_no_zeros[['gene', 'essential_moderated_adj_pval']].groupby('gene', as_index = False)\
.agg(lambda p: combine_pvalues(p)[1])
fishers_pvals.rename(columns = {'essential_moderated_adj_pval': 'essential_fisher_pval'},
inplace = True)
gilbert_gene_lvl = gilbert_gene_lvl.merge(fishers_pvals, on = 'gene', how = 'inner')
print(gilbert_gene_lvl.shape)
print("number with stouffer N/A: ", sum(gilbert_gene_lvl['essential_stouffer_pval'].isna()))
print("number with fisher N/A: ", sum(gilbert_gene_lvl['essential_fisher_pval'].isna()))
print("number with both N/A: ", sum(gilbert_gene_lvl['essential_fisher_pval'].isna() & \
gilbert_gene_lvl['essential_stouffer_pval'].isna()))
print(gilbert_gene_lvl['essential_fisher_pval'].corr(gilbert_gene_lvl['essential_stouffer_pval']))
seaborn.scatterplot(x = 'essential_fisher_pval', y = 'essential_stouffer_pval',
data = gilbert_gene_lvl, alpha = 0.5, color = 'black')
It looks like when the Stouffer value is high, the Fisher value is also high. However, when the Fisher value is high it's not necessarily true that the Stouffer value is high. This is likely because of the issues I've discussed before (https://timydaley.github.io/FisherVsStouffer.html) with Fisher's method in that it doesn't reward signals with a consistent direction and it's susceptibility to off-target and outlier effects.
# top genes
gilbert_gene_lvl.sort_values(by = 'essential_fisher_pval', ascending = True).head(20)
gilbert_gene_lvl.sort_values(by = 'essential_stouffer_pval', ascending = True).head(20)
Now we can test the performance using the essential gene list from Hart et al 2014. This was generated from separate RNAi screens across multiple cell lines (probably including K562 because it's the most common cell line).
essential_genes = pd.read_csv("core-essential-genes-sym_HGNCID.txt", header = None,
sep = '\t', names = ['gene', 'hgnc'])
essential_genes.head()
sum(essential_genes['gene'].isin(gilbert_gene_lvl['gene']))
essential_gene_mask = (gilbert_gene_lvl['gene'].isin(essential_genes['gene']))
fig, ax = plt.subplots()
seaborn.distplot(gilbert_gene_lvl[essential_gene_mask]['essential_stouffer_pval'],
ax = ax, color = 'darkred', label = 'stouffer', kde = False, bins = np.linspace(0, 1, 20))
seaborn.distplot(gilbert_gene_lvl[essential_gene_mask]['essential_fisher_pval'],
ax = ax, color = 'skyblue', label = 'fisher', kde = False, bins = np.linspace(0, 1, 20))
ax.set_xlabel('gene-level p-value')
ax.legend()
Looks like the fisher and stouffer agree on the clear signal genes, while stouffer shows lower p-values for a large number of genes that fisher is uncertain about. What about the "non-essential" genes? I put non-essential in quotes because we don't actually know if they are non-essential or not, but the majority should be.
fig, ax = plt.subplots()
seaborn.distplot(gilbert_gene_lvl[~essential_gene_mask]['essential_stouffer_pval'],
ax = ax, color = 'darkred', label = 'stouffer', kde = False, bins = np.linspace(0, 1, 20))
seaborn.distplot(gilbert_gene_lvl[~essential_gene_mask]['essential_fisher_pval'],
ax = ax, color = 'skyblue', label = 'fisher', kde = False, bins = np.linspace(0, 1, 20))
ax.set_xlabel('gene-level p-value')
ax.legend()
Interesting that Fisher's method is very bimodal, while Stouffer's method is a lot more smooth. This is one reason why I prefer Stouffer's.