There are a number of issues I want to discuss in the problem of analysing CRISPR inhibition or CRISPR activation (CRISPRi/a) screens. Most of this was learned during my postdoc at Stanford, working in the labs of Stanley Qi and Wing Hung Wong.

Design¶

The most prevalant screen design is the classic treatment-control. The key to this design is to ensure that everything is the same between the treatment and control, except for the the effect you want to measure. For example, in gene essentiality screens the treatment is CRISPRi or CRISPRko treated and the the control is the untreated population. However, you need to make sure that everything else between the two conditions are the same. Another example is drug treatment screens to identify the mechanism or pathways of a drug. In this case, the CRISPRi/a treatment is applied to both treatment and control. The difference is that the treatment population is treated with the drug.

Another design, discussed in the Qi lab's neuron screen paper, is to compare positive vs negative selection. This is particularly relevant in the context of gain of function screens. In this design, the CRISPRi/a treatment is applied to the population. Then after the effect is allowed to set in, the population is selected for and against the desired phenotype. In the neuron screen this was beta-tubulin postive and beta-tubulin negative, respectively. This design should control for effects of the CRISPR modulation that are not associated with the desired phenotype.

Measurement¶

The outcomes we see from the screen are the counts of sequenced guide RNAs, typically like a table below taken from Gilbert 2014. This is a genome-wide CRISPRi screen for ricin sensitivity/susceptibility.

import sys
!{sys.executable} -m pip install seaborn

Requirement already satisfied: seaborn in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (0.11.2)
Requirement already satisfied: scipy>=1.0 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from seaborn) (1.7.1)
Requirement already satisfied: matplotlib>=2.2 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from seaborn) (3.4.1)
Requirement already satisfied: pandas>=0.23 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from seaborn) (1.2.4)
Requirement already satisfied: numpy>=1.15 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from seaborn) (1.20.2)
Requirement already satisfied: cycler>=0.10 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: pyparsing>=2.2.1 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (2.8.1)
Requirement already satisfied: pillow>=6.2.0 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from matplotlib>=2.2->seaborn) (8.2.0)
Requirement already satisfied: pytz>=2017.3 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from pandas>=0.23->seaborn) (2021.1)
Requirement already satisfied: six in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from cycler>=0.10->matplotlib>=2.2->seaborn) (1.15.0)
WARNING: You are using pip version 20.2.3; however, version 21.3.1 is available.
You should consider upgrading via the '/Users/tim.daley/git/interviews/engineer/blog/venv/bin/python -m pip install --upgrade pip' command.

import pandas as pd
import seaborn

data_loc = '20150112_ctxdta_crispri_mergedcountstable.txt'
gilbert_2014 = pd.read_csv(data_loc, sep = '\t', header = 0)
gilbert_2014.head()

import numpy as np

def log2_fc(df, condition_col, control_col, zero_correction = 1):
  # formula is log2(treatment_prob/condition_prob)
  # where x_prob is relative proportion
  # x_prob = (count + zero_correction) / sum(count + zero_correction)
  # more accurate to work in log space
  log_num = np.log2(df[condition_col] + zero_correction) - np.log2(sum(df[condition_col] + zero_correction))
  log_denom = np.log2(df[control_col] + zero_correction) - np.log2(sum(df[control_col] + zero_correction))
  return log_num - log_denom

gilbert_2014['essential_lfc_JEV'] = log2_fc(gilbert_2014, "untreatedJEV", "T0JEV")
gilbert_2014['essential_lfc_MH'] = log2_fc(gilbert_2014, "untreatedMH", "T0MH")
gilbert_2014['ricin_lfc_MH'] = log2_fc(gilbert_2014, "treatedMH", "untreatedMH")
gilbert_2014['ricin_lfc_JEV'] = log2_fc(gilbert_2014, "treatedJEV", "untreatedJEV")

gilbert_2014.head()

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax = seaborn.scatterplot(x = "essential_lfc_JEV", y = "essential_lfc_MH", 
                         data = gilbert_2014, alpha = 0.1, color = 'black')
ax.set_xlabel('JEV')
ax.set_ylabel('MH')
ax.set_title("Distribution of essential log2 fold changes")

Text(0.5, 1.0, 'Distribution of essential log2 fold changes')

fig, ax = plt.subplots()
ax = seaborn.scatterplot(x = "ricin_lfc_JEV", y = "ricin_lfc_MH", 
                         data = gilbert_2014, alpha = 0.1, color = 'black')
ax.set_xlabel('JEV')
ax.set_ylabel('MH')
ax.set_title("Distribution of ricin log2 fold changes")

Text(0.5, 1.0, 'Distribution of ricin log2 fold changes')

We see that the distribution of the log fold changes of the "essential" experiment is towards the negative side, while for the "ricin" experiment it goes both ways but mostly towards the postive side. This is a definite difference in the experiments.

Moderated log fold change¶

We see a lot of noise in the above plots. Except for the extreme outliers it's difficult to determine what the signal and what's the noise. A good solution to this problem is using a moderated log fold change. There are a number of packages to compute this. The basic idea behind all of them is to smooth the log fold changes towards zero, accounting for noise from sequencing, usually based on a negative binomial regression. I particularly like DESeq2, but it's implemented in R.

import sys
!{sys.executable} -m pip install rpy2

Requirement already satisfied: rpy2 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (3.4.5)
Requirement already satisfied: cffi>=1.10.0 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from rpy2) (1.14.5)
Requirement already satisfied: jinja2 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from rpy2) (2.11.3)
Requirement already satisfied: pytz in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from rpy2) (2021.1)
Requirement already satisfied: tzlocal in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from rpy2) (4.0)
Requirement already satisfied: pycparser in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from cffi>=1.10.0->rpy2) (2.20)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from jinja2->rpy2) (1.1.1)
Requirement already satisfied: pytz-deprecation-shim in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from tzlocal->rpy2) (0.1.0.post0)
Requirement already satisfied: tzdata; python_version >= "3.6" in /Users/tim.daley/git/interviews/engineer/blog/venv/lib/python3.9/site-packages (from pytz-deprecation-shim->tzlocal->rpy2) (2021.2.post0)
WARNING: You are using pip version 20.2.3; however, version 21.3.1 is available.
You should consider upgrading via the '/Users/tim.daley/git/interviews/engineer/blog/venv/bin/python -m pip install --upgrade pip' command.

%load_ext rpy2.ipython

import warnings
warnings.filterwarnings('ignore')
%R require(DESeq2); require(ggplot2); 
%Rpush gilbert_2014
%R counts = gilbert_2014[, c(4, 5, 8, 9)]; coldata = data.frame(condition = c(0, 0, 1, 1), rep = c(0, 1, 0, 1)); \
  rownames(coldata) = colnames(counts);\
  WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq = DESeq2::DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition);\
  WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq = DESeq2::DESeq(WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq);\
  WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq = DESeq2::results(WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq);\
  log2fc = WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq$log2FoldChange;\
  adj_pvals = WeissmanCRISPRiRicinTreatmentUntreatedVsDay0DESeq$padj
%Rpull log2fc adj_pvals

R[write to console]: Loading required package: DESeq2

R[write to console]: Loading required package: S4Vectors

R[write to console]: Loading required package: stats4

R[write to console]: Loading required package: BiocGenerics

R[write to console]: Loading required package: parallel

R[write to console]: 
Attaching package: ‘BiocGenerics’


R[write to console]: The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


R[write to console]: The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


R[write to console]: The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min


R[write to console]: 
Attaching package: ‘S4Vectors’


R[write to console]: The following object is masked from ‘package:base’:

    expand.grid


R[write to console]: Loading required package: IRanges

R[write to console]: Loading required package: GenomicRanges

R[write to console]: Loading required package: GenomeInfoDb

R[write to console]: Loading required package: SummarizedExperiment

R[write to console]: Loading required package: Biobase

R[write to console]: Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


R[write to console]: Loading required package: DelayedArray

R[write to console]: Loading required package: matrixStats

R[write to console]: 
Attaching package: ‘matrixStats’


R[write to console]: The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians


R[write to console]: 
Attaching package: ‘DelayedArray’


R[write to console]: The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges


R[write to console]: The following objects are masked from ‘package:base’:

    aperm, apply, rowsum


R[write to console]: Loading required package: ggplot2

R[write to console]:   the design formula contains one or more numeric variables with integer values,
  specifying a model with increasing fold change for higher values.
  did you mean for this to be a factor? if so, first convert
  this variable to a factor using the factor() function

R[write to console]: estimating size factors

R[write to console]: estimating dispersions

R[write to console]: gene-wise dispersion estimates

R[write to console]: mean-dispersion relationship

R[write to console]: -- note: fitType='parametric', but the dispersion trend was not well captured by the
   function: y = a/x + b, and a local regression fit was automatically substituted.
   specify fitType='local' or 'mean' to avoid this message next time.

R[write to console]: final dispersion estimates

R[write to console]: fitting model and testing

gilbert_2014['essential_moderated_log2fc'] = log2fc
gilbert_2014['essential_moderated_adj_pval'] = adj_pvals
gilbert_2014.head()

import warnings
warnings.filterwarnings('ignore')
%R require(DESeq2); require(ggplot2); 
%Rpush gilbert_2014
%R counts = gilbert_2014[, c(8, 9, 6, 7)]; coldata = data.frame(condition = c(0, 0, 1, 1), rep = c(0, 1, 0, 1)); \
  rownames(coldata) = colnames(counts);\
  WeissmanCRISPRiRicinTreatmentDESeq = DESeq2::DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition);\
  WeissmanCRISPRiRicinTreatmentDESeq = DESeq2::DESeq(WeissmanCRISPRiRicinTreatmentDESeq);\
  WeissmanCRISPRiRicinTreatmentDESeq = DESeq2::results(WeissmanCRISPRiRicinTreatmentDESeq);\
  log2fc = WeissmanCRISPRiRicinTreatmentDESeq$log2FoldChange;\
  adj_pvals = WeissmanCRISPRiRicinTreatmentDESeq$padj
%Rpull log2fc adj_pvals

R[write to console]:   the design formula contains one or more numeric variables with integer values,
  specifying a model with increasing fold change for higher values.
  did you mean for this to be a factor? if so, first convert
  this variable to a factor using the factor() function

R[write to console]: estimating size factors

R[write to console]: estimating dispersions

R[write to console]: gene-wise dispersion estimates

R[write to console]: mean-dispersion relationship

R[write to console]: final dispersion estimates

R[write to console]: fitting model and testing

gilbert_2014['ricin_moderated_log2fc'] = log2fc
gilbert_2014['ricin_moderated_adj_pval'] = adj_pvals
gilbert_2014.head()

Let's compare the moderated log fold changes to the average of the individual log fold changes.

gilbert_2014['ricin_lfc_avg'] = gilbert_2014[['ricin_lfc_MH', 'ricin_lfc_JEV']].mean(axis=1)
gilbert_2014['essential_lfc_avg'] = gilbert_2014[['essential_lfc_JEV', 'essential_lfc_MH']].mean(axis=1)

seaborn.scatterplot(x = 'essential_moderated_log2fc', y = 'essential_lfc_avg', 
                    data = gilbert_2014, alpha = 0.2, color = 'black')

<AxesSubplot:xlabel='essential_moderated_log2fc', ylabel='essential_lfc_avg'>

fig, ax = plt.subplots()
for x in ['essential_moderated_log2fc', 'essential_lfc_avg']:
    seaborn.kdeplot(gilbert_2014[x], ax=ax, label = x)
ax.set_xlim(-12, 6)
ax.legend()

<matplotlib.legend.Legend at 0x138a02610>

seaborn.scatterplot(x = 'ricin_moderated_log2fc', y = 'ricin_lfc_avg', 
                    data = gilbert_2014, alpha = 0.2, color = 'black')

<AxesSubplot:xlabel='ricin_moderated_log2fc', ylabel='ricin_lfc_avg'>

fig, ax = plt.subplots()
for x in ['ricin_moderated_log2fc', 'ricin_lfc_avg']:
    seaborn.kdeplot(gilbert_2014[x], ax=ax, label = x)
ax.set_xlim(-12, 6)
ax.legend()

<matplotlib.legend.Legend at 0x13b5d2610>

[gilbert_2014['ricin_moderated_log2fc'].mean(), gilbert_2014['ricin_lfc_avg'].mean()]

[0.024711828301861735, -0.19521762342504134]

Note the difference in means between the moderated log fold change and the average of the individual log fold changes. The former is close to zero, while the latter seems a bit different than zero. It's worth checking the distribution of negative control guides.

gilbert_2014['gene'].value_counts()

negative_control    10569
EIF4G2                 56
EIF5A                  52
MATR3                  50
UBTF                   50
                    ...  
RGS18                   6
GZMK                    6
PTPRQ                   6
ZNF705D                 4
IFNA6                   4
Name: gene, Length: 15976, dtype: int64

neg_ctrl_flag = (gilbert_2014['gene'] == 'negative_control')

fig, ax = plt.subplots()
seaborn.kdeplot(gilbert_2014[neg_ctrl_flag]['ricin_moderated_log2fc'], ax=ax, 
                label = 'negative control', color = 'darkred')
seaborn.kdeplot(gilbert_2014[~neg_ctrl_flag]['ricin_moderated_log2fc'], ax=ax, 
                label = 'gene targeting', color = 'skyblue')
ax.set_xlim(-12, 6)
ax.legend()

<matplotlib.legend.Legend at 0x1370911f0>

fig, ax = plt.subplots()
seaborn.kdeplot(gilbert_2014[neg_ctrl_flag]['ricin_lfc_avg'], ax=ax, 
                label = 'negative control', color = 'darkred')
seaborn.kdeplot(gilbert_2014[~neg_ctrl_flag]['ricin_lfc_avg'], ax=ax, 
                label = 'gene targeting', color = 'skyblue')
ax.set_xlim(-12, 6)
ax.legend()

<matplotlib.legend.Legend at 0x138ad51f0>

The importance of negative controls¶

Negative control guides give us an estimate of the null distribution. It's always good to check the distribution of the negative control guides. One thing to check is whether the peaks of the gene targeting and the peak of the negative control guides match up. This validates the assumption that most genes have no effect. The other main thing to check is if there's a over-representation of the gene targeting guides in the tail in which you expect the signal to lie. For example, in the essential log fold change we expect an over-representation towards the negative end of the log fold change. This gives us an assurance that there is a signal present in the data.

fig, ax = plt.subplots()
seaborn.kdeplot(gilbert_2014[neg_ctrl_flag]['essential_moderated_log2fc'], ax=ax, 
                label = 'negative control', color = 'darkred')
seaborn.kdeplot(gilbert_2014[~neg_ctrl_flag]['essential_moderated_log2fc'], ax=ax, 
                label = 'gene targeting', color = 'skyblue')
ax.set_xlim(-12, 0)
ax.set_ylim(0, 0.2)
ax.legend()

<matplotlib.legend.Legend at 0x138b175b0>

When the negative control and gene targeting distributions don't match up¶

A pernicious problem is when the distributions of the negative controls and gene targeting guides don't match up. I've seen this once before, in our neuron screen. This was an activation screen of transcription factors that led mouse embryonic stem cells to differentiation to neurons. But there is a perfectly reasonable explanation for this behavior. If a transcription factor is associated with some cell fate other than neurons then we should expect that those factors would be depleted (compared to the negative control, which have some non-zero base rate of neuron differentation). This would be true even if a non-negligible number of transcription factors would lead to non-neuronal fates.

neuron = pd.read_csv('NeuronPerfectMatchCounts.txt', sep = '\t')
neuron.head()

neuron['gene'].value_counts()

random    8237
Hmga1      825
Nfatc2     807
Aire       632
Tcf3       530
          ... 
Mir15a       9
Gm4736       5
Zfp628       4
Obox3        3
Obox5        2
Name: gene, Length: 824, dtype: int64

# there are micro RNA's in the library.  need to subset to TFs only
mir_flag = neuron['gene'].str.startswith('Mir')
print(mir_flag.value_counts())
print(neuron.shape)
neuron = neuron[~mir_flag]
print(neuron.shape)

False    97388
True      3497
Name: gene, dtype: int64
(100885, 9)
(97388, 9)

neuron['pos_vs_neg_lfc_rep1'] = log2_fc(neuron, "CD8_pos_rep1", "CD8_neg_rep1")
neuron['pos_vs_neg_lfc_rep2'] = log2_fc(neuron, "CD8_pos_rep2", "CD8_neg_rep2")
neuron['pos_vs_neg_lfc_avg'] = neuron[['pos_vs_neg_lfc_rep1', 'pos_vs_neg_lfc_rep2']].mean(axis=1)
neuron.head()

neuron['gene'].value_counts()

random    8237
Hmga1      825
Nfatc2     807
Aire       632
Tcf3       530
          ... 
Mycs        10
Gm4736       5
Zfp628       4
Obox3        3
Obox5        2
Name: gene, Length: 786, dtype: int64

neg_ctrl_flag = (neuron['gene'] == 'random')

fig, ax = plt.subplots()
seaborn.kdeplot(neuron[neg_ctrl_flag]['pos_vs_neg_lfc_avg'], ax=ax, 
                label = 'negative control', color = 'darkred')
seaborn.kdeplot(neuron[~neg_ctrl_flag]['pos_vs_neg_lfc_avg'], ax=ax, 
                label = 'gene targeting', color = 'skyblue')
ax.legend()

<matplotlib.legend.Legend at 0x13897ce50>

The solution proposed by Salil Bhate was to calcaulte a score of the enrichment of the gene targeting guides compared to the negative controls.

import random 
import statistics

def enrich_score(scores, neg_ctrl_scores):
    assert(len(scores) == len(neg_ctrl_scores))
    return(statistics.mean(scores) - statistics.mean(neg_ctrl_scores))

def cal_enrich(gene_targeting_guides, neg_ctrl_guides, n_bootstraps = 10000):
    n_gene = len(gene_targeting_guides)
    samples_scores = []
    for i in range(n_bootstraps):
        neg_ctrl_sample = random.sample(neg_ctrl_guides, n_gene)
        samples_scores.append(enrich_score(gene_targeting_guides, neg_ctrl_sample))
    return statistics.mean(samples_scores)

g = 'Gm4736'
guides = neuron[neuron['gene'] == g]['pos_vs_neg_lfc_avg'].tolist()
print(len(guides))
print(guides)

neg_ctrl_guides = neuron[neuron['gene'] == 'random']['pos_vs_neg_lfc_avg'].tolist()
print(len(neg_ctrl_guides))

neg_ctrl_sample = random.sample(neg_ctrl_guides, len(guides))
print(len(neg_ctrl_sample))
print(neg_ctrl_sample)

cal_enrich(neuron[neuron['gene'] == g]['pos_vs_neg_lfc_avg'].tolist(),
           neuron[neuron['gene'] == 'random']['pos_vs_neg_lfc_avg'].tolist())

5
[-0.051638582979339986, -0.5709901880241777, 0.17943040819011102, 0.17943040819011102, 0.5476740973761256]
8237
5
[-1.5268412000817353, 3.888907026395401, 2.5909816179634557, 0.9260165549450683, -0.3309388391867021]

-0.2631524074368839

genes = neuron['gene'].unique().tolist()
genes.remove('random')
print(len(genes))
gene_scores = pd.DataFrame({'gene': genes})
gene_scores['score'] = \
  gene_scores['gene'].map(lambda g: cal_enrich(neuron[neuron['gene'] == g]['pos_vs_neg_lfc_avg'].tolist(),
                                               neuron[neuron['gene'] == 'random']['pos_vs_neg_lfc_avg'].tolist()))
gene_scores['n_guides'] = gene_scores['gene'].map(lambda g: sum(neuron['gene'] == g))
gene_scores.head()

785

gene_scores.sort_values(by = 'score', ascending = False).head(30)

gene_scores['score'].hist(bins = 25, color = 'black')

<AxesSubplot:>

gene_scores['score'].median()

-0.06582876582313937

We see a lot of these genes overlap with the original list, though it doesn't correspond exactly. But that's not totally unexpected. We also see that the median of the gene scores is slightly negative, which means that most gene scores are negative. We cannot assume there that a null gene has zero effect, relative to the negative controls. Instead most genes have a negative effect. Seems strange, but it can trip you up.

Sorry for the aside, let's return to the Gilbert data.

Combining guide level measurements¶

We now come to the problem of how to combine information at the guide level to make inferences at the gene level. Simple methods, such as taking the mean of the guide-level log fold changes will ignore issues such as non-working guides or off-target effects.

With these types of screens it's difficult to know the ground truth. We usually have to use proxy information in order to judge how good our inferences are, such as GO terms of the top hits. Essential screens are pretty well studied. Hart et al, 2014 compiled a list of common essential genes across a number of a number of different studies. We can use this as a "ground truth" to compare different methods.

essential_genes = pd.read_csv("core-essential-genes-sym_HGNCID.txt", sep = '\t', 
                              header = None, names = ['gene', 'hgnc'])
print(essential_genes.shape)
essential_genes.head()

(1580, 2)

sum(essential_genes['gene'].isin(gilbert_2014['gene']))

1561

Combining p-values¶

There's a few ways to combine p-values. Fisher's method, Stouffer's method, and robust rank aggregation (which MAGeCK uses). Let's compare them. But one thing to account for is that the default implementation of Stouffer's method in scipy.stats won't distinguish the two sided p-values. So we need to do it ourselves.

import scipy.stats as stats
import math

def stouffers_method(pvals, orig_z_vals):
    assert(pvals.shape == orig_z_vals.shape)
    # convert p-values to z_values
    # don't know why scipy uses point percent function to refer to the quantile function
    transformed_zvals = np.sign(orig_z_vals)*pvals.map(lambda p: stats.norm.ppf(p/2.0))
    z = sum(transformed_zvals)/math.sqrt(pvals.shape[0])
    # the absolute value of the z-value is half-normally distributed
    # P(x > |z|) = 1 - 2*(phi(|z|) - 0.5)
    return 1.0 - 2.0*(stats.norm.cdf(abs(z)) - 0.5)

# test if opposite signs cancel
def test_stouffers():
    test_pvals = pd.Series([0.05, 0.05])
    test_zvals = pd.Series([-2.16, 2.16])
    return stouffers_method(test_pvals, test_zvals)
test_stouffers()

1.0

gilbert_gene_lvl = pd.DataFrame({'gene': gilbert_2014['gene'].unique().tolist()})
gilbert_gene_lvl['essential_stouffer_pval'] = gilbert_gene_lvl['gene'].map(lambda g: stouffers_method(gilbert_2014[gilbert_2014['gene'] == g]['essential_moderated_adj_pval'], 
                                                                                                      gilbert_2014[gilbert_2014['gene'] == g]['essential_moderated_log2fc']))
gilbert_gene_lvl.head()

from scipy.stats import combine_pvalues

fishers_pvals = gilbert_2014[['gene', 'essential_moderated_adj_pval']].groupby('gene', as_index = False)\
                                                                      .agg(lambda p: combine_pvalues(p)[1])
fishers_pvals.rename(columns = {'essential_moderated_adj_pval': 'essential_fisher_pval'},
                     inplace = True)
fishers_pvals.head()

print(gilbert_gene_lvl.shape)
gilbert_gene_lvl = gilbert_gene_lvl.merge(fishers_pvals, on = 'gene', how = 'inner')
print(gilbert_gene_lvl.shape)
gilbert_gene_lvl.head()

(15976, 2)
(15976, 3)

# I noticed N/A's above.  Let's see if they're identical
print("number with stouffer N/A: ", sum(gilbert_gene_lvl['essential_stouffer_pval'].isna()))
print("number with fisher N/A: ", sum(gilbert_gene_lvl['essential_fisher_pval'].isna()))
print("number with both N/A: ", sum(gilbert_gene_lvl['essential_fisher_pval'].isna() & \
                                    gilbert_gene_lvl['essential_stouffer_pval'].isna()))

number with stouffer N/A:  1063
number with fisher N/A:  1063
number with both N/A:  1063

Cool! Looks like the missing p-values are the same for both method. Let's see if we can figure out why.

gilbert_gene_lvl[gilbert_gene_lvl['essential_stouffer_pval'].isna()].head()

gilbert_2014[gilbert_2014['gene'].isin(['ANGPT2', 'ARID1A', 'CD40LG'])]

Looks like I forgot to remove the entries with all zeros. Whoops. We can remove these rows are redo the calculations.

print(gilbert_2014.shape)
gilbert_2014_no_zeros = gilbert_2014[(gilbert_2014[['T0JEV', 'T0MH', 'treatedJEV', 'treatedMH', 'untreatedJEV', 
                                                    'untreatedMH']].apply(lambda row: sum(row), axis = 1) > 10)]
print(gilbert_2014_no_zeros.shape)
gilbert_gene_lvl = pd.DataFrame({'gene': gilbert_2014_no_zeros['gene'].unique().tolist()})
gilbert_gene_lvl['essential_stouffer_pval'] = gilbert_gene_lvl['gene'].map(lambda g: stouffers_method(gilbert_2014_no_zeros[gilbert_2014_no_zeros['gene'] == g]['essential_moderated_adj_pval'], 
                                                                                                      gilbert_2014_no_zeros[gilbert_2014_no_zeros['gene'] == g]['essential_moderated_log2fc']))
fishers_pvals = gilbert_2014_no_zeros[['gene', 'essential_moderated_adj_pval']].groupby('gene', as_index = False)\
                                                                      .agg(lambda p: combine_pvalues(p)[1])
fishers_pvals.rename(columns = {'essential_moderated_adj_pval': 'essential_fisher_pval'},
                     inplace = True)
gilbert_gene_lvl = gilbert_gene_lvl.merge(fishers_pvals, on = 'gene', how = 'inner')
print(gilbert_gene_lvl.shape)
print("number with stouffer N/A: ", sum(gilbert_gene_lvl['essential_stouffer_pval'].isna()))
print("number with fisher N/A: ", sum(gilbert_gene_lvl['essential_fisher_pval'].isna()))
print("number with both N/A: ", sum(gilbert_gene_lvl['essential_fisher_pval'].isna() & \
                                    gilbert_gene_lvl['essential_stouffer_pval'].isna()))

(209065, 19)
(207537, 19)
(15976, 3)
number with stouffer N/A:  0
number with fisher N/A:  0
number with both N/A:  0

print(gilbert_gene_lvl['essential_fisher_pval'].corr(gilbert_gene_lvl['essential_stouffer_pval']))
seaborn.scatterplot(x = 'essential_fisher_pval', y = 'essential_stouffer_pval', 
                    data = gilbert_gene_lvl, alpha = 0.5, color = 'black')

0.8828464352628952

<AxesSubplot:xlabel='essential_fisher_pval', ylabel='essential_stouffer_pval'>

It looks like when the Stouffer value is high, the Fisher value is also high. However, when the Fisher value is high it's not necessarily true that the Stouffer value is high. This is likely because of the issues I've discussed before (https://timydaley.github.io/FisherVsStouffer.html) with Fisher's method in that it doesn't reward signals with a consistent direction and it's susceptibility to off-target and outlier effects.

# top genes
gilbert_gene_lvl.sort_values(by = 'essential_fisher_pval', ascending = True).head(20)

gilbert_gene_lvl.sort_values(by = 'essential_stouffer_pval', ascending = True).head(20)

Now we can test the performance using the essential gene list from Hart et al 2014. This was generated from separate RNAi screens across multiple cell lines (probably including K562 because it's the most common cell line).

essential_genes = pd.read_csv("core-essential-genes-sym_HGNCID.txt", header = None, 
                              sep = '\t', names = ['gene', 'hgnc'])
essential_genes.head()

sum(essential_genes['gene'].isin(gilbert_gene_lvl['gene']))

1561

essential_gene_mask = (gilbert_gene_lvl['gene'].isin(essential_genes['gene']))
fig, ax = plt.subplots()
seaborn.distplot(gilbert_gene_lvl[essential_gene_mask]['essential_stouffer_pval'], 
                 ax = ax, color = 'darkred', label = 'stouffer', kde = False, bins = np.linspace(0, 1, 20))
seaborn.distplot(gilbert_gene_lvl[essential_gene_mask]['essential_fisher_pval'], 
                 ax = ax, color = 'skyblue', label = 'fisher', kde = False, bins = np.linspace(0, 1, 20))
ax.set_xlabel('gene-level p-value')
ax.legend()

<matplotlib.legend.Legend at 0x13ca0f5e0>

Looks like the fisher and stouffer agree on the clear signal genes, while stouffer shows lower p-values for a large number of genes that fisher is uncertain about. What about the "non-essential" genes? I put non-essential in quotes because we don't actually know if they are non-essential or not, but the majority should be.

fig, ax = plt.subplots()
seaborn.distplot(gilbert_gene_lvl[~essential_gene_mask]['essential_stouffer_pval'], 
                 ax = ax, color = 'darkred', label = 'stouffer', kde = False, bins = np.linspace(0, 1, 20))
seaborn.distplot(gilbert_gene_lvl[~essential_gene_mask]['essential_fisher_pval'], 
                 ax = ax, color = 'skyblue', label = 'fisher', kde = False, bins = np.linspace(0, 1, 20))
ax.set_xlabel('gene-level p-value')
ax.legend()

<matplotlib.legend.Legend at 0x13c79cd00>

Interesting that Fisher's method is very bimodal, while Stouffer's method is a lot more smooth. This is one reason why I prefer Stouffer's.

	sgId	sequence	gene	T0JEV	T0MH	treatedJEV	treatedMH	untreatedJEV	untreatedMH
0	Apoptosis+Cancer+Other_Cancer=A2M_+_9268488.25...	GACCAGATGGATTGTAGGGAGT	A2M	988	1114	813	1116	1131	1050
1	Apoptosis+Cancer+Other_Cancer=A2M_+_9268495.24...	GTTTGGGACCAGATGGATTGT	A2M	587	592	356	526	670	507
2	Apoptosis+Cancer+Other_Cancer=A2M_+_9268513.26...	GCCCAGTTGCTTTGGGAAGTGTT	A2M	843	909	818	721	1057	867
3	Apoptosis+Cancer+Other_Cancer=A2M_+_9268524.25...	GCAGCATAAAGCCCAGTTGCTT	A2M	572	655	848	797	855	690
4	Apoptosis+Cancer+Other_Cancer=A2M_+_9268659.23...	GGGATTCTATTTAGCCCGCC	A2M	431	417	349	396	394	411

	sgId	sequence	gene	T0JEV	T0MH	treatedJEV	treatedMH	untreatedJEV	untreatedMH	essential_lfc_JEV	essential_lfc_MH	ricin_lfc_MH	ricin_lfc_JEV
0	Apoptosis+Cancer+Other_Cancer=A2M_+_9268488.25...	GACCAGATGGATTGTAGGGAGT	A2M	988	1114	813	1116	1131	1050	0.054000	-0.054468	0.121453	-0.336574
1	Apoptosis+Cancer+Other_Cancer=A2M_+_9268495.24...	GTTTGGGACCAGATGGATTGT	A2M	587	592	356	526	670	507	0.049665	-0.192390	0.086561	-0.771189
2	Apoptosis+Cancer+Other_Cancer=A2M_+_9268513.26...	GCCCAGTTGCTTTGGGAAGTGTT	A2M	843	909	818	721	1057	867	0.185193	-0.037358	-0.232110	-0.230205
3	Apoptosis+Cancer+Other_Cancer=A2M_+_9268524.25...	GCAGCATAAAGCCCAGTTGCTT	A2M	572	655	848	797	855	690	0.438244	0.105803	0.241289	0.127353
4	Apoptosis+Cancer+Other_Cancer=A2M_+_9268659.23...	GGGATTCTATTTAGCCCGCC	A2M	431	417	349	396	394	411	-0.270010	0.009955	-0.019919	-0.035298

	sgId	sequence	gene	T0JEV	T0MH	treatedJEV	treatedMH	untreatedJEV	untreatedMH	essential_lfc_JEV	essential_lfc_MH	ricin_lfc_MH	ricin_lfc_JEV	essential_moderated_log2fc	essential_moderated_adj_pval
0	Apoptosis+Cancer+Other_Cancer=A2M_+_9268488.25...	GACCAGATGGATTGTAGGGAGT	A2M	988	1114	813	1116	1131	1050	0.054000	-0.054468	0.121453	-0.336574	-0.026463	0.999736
1	Apoptosis+Cancer+Other_Cancer=A2M_+_9268495.24...	GTTTGGGACCAGATGGATTGT	A2M	587	592	356	526	670	507	0.049665	-0.192390	0.086561	-0.771189	-0.087428	0.996625
2	Apoptosis+Cancer+Other_Cancer=A2M_+_9268513.26...	GCCCAGTTGCTTTGGGAAGTGTT	A2M	843	909	818	721	1057	867	0.185193	-0.037358	-0.232110	-0.230205	0.052726	0.999736
3	Apoptosis+Cancer+Other_Cancer=A2M_+_9268524.25...	GCAGCATAAAGCCCAGTTGCTT	A2M	572	655	848	797	855	690	0.438244	0.105803	0.241289	0.127353	0.251089	0.536598
4	Apoptosis+Cancer+Other_Cancer=A2M_+_9268659.23...	GGGATTCTATTTAGCCCGCC	A2M	431	417	349	396	394	411	-0.270010	0.009955	-0.019919	-0.035298	-0.156360	0.960393

	sgId	sequence	gene	T0JEV	T0MH	treatedJEV	treatedMH	untreatedJEV	untreatedMH	essential_lfc_JEV	essential_lfc_MH	ricin_lfc_MH	ricin_lfc_JEV	essential_moderated_log2fc	essential_moderated_adj_pval	ricin_moderated_log2fc	ricin_moderated_adj_pval
0	Apoptosis+Cancer+Other_Cancer=A2M_+_9268488.25...	GACCAGATGGATTGTAGGGAGT	A2M	988	1114	813	1116	1131	1050	0.054000	-0.054468	0.121453	-0.336574	-0.026463	0.999736	0.085196	0.984008
1	Apoptosis+Cancer+Other_Cancer=A2M_+_9268495.24...	GTTTGGGACCAGATGGATTGT	A2M	587	592	356	526	670	507	0.049665	-0.192390	0.086561	-0.771189	-0.087428	0.996625	-0.151280	0.974268
2	Apoptosis+Cancer+Other_Cancer=A2M_+_9268513.26...	GCCCAGTTGCTTTGGGAAGTGTT	A2M	843	909	818	721	1057	867	0.185193	-0.037358	-0.232110	-0.230205	0.052726	0.999736	-0.054219	0.990361
3	Apoptosis+Cancer+Other_Cancer=A2M_+_9268524.25...	GCAGCATAAAGCCCAGTTGCTT	A2M	572	655	848	797	855	690	0.438244	0.105803	0.241289	0.127353	0.251089	0.536598	0.358094	0.832268
4	Apoptosis+Cancer+Other_Cancer=A2M_+_9268659.23...	GGGATTCTATTTAGCCCGCC	A2M	431	417	349	396	394	411	-0.270010	0.009955	-0.019919	-0.035298	-0.156360	0.960393	0.150326	0.973473

	guide	gene	distance2TSS	plasmid_rep1	plasmid_rep2	CD8_pos_rep1	CD8_neg_rep1	CD8_pos_rep2	CD8_neg_rep2
0	GGGTATCTGGACTAGAAGCC	Sox17	128	249	400	50	196	3	21
1	GTTTGAGTGTGATACCGCTG	Sox17	170	197	287	190	286	363	582
2	GTATCACACTCAAACAGTGC	Sox17	178	539	517	1462	2171	111	193
3	GACCTGAGGCTCGAAAGGCT	Sox17	241	198	278	604	76	11	21
4	GCTGTGCCCGTAACCATTTG	Sox17	291	372	455	184	688	223	159

	guide	gene	distance2TSS	plasmid_rep1	plasmid_rep2	CD8_pos_rep1	CD8_neg_rep1	CD8_pos_rep2	CD8_neg_rep2	pos_vs_neg_lfc_rep1	pos_vs_neg_lfc_rep2	pos_vs_neg_lfc_avg
0	GGGTATCTGGACTAGAAGCC	Sox17	128	249	400	50	196	3	21	-1.661383	-2.397803	-2.029593
1	GTTTGAGTGTGATACCGCTG	Sox17	170	197	287	190	286	363	582	-0.299234	-0.617929	-0.458581
2	GTATCACACTCAAACAGTGC	Sox17	178	539	517	1462	2171	111	193	-0.281850	-0.730929	-0.506390
3	GACCTGAGGCTCGAAAGGCT	Sox17	241	198	278	604	76	11	21	3.262249	-0.812840	1.224704
4	GCTGTGCCCGTAACCATTTG	Sox17	291	372	455	184	688	223	159	-1.608735	0.547056	-0.530840

	gene	score	n_guides
0	Sox17	-0.347266	252
1	Mybl1	-0.426841	48
2	Ncoa2	-0.432888	72
3	Msc	0.169258	60
4	Arid5a	-0.015604	237

	gene	score	n_guides
216	Obox5	2.404559	2
86	Snai1	1.029969	82
205	Zfp239	0.926932	40
6	Pou3f3	0.874267	60
201	Pcbp1	0.818536	40
280	Zfp423	0.815143	138
212	Zfp628	0.812128	4
500	Pcbp2	0.794597	65
572	Foxd4	0.763076	38
604	Pax3	0.731000	141
124	Foxd3	0.723851	78
407	Sox11	0.723564	62
42	Ptf1a	0.667576	96
409	Twist1	0.633802	82
541	Zbtb12	0.611154	44
171	Foxk1	0.592274	34
135	Nfyc	0.571604	93
556	Pura	0.568692	37
342	Ascl1	0.566426	78
475	Zic2	0.548996	127
454	Foxd1	0.521924	82
82	Tgif2	0.519055	258
542	Pou5f1	0.517626	169
23	Myog	0.514009	110
391	Ikzf3	0.508076	31
563	Smad7	0.481705	114
239	Myod1	0.480629	148
112	Pou3f2	0.475212	90
444	Neurog1	0.474284	116
645	Nr1h2	0.473144	36

	gene	hgnc
0	AAMP	HGNC:18
1	AARS2	HGNC:21022
2	AARS	HGNC:20
3	AASDHPPT	HGNC:14235
4	AATF	HGNC:19235

	gene	essential_stouffer_pval
0	A2M	8.509142e-01
1	A4GALT	6.993867e-01
2	AACS	3.940236e-01
3	AATF	7.682743e-14
4	ABCA13	7.684682e-01

	gene	essential_fisher_pval
0	1-Dec	0.999999
1	1-Mar	1.000000
2	1-Sep	0.995081
3	10-Mar	NaN
4	10-Sep	1.000000

	gene	essential_stouffer_pval	essential_fisher_pval
0	A2M	8.509142e-01	1.000000e+00
1	A4GALT	6.993867e-01	1.000000e+00
2	AACS	3.940236e-01	9.959100e-01
3	AATF	7.682743e-14	1.081587e-27
4	ABCA13	7.684682e-01	1.000000e+00

	gene	essential_stouffer_pval	essential_fisher_pval
139	ANGPT2	NaN	NaN
187	ARID1A	NaN	NaN
453	CD40LG	NaN	NaN
821	DUX4	NaN	NaN
871	ELN	NaN	NaN

	sgId	sequence	gene	T0JEV	T0MH	treatedJEV	treatedMH	untreatedJEV	untreatedMH	essential_lfc_JEV	essential_lfc_MH	ricin_lfc_MH	ricin_lfc_JEV	essential_moderated_log2fc	essential_moderated_adj_pval	ricin_moderated_log2fc	ricin_moderated_adj_pval	ricin_lfc_avg	essential_lfc_avg
1718	Apoptosis+Cancer+Other_Cancer=ANGPT2_+_6420678...	GACACAGCCCTCCCAAGTGAGC	ANGPT2	625	689	641	569	738	704	0.098580	0.061840	-0.273075	-0.063802	0.054315	0.999736	0.012386	0.997272	-0.168438	0.080210
1719	Apoptosis+Cancer+Other_Cancer=ANGPT2_+_6420729...	GGATCTGGGGAGAGAGGAACAA	ANGPT2	642	585	346	587	783	622	0.145203	0.119145	-0.049830	-1.036719	0.108920	0.987145	-0.327465	0.917055	-0.543274	0.132174
1720	Apoptosis+Cancer+Other_Cancer=ANGPT2_+_6420737...	GACACTGTAGGATCTGGGGAGAG	ANGPT2	470	567	744	298	571	670	0.139456	0.271235	-1.132581	0.520425	0.184226	0.897740	0.016118	0.997709	-0.306078	0.205346
1721	Apoptosis+Cancer+Other_Cancer=ANGPT2_+_6420746...	GATACTGACACTGTAGGATC	ANGPT2	675	930	652	448	819	890	0.137769	-0.032542	-0.955124	-0.189341	0.017666	0.999736	-0.370208	0.870948	-0.572233	0.052613
1722	Apoptosis+Cancer+Other_Cancer=ANGPT2_+_6420752...	GATTCGGATACTGACACTGT	ANGPT2	1807	1810	1582	1603	2188	1974	0.135046	0.155879	-0.266592	-0.328411	0.119888	0.784776	-0.120716	0.963820	-0.297502	0.145463
1723	Apoptosis+Cancer+Other_Cancer=ANGPT2_+_6420789...	GCTCTTATCAACTTATCATATA	ANGPT2	0	0	0	0	0	0	-0.140832	0.030813	0.033586	0.139199	NaN	NaN	NaN	NaN	0.086393	-0.055009
1724	Apoptosis+Cancer+Other_Cancer=ANGPT2_-_6420667...	GTAAACTGTCAGATTGCAG	ANGPT2	709	760	628	568	813	906	0.056378	0.284019	-0.639088	-0.232769	0.148588	0.901040	-0.260508	0.915055	-0.435928	0.170199
1725	Apoptosis+Cancer+Other_Cancer=ANGPT2_-_6420691...	GAAGAACAGTCCTGCTCACTT	ANGPT2	395	392	314	588	458	469	0.072162	0.288945	0.359193	-0.403943	0.155260	0.956851	0.218926	0.963503	-0.022375	0.180553
1726	Apoptosis+Cancer+Other_Cancer=ANGPT2_-_6420832...	GATAAGAGCAGCCAGACATGTGTAG	ANGPT2	1204	1283	843	1148	1221	1077	-0.120621	-0.221475	0.125608	-0.394730	-0.195593	0.447121	0.056322	0.990337	-0.134561	-0.171048
1727	Apoptosis+Cancer+Other_Cancer=ANGPT2_-_6420833...	GAGCAGCCAGACATGTGTAGT	ANGPT2	1313	1269	1010	1188	1442	1313	-0.005726	0.079950	-0.110630	-0.374089	0.010341	0.999736	-0.061964	0.987661	-0.242360	0.037112
2317	Apoptosis+Cancer+Other_Cancer=ARID1A_+_2702251...	GACTCTCCGCTTTCTGCTGC	ARID1A	745	762	539	441	736	843	-0.158343	0.176373	-0.899610	-0.309506	-0.011429	0.999736	-0.424945	0.806955	-0.604558	0.009015
2318	Apoptosis+Cancer+Other_Cancer=ARID1A_+_2702262...	GCTCTGTAGGCTCGGGACC	ARID1A	34	61	59	77	43	47	0.189317	-0.338420	0.734026	0.586658	-0.145774	0.999736	0.855925	0.871925	0.660342	-0.074552
2319	Apoptosis+Cancer+Other_Cancer=ARID1A_+_2702272...	Gctcctctcccccccgccccgc	ARID1A	171	176	207	201	232	180	0.297090	0.063054	0.191952	-0.024547	0.163607	0.991712	0.253817	0.960213	0.083702	0.180072
2320	Apoptosis+Cancer+Other_Cancer=ARID1A_+_2702278...	Ggctgccctccccacccgccc	ARID1A	0	0	0	0	0	0	-0.140832	0.030813	0.033586	0.139199	NaN	NaN	NaN	NaN	0.086393	-0.055009
2321	Apoptosis+Cancer+Other_Cancer=ARID1A_+_2702286...	GTCTCGTCCGCGGAGAGACC	ARID1A	342	358	322	702	375	413	-0.008308	0.236460	0.797480	-0.079999	0.091994	0.997765	0.634014	0.777153	0.358741	0.114076
2322	Apoptosis+Cancer+Other_Cancer=ARID1A_+_2702292...	GCGGCGGGTTGCCCAGGCTGC	ARID1A	511	532	410	1170	520	515	-0.115692	-0.015951	1.215884	-0.202946	-0.091562	0.993440	0.866134	0.615286	0.506469	-0.065822
2323	Apoptosis+Cancer+Other_Cancer=ARID1A_+_2702292...	GGCGGCGGCGGCGGGTTGCCC	ARID1A	247	228	308	179	324	269	0.249268	0.268425	-0.551376	0.066367	0.234704	0.923972	-0.013157	0.997851	-0.242505	0.258847
2324	Apoptosis+Cancer+Other_Cancer=ARID1A_+_2702295...	GGCTTTCTTCAGCTCCGAGGG	ARID1A	790	1013	854	480	1092	910	0.325712	-0.123721	-0.887828	-0.215098	0.072590	0.996342	-0.314418	0.920233	-0.551463	0.100995
2325	Apoptosis+Cancer+Other_Cancer=ARID1A_+_2702305...	GGGGCCCTCGCTTTCCTGCC	ARID1A	541	510	379	496	438	393	-0.444904	-0.344314	0.368636	-0.069022	-0.422776	0.091847	0.337701	0.899128	0.149807	-0.394609
2326	Apoptosis+Cancer+Other_Cancer=ARID1A_+_2702309...	GTCCTGCAGCTCCTTTCCCAG	ARID1A	957	879	1023	899	1150	986	0.123959	0.196360	-0.099539	-0.029473	0.133119	0.920551	0.114926	0.973934	-0.064506	0.160159
2327	Apoptosis+Cancer+Other_Cancer=ARID1A_-_2702254...	GCAGAAAGCGGAGAGTCACAG	ARID1A	1399	1444	1931	2323	1595	1436	0.048202	0.022804	0.727136	0.414834	0.010417	0.999736	0.752902	0.088016	0.570985	0.035503
2328	Apoptosis+Cancer+Other_Cancer=ARID1A_-_2702259...	GCCCCCCTCATTCCCAGGCAA	ARID1A	518	503	624	588	531	669	-0.105140	0.441551	-0.152307	0.371629	0.154616	0.959416	0.274976	0.921261	0.109661	0.168205
2329	Apoptosis+Cancer+Other_Cancer=ARID1A_-_2702262...	GGGGAATGAGCCGGGAGAGCC	ARID1A	1040	1162	1219	1159	1219	1153	0.088079	0.019605	0.041068	0.139199	0.027664	0.999736	0.268680	0.877806	0.090134	0.053842
2330	Apoptosis+Cancer+Other_Cancer=ARID1A_-_2702272...	GccAGCCCGGAGCCTGAGCCgg	ARID1A	859	951	825	1272	1001	910	0.079642	-0.032697	0.516296	-0.139469	-0.002022	0.999843	0.395721	0.825335	0.188413	0.023473
2331	Apoptosis+Cancer+Other_Cancer=ARID1A_-_2702276...	Gagcgagcgcagcgcagcag	ARID1A	896	1142	761	769	988	1069	0.000031	-0.064401	-0.441094	-0.236980	-0.061360	0.998586	-0.164600	0.956437	-0.339037	-0.032185
2332	Apoptosis+Cancer+Other_Cancer=ARID1A_-_2702279...	Gcgaggcccgcccgggcgggt	ARID1A	442	461	301	1127	520	628	0.093145	0.475981	0.876221	-0.647535	0.268799	0.664981	0.565885	0.882205	0.114343	0.284563
2333	Apoptosis+Cancer+Other_Cancer=ARID1A_-_2702289...	GGACGAGACAGCGGGGATCA	ARID1A	1150	1288	1470	1121	1339	1321	0.078513	0.067283	-0.203063	0.273764	0.046741	0.999165	0.228261	0.924375	0.035350	0.072898
2334	Apoptosis+Cancer+Other_Cancer=ARID1A_-_2702304...	GgcCGAGCGCGGGGAAATGA	ARID1A	843	791	466	697	1077	919	0.212211	0.246947	-0.364821	-1.067663	0.203918	0.650802	-0.516246	0.711782	-0.716242	0.229579
2335	Apoptosis+Cancer+Other_Cancer=ARID1A_-_2702308...	GGAAAGCGAGGGCCCCGCCG	ARID1A	480	491	598	396	472	533	-0.165028	0.148995	-0.394114	0.479915	-0.029334	0.999736	0.249278	0.947266	0.042900	-0.008017
2336	Apoptosis+Cancer+Other_Cancer=ARID1A_-_2702313...	GGACGGGGCCGAGAGCAATG	ARID1A	759	754	681	503	910	735	0.120620	-0.005958	-0.512696	-0.278480	0.036317	0.999736	-0.204848	0.953583	-0.395588	0.057331
5576	Apoptosis+Cancer+Other_Cancer=CD40LG_+_1357303...	GTCAAAGTGGCCACCTTACTC	CD40LG	548	593	351	357	558	730	-0.114790	0.330222	-0.996326	-0.528073	0.099328	0.994684	-0.603928	0.655443	-0.762199	0.107716
5577	Apoptosis+Cancer+Other_Cancer=CD40LG_+_1357303...	GTTAAAGTTGAAATGGTATCTTC	CD40LG	734	885	1060	667	824	866	0.025818	-0.000461	-0.342598	0.502158	-0.014592	0.999736	0.297952	0.915055	0.079780	0.012678
5578	Apoptosis+Cancer+Other_Cancer=CD40LG_+_1357304...	GCCGCAGATCGGGGAGAAGTT	CD40LG	892	1015	1607	486	1049	1028	0.092826	0.049156	-1.045663	0.754078	0.044438	0.999736	0.284075	0.958708	-0.145793	0.070991
5579	Apoptosis+Cancer+Other_Cancer=CD40LG_+_1357304...	GTCCAGTGGCCGCAGATCG	CD40LG	674	768	433	468	831	714	0.160864	-0.074227	-0.574769	-0.799689	0.018987	0.999736	-0.512502	0.673486	-0.687229	0.043319
5580	Apoptosis+Cancer+Other_Cancer=CD40LG_+_1357305...	GGACCCCAAATTTAGTTAATAA	CD40LG	574	725	854	649	707	692	0.159356	-0.036301	-0.058829	0.411375	0.030436	0.999736	0.369766	0.848349	0.176273	0.061527
5581	Apoptosis+Cancer+Other_Cancer=CD40LG_-_1357303...	GCTCTTAACTAATCCTGAGTA	CD40LG	0	0	0	0	0	0	-0.140832	0.030813	0.033586	0.139199	NaN	NaN	NaN	NaN	0.086393	-0.055009
5582	Apoptosis+Cancer+Other_Cancer=CD40LG_-_1357305...	GCTGTGTATCTTCATAGAAGGT	CD40LG	611	699	933	733	748	743	0.150602	0.118761	0.014064	0.457656	0.108243	0.973946	0.425928	0.776571	0.235860	0.134682
5583	Apoptosis+Cancer+Other_Cancer=CD40LG_-_1357305...	GTATCTTCATAGAAGGTTGGACA	CD40LG	676	802	542	1108	712	841	-0.066085	0.099234	0.430953	-0.253750	-0.003657	0.999843	0.342783	0.917059	0.088602	0.016574
5584	Apoptosis+Cancer+Other_Cancer=CD40LG_-_1357306...	GGTTGGTTCTACCCAAATGA	CD40LG	342	366	409	327	370	393	-0.027621	0.133229	-0.230914	0.283404	0.029023	0.999736	0.212633	0.959365	0.026245	0.052804
5585	Apoptosis+Cancer+Other_Cancer=CD40LG_-_1357306...	GTTCTACCCAAATGATGGATGA	CD40LG	410	445	718	515	622	525	0.459262	0.268832	0.005895	0.345959	0.342030	0.270834	0.373413	0.871294	0.175927	0.364047

	gene	essential_stouffer_pval	essential_fisher_pval
11197	SUPV3L1	0.0	0.0
1741	NBAS	0.0	0.0
15162	METTL16	0.0	0.0
4245	METAP2	0.0	0.0
6708	RTF1	0.0	0.0
3961	HSD17B10	0.0	0.0
8815	ABCB7	0.0	0.0
6428	ORC5	0.0	0.0
12114	CLNS1A	0.0	0.0
11227	TBCD	0.0	0.0
5751	EIF6	0.0	0.0
7185	ZCCHC9	0.0	0.0
11029	SHFM1	0.0	0.0
12883	MTPAP	0.0	0.0
12854	MRPS27	0.0	0.0
5773	ESF1	0.0	0.0
3549	DARS	0.0	0.0
1647	MNAT1	0.0	0.0
1627	MIS12	0.0	0.0
13358	RPL32	0.0	0.0

	gene	essential_fisher_pval
1676	MRPS35	2.133178e-166
765	DHX16	6.340320e-118
4335	NARS2	7.454323e-33
10724	PRPF8	1.646771e-111
768	DICER1	2.090871e-152
10722	PRPF3	1.430806e-155
11793	ACTR10	1.286949e-80
772	DIS3	2.115976e-36
10710	PPWD1	8.043431e-91
8689	TMEM261	6.947308e-28
9271	CENPC	3.202081e-174
785	DLST	4.991865e-78
9269	CENPA	1.263108e-175
2213	PTTG1	4.133527e-74
10702	PPIE	4.294164e-87
789	DNA2	3.718237e-141
13948	XRN2	2.024498e-64
14893	GPN2	1.184472e-196
5252	TRAPPC5	1.548175e-149
6708	RTF1	0.000000e+00