There is an endemic problem today in Bioinformatics. With Deep Learning$^{\text{TM}}$ becoming the next Hot New Thing, the Bioinformatics community has worked quickly to catch up. However, I think some important lessons of the Machine Learning community have not filtered through to the Bioinformatics community. Specifically the big issue in how to choose a test set.

What purpose does a test set serve?

The fundamental purpose of a test set is a fair and honest evaluation of the performance of the trained model. In other words, the test set is used to answer the question: if we train the model and use int on future data, how well can we expect the model to perform. The honesty part is absolutely critical, because otherwise we're going to overpromise the performance. This is why the machine learning community hammers on the concept of a proper test set. And it also guides the choice of the test set. I'll illustrate this with some examples.

Examples

Suppose we want to build a model to predict stock prices. If we train the model today, then we would be able to predict prices for tomorrow. In particular, no information from tomorrow will be seen by the model during training. In particular, if stocks A & B are correlated, say they're both in the same industry and will be similarly affected by underlying economic conditions, then we cannot use the fact that if stock A is up then so will stock B (except in a Granger sense, meaning we can use the current rise in A to predict a likely rise of B in the future). Or if there is some economic event that causes a general shift in stock prices. The latter would be apparent if you saw 80% of future prices, but not if you saw none of them. Therefore the proper split to simulate this behavior is to take a time-based split. If we don't do this, then we would fool ourselves in the performance of the algorithm and it's possible that we might deploy an algorithm that under-performs and loses us money.

Or consider the following example from r/MachineLearning: a Nature paper proposed a deep neural network to predict the location of aftershocks. If we train a model today and an earthquake happens tomorrow, then what information is available to predict the location of the subsequent aftershocks? We can use past earthquakes and aftershocks, as well as information we get from the initial earthquake. What we don't get is aftershocks from the current earthquake. Therefore, a proper split would be either by time (as above) or by earthquakes grouped with their aftershocks. As the post above shows, doing the latter type of split results in a simple regularized regression having better test set performance, which indicates that the deep neural network is over-fitting.

Now consider examples from bioinformatics. Suppose we want to build a model to predict which guide RNAs are going to be effective. If we want to apply the model to help design guides in a new experiment, then we would very typically not have access to a previous experiment in the same cell type and target phenotype (that we are trying to select for). If we did, then we can just use that experiment to select which guides to use. In particular, if we use the same experiment to predict and evaluate the model then there will be several confounders such as batch effects which will make us over-confident in our predictions. One paper clearly showed this with an out-of-sample test set (hidden in the supplementary) where simple regularized regression showed better performance than their proposed deep learning model.

Now consider the problem of predicting gene expression from other modalities, such as the promoter genetic sequence plus open chromatin of the particular cell type. If we want to deploy the model, then we would take the model, the genetic sequence, and open chromatin data to predict the gene expression of a sample for which we have no gene expression data on, like a new patient sample. The key here is that there a lot of biological (cell type to cell type or person to person) variation and batch effects present. These effects hugely impact the variation, but they won't be available to the model in production. Therefore, if the model is able "see" those batch effects (say using a simple train-test split), we will overestimate the accuracy of the model.

Example: guide RNA design

To clearly illustrate how this issue arises we'll use the third example above. Let's say we want to build a model to improve on-target effects for CRISPRko (CRISPR knockout) guides. To train the model we'll use the Toronto Knockout Library dataset, a collection of CRISPRko experiments on 5 different cell lines for gene essentiallity. To remove the bias of biological effect and the bias of using the training data to select positive hit genes, we'll subset the training data to previously known essential genes (from http://www.ncbi.nlm.nih.gov/pubmed/24987113).

First what we'll have to do is process the counts to convert it to log fold change. We'll do this using all guides.

Preprocessing

{toggle}

# this was done in R outside and is not run in this notebook
tko_loc = '/Users/tim.daley/blog/timydaley.github.io/crispr_tko/'
libs = c("DLD1", "GBM", "HCT116_1", "HeLa", "RPE1")
df_list = list()
for(l in libs){
  loc = paste0(tko_loc, "readcount-", l, "-lib1")
  x = read.table(loc, header = T)
  df_list[[l]] = x
}
for(l in libs){
  df_list[[l]]["SEQ"] = sapply(df_list[[l]]$GENE_CLONE, function(s) unlist(strsplit(s, "_"))[2])
}
design_matrices = list()
counts_list = list()
# we need custom design matrices for each experiment because the designs are not identical
# DLD1
counts_list[["DLD1"]] = df_list[["DLD1"]][c("DLD_T0", "DLD_ETOH_R1", "DLD_ETOH_R2", "DLD_ETOH_R3")]
design_matrices[["DLD1"]] = data.frame(condition = c(0, 1, 1, 1), row.names = colnames(counts_list[["DLD1"]]))
# GBM
counts_list[["GBM"]] = df_list[["GBM"]][c("T0", "T21A", "T21B")]
design_matrices[["GBM"]] = data.frame(condition = c(0, 1, 1), row.names = colnames(counts_list[["GBM"]]))
# HCT116_1
counts_list[["HCT116_1"]] = df_list[["HCT116_1"]][c("LIB1_T0", "LIB1_T18_A", "LIB1_T18_B")]
design_matrices[["HCT116_1"]] = data.frame(condition = c(0, 1, 1), row.names = colnames(counts_list[["HCT116_1"]]))
# HeLa
counts_list[["HeLa"]] = df_list[["HeLa"]][c("T0", "T18A", "T18B", "T18C")]
design_matrices[["HeLa"]] = data.frame(condition = c(0, 1, 1, 1), row.names = colnames(counts_list[["HeLa"]]))
# RPE1
counts_list[["RPE1"]] = df_list[["RPE1"]][c("T0", "T18A", "T18B")]
design_matrices[["RPE1"]] = data.frame(condition = c(0, 1, 1), row.names = colnames(counts_list[["RPE1"]]))
# now compute log2 fold changes
log2fc_list = list()
for(l in libs){
  d = DESeq2::DESeqDataSetFromMatrix(countData = counts_list[[l]],
                                     colData = design_matrices[[l]],
                                     design = ~condition)
  d = DESeq2::DESeq(d)
  d = DESeq2::results(d)
  log2fc_list[[l]] = data.frame(d, seq = df_list[[l]]$SEQ, gene = df_list[[l]]$GENE)
}
# now subset to known positive genes
#essential_genes = factor(scan(paste0(tko_loc, "ConstitutiveCoreEssentialGenes.txt"), what = character()))
essential_genes = read.table(file = paste0(tko_loc, "reference_essentials_and_nonessentials_sym_hgnc_entrez/constitutive_core_essentials_hg-Table1.tsv"), header = T)
#sum(essential_genes$Gene %in% factor(df_list[["DLD1"]]$GENE))

# what we really want is a table with log2fc, guide sequence, gene, and cell type
log2fc = data.frame()
for(l in libs){
  log2fc = rbind(log2fc, data.frame(log2fc_list[[l]][c("seq", "gene", "log2FoldChange")], lib = l))
}
log2fc['essential'] = 1*(log2fc$gene %in% essential_genes$Gene)
write.table(log2fc, file = paste0(tko_loc, "CombinedLog2FoldChanges.txt"), quote = F, sep = '\t', row.names = F)

OK, so the missing T's in the end of the guide appear to be missing from the data. Let's re-index it to add those in.

To control for variable gene effect sizes I'll include a gene indicator.

Simple train-test split

First, let's look at a simple train-test split. Since there's 5 libraries/data sources, I'll do a 20% test set size. Note that the training data is evenly split by library, so taking a standard Cv split results in a split by library. I'll shuffle the data frame before computing the CV scores.

Split by library

Now let's take a look at what happens when you split by library. Note that since the libraries are in order, and there are an equal number of guides per library, we can do a standard 5-way cross validation.

Interpretation

Note that the average $R^{2}$ (the default score for RandomForestRegressor) is lower when split by library. In addition the variance is higher, which results in the test set performance being high for some libraries. Some libraries are predicted very well (e.g. DLD1 and GBM) and some are predicted very poorly (HeLa). The latter was noted in a previous project I was involved in with Sunil Bodapati.

The order of the cell types are as follows:

Note that the first two have the highest test set scores. It seems reasonable that DLD1 and HCT116 would be highly predictive of each other, since they are similar cell types. And it is reasonable that HeLa is very difficult to predict, since the karyotype of HeLa is completely haywire. I really have no hypotheses about the good test score of GBM. Critically what we're missing is the metadata, such as specific experimental design and who prepared the libraries. In my experience, such details are crucial to evaluating the quality of a sequencing-based experiment. When reserchers outside the organization use a publicly available ML tool, then batch effects will be new (to the ML tool).