Fast adaptive spectral clustering in R: brain cancer RNA-seq

Spectral clustering refers to a family of algorithms that cluster eigenvectors derived from the matrix that represents the input data’s graph. An important step in this method is running the kernel function that is applied on the input data to generate a NXN similarity matrix or graph (where N is our number of input observations). Subsequent steps include computing the normalised graph Laplacian from this similarity matrix, getting the eigensystem of this graph, and lastly applying k-means on the top K eigenvectors to get the K clusters. Clustering in this way adds flexibility in the range of data that may be analysed and spectral clustering will often outperform k-means. It is an excellent option for image and bioinformatic cluster analysis including single platform and multi-omics.

‘Spectrum’ is a fast adaptive spectral clustering algorithm for R programmed by myself from QMUL and David Watson from Oxford. It contains both novel methodological advances and implementation of pre-existing methods. Spectrum has the following features; 1) A new density-aware kernel that increases similarity between observations that share common nearest neighbours, 2) A tensor product graph data integration and noise reduction system, 3) The eigengap method to decide on the number of clusters, 4) Gaussian Mixture Modelling for the final clustering of the eigenvectors, 5) Implementation of a Fast Approximate Spectral Clustering method for very big datasets, 6) Data imputation for multi-view analyses. Spectrum has been recently published as an article in Bioinformatics. It is available to download from CRAN: https://cran.r-project.org/web/packages/Spectrum/index.html.

In this demonstration, we are going to use Spectrum to cluster brain cancer RNA-seq to find distinct patient groups with different survival times. This is the link, braincancer_test_data, to download the test data for the analysis. The data can also be accessed through Synapse (https://www.synapse.org/#!Synapse:syn18911542/files/).

## load libraries required for analysis
library(Spectrum)
library(plot3D)
library(survival)
library(survminer)
library(scales)

The next block of code runs Spectrum, then does a t-sne analysis to visualise the clusters embedded in a lower dimensional space.

I have found 3D t-sne can work well to see complex patterns of Gaussian clusters which can be found in omic data (as in this cancer data), compared to PCA which is better for a more straightforward structure and outlier detection.

## run Spectrum
r <- Spectrum(brain[[1]])
## do t-sne analysis of results
y <- Rtsne::Rtsne(t(brain[[1]]),dim=3)
scatter3D(y$Y[,1],y$Y[,2],y$Y[,3], phi = 0, #bty = "g", ex = 2,
          ticktype = "detailed", colvar = r$assignments,
          col = gg.col(100), pch=20, cex=2, #type = 'h',
          xlab='Dim1', ylab='Dim2', zlab='Dim3')

Each eigenvector of the graph Laplacian numerically indicates membership of an observation to a block (cluster) in the data's graph Laplacian and each eigenvalue represents how disconnected that cluster is with the other clusters. In the ideal case, if we had eigenvalues reading 1, 1, 1, 1, 0.2 in the below plot, that would tell us we have 4 completely disconnected clusters, followed by one that is much more connected (undesirable when trying to find separate clusters). This principal can be extended to look for the greatest gap in the eigenvalues to find the number of clusters (K).

This first plot, produced automatically by Spectrum of the brain cancer data, shows the eigenvalues for each eigenvector of the graph Laplacian. Here is biggest gap is between the 4th and 5th eigenvectors, thus corresponding to K=4.

egap_brain.png

This next plot shows the 3D t-sne of the brain cancer RNA-seq clusters. We can see clear separation of the groups. On this type of complex data, Spectrum tends to excel due to its prioritisation of local similarities and its ability to reduce noise by making the k nearest neighbour graph and diffusing on it.

brain_tsne_rnaseq.png

Next, we can run a survival analysis using a Cox proportional hazards model and the log rank test is used to test whether there is a difference between the survival times of different groups (clusters).

## do test
clinicali <- brain[[2]]
clinicali$Death <- as.numeric(as.character(clinicali$Death))
coxFit <- coxph(Surv(time = Time, event = Death) ~ as.factor(r$assignments), data = clinicali, ties = "exact")
coxresults <- summary(coxFit)
print(coxresults$logtest[3])

## survival curve
survival_p <- coxresults$logtest[3]
fsize <- 18
clinicalj <- clinicali
if (0 %in% pr$cluster){
  res$cluster <- res$cluster+1
}
clinicalj$cluster <- as.factor(r$assignments)
fit = survfit(Surv(Time,Death)~cluster, data=clinicalj)

## plot
gg <- ggsurvplot(fit, data = clinicalj, legend = 'right', font.x =  fsize, font.y =  fsize, font.tickslab = fsize,
                 palette = gg.col(4),
                 legend.title = 'Cluster', #3 palette = "Dark2",
                 ggtheme = theme_classic2(base_family = "Arial", base_size = fsize))
gg$plot + theme(legend.text = element_text(size = fsize, color = "black")) + theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.text.y = element_text(size = fsize, colour = 'black'),
        axis.text.x = element_text(size = fsize, colour = 'black'),
        axis.title.x = element_text(size = fsize),
        axis.title.y = element_text(size = fsize),
        legend.title = element_text(size = fsize),
        legend.text = element_text(size = fsize))

This is the survival curve for the analysis. The code above also gives us the p value (4.03E-22) for the log rank test which is highly significant, suggesting Spectrum yields good results. More extensive comparisons are included in our manuscript.

brain_survival.png

Spectrum is effective at reducing noise on a single-view (like here) as well as multi-view because it performs a tensor product graph diffusion operation in either case, and it uses a k nearest neighbour graph. Multi-omic clustering examples are included in the package vignette. Spectrum will also be well suited to a broad range of different types of data beyond this demonstration given today.

The density adaptive kernel enhances the intra-cluster similarity which helps to improve the quality of the similarity matrix. A good quality similarity matrix (that represents the data’s graph) is key to the performance of spectral clustering, where intra-cluster similarity should be maximised and inter-cluster similarity minimised.

Spectrum is a fast new algorithm for spectral clustering in R. It is largely based on work by Zelnik-Manor, Ng, and Zhang, and includes implementations of pre-existing methods as well as new innovations. This is the second of the pair of clustering tools we developed for precision medicine, the first being M3C, a consensus clustering algorithm which is on the Bioconductor. In many ways, spectral clustering is a more elegant and powerful method than consensus clustering type algorithms, but both are useful to examine data from a different perspective.

Spectrum can be downloaded here: https://cran.r-project.org/web/packages/Spectrum/index.html

Source code is available here: https://github.com/crj32/Spectrum

Reference

Christopher R John, David Watson, Michael R Barnes, Costantino Pitzalis, Myles J Lewis, Spectrum: fast density-aware spectral clustering for single and multi-omic data, Bioinformatics, , btz704, https://doi.org/10.1093/bioinformatics/btz704

Running UMAP for data visualisation in R

UMAP is a non linear dimensionality reduction algorithm in the same family as t-SNE. In the first phase of UMAP a weighted k nearest neighbour graph is computed, in the second a low dimensionality layout of this is then calculated. Then the embedded data points can be visualised in a new space and compared with other variables of interest.

It can be used for the analysis of many types of data, including, single cell RNA-seq and cancer omic data. One easy way to run UMAP on your data and visualise the results is to make a wrapper function that uses the umap R package and ggplot2, this is easy to do yourself, but in this post we are going to have a look at the one included in the M3C package (https://bioconductor.org/packages/devel/bioc/html/M3C.html). The code for this simple function can be found here (https://github.com/crj32/M3C/blob/master/R/umap.R).

M3C is a clustering algorithm for finding the number of clusters in a given dataset, but it also has several other useful functions that can be used more generally. I don’t use M3C for single cell RNA-seq clustering it is better for other types of omic data, such as RNA-seq and standard omic platforms, however, its visualisation functions are useful for a number of different analyses I do. Otherwise I end up copy and pasting lengthy blocks of ggplot2 code again and again.

There have been two previous posts on the related PCA and t-SNE functions contained in the M3C package which both contain the same helpful parameters for extra labelling of the data points.

Let’s load some single cell RNA-seq data, this is the pollen_test_data we used in the last post when running t-SNE.

So this is just a basic UMAP plot. So in this data within clusters the samples are very similar so they are very compact in these plots.

# load your omic data here as mydata
library(M3C)
umap(pollen$data,colvec=c('skyblue'))

UMAP.png

This is for labelling the sample points with our ground truth labels, i.e. cell types. The colour scale is set to be controlled manually and this using some internal colours inside the function.

umap(pollen$data,labels=as.factor(pollen$celltypes),controlscale=TRUE,scale=3)

UMAPlabeled.png

This is for looking at an individual genes expression across the samples, so we are just labelling the points with a continuous variable, in this case GAGE4 because its variance is among the highest standardised to the mean.

umap(pollen$data, labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='GAGE4',])),
controlscale = TRUE,scale=2,legendtitle = 'GAGE4')

UMAPlabeled.png

Let’s just cluster this data with M3C and then display the clustering assignments on the UMAP plot.

r <- M3C(pollen$data,method=2)
umap(pollen$data,labels=as.factor(r$assignments),printres = TRUE,printwidth = 24)

UMAPlabeled.png

So that’s pretty neat, simple functions like this can be quite a lot of fun.

Let’s do one more plot, we will make up a continuous variable that could be a co-variate we are interested in and display that on the plot. I’ll show the code for printing the plot here which is done internally to simplify things.

Scale=1 means we use the spectral palette for colouring the data points. We have to set controlscale=TRUE for using custom colours, it is also possible to set a ‘high’ and ‘low’ colour manually through changing the scale parameter to 2. Type ?umap for the relevant documentation.

# make random data
var <- seq(1,ncol(pollen$data))
# do plot
umap(pollen$data,labels=var,controlscale=TRUE,scale=1,
legendtitle='var',printres = TRUE,printwidth = 24)

UMAPlabeled.png

Note that if you want to control the various parameters that UMAP uses internally rather than a quick analysis like this, you’ll have to make your own function to wrap ggplot2 and umap  (https://cran.r-project.org/web/packages/umap/index.html) or just run them as individual steps.

So, that bring us to the end of this little blog series on the M3C data visualisation functions. Hope you all enjoyed it!

Quick and easy t-SNE analysis in R

t-SNE is a useful nonlinear dimensionality reduction method that allows you to visualise data embedded in a lower number of dimensions, e.g. 2, in order to see patterns and trends in the data. It can deal with more complex nonlinear patterns of Gaussian clusters in multidimensional space compared to PCA so it is good for single cell RNA-seq analysis. Although is not suited to finding outliers because how the samples are arranged does not directly represent distance, like in PCA, instead t-SNE preserves local distance at the expensive of global distance.

An easy way to run t-SNE on your data is to use a pre-made wrapper function that uses the Rtsne package and ggplot2. Like the one that comes with the M3C package (https://bioconductor.org/packages/devel/bioc/html/M3C.html). You can also just take this code from the github and remake it for your own specific needs if necessary (https://github.com/crj32/M3C), it is quite straightforward. The package also has the equivalent functions for PCA and UMAP.

Let’s load some single cell RNA-seq data and demonstrate this function.

This is the Pollen et al. single cell RNA-seq data, but you can use a different kind of omic data, or non omic data. Let’s just take a look at that data.

# load your omic data here as mydata
library(M3C)
tsne(pollen$data,colvec=c('gold'))

TSNE1.png

So, we can see this data is neatly grouped into clusters for us. Let’s compare these with the cell type labels. This is just using the default colours, for changing to custom colours we can use the ‘controlscale, ‘scale’, and ‘colvec’ parameters. You can read about these in the documentation using ?tsne.

library(M3C)
tsne(pollen$data,labels=as.factor(pollen$celltypes))

TSNElabeled.png

So now we might want to display the expression of a specific gene onto the t-SNE plot, we can easily do this using this function. So the code below can be modified for any given gene, this is for gene ‘HBG1’, we can change this as needed. It just pulls a numerical vector out of the gene expression matrix for us, and z-score scales it for display.

I choose HBG1 because I had run the ‘featurefilter’ function from M3C that collects statistics about all genes in the data frame, such as the coefficient of variation and its second order derivative. Use ?featurefilter to see more information. This gene was the most variable one in the data. Basically this function is for filtering the data according to variance and producing summary statistics per gene.

Here we choose to control the colour scale and setting this to 2 means low is grey and high is red on the scale. However, ‘low’ and ‘high’ may be user specified.

tsne(pollen$data,
labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='HBG1',])),
controlscale = TRUE,scale=2)

TSNElabeled.png

Lastly, I want to find out which cell type that HBG1 expression is coming from, we can do this using the PCA function because the dots are too close together to see the text using t-SNE. To do this we have to add a text variable to the pca function.

pca(pollen$data,labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='HBG1',])),
controlscale=TRUE,scale=2,text=pollen$celltypes,
textlabelsize = 2)

PCAlabeled.png

So, basically we can zoom in on the printed image and see the red cluster is all the K562 cell type. The text parameter is primarily for finding outliers from PCA, but it can be used in other ways too.

M3C is not for clustering single cell RNA-seq data because of the high complexity of the algorithm and the type of consensus clustering it does, but we can use its functions for other purposes anyway. It is good for clustering cancer single-omic data though, for example. It is essentially an enhanced version of the Monti consensus clustering algorithm, implemented originally in the consensusclusterplus package.

For an algorithm that can handle single cell RNA-seq better and also a wide range of omic and non omic data, Spectrum is a good option (https://cran.r-project.org/web/packages/Spectrum/index.html). It is faster and more sophisticated in many ways to the consensus clustering approach. Also good for single cell RNA-seq clustering are SC3 and Seurat.

 

Easy quick PCA analysis in R

Principal component analysis (PCA) is very useful for doing some basic quality control (e.g. looking for batch effects) and assessment of how the data is distributed (e.g. finding outliers). A straightforward way is to make your own wrapper function for prcomp and ggplot2, another way is to use the one that comes with M3C (https://bioconductor.org/packages/devel/bioc/html/M3C.html) or another package. M3C is a consensus clustering tool that makes some major modifications to the Monti et al. (2003) algorithm so that it behaves better, it also provides functions for data visualisation.

Let’s have a go on an example cancer microarray dataset.

# M3C loads an example dataset mydata with metadata desx
library(M3C)
# do PCA
pca(mydata,colvec=c('gold'),printres=TRUE)

PCApriorclustering.png

So, now what prcomp has done is extracted the eigenvectors of the data’s covariance matrix, then projected the original data samples onto them using linear combination. This yields PC scores which are plotted on PC1 and PC2 here (eigenvectors 1 and 2). The eigenvectors are sorted and these first two contain the highest variation in the data, but it might be a good idea to look at additional PCs, which is beyond this simple blog post and function.

You can see above there are no obvious outliers for removal here, which is good for cluster analysis. However, were there outliers, we can label all samples with their names using the ‘text’ parameter.

library(M3C)
pca(mydata,colvec=c('gold'),printres=TRUE,text=colnames(mydata))

PCApriorclustering.png

Now other objectives would be comparing samples with batch to make sure we do not have batch effects driving the variation in our data, and comparing with other variables such as gender and age. Since the metadata only contains tumour class we are going to use that next to see how it is related to these PC scores.

This is a categorical variable, so the ‘scale’ parameter should be set to 3, ‘controlscale’ is set to TRUE because I want to choose the colours, and ‘labels’ parameter passes the metadata tumour class into the function. I am increasing the ‘printwidth’ from its default value because the legend is quite wide.

For more information see the function documentation using ?pca.

# first reformat meta to replace NA with Unknown
desx$class <- as.character(desx$class)
desx$class[is.na(desx$class)] <- as.factor('Unknown')
# next do the plot
pca(mydata,colvec=c('gold','skyblue','tomato','midnightblue','grey'),
labels=desx$class,controlscale=TRUE,scale=3,printres=TRUE,printwidth=25)

PCAlabeled.png

So, now it appears that the variation that governs these two PCs is indeed related to tumour class which is good. But, what if the variable is continuous, and we wanted to compare read mapping rate, or read duplication percentage, or age with our data? So, a simple change of the parameters can allow this too. Let’s make up a continuous variable, then add this to the plot. In this case we change the ‘scale’ parameter to reflect we are using continuous data, and the spectral palette is used for the colours by default.

randomvec <- seq(1,50)
pca(mydata,labels=randomvec,controlscale=TRUE,scale=1,legendtitle='Random',
printres=TRUE,printwidth=25)

PCAlabeled.png

So, since this is a random variable, we can see it has no relationship with our data. Let’s just define a custom scale now. So we change ‘scale’ to 2, then use the ‘low’ and ‘high’ parameters to define the scale colour range.

pca(mydata,labels=randomvec,controlscale=TRUE,scale=2,legendtitle='Random',
printres=TRUE,,printwidth=25,low='grey',high='red')

PCAlabeled.png

Super easy, yet pretty cool. The idea here is just to minimise the amount of code hanging around for doing basic analyses like these. You can rip the code from here: https://github.com/crj32/M3C/blob/master/R/pca.R. Remember if your features are on different scales, the data needs transforming to be comparable beforehand.

M3C is part of a series of cluster analysis tools that we are releasing, the next is ‘Spectrum’, a fast adaptive spectral clustering tool, which is already on CRAN (https://cran.r-project.org/web/packages/Spectrum/index.html). Perhaps there will be time to blog about that in the future.

For quantitative analysis of drivers in the variation on data, I recommend checking out David Watson’s great function in the ‘bioplotr’ package called ‘plot_drivers’ (https://github.com/dswatson/bioplotr). This compares the PCs with categorical and continuous variables and performs univariate statistical analysis on them producing a very nice plot.

Using clusterlab to benchmark clustering algorithms

Clusterlab is a CRAN package (https://cran.r-project.org/web/packages/clusterlab/index.html) for the routine testing of clustering algorithms. It can simulate positive (data-sets with >1 clusters) and negative controls (data-sets with 1 cluster). Why test clustering algorithms? Because they often fail in identifying the true K in practice, published algorithms are not always well tested, and we need to know about ones that have strange behaviour. I’ve found in many own experiments on clustering algorithms that algorithms many people are using are not necessary ones that provide the most sensible results. I can give a good example below.

I was interested to see clusterExperiment, a relatively new clustering package on the Bioconductor, cannot detect the ground truth K in my testing so far. Instead yielding solutions with many more clusters than there are in reality. On the other hand, the package I developed with David Watson here at QMUL, does work rather well. It is a derivative of the Monti et al. (2003) consensus clustering algorithm, fitted with a Monte Carlo reference procedure to eliminate overfitting. We called this algorithm M3C.

library(clusterExperiment)
library(clusterlab)
library(M3C)

Experiment 1: Simulate a positive control dataset (K=5)

PCAlabeled_real.png

k5 <- clusterlab(centers=5)

Experiment 2: Test RSEC (https://bioconductor.org/packages/release/bioc/html/clusterExperiment.html)

RSEC_test.png

rsec_test <- RSEC(as.matrix(k5$synthetic_data),ncores=10)
assignments <- primaryCluster(rsec_test)
pca(k5$synthetic_data,labels=assignments)

Experiment 3: Test M3C (https://bioconductor.org/packages/release/bioc/html/M3C.html)

PCAlabeled_M3C.png

M3C_test <- M3C(k5$synthetic_data,iters=10,cores=10)
optk <- which.max(M3C_test$scores$RCSI)+1
pca(M3C_test$realdataresults[[optk]]$ordered_data,
labels=M3C_test$realdataresults[[optk]]$ordered_annotation$consensuscluster,printres=TRUE)

Interesting isn't it R readers.

Well all I can say is I recommend comparing different machine learning methods and if in doubt, run your own control data-sets (positive and negative controls) to test various methods. In the other post we showed a remarkable bias in optimism corrected bootstrapping compared with LOOCV under certain conditions, simply by simulating null data-sets and passing them into the method.

If you are clustering omic’ data from a single platform (RNAseq, protein, methylation, etc) I have tested and recommend the following packages:

CLEST: https://www.rdocumentation.org/packages/RSKC/versions/2.4.2/topics/Clest
M3C: https://bioconductor.org/packages/release/bioc/html/M3C.html

And to be honest, that is about it. I have also tested PINSplus, GAP-statistic, and SNF, but they did not provide satisfactory results in my experiments on single platform clustering (currently unpublished data). Multi-omic and single cell RNA-seq analysis is another story, there will be more on that to come in the future R readers.

Remember there is a darkness out there R readers, not just in Washington, but in the world of statistics. It is there because of the positive results bias in science, because of people not checking methods and comparing them with one another, and because of several other reasons I can’t even be bothered to mention.

 

Part 5: Code corrections to optimism corrected bootstrapping series

The truth is out there R readers, but often it is not what we have been led to believe. The previous post examined the strong positive results bias in optimism corrected bootstrapping (a method of assessing a machine learning model’s predictive power) with increasing p (completely random features). This bias is real and is effecting publications. However, there were 2 implementations of the method given in the previous post, 1 has a slight error, 2 seem fine. The trend is still the same with the corrected code, but the problem with the earlier code is I did not set ‘replace=TRUE’ in the call to the ‘sample’ function. The problem inherent to the ‘optimism corrected bootstrapping’ method is more about estimating error using the same samples for training and testing, than how we are resampling the data, so this change does not make a large difference to the results. We will go into a little more detail about bootstrapping in this article and repeat the analyses with the corrected code. Once again, you are welcome to re-run the code written here and also test using the independent Caret implementation shown on the other page.

For the uninformed reader, I have also shown the problem using Caret (https://intobioinformatics.wordpress.com/2018/12/25/optimism-corrected-bootstrapping-a-problematic-method/), and where it originates from in the method here (it is obvious with a bit of statistical knowledge) (https://intobioinformatics.wordpress.com/2018/12/28/part-4-more-bias-and-why-does-bias-occur-in-optimism-corrected-bootstrapping/). It is a very simple test I have done (using simulated null data-sets with increasing features) and shows the problem with the method very clearly.

Thanks to ECOQUANT for pointing out to me the replace function should have been called with an additional parameter.

Let’s just recap what bootstrapping is and what optimism corrected bootstrapping is before we redo the experiments:

This is from Jason’s excellent blog (https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/), bootstrapping is:

  1. Choose a number of bootstrap samples to perform
  2. Choose a sample size
  3. For each bootstrap sample (b=1 … B)
    1. Draw a sample with replacement with the chosen size
    2. Calculate the statistic on the sample
  4. Calculate the mean of the calculated sample statistics.

The with replacement part means we have to put each individual sample back when getting the sample in the bth bootstrap iteration. Thus, we usually have duplicate samples in our sample of the data when doing bootstrapping.

This is the optimism corrected bootstrapping algorithm:

  1. Fit a model M to entire data S and estimate predictive ability C.
  2. Iterate from b=1…B:
    1. Take a resample from the original data, S*
    2. Fit the bootstrap model M* to S* and get predictive ability, C_boot
    3. Use the bootstrap model M* to get predictive ability on S, C_orig
  3. Optimism O is calculated as mean(C_boot – C_orig)
  4. Calculate optimism corrected performance as C-O.

Since we use the same data in step 3 of the bootstrap to train and test the model (an information leak), we would expect increasing bias (C_orig should be too high, thus O too small) when more and more random features are added. See the previous post for more explanation on this. Another point is, the optimism corrected bootstrap is done with a sample size of N instead of just a fraction of N, usually. I found the following quote to support this:

“The small data set was repeatedly re-sampled to produce b replicated data sets, each the same size as the original. We used b = 200. The predictive model was fitted to each of
the b replicated data sets in turn. Each fitted model was then applied both to the resampled data set from which it was generated and to the original data set.”

Smith, Gordon CS, et al. “Correcting for optimistic prediction in small data sets.” American journal of epidemiology 180.3 (2014): 318-324.

I have tried reducing the re-sampling size to a fraction of N, which reduces the bias somewhat, but it is still there. This makes sense due to the information leak in this method which results in an under estimation of the optimism (O).

Your welcome to experiment with this code yourselves. If you are thinking of using this method, I recommend simulating null datasets with the same number of dimensions to check how bias your AUC/ model performance will be first. When we have high numbers of features using this method is clearly a serious mistake.

This code can be directly copied and pasted into R to repeat the experiments.

Experiment 1: my implementation – glmnet (lasso logistic regression)


library(glmnet)
library(pROC)
library(caret)
library(ggplot2)
library(kimisc)

### TEST 1: bootstrap optimism with glmnet
cc <- c()
for (zz in seq(2,100,1)){
print(zz)
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')

### my own implementation of optimism corrected bootstrapping

## 1. fit model to entire test data (i.e. overfit it)

orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial")
preds <- predict(orig,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
original_auc <- as.numeric(auc$auc)

## 2. take resample of data and try to estimate optimism

test2 <- cbind(test,labelsa)
B <- 50
results <- matrix(ncol=2,nrow=B)
for (b in seq(1,B)){
## get the bootstrapped resampled data
boot <- test2[sample(row.names(test2),100,replace=TRUE),]
labelsb <- boot[,ncol(boot)]
boot <- boot[,-ncol(boot)]
## use the bootstrapped model to predict its own labels
bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial")
preds <- predict(bootf,newx=boot,type='response',s=0.01)
auc <- roc(labelsb,as.vector(preds))
boot_auc <- as.numeric(auc$auc)
## use bootstrap model to predict original labels
preds <- predict(bootf,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
boot_original_auc <- as.numeric(auc$auc)
## add the data to results
results[b,1] <- boot_auc
results[b,2] <- boot_original_auc
}
## calculate the optimism
O <- mean(results[,1]-results[,2])
## calculate optimism corrected measure of prediction (AUC)
corrected <- original_auc-O
##
cc <- c(cc,corrected)
print(cc)
}

## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p1 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glmnet - random data only gives positive result with optimism corrected bootstrap')
print(p1)
png('glmnet_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')
print(p1)
dev.off()

Here are the results with 100 samples and 50 bootstrap iterations from 2 to 100 random features from a Gaussian distribution. We are re-sampling using the original sample size (N=100).

Random features are being added iteratively on the X axis, and on the Y, we have AUC. The AUC should be 0.5 to reflect the data has no real predictive power, but it is highly inflated.

glmnet_test_upto100

Experiment 2: another implementation – glm (logistic regression)


## TEST2
auc.adjust <- function(data, fit, B){
fit.model <- fit
data$pred.prob <- fitted(fit.model)
# get overfitted AUC
auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC'
auc.boot <- vector (mode = "numeric", length = B)
auc.orig <- vector (mode = "numeric", length = B)
o <- vector (mode = "numeric", length = B)
for(i in 1:B){
boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc'
fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial")
boot.sample$pred.prob <- fitted(fit.boot)
# get bootstrapped AUC
auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc
# get original data boot AUC
data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response")
auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc
# calculated optimism corrected version
o[i] <- auc.boot[i] - auc.orig[i]
}
auc.adj <- auc.app - (sum(o)/B)
return(auc.adj)
}
cc <- c()
for (zz in seq(2,100,1)){
print(zz)
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')
test2 <- data.frame(cbind(labelsa,test))
test2$labelsa <- as.factor(test2$labelsa)
## 1. make overfitted model
model <- glm(labelsa ~., data = test2, family = "binomial")
## 2. estimate optimism and correct
d <- auc.adjust(test2, model, B=200)
cc <- c(cc,d)
}
## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p2 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glm - random data only gives positive result with optimism corrected bootstrap')
print(p2)
png('glmt_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')
print(p2)
dev.off()

glmt_test_upto100.png

Part 4: Why does bias occur in optimism corrected bootstrapping?

In the previous parts of the series we demonstrated a positive results bias in optimism corrected bootstrapping by simply adding random features to our labels. This problem is due to an ‘information leak’ in the algorithm, meaning the training and test datasets are not kept seperate when estimating the optimism. Due to this, the optimism, under some conditions, can be very under estimated. Let’s analyse the code, it is pretty straightforward to understand then we can see where the problem originates.

  1. Fit a model M to entire data S and estimate predictive ability C. ## this part is our overfitted estimation of performance (can be AUC, accuracy, etc)
  2. Iterate from b=1…B: ## now we are doing resampling of the data to estimate the error
    1. Take a resample from the original data, S*
    2. Fit the bootstrap model M* to S* and get predictive ability, C_boot ## this will clearly give us another overfitted model performance, which is fine
    3. Use the bootstrap model M* to get predictive ability on S, C_orig ## we are using the same data (samples) used to train the model to test it, therefore it is not surprising that we have values with better performance than expected. C_orig values will be too high.
  3. Optimism O is calculated as mean(C_boot – C_orig)
  4. Calculate optimism corrected performance as C-O.

One way of correcting for this would be changing step 3 of the bootstrap, instead of testing on the original data, to test on the left out (unseen) data in the bootstrap. This way the training and test data is kept entirely seperate in terms of samples, thus eliminating our bias towards inflated model performance on null datasets with high features. There is probably no point of doing anyway this when we have methods such as LOOCV and K fold cross validation.

As p (features) >> N (samples) we are going to get better and better ability to get good model performance using the bootstrapped data on the original data. Why? Because the original data contains the same samples as the bootstrap and as we get more features, greater the chance we are going to get some randomly correlating with our response variable. When we test the bootstrap on the original data (plus more samples) it retains some of this random ability to predict the real labels. This is a typical overfitting problem when we have higher numbers of features, and the procedure is faulty.

Let’s take another experimental look at the problem, this code can be directly copied and pasted into R for repeating the analyses and plots. We have two implementations of the method, the first by me for glmnet (lasso logisitic regression), the second for glm (logisitic regression) from this website (http://cainarchaeology.weebly.com/r-function-for-optimism-adjusted-auc.html). Feel free to try different machine learning algorithms and play with the parameters.


library(glmnet)
library(pROC)
library(caret)
library(ggplot)
library(kimisc)

### TEST 1: bootstrap optimism with glmnet
cc <- c()
for (zz in seq(2,100,1)){
print(zz)
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')

### my own implementation of optimism corrected bootstrapping

## 1. fit model to entire test data (i.e. overfit it)

orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial")
preds <- predict(orig,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
original_auc <- as.numeric(auc$auc)

## 2. take resample of data and try to estimate optimism

test2 <- cbind(test,labelsa)
B <- 50
results <- matrix(ncol=2,nrow=B)
for (b in seq(1,B)){
## get the bootstrapped resampled data
boot <- test2[sample(row.names(test2),50),]
labelsb <- boot[,ncol(boot)]
boot <- boot[,-ncol(boot)]
## use the bootstrapped model to predict its own labels
bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial")
preds <- predict(bootf,newx=boot,type='response',s=0.01)
auc <- roc(labelsb,as.vector(preds))
boot_auc <- as.numeric(auc$auc)
## use bootstrap model to predict original labels
preds <- predict(bootf,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
boot_original_auc <- as.numeric(auc$auc)
## add the data to results
results[b,1] <- boot_auc
results[b,2] <- boot_original_auc
}
## calculate the optimism
O <- mean(results[,1]-results[,2])
## calculate optimism corrected measure of prediction (AUC)
corrected <- original_auc-O
##
cc <- c(cc,corrected)
print(cc)
}
## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p1 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glmnet - random data only gives positive result with optimism corrected bootstrap')
print(p1)
png('glmnet_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')
print(p1)
dev.off()

So here are the results, as number of noise only features on the x axis increase our 'corrected' estimate of AUC (on y axis) also increases when we start getting enough to allow that noise to randomly predict the labels. So this shows the problem starts about 40-50 features, then gets worse until about 75+. This is with the 'glmnet' function.

glmnet_test_upto100.png

Let’s look at the method using glm, we find the same trend, different implementation.

## TEST2
auc.adjust <- function(data, fit, B){
fit.model <- fit
data$pred.prob <- fitted(fit.model)
# get overfitted AUC
auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC'
auc.boot <- vector (mode = "numeric", length = B)
auc.orig <- vector (mode = "numeric", length = B)
o <- vector (mode = "numeric", length = B)
for(i in 1:B){
boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc'
fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial")
boot.sample$pred.prob <- fitted(fit.boot)
# get bootstrapped AUC
auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc
# get original data boot AUC
data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response")
auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc
# calculated optimism corrected version
o[i] <- auc.boot[i] - auc.orig[i]
}
auc.adj <- auc.app - (sum(o)/B)
return(auc.adj)
}
cc <- c()
for (zz in seq(2,100,1)){
print(zz)
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')
test2 <- data.frame(cbind(labelsa,test))
test2$labelsa <- as.factor(test2$labelsa)
## 1. make overfitted model
model <- glm(labelsa ~., data = test2, family = "binomial")
## 2. estimate optimism and correct
d <- auc.adjust(test2, model, B=200)
cc <- c(cc,d)
}
## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p2 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glm - random data only gives positive result with optimism corrected bootstrap')
print(p2)
png('glmt_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')
print(p2)
dev.off()

So there we have it, the method has a problem with it and should not be used with greater than about 40 features. This method unfortunately is currently being used for datasets with higher than this number of dimensions (40+) because people think it is a published method it is safe, unfortunately this is not how the world works R readers. Remember, the system is corrupt, science and statistics is full of lies, and if in doubt, do your own tests on positive and negative controls.

This is with the 'glm' function. Random features added on x axis, the corrected AUC on y axis.

glmt_test_upto100.png

What if you don’t believe it? I mean this is a text book method. Well R readers if that is so I suggest code it your self and try the code here, run the experiments on null (random data only) datasets with increasing features.

This is the last part in the series on debunking the optimism corrected bootstrap method. I consider it, debunked.