Running UMAP for data visualisation in R

UMAP is a non linear dimensionality reduction algorithm in the same family as t-SNE. In the first phase of UMAP a weighted k nearest neighbour graph is computed, in the second a low dimensionality layout of this is then calculated. Then the embedded data points can be visualised in a new space and compared with other variables of interest.

It can be used for the analysis of many types of data, including, single cell RNA-seq and cancer omic data. One easy way to run UMAP on your data and visualise the results is to make a wrapper function that uses the umap R package and ggplot2, this is easy to do yourself, but in this post we are going to have a look at the one included in the M3C package (https://bioconductor.org/packages/devel/bioc/html/M3C.html). The code for this simple function can be found here (https://github.com/crj32/M3C/blob/master/R/umap.R).

M3C is a clustering algorithm for finding the number of clusters in a given dataset, but it also has several other useful functions that can be used more generally. I don’t use M3C for single cell RNA-seq clustering it is better for other types of omic data, such as RNA-seq and standard omic platforms, however, its visualisation functions are useful for a number of different analyses I do. Otherwise I end up copy and pasting lengthy blocks of ggplot2 code again and again.

There have been two previous posts on the related PCA and t-SNE functions contained in the M3C package which both contain the same helpful parameters for extra labelling of the data points.

Let’s load some single cell RNA-seq data, this is the pollen_test_data we used in the last post when running t-SNE.

So this is just a basic UMAP plot. So in this data within clusters the samples are very similar so they are very compact in these plots.

# load your omic data here as mydata
library(M3C)
umap(pollen$data,colvec=c('skyblue'))

UMAP.png

This is for labelling the sample points with our ground truth labels, i.e. cell types. The colour scale is set to be controlled manually and this using some internal colours inside the function.

umap(pollen$data,labels=as.factor(pollen$celltypes),controlscale=TRUE,scale=3)

UMAPlabeled.png

This is for looking at an individual genes expression across the samples, so we are just labelling the points with a continuous variable, in this case GAGE4 because its variance is among the highest standardised to the mean.

umap(pollen$data, labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='GAGE4',])),
controlscale = TRUE,scale=2,legendtitle = 'GAGE4')

UMAPlabeled.png

Let’s just cluster this data with M3C and then display the clustering assignments on the UMAP plot.

r <- M3C(pollen$data,method=2)
umap(pollen$data,labels=as.factor(r$assignments),printres = TRUE,printwidth = 24)

UMAPlabeled.png

So that’s pretty neat, simple functions like this can be quite a lot of fun.

Let’s do one more plot, we will make up a continuous variable that could be a co-variate we are interested in and display that on the plot. I’ll show the code for printing the plot here which is done internally to simplify things.

Scale=1 means we use the spectral palette for colouring the data points. We have to set controlscale=TRUE for using custom colours, it is also possible to set a ‘high’ and ‘low’ colour manually through changing the scale parameter to 2. Type ?umap for the relevant documentation.

# make random data
var <- seq(1,ncol(pollen$data))
# do plot
umap(pollen$data,labels=var,controlscale=TRUE,scale=1,
legendtitle='var',printres = TRUE,printwidth = 24)

UMAPlabeled.png

Note that if you want to control the various parameters that UMAP uses internally rather than a quick analysis like this, you’ll have to make your own function to wrap ggplot2 and umap  (https://cran.r-project.org/web/packages/umap/index.html) or just run them as individual steps.

So, that bring us to the end of this little blog series on the M3C data visualisation functions. Hope you all enjoyed it!

Quick and easy t-SNE analysis in R

t-SNE is a useful nonlinear dimensionality reduction method that allows you to visualise data embedded in a lower number of dimensions, e.g. 2, in order to see patterns and trends in the data. It can deal with more complex nonlinear patterns of Gaussian clusters in multidimensional space compared to PCA so it is good for single cell RNA-seq analysis. Although is not suited to finding outliers because how the samples are arranged does not directly represent distance, like in PCA, instead t-SNE preserves local distance at the expensive of global distance.

An easy way to run t-SNE on your data is to use a pre-made wrapper function that uses the Rtsne package and ggplot2. Like the one that comes with the M3C package (https://bioconductor.org/packages/devel/bioc/html/M3C.html). You can also just take this code from the github and remake it for your own specific needs if necessary (https://github.com/crj32/M3C), it is quite straightforward. The package also has the equivalent functions for PCA and UMAP.

Let’s load some single cell RNA-seq data and demonstrate this function.

This is the Pollen et al. single cell RNA-seq data, but you can use a different kind of omic data, or non omic data. Let’s just take a look at that data.

# load your omic data here as mydata
library(M3C)
tsne(pollen$data,colvec=c('gold'))

TSNE1.png

So, we can see this data is neatly grouped into clusters for us. Let’s compare these with the cell type labels. This is just using the default colours, for changing to custom colours we can use the ‘controlscale, ‘scale’, and ‘colvec’ parameters. You can read about these in the documentation using ?tsne.

library(M3C)
tsne(pollen$data,labels=as.factor(pollen$celltypes))

TSNElabeled.png

So now we might want to display the expression of a specific gene onto the t-SNE plot, we can easily do this using this function. So the code below can be modified for any given gene, this is for gene ‘HBG1’, we can change this as needed. It just pulls a numerical vector out of the gene expression matrix for us, and z-score scales it for display.

I choose HBG1 because I had run the ‘featurefilter’ function from M3C that collects statistics about all genes in the data frame, such as the coefficient of variation and its second order derivative. Use ?featurefilter to see more information. This gene was the most variable one in the data. Basically this function is for filtering the data according to variance and producing summary statistics per gene.

Here we choose to control the colour scale and setting this to 2 means low is grey and high is red on the scale. However, ‘low’ and ‘high’ may be user specified.

tsne(pollen$data,
labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='HBG1',])),
controlscale = TRUE,scale=2)

TSNElabeled.png

Lastly, I want to find out which cell type that HBG1 expression is coming from, we can do this using the PCA function because the dots are too close together to see the text using t-SNE. To do this we have to add a text variable to the pca function.

pca(pollen$data,labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='HBG1',])),
controlscale=TRUE,scale=2,text=pollen$celltypes,
textlabelsize = 2)

PCAlabeled.png

So, basically we can zoom in on the printed image and see the red cluster is all the K562 cell type. The text parameter is primarily for finding outliers from PCA, but it can be used in other ways too.

M3C is not for clustering single cell RNA-seq data because of the high complexity of the algorithm and the type of consensus clustering it does, but we can use its functions for other purposes anyway. It is good for clustering cancer single-omic data though, for example. It is essentially an enhanced version of the Monti consensus clustering algorithm, implemented originally in the consensusclusterplus package.

For an algorithm that can handle single cell RNA-seq better and also a wide range of omic and non omic data, Spectrum is a good option (https://cran.r-project.org/web/packages/Spectrum/index.html). It is faster and more sophisticated in many ways to the consensus clustering approach. Also good for single cell RNA-seq clustering are SC3 and Seurat.

 

Easy quick PCA analysis in R

Principal component analysis (PCA) is very useful for doing some basic quality control (e.g. looking for batch effects) and assessment of how the data is distributed (e.g. finding outliers). A straightforward way is to make your own wrapper function for prcomp and ggplot2, another way is to use the one that comes with M3C (https://bioconductor.org/packages/devel/bioc/html/M3C.html) or another package. M3C is a consensus clustering tool that makes some major modifications to the Monti et al. (2003) algorithm so that it behaves better, it also provides functions for data visualisation.

Let’s have a go on an example cancer microarray dataset.

# M3C loads an example dataset mydata with metadata desx
library(M3C)
# do PCA
pca(mydata,colvec=c('gold'),printres=TRUE)

PCApriorclustering.png

So, now what prcomp has done is extracted the eigenvectors of the data’s covariance matrix, then projected the original data samples onto them using linear combination. This yields PC scores which are plotted on PC1 and PC2 here (eigenvectors 1 and 2). The eigenvectors are sorted and these first two contain the highest variation in the data, but it might be a good idea to look at additional PCs, which is beyond this simple blog post and function.

You can see above there are no obvious outliers for removal here, which is good for cluster analysis. However, were there outliers, we can label all samples with their names using the ‘text’ parameter.

library(M3C)
pca(mydata,colvec=c('gold'),printres=TRUE,text=colnames(mydata))

PCApriorclustering.png

Now other objectives would be comparing samples with batch to make sure we do not have batch effects driving the variation in our data, and comparing with other variables such as gender and age. Since the metadata only contains tumour class we are going to use that next to see how it is related to these PC scores.

This is a categorical variable, so the ‘scale’ parameter should be set to 3, ‘controlscale’ is set to TRUE because I want to choose the colours, and ‘labels’ parameter passes the metadata tumour class into the function. I am increasing the ‘printwidth’ from its default value because the legend is quite wide.

For more information see the function documentation using ?pca.

# first reformat meta to replace NA with Unknown
desx$class <- as.character(desx$class)
desx$class[is.na(desx$class)] <- as.factor('Unknown')
# next do the plot
pca(mydata,colvec=c('gold','skyblue','tomato','midnightblue','grey'),
labels=desx$class,controlscale=TRUE,scale=3,printres=TRUE,printwidth=25)

PCAlabeled.png

So, now it appears that the variation that governs these two PCs is indeed related to tumour class which is good. But, what if the variable is continuous, and we wanted to compare read mapping rate, or read duplication percentage, or age with our data? So, a simple change of the parameters can allow this too. Let’s make up a continuous variable, then add this to the plot. In this case we change the ‘scale’ parameter to reflect we are using continuous data, and the spectral palette is used for the colours by default.

randomvec <- seq(1,50)
pca(mydata,labels=randomvec,controlscale=TRUE,scale=1,legendtitle='Random',
printres=TRUE,printwidth=25)

PCAlabeled.png

So, since this is a random variable, we can see it has no relationship with our data. Let’s just define a custom scale now. So we change ‘scale’ to 2, then use the ‘low’ and ‘high’ parameters to define the scale colour range.

pca(mydata,labels=randomvec,controlscale=TRUE,scale=2,legendtitle='Random',
printres=TRUE,,printwidth=25,low='grey',high='red')

PCAlabeled.png

Super easy, yet pretty cool. The idea here is just to minimise the amount of code hanging around for doing basic analyses like these. You can rip the code from here: https://github.com/crj32/M3C/blob/master/R/pca.R. Remember if your features are on different scales, the data needs transforming to be comparable beforehand.

M3C is part of a series of cluster analysis tools that we are releasing, the next is ‘Spectrum’, a fast adaptive spectral clustering tool, which is already on CRAN (https://cran.r-project.org/web/packages/Spectrum/index.html). Perhaps there will be time to blog about that in the future.

For quantitative analysis of drivers in the variation on data, I recommend checking out David Watson’s great function in the ‘bioplotr’ package called ‘plot_drivers’ (https://github.com/dswatson/bioplotr). This compares the PCs with categorical and continuous variables and performs univariate statistical analysis on them producing a very nice plot.

Using clusterlab to benchmark clustering algorithms

Clusterlab is a CRAN package (https://cran.r-project.org/web/packages/clusterlab/index.html) for the routine testing of clustering algorithms. It can simulate positive (data-sets with >1 clusters) and negative controls (data-sets with 1 cluster). Why test clustering algorithms? Because they often fail in identifying the true K in practice, published algorithms are not always well tested, and we need to know about ones that have strange behaviour. I’ve found in many own experiments on clustering algorithms that algorithms many people are using are not necessary ones that provide the most sensible results. I can give a good example below.

I was interested to see clusterExperiment, a relatively new clustering package on the Bioconductor, cannot detect the ground truth K in my testing so far. Instead yielding solutions with many more clusters than there are in reality. On the other hand, the package I developed with David Watson here at QMUL, does work rather well. It is a derivative of the Monti et al. (2003) consensus clustering algorithm, fitted with a Monte Carlo reference procedure to eliminate overfitting. We called this algorithm M3C.

library(clusterExperiment)
library(clusterlab)
library(M3C)

Experiment 1: Simulate a positive control dataset (K=5)

PCAlabeled_real.png

k5 <- clusterlab(centers=5)

Experiment 2: Test RSEC (https://bioconductor.org/packages/release/bioc/html/clusterExperiment.html)

RSEC_test.png

rsec_test <- RSEC(as.matrix(k5$synthetic_data),ncores=10)
assignments <- primaryCluster(rsec_test)
pca(k5$synthetic_data,labels=assignments)

Experiment 3: Test M3C (https://bioconductor.org/packages/release/bioc/html/M3C.html)

PCAlabeled_M3C.png

M3C_test <- M3C(k5$synthetic_data,iters=10,cores=10)
optk <- which.max(M3C_test$scores$RCSI)+1
pca(M3C_test$realdataresults[[optk]]$ordered_data,
labels=M3C_test$realdataresults[[optk]]$ordered_annotation$consensuscluster,printres=TRUE)

Interesting isn't it R readers.

Well all I can say is I recommend comparing different machine learning methods and if in doubt, run your own control data-sets (positive and negative controls) to test various methods. In the other post we showed a remarkable bias in optimism corrected bootstrapping compared with LOOCV under certain conditions, simply by simulating null data-sets and passing them into the method.

If you are clustering omic’ data from a single platform (RNAseq, protein, methylation, etc) I have tested and recommend the following packages:

CLEST: https://www.rdocumentation.org/packages/RSKC/versions/2.4.2/topics/Clest
M3C: https://bioconductor.org/packages/release/bioc/html/M3C.html

And to be honest, that is about it. I have also tested PINSplus, GAP-statistic, and SNF, but they did not provide satisfactory results in my experiments on single platform clustering (currently unpublished data). Multi-omic and single cell RNA-seq analysis is another story, there will be more on that to come in the future R readers.

Remember there is a darkness out there R readers, not just in Washington, but in the world of statistics. It is there because of the positive results bias in science, because of people not checking methods and comparing them with one another, and because of several other reasons I can’t even be bothered to mention.

 

Part 5: Code corrections to optimism corrected bootstrapping series

The truth is out there R readers, but often it is not what we have been led to believe. The previous post examined the strong positive results bias in optimism corrected bootstrapping (a method of assessing a machine learning model’s predictive power) with increasing p (completely random features). This bias is real and is effecting publications. However, there were 2 implementations of the method given in the previous post, 1 has a slight error, 2 seem fine. The trend is still the same with the corrected code, but the problem with the earlier code is I did not set ‘replace=TRUE’ in the call to the ‘sample’ function. The problem inherent to the ‘optimism corrected bootstrapping’ method is more about estimating error using the same samples for training and testing, than how we are resampling the data, so this change does not make a large difference to the results. We will go into a little more detail about bootstrapping in this article and repeat the analyses with the corrected code. Once again, you are welcome to re-run the code written here and also test using the independent Caret implementation shown on the other page.

For the uninformed reader, I have also shown the problem using Caret (https://intobioinformatics.wordpress.com/2018/12/25/optimism-corrected-bootstrapping-a-problematic-method/), and where it originates from in the method here (it is obvious with a bit of statistical knowledge) (https://intobioinformatics.wordpress.com/2018/12/28/part-4-more-bias-and-why-does-bias-occur-in-optimism-corrected-bootstrapping/). It is a very simple test I have done (using simulated null data-sets with increasing features) and shows the problem with the method very clearly.

Thanks to ECOQUANT for pointing out to me the replace function should have been called with an additional parameter.

Let’s just recap what bootstrapping is and what optimism corrected bootstrapping is before we redo the experiments:

This is from Jason’s excellent blog (https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/), bootstrapping is:

  1. Choose a number of bootstrap samples to perform
  2. Choose a sample size
  3. For each bootstrap sample (b=1 … B)
    1. Draw a sample with replacement with the chosen size
    2. Calculate the statistic on the sample
  4. Calculate the mean of the calculated sample statistics.

The with replacement part means we have to put each individual sample back when getting the sample in the bth bootstrap iteration. Thus, we usually have duplicate samples in our sample of the data when doing bootstrapping.

This is the optimism corrected bootstrapping algorithm:

  1. Fit a model M to entire data S and estimate predictive ability C.
  2. Iterate from b=1…B:
    1. Take a resample from the original data, S*
    2. Fit the bootstrap model M* to S* and get predictive ability, C_boot
    3. Use the bootstrap model M* to get predictive ability on S, C_orig
  3. Optimism O is calculated as mean(C_boot – C_orig)
  4. Calculate optimism corrected performance as C-O.

Since we use the same data in step 3 of the bootstrap to train and test the model (an information leak), we would expect increasing bias (C_orig should be too high, thus O too small) when more and more random features are added. See the previous post for more explanation on this. Another point is, the optimism corrected bootstrap is done with a sample size of N instead of just a fraction of N, usually. I found the following quote to support this:

“The small data set was repeatedly re-sampled to produce b replicated data sets, each the same size as the original. We used b = 200. The predictive model was fitted to each of
the b replicated data sets in turn. Each fitted model was then applied both to the resampled data set from which it was generated and to the original data set.”

Smith, Gordon CS, et al. “Correcting for optimistic prediction in small data sets.” American journal of epidemiology 180.3 (2014): 318-324.

I have tried reducing the re-sampling size to a fraction of N, which reduces the bias somewhat, but it is still there. This makes sense due to the information leak in this method which results in an under estimation of the optimism (O).

Your welcome to experiment with this code yourselves. If you are thinking of using this method, I recommend simulating null datasets with the same number of dimensions to check how bias your AUC/ model performance will be first. When we have high numbers of features using this method is clearly a serious mistake.

This code can be directly copied and pasted into R to repeat the experiments.

Experiment 1: my implementation – glmnet (lasso logistic regression)


library(glmnet)
library(pROC)
library(caret)
library(ggplot2)
library(kimisc)

### TEST 1: bootstrap optimism with glmnet
cc <- c()
for (zz in seq(2,100,1)){
print(zz)
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')

### my own implementation of optimism corrected bootstrapping

## 1. fit model to entire test data (i.e. overfit it)

orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial")
preds <- predict(orig,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
original_auc <- as.numeric(auc$auc)

## 2. take resample of data and try to estimate optimism

test2 <- cbind(test,labelsa)
B <- 50
results <- matrix(ncol=2,nrow=B)
for (b in seq(1,B)){
## get the bootstrapped resampled data
boot <- test2[sample(row.names(test2),100,replace=TRUE),]
labelsb <- boot[,ncol(boot)]
boot <- boot[,-ncol(boot)]
## use the bootstrapped model to predict its own labels
bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial")
preds <- predict(bootf,newx=boot,type='response',s=0.01)
auc <- roc(labelsb,as.vector(preds))
boot_auc <- as.numeric(auc$auc)
## use bootstrap model to predict original labels
preds <- predict(bootf,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
boot_original_auc <- as.numeric(auc$auc)
## add the data to results
results[b,1] <- boot_auc
results[b,2] <- boot_original_auc
}
## calculate the optimism
O <- mean(results[,1]-results[,2])
## calculate optimism corrected measure of prediction (AUC)
corrected <- original_auc-O
##
cc <- c(cc,corrected)
print(cc)
}

## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p1 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glmnet - random data only gives positive result with optimism corrected bootstrap')
print(p1)
png('glmnet_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')
print(p1)
dev.off()

Here are the results with 100 samples and 50 bootstrap iterations from 2 to 100 random features from a Gaussian distribution. We are re-sampling using the original sample size (N=100).

Random features are being added iteratively on the X axis, and on the Y, we have AUC. The AUC should be 0.5 to reflect the data has no real predictive power, but it is highly inflated.

glmnet_test_upto100

Experiment 2: another implementation – glm (logistic regression)


## TEST2
auc.adjust <- function(data, fit, B){
fit.model <- fit
data$pred.prob <- fitted(fit.model)
# get overfitted AUC
auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC'
auc.boot <- vector (mode = "numeric", length = B)
auc.orig <- vector (mode = "numeric", length = B)
o <- vector (mode = "numeric", length = B)
for(i in 1:B){
boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc'
fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial")
boot.sample$pred.prob <- fitted(fit.boot)
# get bootstrapped AUC
auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc
# get original data boot AUC
data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response")
auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc
# calculated optimism corrected version
o[i] <- auc.boot[i] - auc.orig[i]
}
auc.adj <- auc.app - (sum(o)/B)
return(auc.adj)
}
cc <- c()
for (zz in seq(2,100,1)){
print(zz)
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')
test2 <- data.frame(cbind(labelsa,test))
test2$labelsa <- as.factor(test2$labelsa)
## 1. make overfitted model
model <- glm(labelsa ~., data = test2, family = "binomial")
## 2. estimate optimism and correct
d <- auc.adjust(test2, model, B=200)
cc <- c(cc,d)
}
## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p2 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glm - random data only gives positive result with optimism corrected bootstrap')
print(p2)
png('glmt_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')
print(p2)
dev.off()

glmt_test_upto100.png

Part 4: Why does bias occur in optimism corrected bootstrapping?

In the previous parts of the series we demonstrated a positive results bias in optimism corrected bootstrapping by simply adding random features to our labels. This problem is due to an ‘information leak’ in the algorithm, meaning the training and test datasets are not kept seperate when estimating the optimism. Due to this, the optimism, under some conditions, can be very under estimated. Let’s analyse the code, it is pretty straightforward to understand then we can see where the problem originates.

  1. Fit a model M to entire data S and estimate predictive ability C. ## this part is our overfitted estimation of performance (can be AUC, accuracy, etc)
  2. Iterate from b=1…B: ## now we are doing resampling of the data to estimate the error
    1. Take a resample from the original data, S*
    2. Fit the bootstrap model M* to S* and get predictive ability, C_boot ## this will clearly give us another overfitted model performance, which is fine
    3. Use the bootstrap model M* to get predictive ability on S, C_orig ## we are using the same data (samples) used to train the model to test it, therefore it is not surprising that we have values with better performance than expected. C_orig values will be too high.
  3. Optimism O is calculated as mean(C_boot – C_orig)
  4. Calculate optimism corrected performance as C-O.

One way of correcting for this would be changing step 3 of the bootstrap, instead of testing on the original data, to test on the left out (unseen) data in the bootstrap. This way the training and test data is kept entirely seperate in terms of samples, thus eliminating our bias towards inflated model performance on null datasets with high features. There is probably no point of doing anyway this when we have methods such as LOOCV and K fold cross validation.

As p (features) >> N (samples) we are going to get better and better ability to get good model performance using the bootstrapped data on the original data. Why? Because the original data contains the same samples as the bootstrap and as we get more features, greater the chance we are going to get some randomly correlating with our response variable. When we test the bootstrap on the original data (plus more samples) it retains some of this random ability to predict the real labels. This is a typical overfitting problem when we have higher numbers of features, and the procedure is faulty.

Let’s take another experimental look at the problem, this code can be directly copied and pasted into R for repeating the analyses and plots. We have two implementations of the method, the first by me for glmnet (lasso logisitic regression), the second for glm (logisitic regression) from this website (http://cainarchaeology.weebly.com/r-function-for-optimism-adjusted-auc.html). Feel free to try different machine learning algorithms and play with the parameters.


library(glmnet)
library(pROC)
library(caret)
library(ggplot)
library(kimisc)

### TEST 1: bootstrap optimism with glmnet
cc <- c()
for (zz in seq(2,100,1)){
print(zz)
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')

### my own implementation of optimism corrected bootstrapping

## 1. fit model to entire test data (i.e. overfit it)

orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial")
preds <- predict(orig,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
original_auc <- as.numeric(auc$auc)

## 2. take resample of data and try to estimate optimism

test2 <- cbind(test,labelsa)
B <- 50
results <- matrix(ncol=2,nrow=B)
for (b in seq(1,B)){
## get the bootstrapped resampled data
boot <- test2[sample(row.names(test2),50),]
labelsb <- boot[,ncol(boot)]
boot <- boot[,-ncol(boot)]
## use the bootstrapped model to predict its own labels
bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial")
preds <- predict(bootf,newx=boot,type='response',s=0.01)
auc <- roc(labelsb,as.vector(preds))
boot_auc <- as.numeric(auc$auc)
## use bootstrap model to predict original labels
preds <- predict(bootf,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
boot_original_auc <- as.numeric(auc$auc)
## add the data to results
results[b,1] <- boot_auc
results[b,2] <- boot_original_auc
}
## calculate the optimism
O <- mean(results[,1]-results[,2])
## calculate optimism corrected measure of prediction (AUC)
corrected <- original_auc-O
##
cc <- c(cc,corrected)
print(cc)
}
## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p1 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glmnet - random data only gives positive result with optimism corrected bootstrap')
print(p1)
png('glmnet_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')
print(p1)
dev.off()

So here are the results, as number of noise only features on the x axis increase our 'corrected' estimate of AUC (on y axis) also increases when we start getting enough to allow that noise to randomly predict the labels. So this shows the problem starts about 40-50 features, then gets worse until about 75+. This is with the 'glmnet' function.

glmnet_test_upto100.png

Let’s look at the method using glm, we find the same trend, different implementation.

## TEST2
auc.adjust <- function(data, fit, B){
fit.model <- fit
data$pred.prob <- fitted(fit.model)
# get overfitted AUC
auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC'
auc.boot <- vector (mode = "numeric", length = B)
auc.orig <- vector (mode = "numeric", length = B)
o <- vector (mode = "numeric", length = B)
for(i in 1:B){
boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc'
fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial")
boot.sample$pred.prob <- fitted(fit.boot)
# get bootstrapped AUC
auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc
# get original data boot AUC
data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response")
auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc
# calculated optimism corrected version
o[i] <- auc.boot[i] - auc.orig[i]
}
auc.adj <- auc.app - (sum(o)/B)
return(auc.adj)
}
cc <- c()
for (zz in seq(2,100,1)){
print(zz)
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')
test2 <- data.frame(cbind(labelsa,test))
test2$labelsa <- as.factor(test2$labelsa)
## 1. make overfitted model
model <- glm(labelsa ~., data = test2, family = "binomial")
## 2. estimate optimism and correct
d <- auc.adjust(test2, model, B=200)
cc <- c(cc,d)
}
## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p2 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glm - random data only gives positive result with optimism corrected bootstrap')
print(p2)
png('glmt_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')
print(p2)
dev.off()

So there we have it, the method has a problem with it and should not be used with greater than about 40 features. This method unfortunately is currently being used for datasets with higher than this number of dimensions (40+) because people think it is a published method it is safe, unfortunately this is not how the world works R readers. Remember, the system is corrupt, science and statistics is full of lies, and if in doubt, do your own tests on positive and negative controls.

This is with the 'glm' function. Random features added on x axis, the corrected AUC on y axis.

glmt_test_upto100.png

What if you don’t believe it? I mean this is a text book method. Well R readers if that is so I suggest code it your self and try the code here, run the experiments on null (random data only) datasets with increasing features.

This is the last part in the series on debunking the optimism corrected bootstrap method. I consider it, debunked.

Part 3: Two more implementations of optimism corrected bootstrapping show clear positive results bias

Previously we saw with a reproducible code implementation that optimism corrected bootstrapping is very bias when we have many features (50-100 or more). Just re-run the code yourself and make up your own mind if in doubt.

This time I have used a different persons implementation using the ‘glm’ function, i.e. logistic regression to show the same misleading trend occurs, i.e. positive results on purely random data. The code has been directly taken from http://cainarchaeology.weebly.com/r-function-for-optimism-adjusted-auc.html, here. The second implementation is from another unnamed statistician.

If you disagree with these findings, please show your own implementation of the method in R (not using Caret) following the same experiment of increasing number of features that are purely noise very high with binary labels. Also, check the last two posts and re-run the code your self, before making your mind up and providing counter arguments.

There are no real predictors in this data, all of them are random data from a normal distribution. The bias is less bad using glm than with glmnet.

You should be able to copy and paste this code directly into R to repeat the results. I can’t vouch for this code because I did not write it, but both implementations show the same result as in the last blog post with my code.

Implementation 1. A glm experiment.


### example of logistic regression optimism corrected bootstrapping
# source: http://cainarchaeology.weebly.com/r-function-for-optimism-adjusted-auc.html

auc.adjust <- function(data, fit, B){
fit.model <- fit
data$pred.prob <- fitted(fit.model)
auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC'
auc.boot <- vector (mode = "numeric", length = B)
auc.orig <- vector (mode = "numeric", length = B)
o <- vector (mode = "numeric", length = B)
for(i in 1:B){
boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc'
fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial")
boot.sample$pred.prob <- fitted(fit.boot)
auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc
data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response")
auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc
o[i] <- auc.boot[i] - auc.orig[i]
}
auc.adj <- auc.app - (sum(o)/B)
boxplot(auc.boot, auc.orig, names=c("auc.boot", "auc.orig"))
title(main=paste("Optimism-adjusted AUC", "\nn of bootstrap resamples:", B), sub=paste("auc.app (blue line)=", round(auc.app, digits=4),"\nadj.auc (red line)=", round(auc.adj, digits=4)), cex.sub=0.8)
abline(h=auc.app, col="blue", lty=2)
abline(h=auc.adj, col="red", lty=3)
}

## generate random data

xx.glmnet <- matrix(nrow=53,ncol=1)
xx.glmnet <- data.frame(xx.glmnet)
xx.glmnet[,1] <- c(rep(0,25),rep(1,28))
test <- matrix(rnorm(53*1000, mean = 0, sd = 1),
nrow = 53, ncol = 1000, byrow = TRUE)
xx.glmnet <- cbind(xx.glmnet,test)
colnames(xx.glmnet) <- c('outcome',paste('Y',seq(1,1000),sep=''))

## 1. make overfitted model

model <- glm(outcome ~., data = xx.glmnet, family = "binomial")

## 2. estimate optimism and correct

auc.adjust(xx.glmnet, model, B=200)

Let's have a look at the results, which agree nicely with our previous findings using a from scratch implementation of the method. So the red line is supposedly our corrected AUC, but the AUC should be 0.5 when running on random data. See previous part 1 post and part 2 post for demonstration of cross validation results on random data which give the correct result.

Rplot

Implementation 2: Another glmnet experiment

Edit: It has come to my attention the below implementation suffers from numerical instability, possibly due to the cv.glmnet function giving unstable results. Sometimes we have bias, sometimes not. This is not the best example.


### example of optimism corrected bootstrapping implementation
# source: an unnamed programmer

library(boot)
library(pROC)
library(glmnet)

#glmnet -all predictors penalised
compare_opt_glmnet<- function(orig_data_glmnet, i){
# orig_data_glmnet = the whole data
train_data_glmnet<- orig_data_glmnet[i, ] # get bootstrap
model_glmnet<- model_process_glmnet(train_data_glmnet)
# return a summary optimism results
AUC_results_glmnet <- c(
# boot against original
AUC_train1=roc(orig_data_glmnet$outcome,as.vector(predict(model_glmnet,type="response", newx = model.matrix (outcome~.,orig_data_glmnet )[,-1], s = 'lambda.min')))$auc,
# boot against boot
AUC_train2=roc(train_data_glmnet$outcome,as.vector(predict(model_glmnet,type="response", newx = model.matrix (outcome~.,train_data_glmnet )[,-1], s = 'lambda.min')))$auc,
AUC_diff=roc(train_data_glmnet$outcome,as.vector(predict(model_glmnet,type="response", newx = model.matrix (outcome~.,train_data_glmnet )[,-1], s = 'lambda.min')))$auc-
roc(orig_data_glmnet$outcome,as.vector(predict(model_glmnet,type="response", newx = model.matrix (outcome~.,orig_data_glmnet )[,-1], s = 'lambda.min')))$auc
)
return(AUC_results_glmnet)
}

model_process_glmnet<- function(the_data_glmnet){
model_final_glmnet <- cv.glmnet (model.matrix (outcome~.,the_data_glmnet )[,-1],the_data_glmnet$outcome,alpha=1,family = "binomial")
return(model_final_glmnet)
}

# generate random data

test <- matrix(rnorm(53*1000, mean = 0, sd = 1),
nrow = 53, ncol = 1000, byrow = TRUE)
test <- data.frame(test)
labs <- factor(c(0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,1,0,1,1,0,1,0,1,0,0,0,0,0,
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
labs <- data.frame(as.matrix(labs))
colnames(labs)[1] <- 'outcome'
xx.glmnet <- cbind(labs,test)

## 1. make overfitted model

tempd <- xx.glmnet[,2:ncol(xx.glmnet)]
labels <- xx.glmnet[,1]
overfittedmodel <- cv.glmnet(as.matrix(tempd),y=labels,alpha=1,family="binomial")
lasso.prob <- predict(overfittedmodel,type="response",newx=as.matrix(tempd),s='lambda.min')
auc <- roc(labels,as.vector(lasso.prob))
overfitted_auc <- as.numeric(auc$auc)

## 2. try to correct for optimism

Repeats = 100
res_opt_glmnet <- boot(xx.glmnet, statistic = compare_opt_glmnet, R = Repeats)
optimism_glmnet <- mean(res_opt_glmnet$t[ , 3])
enhanced_glmnet <- overfitted_auc - optimism_glmnet

Here are the results of the above code on random data only.

> enhanced_glmnet
[1] 0.8916488

If in doubt compare your results with LOOCV and k fold cross validation, this is how I discovered this method is dodgy by seeing substantial differences on null data-sets where no real predictive ability should be found. These two methods are more widely used and reliable.

Sorry @ Frank Harrell, I don’t have time atm to do the extra bits you suggest.
Game on @ Davis Vaughan. Do your own implementation with the same data as I have used with glmnet and glm.

The steps of this method, briefly, are as follows:

  1. Fit a model M to entire data S and estimate predictive ability C.
  2. Iterate from b=1…B:
    1. Take a resample from the original data, S*
    2. Fit the bootstrap model M* to S* and get predictive ability, C_boot
    3. Use the bootstrap model M* to get predictive ability on S, C_orig
  3. Optimism O is calculated as mean(C_boot – C_orig)
  4. Calculate optimism corrected performance as C-O.