Part 4: Why does bias occur in optimism corrected bootstrapping?

In the previous parts of the series we demonstrated a positive results bias in optimism corrected bootstrapping by simply adding random features to our labels. This problem is due to an ‘information leak’ in the algorithm, meaning the training and test datasets are not kept seperate when estimating the optimism. Due to this, the optimism, under some conditions, can be very under estimated. Let’s analyse the code, it is pretty straightforward to understand then we can see where the problem originates.

  1. Fit a model M to entire data S and estimate predictive ability C. ## this part is our overfitted estimation of performance (can be AUC, accuracy, etc)
  2. Iterate from b=1…B: ## now we are doing resampling of the data to estimate the error
    1. Take a resample from the original data, S*
    2. Fit the bootstrap model M* to S* and get predictive ability, C_boot ## this will clearly give us another overfitted model performance, which is fine
    3. Use the bootstrap model M* to get predictive ability on S, C_orig ## we are using the same data (samples) used to train the model to test it, therefore it is not surprising that we have values with better performance than expected. C_orig values will be too high.
  3. Optimism O is calculated as mean(C_boot – C_orig)
  4. Calculate optimism corrected performance as C-O.

One way of correcting for this would be changing step 3 of the bootstrap, instead of testing on the original data, to test on the left out (unseen) data in the bootstrap. This way the training and test data is kept entirely seperate in terms of samples, thus eliminating our bias towards inflated model performance on null datasets with high features. There is probably no point of doing anyway this when we have methods such as LOOCV and K fold cross validation.

As p (features) >> N (samples) we are going to get better and better ability to get good model performance using the bootstrapped data on the original data. Why? Because the original data contains the same samples as the bootstrap and as we get more features, greater the chance we are going to get some randomly correlating with our response variable. When we test the bootstrap on the original data (plus more samples) it retains some of this random ability to predict the real labels. This is a typical overfitting problem when we have higher numbers of features, and the procedure is faulty.

Let’s take another experimental look at the problem, this code can be directly copied and pasted into R for repeating the analyses and plots. We have two implementations of the method, the first by me for glmnet (lasso logisitic regression), the second for glm (logisitic regression) from this website (http://cainarchaeology.weebly.com/r-function-for-optimism-adjusted-auc.html). Feel free to try different machine learning algorithms and play with the parameters.


library(glmnet)
library(pROC)
library(caret)
library(ggplot)
library(kimisc)

### TEST 1: bootstrap optimism with glmnet
cc <- c()
for (zz in seq(2,100,1)){
print(zz)
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')

### my own implementation of optimism corrected bootstrapping

## 1. fit model to entire test data (i.e. overfit it)

orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial")
preds <- predict(orig,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
original_auc <- as.numeric(auc$auc)

## 2. take resample of data and try to estimate optimism

test2 <- cbind(test,labelsa)
B <- 50
results <- matrix(ncol=2,nrow=B)
for (b in seq(1,B)){
## get the bootstrapped resampled data
boot <- test2[sample(row.names(test2),50),]
labelsb <- boot[,ncol(boot)]
boot <- boot[,-ncol(boot)]
## use the bootstrapped model to predict its own labels
bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial")
preds <- predict(bootf,newx=boot,type='response',s=0.01)
auc <- roc(labelsb,as.vector(preds))
boot_auc <- as.numeric(auc$auc)
## use bootstrap model to predict original labels
preds <- predict(bootf,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
boot_original_auc <- as.numeric(auc$auc)
## add the data to results
results[b,1] <- boot_auc
results[b,2] <- boot_original_auc
}
## calculate the optimism
O <- mean(results[,1]-results[,2])
## calculate optimism corrected measure of prediction (AUC)
corrected <- original_auc-O
##
cc <- c(cc,corrected)
print(cc)
}
## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p1 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glmnet - random data only gives positive result with optimism corrected bootstrap')
print(p1)
png('glmnet_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')
print(p1)
dev.off()

So here are the results, as number of noise only features on the x axis increase our 'corrected' estimate of AUC (on y axis) also increases when we start getting enough to allow that noise to randomly predict the labels. So this shows the problem starts about 40-50 features, then gets worse until about 75+. This is with the 'glmnet' function.

glmnet_test_upto100.png

Let’s look at the method using glm, we find the same trend, different implementation.

## TEST2
auc.adjust <- function(data, fit, B){
fit.model <- fit
data$pred.prob <- fitted(fit.model)
# get overfitted AUC
auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC'
auc.boot <- vector (mode = "numeric", length = B)
auc.orig <- vector (mode = "numeric", length = B)
o <- vector (mode = "numeric", length = B)
for(i in 1:B){
boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc'
fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial")
boot.sample$pred.prob <- fitted(fit.boot)
# get bootstrapped AUC
auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc
# get original data boot AUC
data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response")
auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc
# calculated optimism corrected version
o[i] <- auc.boot[i] - auc.orig[i]
}
auc.adj <- auc.app - (sum(o)/B)
return(auc.adj)
}
cc <- c()
for (zz in seq(2,100,1)){
print(zz)
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')
test2 <- data.frame(cbind(labelsa,test))
test2$labelsa <- as.factor(test2$labelsa)
## 1. make overfitted model
model <- glm(labelsa ~., data = test2, family = "binomial")
## 2. estimate optimism and correct
d <- auc.adjust(test2, model, B=200)
cc <- c(cc,d)
}
## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p2 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glm - random data only gives positive result with optimism corrected bootstrap')
print(p2)
png('glmt_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')
print(p2)
dev.off()

So there we have it, the method has a problem with it and should not be used with greater than about 40 features. This method unfortunately is currently being used for datasets with higher than this number of dimensions (40+) because people think it is a published method it is safe, unfortunately this is not how the world works R readers. Remember, the system is corrupt, science and statistics is full of lies, and if in doubt, do your own tests on positive and negative controls.

This is with the 'glm' function. Random features added on x axis, the corrected AUC on y axis.

glmt_test_upto100.png

What if you don’t believe it? I mean this is a text book method. Well R readers if that is so I suggest code it your self and try the code here, run the experiments on null (random data only) datasets with increasing features.

This is the last part in the series on debunking the optimism corrected bootstrap method. I consider it, debunked.

Part 3: Two more implementations of optimism corrected bootstrapping show clear positive results bias

Previously we saw with a reproducible code implementation that optimism corrected bootstrapping is very bias when we have many features (50-100 or more). Just re-run the code yourself and make up your own mind if in doubt.

This time I have used a different persons implementation using the ‘glm’ function, i.e. logistic regression to show the same misleading trend occurs, i.e. positive results on purely random data. The code has been directly taken from http://cainarchaeology.weebly.com/r-function-for-optimism-adjusted-auc.html, here. The second implementation is from another unnamed statistician.

If you disagree with these findings, please show your own implementation of the method in R (not using Caret) following the same experiment of increasing number of features that are purely noise very high with binary labels. Also, check the last two posts and re-run the code your self, before making your mind up and providing counter arguments.

There are no real predictors in this data, all of them are random data from a normal distribution. The bias is less bad using glm than with glmnet.

You should be able to copy and paste this code directly into R to repeat the results. I can’t vouch for this code because I did not write it, but both implementations show the same result as in the last blog post with my code.

Implementation 1. A glm experiment.


### example of logistic regression optimism corrected bootstrapping
# source: http://cainarchaeology.weebly.com/r-function-for-optimism-adjusted-auc.html

auc.adjust <- function(data, fit, B){
fit.model <- fit
data$pred.prob <- fitted(fit.model)
auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC'
auc.boot <- vector (mode = "numeric", length = B)
auc.orig <- vector (mode = "numeric", length = B)
o <- vector (mode = "numeric", length = B)
for(i in 1:B){
boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc'
fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial")
boot.sample$pred.prob <- fitted(fit.boot)
auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc
data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response")
auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc
o[i] <- auc.boot[i] - auc.orig[i]
}
auc.adj <- auc.app - (sum(o)/B)
boxplot(auc.boot, auc.orig, names=c("auc.boot", "auc.orig"))
title(main=paste("Optimism-adjusted AUC", "\nn of bootstrap resamples:", B), sub=paste("auc.app (blue line)=", round(auc.app, digits=4),"\nadj.auc (red line)=", round(auc.adj, digits=4)), cex.sub=0.8)
abline(h=auc.app, col="blue", lty=2)
abline(h=auc.adj, col="red", lty=3)
}

## generate random data

xx.glmnet <- matrix(nrow=53,ncol=1)
xx.glmnet <- data.frame(xx.glmnet)
xx.glmnet[,1] <- c(rep(0,25),rep(1,28))
test <- matrix(rnorm(53*1000, mean = 0, sd = 1),
nrow = 53, ncol = 1000, byrow = TRUE)
xx.glmnet <- cbind(xx.glmnet,test)
colnames(xx.glmnet) <- c('outcome',paste('Y',seq(1,1000),sep=''))

## 1. make overfitted model

model <- glm(outcome ~., data = xx.glmnet, family = "binomial")

## 2. estimate optimism and correct

auc.adjust(xx.glmnet, model, B=200)

Let's have a look at the results, which agree nicely with our previous findings using a from scratch implementation of the method. So the red line is supposedly our corrected AUC, but the AUC should be 0.5 when running on random data. See previous part 1 post and part 2 post for demonstration of cross validation results on random data which give the correct result.

Rplot

Implementation 2: Another glmnet experiment

Edit: It has come to my attention the below implementation suffers from numerical instability, possibly due to the cv.glmnet function giving unstable results. Sometimes we have bias, sometimes not. This is not the best example.


### example of optimism corrected bootstrapping implementation
# source: an unnamed programmer

library(boot)
library(pROC)
library(glmnet)

#glmnet -all predictors penalised
compare_opt_glmnet<- function(orig_data_glmnet, i){
# orig_data_glmnet = the whole data
train_data_glmnet<- orig_data_glmnet[i, ] # get bootstrap
model_glmnet<- model_process_glmnet(train_data_glmnet)
# return a summary optimism results
AUC_results_glmnet <- c(
# boot against original
AUC_train1=roc(orig_data_glmnet$outcome,as.vector(predict(model_glmnet,type="response", newx = model.matrix (outcome~.,orig_data_glmnet )[,-1], s = 'lambda.min')))$auc,
# boot against boot
AUC_train2=roc(train_data_glmnet$outcome,as.vector(predict(model_glmnet,type="response", newx = model.matrix (outcome~.,train_data_glmnet )[,-1], s = 'lambda.min')))$auc,
AUC_diff=roc(train_data_glmnet$outcome,as.vector(predict(model_glmnet,type="response", newx = model.matrix (outcome~.,train_data_glmnet )[,-1], s = 'lambda.min')))$auc-
roc(orig_data_glmnet$outcome,as.vector(predict(model_glmnet,type="response", newx = model.matrix (outcome~.,orig_data_glmnet )[,-1], s = 'lambda.min')))$auc
)
return(AUC_results_glmnet)
}

model_process_glmnet<- function(the_data_glmnet){
model_final_glmnet <- cv.glmnet (model.matrix (outcome~.,the_data_glmnet )[,-1],the_data_glmnet$outcome,alpha=1,family = "binomial")
return(model_final_glmnet)
}

# generate random data

test <- matrix(rnorm(53*1000, mean = 0, sd = 1),
nrow = 53, ncol = 1000, byrow = TRUE)
test <- data.frame(test)
labs <- factor(c(0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,1,0,1,1,0,1,0,1,0,0,0,0,0,
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
labs <- data.frame(as.matrix(labs))
colnames(labs)[1] <- 'outcome'
xx.glmnet <- cbind(labs,test)

## 1. make overfitted model

tempd <- xx.glmnet[,2:ncol(xx.glmnet)]
labels <- xx.glmnet[,1]
overfittedmodel <- cv.glmnet(as.matrix(tempd),y=labels,alpha=1,family="binomial")
lasso.prob <- predict(overfittedmodel,type="response",newx=as.matrix(tempd),s='lambda.min')
auc <- roc(labels,as.vector(lasso.prob))
overfitted_auc <- as.numeric(auc$auc)

## 2. try to correct for optimism

Repeats = 100
res_opt_glmnet <- boot(xx.glmnet, statistic = compare_opt_glmnet, R = Repeats)
optimism_glmnet <- mean(res_opt_glmnet$t[ , 3])
enhanced_glmnet <- overfitted_auc - optimism_glmnet

Here are the results of the above code on random data only.

> enhanced_glmnet
[1] 0.8916488

If in doubt compare your results with LOOCV and k fold cross validation, this is how I discovered this method is dodgy by seeing substantial differences on null data-sets where no real predictive ability should be found. These two methods are more widely used and reliable.

Sorry @ Frank Harrell, I don’t have time atm to do the extra bits you suggest.
Game on @ Davis Vaughan. Do your own implementation with the same data as I have used with glmnet and glm.

The steps of this method, briefly, are as follows:

  1. Fit a model M to entire data S and estimate predictive ability C.
  2. Iterate from b=1…B:
    1. Take a resample from the original data, S*
    2. Fit the bootstrap model M* to S* and get predictive ability, C_boot
    3. Use the bootstrap model M* to get predictive ability on S, C_orig
  3. Optimism O is calculated as mean(C_boot – C_orig)
  4. Calculate optimism corrected performance as C-O.

Part 2: Optimism corrected bootstrapping is definitely bias, further evidence

Some people are very fond of the technique known as ‘optimism corrected bootstrapping’, however, this method is bias and this becomes apparent as we increase the number of noise features to high numbers (as shown very clearly in my previous blog post). This needs exposing, I don’t have the time to do a publication on this nor the interest so hence this article. Now, I have reproduced the bias with my own code.

Let’s first just show the plot from last time again to recap, so as features increase (that are just noise), this method gives a strong positive AUC compared with cross validation. The response variable is random, the explanatory variables are random, so we should not get a positive result. Now, we are going to implement the method from scratch, instead of using Caret, and show the same thing.

bias_in_optimism_corrected_bootstrapping

Perhaps in lower dimensions the method is not so bad, but I suggest avoiding it in all cases. I consider it something of a hack to get statistical power. Regardless of where it is published and by whom. Why? Because I have tested it myself, I am sure we could find some simulations that support its use somewhere, but LOOCV and K-fold cross validation do not have this problem. This method is infecting peoples work with a positive results bias in high dimensions, possibly lower dimensions as well.

After reading the last post Frank Harrell decided my results he decided were ‘erroneous’. I doubted this was the case, so now we have more evidence to expose the truth. This simple statistical method is described in this text book and a blog post (below), for those who are interested:

Steyerberg, Ewout W. Clinical prediction models: a practical approach to development, validation, and updating. Springer Science & Business Media, 2008.

http://thestatsgeek.com/2014/10/04/adjusting-for-optimismoverfitting-in-measures-of-predictive-ability-using-bootstrapping/

The steps of this method, briefly, are as follows:

  1. Fit a model M to entire data S and estimate predictive ability C.
  2. Iterate from b=1…B:
    1. Take a resample from the original data, let this be S*
    2. Fit the bootstrap model M* to S* and get predictive ability, C_boot
    3. Use the bootstrap model M* to get predictive ability on S, C_orig
  3. Optimism O is calculated as mean(C_boot – C_orig)
  4. Calculate optimism corrected performance as C-O.

This is a general procedure (Steyerberg, 2008) and so I can use glmnet to fit the model and AUC as my measure of predictive performance. Glmnet is basically logistic regression with feature selection, so if this method for estimating ‘optimism’ is a good one, we should be able to estimate the optimism of overfitting this model using the above method, then correct for it.

Let’s re-run the analysis with my implementation of this method and compare with Caret (which is great software). Caret agrees with my implementation as shown below in fully reproducible code.

This code should be fine to copy and paste into R. The data is random noise, yet, optimism corrected bootstrapping gives a strong positive result AUC of 0.86, way way above what we would expect of 0.5.

In contrast, below I give code for Caret cross validation which gives the correct result, just test it for yourself instead of believing statistical dogma.

> corrected
[1] 0.858264

##
library(glmnet)
library(pROC)
library(caret)

## set up test data
test <- matrix(rnorm(100*1000, mean = 0, sd = 1),
nrow = 100, ncol = 1000, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,1000),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')

### my own implementation of optimism corrected bootstrapping

## 1. fit model to entire test data (i.e. overfit it)

orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial")
preds <- predict(orig,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
original_auc <- as.numeric(auc$auc)

## 2. take resample of data and try to estimate optimism

test2 <- cbind(test,labelsa)
B <- 50
results <- matrix(ncol=2,nrow=B)
for (b in seq(1,B)){
## get the bootstrapped resampled data
boot <- test2[sample(row.names(test2),50),]
labelsb <- boot[,ncol(boot)]
boot <- boot[,-ncol(boot)]
## use the bootstrapped model to predict its own labels
bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial")
preds <- predict(bootf,newx=boot,type='response',s=0.01)
auc <- roc(labelsb,as.vector(preds))
boot_auc <- as.numeric(auc$auc)
## use bootstrap model to predict original labels
preds <- predict(bootf,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
boot_original_auc <- as.numeric(auc$auc)
## add the data to results
results[b,1] <- boot_auc
results[b,2] <- boot_original_auc
}
## calculate the optimism
O <- mean(results[,1]-results[,2])
## calculate optimism corrected measure of prediction (AUC)
corrected <- original_auc-O

### caret implementation of optimism corrected bootstrapping

test3 <- data.frame(test2)
test3$labelsa <- as.factor(test3$labelsa)
test3$labelsa <- make.names(test3$labelsa)

ctrl <- trainControl(method = 'optimism_boot',
summaryFunction=twoClassSummary,
classProbs=T,
savePredictions = T,
verboseIter = T)
fit1 <- train(as.formula( paste( 'labelsa', '~', '.' ) ), data=test3,
method="glmnet", # preProc=c("center", "scale")
trControl=ctrl, metric = "ROC")

### sanity check, using cv

ctrl <- trainControl(method = 'cv',
summaryFunction=twoClassSummary,
classProbs=T,
savePredictions = T,
verboseIter = T)
fit2 <- train(as.formula( paste( 'labelsa', '~', '.' ) ), data=test3,
method="glmnet", # preProc=c("center", "scale")
trControl=ctrl, metric = "ROC")

Here is a plot repeating the Caret results I did with a simple loop to iteratively add noise only features and getting the corrected AUC for each iteration, but this is with my own implementation instead of Caret. You can try this out by sticking a FOR loop outside the code above to calculate the 'corrected' AUC. You can see the same misleading trend as we saw in the last post.

bias_in_optimism_corrected_bootstrapping2.png

(X axis = number of noise only features added to random labels)
(Y axis = ‘optimism corrected bootstrap AUC’)

People are currently using this method because it is published and they some how think published methods do not have fundamental problems, which is a widespread misconception.

 

Optimism corrected bootstrapping: a problematic method

There are lots of ways to assess how predictive a model is while correcting for overfitting. In Caret the main methods I use are leave one out cross validation, for when we have relatively few samples, and k fold cross validation when we have more. There also is another method called ‘optimism corrected bootstrapping’, that attempts to save statistical power, by first getting the overfitted result, then trying to correct this result by bootstrapping the data to estimate the degree of optimism. A few simple tests in Caret can demonstrate this method is bunk.

This is a very straightforward method, just add random variables from a normal distribution to the ground truth iris labels. We should find our AUC (area under ROC curve) is about 0.5. Yet for optimism corrected bootstrap it gives a positive result regardless of whether the predictors are just noise or not. Let’s just run that test:

This is called a sensitivity analysis for the uninitiated, I am just increasing number of random noise features (z) and binding them to the real labels in an iterative manner.

library(caret)
allresults <- matrix(ncol=2,nrow=200)
i = 0
for (z in seq(10,2000,10)){

  i = i + 1

  # select only two species
  iris <- subset(iris, iris$Species == 'versicolor' | iris$Species == 'virginica')
  iris$Species <- droplevels(iris$Species)
  # generate random data
  test <- matrix(rnorm(100*z, mean = 0, sd = 1),
                 nrow = 100, ncol = z, byrow = TRUE)
  # bind random data
  iris <- cbind(iris,test)
  # remove real data
  iris <- iris[,-1:-4]

  # cross validation
  ctrl <- trainControl(method = 'cv',
                       summaryFunction=twoClassSummary,
                       classProbs=T,
                       savePredictions = T,
                       verboseIter = T)
  fit3 <- train(as.formula( paste( 'Species', '~', '.' ) ), data=iris,
                method="glmnet", # preProc=c("center", "scale")
                trControl=ctrl, metric = "ROC") #
  allresults[i,1] <- max(fit3$results$ROC)

  # optimism corrected bootstrapping
  ctrl <- trainControl(method = 'optimism_boot',
                       summaryFunction=twoClassSummary,
                       classProbs=T,
                       savePredictions = T,
                       verboseIter = T)
  fit4 <- train(as.formula( paste( 'Species', '~', '.' ) ), data=iris,
                method="glmnet", # preProc=c("center", "scale")
                trControl=ctrl, metric = "ROC") #
  allresults[i,2] <- max(fit4$results$ROC)

  rm(iris)
}

df <- data.frame(allresults)
colnames(df) <- c('cross_validation','optimism_corrected_boot')
df2 <- reshape2::melt(df)
df2$N <- c(seq(10,2000,10),seq(10,2000,10))

# do the plot
p1 <- ggplot(df2, aes(x=N, y=value, group=variable)) +
  geom_line(aes(colour=variable))
png('bias_in_optimism_corrected_bootstrapping.png', height = 15, width = 27, units = 'cm',
    res = 900, type = 'cairo')
print(p1)
dev.off()

So there we have it, a method that is being used in publications as we speak, but it is horribly bias. This is science for you, full of misinformation and lies. Be warned, there is a darkness out there R readers, not just in Washington.

bias_in_optimism_corrected_bootstrapping.png

N (x axis) refers to number of features here (that are just noise). AUC (y axis) is about 0.5 when we have no predictive power.

We can see cross validation performs as expected, it is a bit unstable though, better to use repeated cross validation I think.

My recommendations for Rbloggers readers who don’t know much about supervised machine learning, but want to use it on their data in R:

  • Don’t start by implementing cross validation loops your self and calculate AUC etc, Caret makes life a lot easier and it is bug tested by thousands of people around the world like myself.
  • Stick with tried and tested mainstream approaches when in doubt. Like LOOCV or repeated k fold CV.
  • If implementing your own method or skeptical of the reliability of the method your using, use rnorm to perform a negative control to test your statistical method for bias. This is a good check to perform, as well as a positive control.
  • Make sure your data is in the correct order.

 

 

 

 

Simulating NXN dimensional Gaussian clusters in R

Gaussian clusters are found in a range of fields and simulating them is important as often we will want to test a given class discovery tools performance under conditions where the ground truth is known (e.g. K=6). However, a flexible Gaussian cluster simulator for simulating Gaussian clusters with defined variance, spacing, and size does not exist. This is why we made ‘clusterlab’, a new CRAN package (https://cran.r-project.org/web/packages/clusterlab/index.html). It was initially based on the Scikit-Learn make.blobs function, but now is much more sophisticated.

Clusterlab works in 2D space initially, because it is easy to work in mathematically. First, single 2D coordinates are placed on the perimeter of a circle by the algorithm, then they may be ‘pushed’ through vector multiplication with scalars outwards. Multiple rings of coordinates may be generated, this allows the user to create complex patterns through rotations of the individual rings. So by this point, we have the cluster center coordinates created in 2D space. Next, clusters are formed by adding Gaussian noise to the cluster centers and concatenating the new sample coordinates to our set of coordinates. Then, finally, to create high dimensional datasets these 2D coordinates are treated as principal component scores and projected into higher dimensions using fixed eigenvectors that are simulated again from a Gaussian distribution.

Let’s take a look at generating a six cluster dataset with clusterlab. The plots show the data in 2D principal component space.

library(clusterlab)
synthetic <- clusterlab(centers=6,centralcluster=TRUE)

sixclustersolution.png

Next, we will change the variance of the clusters using the ‘sdvec’ parameter, this allows the user to set the variance of each cluster individually.

synthetic <- clusterlab(centers=6,sdvec=c(0.5,1,1.5,2,2,2))

sixclustersolution_sd_demo.png

We can also modify the ‘alpha’ parameter to push each cluster away from its central coordinate by a specified degree (a scalar) as follows:

synthetic <- clusterlab(centers=8,alphas=c(0,1,2,3,4,5,6,7))

eightclustersolution_alphas_demo.png

Let’s also try simulating a more complex multi-ringed structure we talked about earlier. In this code we set the ‘ringthetas’ parameter to rotate the individual rings by a specified number of degrees (theta).

synthetic <- clusterlab(centers=5,r=7,sdvec=c(6,6,6,6,6),
alphas=c(2,2,2,2,2),centralcluster=FALSE,
numbervec=c(50,50,50,50),rings=5,ringalphas=c(2,4,6,8,10,12),
ringthetas = c(30,90,180,0,0,0), seed=123)

multiringedsolution.png

As a last demonstration, we will find the optimal K of the six cluster solution using the Bioconductor M3C package (https://www.bioconductor.org/packages/devel/bioc/html/M3C.html), which is a consensus clustering algorithm I use. We are going to use a smaller test dataset because I am on my laptop and it is quite old without many cores. And, we are going to set the Monte Carlo iterations parameter to 50 and use Ward’s hierarchical clustering because that should speed things up.

synthetic <- clusterlab(centers=6,centralcluster=TRUE,features=1000,
numbervec=c(30,30,30,30,30,30))
library(M3C)
d <- synthetic$synthetic_data
res <- M3C(d,iters=50,printres = TRUE,clusteralg = 'hc')

RCSI.png

So as there is a maximal Relative Cluster Stability Index (RCSI) for K=6, M3C confirms our simulation. I think this is pretty neat. The clusterlab and M3C methods are described in our bioRxiv preprint (https://www.biorxiv.org/content/early/2018/07/25/377002).

How to perform consensus clustering without overestimation of K and reject the null hypothesis

The Monti consensus clustering algorithm is one of the most widely used class discovery techniques in the genome sciences and is commonly used to cluster transcriptomic, epigenetic, proteomic, and a range of other types of data. It can automatically decide the number of classes (K), by resampling the data and for each K (e.g. 2-10) calculating a consensus rate of how frequently each pair of samples are sampled together by a given clustering algorithm, e.g. PAM. These consensus rates form a consensus matrix. A perfect consensus matrix for any given K would just consist of 0s and 1s because all samples always cluster together or not together over all resampling iterations. Whereas values around 0.5 would indicate the clustering is less clear. However, as Șenbabaoğlu et al. recently pointed out, consensus clustering is subject to false discoveries, like many clustering algorithms are, as they usually do not test the null hypothesis K=1 (no structure). We also found the consensus clustering algorithm overestimates K, this is why we developed M3C which was inspired by the GAP-statistic (https://www.biorxiv.org/content/10.1101/377002v1).

M3C (https://bioconductor.org/packages/release/bioc/html/M3C.html) uses a multicore Monte Carlo simulation, that maintains the correlation structures of the features of the input data, to make multivariate Gaussian references for consensus clustering to eliminate intrinsic bias. We use the reference relative Relative Cluster Stability Index (RCSI) and p values to chose K and reject the null hypothesis K=1. The package also contains a variety of other functions and features for investigating data, such as PCA, t-SNE, automatic clinical data analysis, and a selection of clustering algorithms we have tested such as PAM, k-means, self-tuning spectral clustering, and Ward’s hierarchical clustering.

Let’s take a look at a demonstration and show how it removes the limitations of the method. This is with the developers’ version of M3C which comes with a glioblastoma cancer dataset with tumor classes (from histology) labeled in a description object.

library(M3C)
dim(mydata) # we have 1740 features and 50 samples (microarray data)
[1] 1740   50
head(desx) # we have a annotation file with tumour types labelled
class                  ID
TCGA.02.0266.01A.01         TCGA.02.0266.01A.01
TCGA.08.0353.01A.01   Proneural TCGA.08.0353.01A.01
TCGA.06.0175.01A.01 Mesenchymal TCGA.06.0175.01A.01
TCGA.08.0346.01A.01 Mesenchymal TCGA.08.0346.01A.01
TCGA.02.0114.01A.01   Proneural TCGA.02.0114.01A.01
TCGA.02.0048.01A.01   Proneural TCGA.02.0048.01A.01

So now we know something about the data we are working with we can go ahead and run M3C. This code just uses the default algorithm, PAM, and in this case we are setting the ‘doanalysis’ and ‘analysistype’ parameters to get M3C to analyze the clinical data we are inputting. Since the variable we want to test for is categorical, we can do a chi-squared test. The ‘printres’ parameter will get all our results printed in the current directory.

res <- M3C(mydata,des=desx,cores=2,doanalysis=TRUE,analysistype='chi',
printres = TRUE)

This plot shows the PAC score, which is a measure of consensus matrix stability for each value of K. A PAC score of 0 indicates perfect clustering. As we can see, there is a trend towards lower values of K, which is an internal bias. This is commonly observed on real datasets with the PAC score, although it performs better than the original delta K. Delta K is still commonly used to this day despite performing poorly simply because there are so many publications which have used it in the past. The PAC score bias can be eliminated by using the RCSI which takes the reference into account.

PACscore.png

We can see the RCSI below, removes the bias and a maximum (which indicates the optimal K) is found for K=4.

RCSI.png

M3C also calculates empirical p values from the reference distributions to test the null hypothesis K=1. We can see below, for both K=4 and K=5 we can reject the null hypothesis K=1.

pscore.png

These are the results quantified in a table. The beta p value is simply where the N Monte Carlo simulations were used to yield the parameters required for a theoretical beta distribution from which a p value can be derived that may go lower than those provided by 100 Monte Carlo iterations. We chose a beta distribution because the reference distributions for K=2 often exhibit skew and kurtosis. The PAC score also occurs in the range 0-1, the same as the beta distribution. The chi-squared test in the below table tells us K=4 and K=5 are two optimal values for the relationship of the clinical data with the discovered classes. There are some NA values due to incomplete data.

We can see from the significant relationship of the clinical data, particularly with the M3C optimal K, that the framework works.

res$scores
K  PAC_REAL   PAC_REF         RCSI MONTECARLO_P     BETA_P   P_SCORE
1  2 0.6391837 0.5722939 -0.110539221   0.69306931 0.65102440 0.1864027
2  3 0.4326531 0.5602776  0.258496125   0.09900990 0.09587818 1.0182802
3  4 0.3273469 0.4847918  0.392699013   0.02970297 0.01060788 1.9743716
4  5 0.3208163 0.4149796  0.257360575   0.04950495 0.02856676 1.5441390
5  6 0.2955102 0.3602776  0.198171342   0.05940594 0.05211127 1.2830683
6  7 0.2800000 0.3150857  0.118055107   0.15841584 0.16558430 0.7809809
7  8 0.2677551 0.2796735  0.043549978   0.30693069 0.35994732 0.4437610
8  9 0.2563265 0.2523020 -0.015825197   0.55445545 0.56868102 0.2451313
9 10 0.2302041 0.2282449 -0.008547061   0.49504950 0.54129058 0.2665695

> res$clinicalres
K          chi
1  2 8.977979e-07
2  3 2.482780e-11
3  4 1.806709e-13
4  5 7.124742e-13
5  6 1.294080e-11
6  7          NaN
7  8          NaN
8  9          NaN
9 10          NaN

Summary

We have seen in this blog post how the M3C package provides a powerful approach to identifying classes using consensus clustering in high dimensional continuous data. I suggest trying the approach and comparing it with others for yourself. We have also included a simple data simulator in the package to encourage user testing of class discovery methods used in the field of cancer research and stratified medicine. A more complex data simulator has been made in parallel with M3C called, ‘clusterlab’ on CRAN (https://cran.r-project.org/web/packages/clusterlab/index.html). This method offers precise control over variance, spacing, outliers, and cluster size.