Part 4: Why does bias occur in optimism corrected bootstrapping?

In the previous parts of the series we demonstrated a positive results bias in optimism corrected bootstrapping by simply adding random features to our labels. This problem is due to an ‘information leak’ in the algorithm, meaning the training and test datasets are not kept seperate when estimating the optimism. Due to this, the optimism, under some conditions, can be very under estimated. Let’s analyse the code, it is pretty straightforward to understand then we can see where the problem originates.

  1. Fit a model M to entire data S and estimate predictive ability C. ## this part is our overfitted estimation of performance (can be AUC, accuracy, etc)
  2. Iterate from b=1…B: ## now we are doing resampling of the data to estimate the error
    1. Take a resample from the original data, S*
    2. Fit the bootstrap model M* to S* and get predictive ability, C_boot ## this will clearly give us another overfitted model performance, which is fine
    3. Use the bootstrap model M* to get predictive ability on S, C_orig ## we are using the same data (samples) used to train the model to test it, therefore it is not surprising that we have values with better performance than expected. C_orig values will be too high.
  3. Optimism O is calculated as mean(C_boot – C_orig)
  4. Calculate optimism corrected performance as C-O.

One way of correcting for this would be changing step 3 of the bootstrap, instead of testing on the original data, to test on the left out (unseen) data in the bootstrap. This way the training and test data is kept entirely seperate in terms of samples, thus eliminating our bias towards inflated model performance on null datasets with high features. There is probably no point of doing anyway this when we have methods such as LOOCV and K fold cross validation.

As p (features) >> N (samples) we are going to get better and better ability to get good model performance using the bootstrapped data on the original data. Why? Because the original data contains the same samples as the bootstrap and as we get more features, greater the chance we are going to get some randomly correlating with our response variable. When we test the bootstrap on the original data (plus more samples) it retains some of this random ability to predict the real labels. This is a typical overfitting problem when we have higher numbers of features, and the procedure is faulty.

Let’s take another experimental look at the problem, this code can be directly copied and pasted into R for repeating the analyses and plots. We have two implementations of the method, the first by me for glmnet (lasso logisitic regression), the second for glm (logisitic regression) from this website ( Feel free to try different machine learning algorithms and play with the parameters.


### TEST 1: bootstrap optimism with glmnet
cc <- c()
for (zz in seq(2,100,1)){
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')

### my own implementation of optimism corrected bootstrapping

## 1. fit model to entire test data (i.e. overfit it)

orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial")
preds <- predict(orig,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
original_auc <- as.numeric(auc$auc)

## 2. take resample of data and try to estimate optimism

test2 <- cbind(test,labelsa)
B <- 50
results <- matrix(ncol=2,nrow=B)
for (b in seq(1,B)){
## get the bootstrapped resampled data
boot <- test2[sample(row.names(test2),50),]
labelsb <- boot[,ncol(boot)]
boot <- boot[,-ncol(boot)]
## use the bootstrapped model to predict its own labels
bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial")
preds <- predict(bootf,newx=boot,type='response',s=0.01)
auc <- roc(labelsb,as.vector(preds))
boot_auc <- as.numeric(auc$auc)
## use bootstrap model to predict original labels
preds <- predict(bootf,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
boot_original_auc <- as.numeric(auc$auc)
## add the data to results
results[b,1] <- boot_auc
results[b,2] <- boot_original_auc
## calculate the optimism
O <- mean(results[,1]-results[,2])
## calculate optimism corrected measure of prediction (AUC)
corrected <- original_auc-O
cc <- c(cc,corrected)
## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p1 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glmnet - random data only gives positive result with optimism corrected bootstrap')
png('glmnet_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')

So here are the results, as number of noise only features on the x axis increase our 'corrected' estimate of AUC (on y axis) also increases when we start getting enough to allow that noise to randomly predict the labels. So this shows the problem starts about 40-50 features, then gets worse until about 75+. This is with the 'glmnet' function.


Let’s look at the method using glm, we find the same trend, different implementation.

## TEST2
auc.adjust <- function(data, fit, B){
fit.model <- fit
data$pred.prob <- fitted(fit.model)
# get overfitted AUC <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC'
auc.boot <- vector (mode = "numeric", length = B)
auc.orig <- vector (mode = "numeric", length = B)
o <- vector (mode = "numeric", length = B)
for(i in 1:B){
boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc'
fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial")
boot.sample$pred.prob <- fitted(fit.boot)
# get bootstrapped AUC
auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc
# get original data boot AUC
data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response")
auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc
# calculated optimism corrected version
o[i] <- auc.boot[i] - auc.orig[i]
auc.adj <- - (sum(o)/B)
cc <- c()
for (zz in seq(2,100,1)){
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')
test2 <- data.frame(cbind(labelsa,test))
test2$labelsa <- as.factor(test2$labelsa)
## 1. make overfitted model
model <- glm(labelsa ~., data = test2, family = "binomial")
## 2. estimate optimism and correct
d <- auc.adjust(test2, model, B=200)
cc <- c(cc,d)
## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p2 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glm - random data only gives positive result with optimism corrected bootstrap')
png('glmt_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')

So there we have it, the method has a problem with it and should not be used with greater than about 40 features. This method unfortunately is currently being used for datasets with higher than this number of dimensions (40+) because people think it is a published method it is safe, unfortunately this is not how the world works R readers. Remember, the system is corrupt, science and statistics is full of lies, and if in doubt, do your own tests on positive and negative controls.

This is with the 'glm' function. Random features added on x axis, the corrected AUC on y axis.


What if you don’t believe it? I mean this is a text book method. Well R readers if that is so I suggest code it your self and try the code here, run the experiments on null (random data only) datasets with increasing features.

This is the last part in the series on debunking the optimism corrected bootstrap method. I consider it, debunked.

Posted in R

12 thoughts on “Part 4: Why does bias occur in optimism corrected bootstrapping?

    • chris2016

      The second part is someone else’s code, but replace is set to TRUE not FALSE:
      boot.sample <- sample.rows(data, nrow(data), replace=TRUE)
      I can't see what is wrong with that?
      This is the correct implementation I think, as described here: Smith, Gordon CS, et al. “Correcting for optimistic prediction in small data sets.” American journal of epidemiology 180.3 (2014): 318-324.
      As the Harrell method.

      Although my code isn't with replacement your right, so it is a bit wrong 😛
      Another issue is as far as I can see standard method is to not reduce N when doing the bootstrap.

      Regardless, of these points, the bias still is there… under these conditions.

  1. Alan Mainwaring

    About twenty years ago I was thrown into teaching classical statistics. Coming from a Pure Mathematics background, I was totally unsettled by this prospect. Why because mathematical proof of many of the techniques, terminology (eg Variate vs Random variable) was virtually impossible to find or in my case follow. So what do we do? teach the statistical techniques like we teach life scientists to use a microscope. We hope that the people who have developed the statistical packages and got it right and just use the statistical packages without much thought about how they work. When I heard about “boot strapping” I simply could not accept it. It seemed you were getting something for nothing I still don’t get it. Of all the disciplines I have studied mathematical statistics is the most difficult discipline to work in and proving the foundation concepts via pure mathematics is extremely difficult, I look back on the likes of R A Fisher and Gossett, how the hell they derived the t-distribution is beyond me. Even if we prove something mathematically the real test is does it work in the real world ?

    • ecoquant

      I have read that an indicator of an important theorem in Mathematics is that when you read it, before seeing the proof, you find it difficult to believe possible.

      While there are many examples, such as the Bayesian analysis of the Monte Hall Problem, or Stein’s Result (which sets m.l.e. theory on it’s head for dinensions 3 or greater) the recent such for me is the Johnson- Lindenstrauss Lemma

      The bootstrap is one of those, and while it won’t work for all estimators, particularly some data dependent ones, it works for an amazingly large set of estimators. There’s really no reason not to understand it as the number of good textbooks available on it is large. There are even great and deep theoretical treatments, like Hall’s Bootstrap and Edgeworth Expansion and many practical ones, like that by Davison and Hinkley and the one by Chernick and the one by Lahiri.

      I daresay someone who hadn’t mastered the bootstrap along with Monte Carlo methods like MCMC or importance sampling or slice sampling or stochastic search methods or the LASSO, does not know modern statistics, especially computational ones. And of course as in all numerical work, e.g., numerical linear algebra, using these wuthout understanding theiry is asking for trouble, even if that’s not sufficient.

      Classical frequentist techniques turn out to be not so great when seen through the lens of big actual problems and many dimensions. One of the most disappointing, to me, is the Method of Naximum Likelihood.

  2. ecoquant

    Fix the above second paragraph to:

    “While there are many examples, such as the Bayesian analysis of the Monte Hall Problem, or Stein’s Result (which sets m.l.e. theory on it’s head for dinensions 3 or greater) the recent such for me is the Johnson- Lindenstrauss Lemma.”

    I continue to be challenged by trying to type on (even) a Google Pixel 2.

  3. Noman Dormosh

    I came across this post by chance when I wanted to explore the limitations of the use of bootstrapping for validation. I decided to use the code to reproduce the issue as you suggested.

    At first glance, the behavior of your second experiment with logistic regression (using glm package) did not surprise me as overfitting is a well-known problem especially when you add features. However, using regularized logistic regression (L1 regularization in your example or LASSO) should take care of the overfitting to some extent. I was wondering why did it show inflated optimism corrected AUC? I realized then that regularization was not performed in the correct way in your code. The optimal tuning parameter (Lambda) is usually chosen by cross-validation and by selecting lambda that gives the minimum mean squared error (MSE). Your code lacks this part of lambda tuning and you used consistently the penalty value of 0.01. The magnitude of the shrinkage of the features is highly dependent on the penalty value. As lambda gets larger, the bias is unchanged but the variance drops.
    Therefore, I changed your implementation of your first experiment (LASSO logistic regression) to determine this value using cross-validation, to get the correct estimations of optimism correct AUC.

    Here are the results:
    Your code (lambda=0.01): optimism corrected AUC = 0.714 (CI 95%; 0.692 – 0.738)
    10 x cross-validation: optimism corrected AUC = 0.446 (CI 95%; 0.418 – 0.474)
    3 x cross-validation: optimism corrected AUC = 0.427 (CI 95%; 0.405 – 0.449)

    When I manually selected an arbitrary higher lambda (let’s say s=0.07), then the optimism corrected AUC became 0.588 (CI 95%; 0.574 – 0.602).

    Obviously, the correct use of the regularization parameter did the magic and as you can see from the results, the optimism corrected AUC is not far away from 0.5!

    Lines of code to change:
    orig <- cv.glmnet(test,y=labelsa,alpha=1,family = "binomial", nfolds=3)
    preds <- predict(orig,newx=test,type='response',s="lambda.min")

    Inside the bootstrap loop:
    bootf <- cv.glmnet(boot,y=labelsb,alpha=1,family = "binomial", nfolds=3)
    preds <- predict(bootf,newx=boot,type='response',s="lambda.min")
    preds <- predict(bootf,newx=test,type='response',s="lambda.min")

    Kind Regards,

    • chris2016

      Hi Noman

      I don’t have time to carry on working on this at the moment. But,

      Have you seen the Caret implementation shows an identical problem? This is not a co incidence and their glmnet code is tuned over lambda and alpha.

      You have also seen the glm function with a standard logistic regression shows the same problem… and you imply this model will cause inflated AUC inherently with any cross validation method? No, actually if you run a standard cross validation loop using logistic regression or any machine learning model, and hide your test data in each iteration, this overestimation of the AUC problem does not occur, you can try it yourself when p >> s using noise only features. I suggest using Caret to standardise your comparisons with well tested code.

      This problem occurs because of not hiding the training data when estimating the optimism in this dodgy method, I have shown this clearly here:

      If you disagree, send me an email with the experiments using the Caret implementation and we can talk further.

      Best wishes,


      • chris2016

        Try this:

        allresults <- matrix(ncol=2,nrow=200)
        i = 0
        for (z in seq(200,1000,100)){

        i = i + 1

        # select only two species
        iris <- subset(iris, iris$Species == 'versicolor' | iris$Species == 'virginica')
        iris$Species <- droplevels(iris$Species)
        # generate random data
        test <- matrix(rnorm(100*z, mean = 0, sd = 1),
        nrow = 100, ncol = z, byrow = TRUE)
        # bind random data
        iris <- cbind(iris,test)
        # remove real data
        iris <- iris[,-1:-4]

        # cross validation
        ctrl <- trainControl(method = 'cv',
        savePredictions = T,
        verboseIter = T)
        fit3 <- train(as.formula( paste( 'Species', '~', '.' ) ), data=iris,
        method="glm", family="binomial", # preProc=c("center", "scale")
        trControl=ctrl, metric = "ROC") #
        allresults[i,1] <- max(fit3$results$ROC)

        # optimism corrected bootstrapping
        ctrl <- trainControl(method = 'optimism_boot',
        savePredictions = T,
        verboseIter = T)
        fit4 <- train(as.formula( paste( 'Species', '~', '.' ) ), data=iris,
        method="glm", family="binomial", # preProc=c("center", "scale")
        trControl=ctrl, metric = "ROC") #
        allresults[i,2] <- max(fit4$results$ROC)


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s