In the previous parts of the series we demonstrated a positive results bias in optimism corrected bootstrapping by simply adding random features to our labels. This problem is due to an ‘information leak’ in the algorithm, meaning the training and test datasets are not kept seperate when estimating the optimism. Due to this, the optimism, under some conditions, can be very under estimated. Let’s analyse the code, it is pretty straightforward to understand then we can see where the problem originates.

- Fit a model M to entire data S and estimate predictive ability C.
**## this part is our overfitted estimation of performance (can be AUC, accuracy, etc)**

- Iterate from b=1…B:
**## now we are doing resampling of the data to estimate the error**- Take a resample from the original data, S*
- Fit the bootstrap model M* to S* and get predictive ability, C_boot
**## this will clearly give us another overfitted model performance, which is fine** - Use the bootstrap model M* to get predictive ability on S, C_orig
**## we are using the same data (samples) used to train the model to test it, therefore it is not surprising that we have values with better performance than expected. C_orig values will be too high.**

- Optimism O is calculated as mean(C_boot – C_orig)
- Calculate optimism corrected performance as C-O.

One way of correcting for this would be changing **step 3 of the bootstrap**, instead of testing on the original data, to test on the left out (unseen) data in the bootstrap. This way the training and test data is kept entirely seperate in terms of samples, thus eliminating our bias towards inflated model performance on null datasets with high features. There is probably no point of doing anyway this when we have methods such as LOOCV and K fold cross validation.

As p (features) >> N (samples) we are going to get better and better ability to get good model performance using the bootstrapped data on the original data. Why? Because the original data contains the same samples as the bootstrap and as we get more features, greater the chance we are going to get some randomly correlating with our response variable. When we test the bootstrap on the original data (plus more samples) it retains some of this random ability to predict the real labels. This is a typical overfitting problem when we have higher numbers of features, and the procedure is faulty.

Let’s take another experimental look at the problem, this code can be directly copied and pasted into R for repeating the analyses and plots. We have two implementations of the method, the first by me for glmnet (lasso logisitic regression), the second for glm (logisitic regression) from this website (http://cainarchaeology.weebly.com/r-function-for-optimism-adjusted-auc.html). Feel free to try different machine learning algorithms and play with the parameters.

library(glmnet) library(pROC) library(caret) library(ggplot) library(kimisc) ### TEST 1: bootstrap optimism with glmnet cc <- c() for (zz in seq(2,100,1)){ print(zz) ## set up test data test <- matrix(rnorm(100*zz, mean = 0, sd = 1), nrow = 100, ncol = zz, byrow = TRUE) labelsa <- as.factor(c(rep('A',50),rep('B',50))) colnames(test) <- paste('Y',seq(1,zz),sep='') row.names(test) <- paste('Z',seq(1,100),sep='') ### my own implementation of optimism corrected bootstrapping ## 1. fit model to entire test data (i.e. overfit it) orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial") preds <- predict(orig,newx=test,type='response',s=0.01) auc <- roc(labelsa,as.vector(preds)) original_auc <- as.numeric(auc$auc) ## 2. take resample of data and try to estimate optimism test2 <- cbind(test,labelsa) B <- 50 results <- matrix(ncol=2,nrow=B) for (b in seq(1,B)){ ## get the bootstrapped resampled data boot <- test2[sample(row.names(test2),50),] labelsb <- boot[,ncol(boot)] boot <- boot[,-ncol(boot)] ## use the bootstrapped model to predict its own labels bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial") preds <- predict(bootf,newx=boot,type='response',s=0.01) auc <- roc(labelsb,as.vector(preds)) boot_auc <- as.numeric(auc$auc) ## use bootstrap model to predict original labels preds <- predict(bootf,newx=test,type='response',s=0.01) auc <- roc(labelsa,as.vector(preds)) boot_original_auc <- as.numeric(auc$auc) ## add the data to results results[b,1] <- boot_auc results[b,2] <- boot_original_auc } ## calculate the optimism O <- mean(results[,1]-results[,2]) ## calculate optimism corrected measure of prediction (AUC) corrected <- original_auc-O ## cc <- c(cc,corrected) print(cc) } ## print it df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc) p1 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) + geom_line() + ggtitle('glmnet - random data only gives positive result with optimism corrected bootstrap') print(p1) png('glmnet_test_upto100.png', height = 15, width = 27, units = 'cm', res = 900, type = 'cairo') print(p1) dev.off()

So here are the results, as number of noise only features on the x axis increase our 'corrected' estimate of AUC (on y axis) also increases when we start getting enough to allow that noise to randomly predict the labels. So this shows the problem starts about 40-50 features, then gets worse until about 75+. This is with the 'glmnet' function.

Let’s look at the method using glm, we find the same trend, different implementation.

## TEST2 auc.adjust <- function(data, fit, B){ fit.model <- fit data$pred.prob <- fitted(fit.model) # get overfitted AUC auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC' auc.boot <- vector (mode = "numeric", length = B) auc.orig <- vector (mode = "numeric", length = B) o <- vector (mode = "numeric", length = B) for(i in 1:B){ boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc' fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial") boot.sample$pred.prob <- fitted(fit.boot) # get bootstrapped AUC auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc # get original data boot AUC data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response") auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc # calculated optimism corrected version o[i] <- auc.boot[i] - auc.orig[i] } auc.adj <- auc.app - (sum(o)/B) return(auc.adj) } cc <- c() for (zz in seq(2,100,1)){ print(zz) ## set up test data test <- matrix(rnorm(100*zz, mean = 0, sd = 1), nrow = 100, ncol = zz, byrow = TRUE) labelsa <- as.factor(c(rep('A',50),rep('B',50))) colnames(test) <- paste('Y',seq(1,zz),sep='') row.names(test) <- paste('Z',seq(1,100),sep='') test2 <- data.frame(cbind(labelsa,test)) test2$labelsa <- as.factor(test2$labelsa) ## 1. make overfitted model model <- glm(labelsa ~., data = test2, family = "binomial") ## 2. estimate optimism and correct d <- auc.adjust(test2, model, B=200) cc <- c(cc,d) } ## print it df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc) p2 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) + geom_line() + ggtitle('glm - random data only gives positive result with optimism corrected bootstrap') print(p2) png('glmt_test_upto100.png', height = 15, width = 27, units = 'cm', res = 900, type = 'cairo') print(p2) dev.off()

So there we have it, the method has a problem with it and should not be used with greater than about 40 features. This method unfortunately is currently being used for datasets with higher than this number of dimensions (40+) because people think it is a published method it is safe, unfortunately this is not how the world works R readers. Remember, the system is corrupt, science and statistics is full of lies, and if in doubt, do your own tests on positive and negative controls.

This is with the 'glm' function. Random features added on x axis, the corrected AUC on y axis.

What if you don’t believe it? I mean this is a text book method. Well R readers if that is so I suggest code it your self and try the code here, run the experiments on null (random data only) datasets with increasing features.

This is the last part in the series on debunking the optimism corrected bootstrap method. I consider it, debunked.

Your bootstrap here is STILL wrong since

`replace = FALSE`

in your call to “sample”.You need to answer why bootstrap fails here and wins in other applications.

The second part is someone else’s code, but replace is set to TRUE not FALSE:

boot.sample <- sample.rows(data, nrow(data), replace=TRUE)

I can't see what is wrong with that?

This is the correct implementation I think, as described here: Smith, Gordon CS, et al. “Correcting for optimistic prediction in small data sets.” American journal of epidemiology 180.3 (2014): 318-324.

As the Harrell method.

Although my code isn't with replacement your right, so it is a bit wrong 😛

Another issue is as far as I can see standard method is to not reduce N when doing the bootstrap.

Regardless, of these points, the bias still is there… under these conditions.

About twenty years ago I was thrown into teaching classical statistics. Coming from a Pure Mathematics background, I was totally unsettled by this prospect. Why because mathematical proof of many of the techniques, terminology (eg Variate vs Random variable) was virtually impossible to find or in my case follow. So what do we do? teach the statistical techniques like we teach life scientists to use a microscope. We hope that the people who have developed the statistical packages and got it right and just use the statistical packages without much thought about how they work. When I heard about “boot strapping” I simply could not accept it. It seemed you were getting something for nothing I still don’t get it. Of all the disciplines I have studied mathematical statistics is the most difficult discipline to work in and proving the foundation concepts via pure mathematics is extremely difficult, I look back on the likes of R A Fisher and Gossett, how the hell they derived the t-distribution is beyond me. Even if we prove something mathematically the real test is does it work in the real world ?

I have read that an indicator of an important theorem in Mathematics is that when you read it, before seeing the proof, you find it difficult to believe possible.

While there are many examples, such as the Bayesian analysis of the Monte Hall Problem, or Stein’s Result (which sets m.l.e. theory on it’s head for dinensions 3 or greater) the recent such for me is the Johnson- Lindenstrauss Lemma

The bootstrap is one of those, and while it won’t work for all estimators, particularly

somedata dependent ones, it works for an amazingly large set of estimators. There’s really no reason not to understand it as the number of good textbooks available on it is large. There are even great and deep theoretical treatments, like Hall’sBootstrap and Edgeworth Expansionand many practical ones, like that by Davison and Hinkley and the one by Chernick and the one by Lahiri.I daresay someone who hadn’t mastered the bootstrap along with Monte Carlo methods like MCMC or importance sampling or slice sampling or stochastic search methods or the LASSO, does not know modern statistics, especially computational ones. And of course as in all numerical work, e.g., numerical linear algebra, using these wuthout understanding theiry is asking for trouble, even if that’s not sufficient.

Classical frequentist techniques turn out to be not so great when seen through the lens of big actual problems and many dimensions. One of the most disappointing, to me, is the Method of Naximum Likelihood.

Fix the above second paragraph to:

“While there are many examples, such as the Bayesian analysis of the Monte Hall Problem, or Stein’s Result (which sets m.l.e. theory on it’s head for dinensions 3 or greater) the recent such for me is the Johnson- Lindenstrauss Lemma.”

I continue to be challenged by trying to type on (even) a Google Pixel 2.

Done.

I came across this post by chance when I wanted to explore the limitations of the use of bootstrapping for validation. I decided to use the code to reproduce the issue as you suggested.

At first glance, the behavior of your second experiment with logistic regression (using glm package) did not surprise me as overfitting is a well-known problem especially when you add features. However, using regularized logistic regression (L1 regularization in your example or LASSO) should take care of the overfitting to some extent. I was wondering why did it show inflated optimism corrected AUC? I realized then that regularization was not performed in the correct way in your code. The optimal tuning parameter (Lambda) is usually chosen by cross-validation and by selecting lambda that gives the minimum mean squared error (MSE). Your code lacks this part of lambda tuning and you used consistently the penalty value of 0.01. The magnitude of the shrinkage of the features is highly dependent on the penalty value. As lambda gets larger, the bias is unchanged but the variance drops.

Therefore, I changed your implementation of your first experiment (LASSO logistic regression) to determine this value using cross-validation, to get the correct estimations of optimism correct AUC.

Here are the results:

Your code (lambda=0.01): optimism corrected AUC = 0.714 (CI 95%; 0.692 – 0.738)

10 x cross-validation: optimism corrected AUC = 0.446 (CI 95%; 0.418 – 0.474)

3 x cross-validation: optimism corrected AUC = 0.427 (CI 95%; 0.405 – 0.449)

When I manually selected an arbitrary higher lambda (let’s say s=0.07), then the optimism corrected AUC became 0.588 (CI 95%; 0.574 – 0.602).

Obviously, the correct use of the regularization parameter did the magic and as you can see from the results, the optimism corrected AUC is not far away from 0.5!

Lines of code to change:

orig <- cv.glmnet(test,y=labelsa,alpha=1,family = "binomial", nfolds=3)

preds <- predict(orig,newx=test,type='response',s="lambda.min")

Inside the bootstrap loop:

bootf <- cv.glmnet(boot,y=labelsb,alpha=1,family = "binomial", nfolds=3)

preds <- predict(bootf,newx=boot,type='response',s="lambda.min")

preds <- predict(bootf,newx=test,type='response',s="lambda.min")

Kind Regards,

Noman

Hi Noman

I don’t have time to carry on working on this at the moment. But,

Have you seen the Caret implementation shows an identical problem? This is not a co incidence and their glmnet code is tuned over lambda and alpha.

You have also seen the glm function with a standard logistic regression shows the same problem… and you imply this model will cause inflated AUC inherently with any cross validation method? No, actually if you run a standard cross validation loop using logistic regression or any machine learning model, and hide your test data in each iteration, this overestimation of the AUC problem does not occur, you can try it yourself when p >> s using noise only features. I suggest using Caret to standardise your comparisons with well tested code.

This problem occurs because of not hiding the training data when estimating the optimism in this dodgy method, I have shown this clearly here: https://intobioinformatics.wordpress.com/2018/12/28/part-4-more-bias-and-why-does-bias-occur-in-optimism-corrected-bootstrapping/.

If you disagree, send me an email with the experiments using the Caret implementation and we can talk further.

Best wishes,

Chris

Try this:

library(caret)

allresults <- matrix(ncol=2,nrow=200)

i = 0

for (z in seq(200,1000,100)){

i = i + 1

# select only two species

iris <- subset(iris, iris$Species == 'versicolor' | iris$Species == 'virginica')

iris$Species <- droplevels(iris$Species)

# generate random data

test <- matrix(rnorm(100*z, mean = 0, sd = 1),

nrow = 100, ncol = z, byrow = TRUE)

# bind random data

iris <- cbind(iris,test)

# remove real data

iris <- iris[,-1:-4]

# cross validation

ctrl <- trainControl(method = 'cv',

summaryFunction=twoClassSummary,

classProbs=T,

savePredictions = T,

verboseIter = T)

fit3 <- train(as.formula( paste( 'Species', '~', '.' ) ), data=iris,

method="glm", family="binomial", # preProc=c("center", "scale")

trControl=ctrl, metric = "ROC") #

allresults[i,1] <- max(fit3$results$ROC)

# optimism corrected bootstrapping

ctrl <- trainControl(method = 'optimism_boot',

summaryFunction=twoClassSummary,

classProbs=T,

savePredictions = T,

verboseIter = T)

fit4 <- train(as.formula( paste( 'Species', '~', '.' ) ), data=iris,

method="glm", family="binomial", # preProc=c("center", "scale")

trControl=ctrl, metric = "ROC") #

allresults[i,2] <- max(fit4$results$ROC)

rm(iris)

}