The truth is out there R readers, but often it is not what we have been led to believe. The previous post examined the strong positive results bias in optimism corrected bootstrapping (a method of assessing a machine learning model’s predictive power) with increasing p (completely random features). This bias is real and is effecting publications. However, there were 2 implementations of the method given in the previous post, 1 has a slight error, 2 seem fine. The trend is still the same with the corrected code, but the problem with the earlier code is I did not set ‘replace=TRUE’ in the call to the ‘sample’ function. The problem inherent to the ‘optimism corrected bootstrapping’ method is more about estimating error using the same samples for training and testing, than how we are resampling the data, so this change does not make a large difference to the results. We will go into a little more detail about bootstrapping in this article and repeat the analyses with the corrected code. Once again, you are welcome to re-run the code written here and also test using the independent Caret implementation shown on the other page.

For the uninformed reader, I have also shown the problem using Caret (https://intobioinformatics.wordpress.com/2018/12/25/optimism-corrected-bootstrapping-a-problematic-method/), and where it originates from in the method here (it is obvious with a bit of statistical knowledge) (https://intobioinformatics.wordpress.com/2018/12/28/part-4-more-bias-and-why-does-bias-occur-in-optimism-corrected-bootstrapping/). It is a very simple test I have done (using simulated null data-sets with increasing features) and shows the problem with the method very clearly.

Thanks to ECOQUANT for pointing out to me the replace function should have been called with an additional parameter.

Let’s just recap **what bootstrapping is** and **what optimism corrected bootstrapping is** before we redo the experiments:

This is from Jason’s excellent blog (https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/), bootstrapping is:

- Choose a number of bootstrap samples to perform
- Choose a sample size
- For each bootstrap sample (b=1 … B)
- Draw a sample with replacement with the chosen size
- Calculate the statistic on the sample

- Calculate the mean of the calculated sample statistics.

The with replacement part means we have to put each individual sample back when getting the sample in the bth bootstrap iteration. Thus, we usually have duplicate samples in our sample of the data when doing bootstrapping.

This is the optimism corrected bootstrapping algorithm:

- Fit a model M to entire data S and estimate predictive ability C.
- Iterate from b=1…B:
- Take a resample from the original data, S*
- Fit the bootstrap model M* to S* and get predictive ability, C_boot
- Use the bootstrap model M* to get predictive ability on S, C_orig

- Optimism O is calculated as mean(C_boot – C_orig)
- Calculate optimism corrected performance as C-O.

Since we use the same data in step 3 of the bootstrap to train and test the model (an information leak), we would expect increasing bias (C_orig should be too high, thus O too small) when more and more random features are added. See the previous post for more explanation on this. Another point is, the optimism corrected bootstrap is done with a sample size of N instead of just a fraction of N, usually. I found the following quote to support this:

“The small data set was repeatedly re-sampled to produce b replicated data sets, each the same size as the original. We used b = 200. The predictive model was fitted to each of

the b replicated data sets in turn. Each fitted model was then applied both to the resampled data set from which it was generated and to the original data set.”

Smith, Gordon CS, et al. “Correcting for optimistic prediction in small data sets.” *American journal of epidemiology* 180.3 (2014): 318-324.

I have tried reducing the re-sampling size to a fraction of N, which reduces the bias somewhat, but it is still there. This makes sense due to the information leak in this method which results in an under estimation of the optimism (O).

Your welcome to experiment with this code yourselves. If you are thinking of using this method, I recommend simulating null datasets with the same number of dimensions to check how bias your AUC/ model performance will be first. When we have high numbers of features using this method is clearly a serious mistake.

This code can be directly copied and pasted into R to repeat the experiments.

**Experiment 1: my implementation – glmnet (lasso logistic regression)**

library(glmnet) library(pROC) library(caret) library(ggplot2) library(kimisc) ### TEST 1: bootstrap optimism with glmnet cc <- c() for (zz in seq(2,100,1)){ print(zz) ## set up test data test <- matrix(rnorm(100*zz, mean = 0, sd = 1), nrow = 100, ncol = zz, byrow = TRUE) labelsa <- as.factor(c(rep('A',50),rep('B',50))) colnames(test) <- paste('Y',seq(1,zz),sep='') row.names(test) <- paste('Z',seq(1,100),sep='') ### my own implementation of optimism corrected bootstrapping ## 1. fit model to entire test data (i.e. overfit it) orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial") preds <- predict(orig,newx=test,type='response',s=0.01) auc <- roc(labelsa,as.vector(preds)) original_auc <- as.numeric(auc$auc) ## 2. take resample of data and try to estimate optimism test2 <- cbind(test,labelsa) B <- 50 results <- matrix(ncol=2,nrow=B) for (b in seq(1,B)){ ## get the bootstrapped resampled data boot <- test2[sample(row.names(test2),100,replace=TRUE),] labelsb <- boot[,ncol(boot)] boot <- boot[,-ncol(boot)] ## use the bootstrapped model to predict its own labels bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial") preds <- predict(bootf,newx=boot,type='response',s=0.01) auc <- roc(labelsb,as.vector(preds)) boot_auc <- as.numeric(auc$auc) ## use bootstrap model to predict original labels preds <- predict(bootf,newx=test,type='response',s=0.01) auc <- roc(labelsa,as.vector(preds)) boot_original_auc <- as.numeric(auc$auc) ## add the data to results results[b,1] <- boot_auc results[b,2] <- boot_original_auc } ## calculate the optimism O <- mean(results[,1]-results[,2]) ## calculate optimism corrected measure of prediction (AUC) corrected <- original_auc-O ## cc <- c(cc,corrected) print(cc) } ## print it df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc) p1 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) + geom_line() + ggtitle('glmnet - random data only gives positive result with optimism corrected bootstrap') print(p1) png('glmnet_test_upto100.png', height = 15, width = 27, units = 'cm', res = 900, type = 'cairo') print(p1) dev.off()

Here are the results with 100 samples and 50 bootstrap iterations from 2 to 100 random features from a Gaussian distribution. We are re-sampling using the original sample size (N=100).

Random features are being added iteratively on the X axis, and on the Y, we have AUC. The AUC should be 0.5 to reflect the data has no real predictive power, but it is highly inflated.

**Experiment 2: another implementation – glm (logistic regression)
**

## TEST2 auc.adjust <- function(data, fit, B){ fit.model <- fit data$pred.prob <- fitted(fit.model) # get overfitted AUC auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC' auc.boot <- vector (mode = "numeric", length = B) auc.orig <- vector (mode = "numeric", length = B) o <- vector (mode = "numeric", length = B) for(i in 1:B){ boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc' fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial") boot.sample$pred.prob <- fitted(fit.boot) # get bootstrapped AUC auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc # get original data boot AUC data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response") auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc # calculated optimism corrected version o[i] <- auc.boot[i] - auc.orig[i] } auc.adj <- auc.app - (sum(o)/B) return(auc.adj) } cc <- c() for (zz in seq(2,100,1)){ print(zz) ## set up test data test <- matrix(rnorm(100*zz, mean = 0, sd = 1), nrow = 100, ncol = zz, byrow = TRUE) labelsa <- as.factor(c(rep('A',50),rep('B',50))) colnames(test) <- paste('Y',seq(1,zz),sep='') row.names(test) <- paste('Z',seq(1,100),sep='') test2 <- data.frame(cbind(labelsa,test)) test2$labelsa <- as.factor(test2$labelsa) ## 1. make overfitted model model <- glm(labelsa ~., data = test2, family = "binomial") ## 2. estimate optimism and correct d <- auc.adjust(test2, model, B=200) cc <- c(cc,d) } ## print it df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc) p2 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) + geom_line() + ggtitle('glm - random data only gives positive result with optimism corrected bootstrap') print(p2) png('glmt_test_upto100.png', height = 15, width = 27, units = 'cm', res = 900, type = 'cairo') print(p2) dev.off()

I haven’t made a bigger deal about all this because, as a Bayesian, I’m amused by what I necessarily believe to be an excessive reliance upon a diagnostic construct, the ROC, which in itself can have problems. To me the most striking one is that, even in a hypothesis testing framework and view, the ROC construction dance makes one believe there’s an inherent coupling between probability of rejecting the null and probability of accepting the alternative. In fact, those are picked out of the air, and to make them identical implies things about the cost of making an error in one way or the other, and the shape of those loss densities.

At core this is because, as a Bayesian, there’s no such thing to me as a classical “test” so a “false alarm” or a “false negative” don’t mean a lot. If there’s a Bernoulli outcome involved, I’m interested in

.

which is a posterior probability space having as many dimensions as . denotes a prior. denotes a conditional probability. I don’t see the point of collapsing that much farther, because it’s pretty irreducible, unless a prediction is wanted, which presumably is going to inform some kind of decision. Then loss functions are appropriate, as well as priors on .

So, the idea of basing a choice of a model as superior or not on whether some index of an ROC is better or worse seems misguided. I know this is done all the time.

But there are other things to think about, such as

misspecification error, something which afflicts Bayesian modeling as well as frequentist, and is worth bounding in some way. In fact, there are conceptual problems here for dimension reduction. By rights re-choosing a set of predictors ought to be repeated for the held out part of a dataset and might end up with a different set of predictors than the training one, but if that is done, apples and oranges are being compared and the result is pretty meaningless. If re-choosing predictors is not repeated, then it’s possible the model picked based upon training data might not be appropriate for the hold out, so test performance looks worse than otherwise should. Indeed, a simpler regularized model then often works better on the entire dataset, even if it works more poorly for the train.There are other issues as well, like balance and overall m.s.e., which is a function of

bothbias and variance, and the key thing to remember is that there are plenty of situations were accepting some bias is better for the overall result than none, per James-Stein.And when ROC is done, there are choices on how it’s done: There is such a thing as a Bayesian ROC calculation, although, per the above, I’d need to think through what that really means before diving in to use it.

This came up elsewhere in the Exchange, based upon an insight, but kins of missing the entire point of a Bayesian approach. It also received comments from a Favorite Professor of mine. (His

Regression Modeling Strategiesis amust-study, as well as his software, which is some of the best done and maintained on CRAN.)Finally, note that these kinds of prediction errors are also affected by sample size.

Interesting, I am yet to look more into Bayesian statistics. Hopefully will get the chance sometime.