Part 6: How not to validate your model with optimism corrected bootstrapping

When evaluating a machine learning model if the same data is used to train and test the model this results in overfitting. So the model performs much better in predictive ability  than it would if it was applied on completely new data, this is because the model uses random noise within the data to learn from and make predictions. However, new data will have different noise and so it is hard for the overfitted model to predict accurately just from noise on data it has not seen.

Keeping the training and test datasets completely seperate in machine learning from feature selection to fitting a model mitigates this problem, this is the basis for cross validation which iteratively splits the data into different chunks and iterates over them to avoid the problem. Normally, we use 100 times repeated 10 fold cross validation for medium to large datasets and leave-one-out cross validation (LOOCV) for small datasets.

There is another technique used to correct for the “optimism” resulting from fitting a model to the same data used to test it on, this is called optimism corrected bootstrapping. However, it is based on fundamentally unsound statistical principles that can introduce major bias into the results under certain conditions.

This is the optimism corrected bootstrapping method:

  1. Fit a model M to entire data S and estimate predictive ability C.
  2. Iterate from b=1…B:
    1. Take a bootstrap sample from the original data, S*
    2. Fit the bootstrap model M* to S* and get predictive ability, C_boot
    3. Use the bootstrap model M* to get predictive ability on S, C_orig
  3. Optimism O is calculated as mean(C_boot – C_orig)
  4. Calculate optimism corrected performance as C-O.

As I have stated previously, because the same data is used for training and testing in step 3 of the bootstrapping loop, there is an information leak here. Because the data used to make M* includes many of the same data that is used to test it, when applied on S. This can lead to C_orig being too high. Thus, the optimism (O) is expected to be too low, leading to the corrected performance from step 4 being too high. This can be a major problem.

The strength of this bias for this method strongly depends on: 1) the number of variables used, 2) the number of samples used, and, 3) the machine learning model used.

I can show the bias using glmnet and random forest in these new experiments. This code is different than in Part 5 because I am changing the ROC calculation to use my own evaluation package (https://cran.r-project.org/web/packages/MLeval/index.html), and also examining random forest instead of just logistic regression.

I am going to use a low N high p simulation, because that is what I am used to working with. N is the number of observations and p the number of features. Such dimensionalities are very common in the health care and bioinformatics industry and academia, using this method can lead to highly erroneous results, some of which end up in publications.

I am simulating null datasets here with random features sampled from a Gaussian distribution, so we should not expect any predictive ability thus the area under the ROC curve should be about 0.5 (chance expectation).

Experiment 1: my implementation.

N=100 and p=5 to 500 by 50. Random forest can be selected by using ‘ranger’ or lasso by using ‘glmnet’.

## my implementation optimism corrected bootstrapping

library(ggplot2)
library(MLeval)
library(ranger)
library(glmnet)

N <- 100
alg <- 'ranger'

## loop over increasing number features
cc <- c()
for (zz in seq(5,500,50)){

print(zz)

## set up test data
test <- matrix(rnorm(N*zz, mean = 0, sd = 1),
nrow = N, ncol = zz, byrow = TRUE)
labelsa <- as.factor(sample(c(rep('A',N/2),rep('B',N/2))))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,N),sep='')
test <- data.frame(test)
test[,zz+1] <- labelsa
colnames(test)[zz+1] <- 'XXX'
print(dim(test))

## 1. fit model to entire data and predict labels on same data
if (alg == 'glmnet'){
#orig <- glm(XXX~.,data=test,family='binomial')
X <- test[,1:zz]
Y <- labelsa
orig <- glmnet(as.matrix(X),Y,family='binomial')
preds <- predict(orig,newx=as.matrix(test[,-(zz+1)]),type='response',s=0.01)
}else{
orig <- ranger(XXX~.,data=test,probability=TRUE)
preds <- predict(orig,data=test[,-(zz+1)],type='response')
}

## 2. use MLeval to get auc
if (alg == 'ranger'){
preds <- data.frame(preds$predictions)
}else{
preds <- data.frame(B=preds[,1],A=1-preds[,1])
}
preds$obs <- labelsa
x <- evalm(preds,silent=TRUE,showplots=FALSE)
x <- x$stdres
auc <- x$Group1[13,1]
original_auc <- auc
print(paste('original:',auc))

## 3. bootstrapping to estimate optimism
B <- 50
results <- matrix(ncol=2,nrow=B)

for (b in seq(1,B)){

# get the bootstrapped data
boot <- test[sample(row.names(test),N,replace=TRUE),]
labelsb <- boot[,ncol(boot)]

# use the bootstrapped model to predict bootstrapped data labels
if (alg == 'ranger'){
bootf <- ranger(XXX~.,data=boot,probability=TRUE)
preds <- predict(bootf,data=boot[,-(zz+1)],type='response')
}else{
#bootf <- glm(XXX~.,data=boot,family='binomial')
X <- boot[,1:zz]
Y <- labelsb
bootf <- glmnet(as.matrix(X),Y,family='binomial')
preds <- predict(bootf,newx=as.matrix(boot[,-(zz+1)]),type='response',s=0.01)
}

# get auc of boot on boot
if (alg == 'ranger'){
preds <- data.frame(preds$predictions)
}else{
preds <- data.frame(B=preds[,1],A=1-preds[,1])
}
preds$obs <- labelsb
x <- evalm(preds,silent=TRUE,showplots=FALSE)
x <- x$stdres
boot_auc <- x$Group1[13,1]

# boot on test/ entire data
if (alg == 'ranger'){
## use bootstrap model to predict original labels
preds <- predict(bootf,data=test[,-(zz+1)],type='response')
}else{
preds <- predict(bootf,newx=as.matrix(test[,-(zz+1)]),type='response',s=0.01)
}

# get auc
if (alg == 'ranger'){
preds <- data.frame(preds$predictions)
}else{
preds <- data.frame(B=preds[,1],A=1-preds[,1])
}
preds$obs <- labelsa
x <- evalm(preds,silent=TRUE,showplots=FALSE)
x <- x$stdres
boot_original_auc <- x$Group1[13,1]
#

## add the data to results
results[b,1] <- boot_auc
results[b,2] <- boot_original_auc
}

## calculate the optimism
O <- mean(results[,1]-results[,2])

## calculate optimism corrected measure of prediction (AUC)
corrected <- original_auc-O

## append
cc <- c(cc,corrected)
print(cc)
}

my_imp_done

We can see that even with random features, optimism corrected bootstrapping can give very inflated performance metrics. Random data should give an AUC of about 0.5, this experiment shows the value of simulating null datasets and testing your machine learning pipeline with data with similar characteristics to that you will be using.

Below is code and results from the Caret implementation of this method (https://cran.r-project.org/web/packages/caret/index.html). Note, that the results are very similar to mine.

Caret will tune over a range of lambda for lasso by default (results below), whereas above I just selected a single value of this parameter. The bias for this method is much worse for models such as random forest and gbm, I suspect because they use decision trees which are non-linear so can more easily make accurate models out of noise if tested on the same data used to train with.

You can experiment easily with different types of cross validation and bootstrapping using null datasets with Caret. I invite you to do this, and if you change the cross validation method below, to, e.g. LOOCV or repeatedcv, you will see, these methods are not subject to this major overly optimistic results bias. They have their own issues, but I have found optimism corrected bootstrapping can give completely the wrong conclusion very easily, it seems not everyone knows of this problem before they are using it.

I recommend using Caret for machine learning. The model type and cross validation method can be changed simply without much effort. If ranger is changed to gbm in the below code we get a similar bias to using ranger.

Experiment 2: Caret implementation.

N=100 and p=5 to 500 by 50. Random forest can be selected by using ‘ranger’ or lasso by using ‘glmnet’.

## caret optimism corrected bootstrapping test

library(caret)

cc <- c()
i = 0
N <- 100

## add features iterate
for (zz in seq(5,500,50)){

i = i + 1

# simulate data
test <- matrix(rnorm(N*zz, mean = 0, sd = 1),
nrow = N, ncol = zz, byrow = TRUE)
labelsa <- as.factor(sample(c(rep('A',N/2),rep('B',N/2))))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,N),sep='')
test <- data.frame(test)
test[,zz+1] <- labelsa
colnames(test)[zz+1] <- 'XXX'

# optimism corrected bootstrapping with caret
ctrl <- trainControl(method = 'optimism_boot',
summaryFunction=twoClassSummary,
classProbs=T,
savePredictions = T,
verboseIter = F)
fit4 <- train(as.formula( paste( 'XXX', '~', '.' ) ), data=test,
method="ranger", # preProc=c("center", "scale")
trControl=ctrl, metric = "ROC") #

cc <- c(cc, max(fit4$results$ROC))
print(max(fit4$results$ROC))
}

caret_imp

Code and png files zipped here: optimism_boot_experiment.

Further reading:

https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/

Advertisement

Consensus clustering in R

The logic behind the Monti consensus clustering algorithm is that in the face of resampling the ideal clusters should be stable, thus any pair of samples should either always or never cluster together. We can use this principle to infer the optimal number of clusters (K). This works by examining cluster stability from K=2 to K=10 during resampling. To do this for every K, we calculate the consensus rates for all sample pairs, which is the fraction of times a pair of samples cluster together. This gives us a consensus matrix for every K, which is symmetrical and in the range 0 to 1. A matrix entirely of 1s and 0s will represent perfect stability for a given K. We can simply compare stability of these matrices from K=2 to K=10 to determine the optimal K.

This is a very nice concept and the consensus matrix gives us a helpful visualisation tool. However, a major problem with this approach is that on random data the consensus matrix converges towards a matrix of perfect stability simply by chance. A second problem, is one general to cluster analysis, and is that we do not test the null hypothesis K=1. We developed M3C to correct for the internal bias of consensus clustering by using a Monte Carlo simulation, M3C can also test the null hypothesis K=1.

We have recently had the M3C paper published (https://www.nature.com/articles/s41598-020-58766-1), and this blog post is to demonstrate a new objective function included in the M3C algorithm (https://bioconductor.org/packages/devel/bioc/html/M3C.html) which did not get the chance the get into the paper. I’m going to do this without math notation because this is pretty informal.

Essentially we are in fact dealing with probabilities in the consensus matrix, where each element is the probability of samples i and j clustering together, given they were resampled together. Seeing the consensus matrix in this way, naturally leads to entropy as a very appropiate function to describe the stability (or uncertainty) of each consensus matrix. We simply can calculate entropy (binary information entropy) of each matrix and minimise it to determine the best K. Entropy, in this context, is the degree of surprisal, and we want the minimum surprise, if these are indeed good clusters.

Let’s do some analyses with the upgraded version of M3C on some glioblastoma cancer data (GBM).

## load package and test data
library(M3C)
dim(mydata) # we have 1740 features and 50 samples (gbm data)

## test code
test <- M3C(mydata)

Below is the CDF plot for every consensus matrix from 2 … 10, and it describes the distribution of numerical values in each matrix. A flat line would indicate all 1s and 0s. We can see a problem here because as K increases the line is converging towards a matrix of perfect stability. This also happens on random data with one cluster.

I think the CDF is a nice visualisation tool, but entropy does not require a CDF to work. Some metrics like the PAC score and delta k are calculated from the CDF, but I think it is more parsimonious to work directly with the consensus matrix probabilities. None of these methods correct for the convergence at chance problem of this method, for that we need to do Monte Carlo.

CDF

This next plot is entropy for every consensus matrix from 2 … 10.

Again, we can see a problem here because, although there is an elbow at K=4, the most certain or stable consensus matrix is for K=10. This is because even on random data, the algorithm generates consensus matrices that become more stable as K increases (see our paper for more details on this).

S

Remember, information entropy is describing uncertainty in the system, for the best clustering we want to minimise uncertainty (or surprisal) as much as possible despite perturbation. However, the random expectation must be taken into account.

The next plot is the RCSI, which corrects entropy for the chance expectation using a reference distribution derived using a Monte Carlo simulation and calculates 95% confidence intervals. We can see the bias has been corrected, and now we are just looking for a maximum value instead of elbows or last values before the floor, or other subjective approaches. This is nice.

RCSI

Four clusters is well supported by the histological data in which there are four subtypes for GBM, so I am satisfied with the decision made here. M3C also calculates p values for each K to test the null hypothesis K=1, see our vignette for more details.

Here are some general pointers on M3C and cluster analysis on Omic data.

  • When deciding K using M3C it is best to examine entropy, the RCSI, and the p values.
  • For more accurate p values, on some datasets, it is good to increase the Monte Carlo simulation number, the Monte Carlo simulation gives us the null distribution. Default is 25, this could be increased to 100.
  • Inner replications are set to 100, this is usually fine, however, on some datasets it is better to increase this to 250.
  • It is preferable to run SigClust on all pairs of clusters to check the significance.
  • It is preferable to run another method to support the K decision, we have found CLEST works well, a classic algorithm with some similarities to consensus clustering. This is implemented in the RSKC package.
  • The silhouette width does not work well on high dimensional cancer data in our experience and that of Șenbabaoğlu described in the 2014 paper. A possible alternative is to run t-SNE first before the silhouette width on the data with reduced dimensionality. Estimating K on complex Omic data is a hard problem to do well because of the high dimensionality and noise.

For demonstration of M3C on real TCGA data, see the supplementary information of this paper, in Table 1. The p value method was used in select K when running M3C here.

https://doi.org/10.1093/bioinformatics/btz704

For demonstration of M3C on simulated data and more TCGA data, this is the main paper that includes the method, but we used the PAC score instead of entropy as the objective function to minimise to find the best K. We still include the PAC score in the Bioconductor software because it is pretty good.

https://doi.org/10.1038/s41598-020-58766-1

Thanks for reading.

 

 

 

How to make a precision recall curve in R

Precision recall (PR) curves are useful for machine learning model evaluation when there is an extreme imbalance in the data and the analyst is interested particuarly in one class. A good example is credit card fraud, where the instances of fraud are extremely few compared with non fraud. Here are some facts about PR curves.

  • PR curves are sensitive to which class is defined as positive, unlike the ROC curve which is a more balanced method. So we will get a completely different result depending on which class is set as positive.
  • PR curves are sensitive to class distribution, so if the ratio of positives to negatives changes across different analyses, then the PR curve cannot be used to compare performance between them. This is because the chance line varies between datasets with different class distributions.
  • PR curves, because they use precision, instead of specificity (like ROC) can pick up false positives in the predicted positive fraction. This is very helpful when negatives >> positives. In these cases, the ROC is pretty insensitive and can be misleading, whereas PR curves reign supreme.
  • The area under the PR curve does not have a probabilistic interpretation like ROC.

The PR gain curve was made to deal with some of the above problems with PR curves, although it still is intended for extreme class imbalance situations. The main difference is the PR gain curve has a universal baseline as precision is corrected according to chance expectation. We will see this in an example.

Note, we are using non repeated cross validation with Caret to save time, but it requires repeating generally to reduce the variance and bias. Also, for imbalanced data in Caret we need to change to the ‘prSummary’ function that uses area under the PR curve to select the best model.

MLeval (https://cran.r-project.org/web/packages/MLeval/index.html) makes curves and calculates metrics, and will automatically extract the best model parameters and data from the Caret results.

## load libraries required for analysis
library(MLeval)
library(caret)

## simulate data
im <- twoClassSim(2000, intercept = -25, linearVars = 20)
table(im$Class)

## run caret
fitControl <- trainControl(method = "cv",summaryFunction=prSummary,
classProbs=T,savePredictions = T,verboseIter = F)
im_fit <- train(Class ~ ., data = im,method = "ranger",metric = "AUC",
trControl = fitControl)
im_fit2 <- train(Class ~ ., data = im,method = "xgbTree",metric = "AUC",
trControl = fitControl)

## run MLeval
x <- evalm(list(im_fit,im_fit2))

## curves and metrics are in the 'x' list object

We can see in the below analyses, both models are above chance and xgbTree does slightly better than random forest. Pretty cool.

This is the PR curve, see the grey line is baseline chance expectation.

pr.png

This is the PR gain curve, where baseline is zero.

prg.png

How to easily make a ROC curve in R

A typical task in evaluating the results of machine learning models is making a ROC curve, this plot can inform the analyst how well a model can discriminate one class from a second. We developed MLeval (https://cran.r-project.org/web/packages/MLeval/index.html), a evaluation package for R, to make ROC curves, PR curves, PR gain curves, and calibration curves. These plots are all using ggplot2 and it also yields performance metrics such as, Matthew’s correlation coefficient, specificity, sensitivity, and includes confidence intervals.

MLeval is aimed to make life as simple as possible. It can be run directly on a data frame of predicted probabilities and ground truth probabilities (labels), or on the Caret ‘train’ function output which performs cross validation to avoid overfitting. It also makes it easy to compare different models together. Let’s see an example.

## load libraries required for analysis
library(MLeval)
library(caret)

Run Caret on the Sonar data with 3 different models, then evaluate by passing the results objects as a list into ‘evalm’.


## load data and run Caret
data(Sonar)
ctrl <- trainControl(method="cv", summaryFunction=twoClassSummary, classProbs=T,
                     savePredictions = T)
fit1 <- train(Class ~ .,data=Sonar,method="rf",trControl=ctrl)
ctrl <- trainControl(method="cv", summaryFunction=twoClassSummary, classProbs=T,
                     savePredictions = T)
fit2 <- train(Class ~ .,data=Sonar,method="gbm",trControl=ctrl)
ctrl <- trainControl(method="cv", summaryFunction=twoClassSummary, classProbs=T,
                     savePredictions = T)
fit3 <- train(Class ~ .,data=Sonar,method="nb",trControl=ctrl)

## run MLeval
res <- evalm(list(fit1,fit2,fit3),gnames=c('rf','gbm','nb'))

The results (metrics and plots) can be accessed through the list object 'evalm' produces. We can see below that random forest and gbm perform the same, whereas naive bayes does not do as well falling behind the others in the two discrimination tests (ROC and PRG). However, in the calibration curves we can see all models are quite well calibrated, showing that being good at calibration does not always imply good discrimination.

The PRG curve standardises precision to the baseline, whereas the PR curve has a variable baseline, making it unsuitable to compare between data with different class distributions. This plot will change depending on which class is defined as positive, and is a deficiency of precision recall for non extremely imbalanced tasks. Credit card fraud is an example of where positives << negatives and it becomes more appropiate.

In the first two plots the analysis performed is the same, the probabilities are ranked from high to low then a sensitivity analysis is performed of the probability cut-off parameter to define a positive. For each iteration true positive rate vs true negative rate are calculated and plotted in the case of the ROC, for PRG it is precision gain vs recall gain. In the last plot, we plot predicted vs real probabilities (in bins) and the aim is for them to match as closely as possible (grey diagonal line = perfect). See our vignette for more information (https://cran.r-project.org/web/packages/MLeval/vignettes/introduction.pdf). Code is hosted here (https://github.com/crj32/MLeval).

ROC.pngPRG.png

CC.png

Fast adaptive spectral clustering in R: brain cancer RNA-seq

Spectral clustering refers to a family of algorithms that cluster eigenvectors derived from the matrix that represents the input data’s graph. An important step in this method is running the kernel function that is applied on the input data to generate a NXN similarity matrix or graph (where N is our number of input observations). Subsequent steps include computing the normalised graph Laplacian from this similarity matrix, getting the eigensystem of this graph, and lastly applying k-means on the top K eigenvectors to get the K clusters. Clustering in this way adds flexibility in the range of data that may be analysed and spectral clustering will often outperform k-means. It is an excellent option for image and bioinformatic cluster analysis including single platform and multi-omics.

‘Spectrum’ is a fast adaptive spectral clustering algorithm for R programmed by myself from QMUL and David Watson from Oxford. It contains both novel methodological advances and implementation of pre-existing methods. Spectrum has the following features; 1) A new density-aware kernel that increases similarity between observations that share common nearest neighbours, 2) A tensor product graph data integration and noise reduction system, 3) The eigengap method to decide on the number of clusters, 4) Gaussian Mixture Modelling for the final clustering of the eigenvectors, 5) Implementation of a Fast Approximate Spectral Clustering method for very big datasets, 6) Data imputation for multi-view analyses. Spectrum has been recently published as an article in Bioinformatics. It is available to download from CRAN: https://cran.r-project.org/web/packages/Spectrum/index.html.

In this demonstration, we are going to use Spectrum to cluster brain cancer RNA-seq to find distinct patient groups with different survival times. This is the link, braincancer_test_data, to download the test data for the analysis. The data can also be accessed through Synapse (https://www.synapse.org/#!Synapse:syn18911542/files/).

## load libraries required for analysis
library(Spectrum)
library(plot3D)
library(survival)
library(survminer)
library(scales)

The next block of code runs Spectrum, then does a t-sne analysis to visualise the clusters embedded in a lower dimensional space.

I have found 3D t-sne can work well to see complex patterns of Gaussian clusters which can be found in omic data (as in this cancer data), compared to PCA which is better for a more straightforward structure and outlier detection.

## run Spectrum
r <- Spectrum(brain[[1]])
## do t-sne analysis of results
y <- Rtsne::Rtsne(t(brain[[1]]),dim=3)
scatter3D(y$Y[,1],y$Y[,2],y$Y[,3], phi = 0, #bty = "g", ex = 2,
          ticktype = "detailed", colvar = r$assignments,
          col = gg.col(100), pch=20, cex=2, #type = 'h',
          xlab='Dim1', ylab='Dim2', zlab='Dim3')

Each eigenvector of the graph Laplacian numerically indicates membership of an observation to a block (cluster) in the data's graph Laplacian and each eigenvalue represents how disconnected that cluster is with the other clusters. In the ideal case, if we had eigenvalues reading 1, 1, 1, 1, 0.2 in the below plot, that would tell us we have 4 completely disconnected clusters, followed by one that is much more connected (undesirable when trying to find separate clusters). This principal can be extended to look for the greatest gap in the eigenvalues to find the number of clusters (K).

This first plot, produced automatically by Spectrum of the brain cancer data, shows the eigenvalues for each eigenvector of the graph Laplacian. Here is biggest gap is between the 4th and 5th eigenvectors, thus corresponding to K=4.

egap_brain.png

This next plot shows the 3D t-sne of the brain cancer RNA-seq clusters. We can see clear separation of the groups. On this type of complex data, Spectrum tends to excel due to its prioritisation of local similarities and its ability to reduce noise by making the k nearest neighbour graph and diffusing on it.

brain_tsne_rnaseq.png

Next, we can run a survival analysis using a Cox proportional hazards model and the log rank test is used to test whether there is a difference between the survival times of different groups (clusters).

## do test
clinicali <- brain[[2]]
clinicali$Death <- as.numeric(as.character(clinicali$Death))
coxFit <- coxph(Surv(time = Time, event = Death) ~ as.factor(r$assignments), data = clinicali, ties = "exact")
coxresults <- summary(coxFit)
print(coxresults$logtest[3])

## survival curve
survival_p <- coxresults$logtest[3]
fsize <- 18
clinicalj <- clinicali
if (0 %in% pr$cluster){
  res$cluster <- res$cluster+1
}
clinicalj$cluster <- as.factor(r$assignments)
fit = survfit(Surv(Time,Death)~cluster, data=clinicalj)

## plot
gg <- ggsurvplot(fit, data = clinicalj, legend = 'right', font.x =  fsize, font.y =  fsize, font.tickslab = fsize,
                 palette = gg.col(4),
                 legend.title = 'Cluster', #3 palette = "Dark2",
                 ggtheme = theme_classic2(base_family = "Arial", base_size = fsize))
gg$plot + theme(legend.text = element_text(size = fsize, color = "black")) + theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.text.y = element_text(size = fsize, colour = 'black'),
        axis.text.x = element_text(size = fsize, colour = 'black'),
        axis.title.x = element_text(size = fsize),
        axis.title.y = element_text(size = fsize),
        legend.title = element_text(size = fsize),
        legend.text = element_text(size = fsize))

This is the survival curve for the analysis. The code above also gives us the p value (4.03E-22) for the log rank test which is highly significant, suggesting Spectrum yields good results. More extensive comparisons are included in our manuscript.

brain_survival.png

Spectrum is effective at reducing noise on a single-view (like here) as well as multi-view because it performs a tensor product graph diffusion operation in either case, and it uses a k nearest neighbour graph. Multi-omic clustering examples are included in the package vignette. Spectrum will also be well suited to a broad range of different types of data beyond this demonstration given today.

The density adaptive kernel enhances the intra-cluster similarity which helps to improve the quality of the similarity matrix. A good quality similarity matrix (that represents the data’s graph) is key to the performance of spectral clustering, where intra-cluster similarity should be maximised and inter-cluster similarity minimised.

Spectrum is a fast new algorithm for spectral clustering in R. It is largely based on work by Zelnik-Manor, Ng, and Zhang, and includes implementations of pre-existing methods as well as new innovations. This is the second of the pair of clustering tools we developed for precision medicine, the first being M3C, a consensus clustering algorithm which is on the Bioconductor. In many ways, spectral clustering is a more elegant and powerful method than consensus clustering type algorithms, but both are useful to examine data from a different perspective.

Spectrum can be downloaded here: https://cran.r-project.org/web/packages/Spectrum/index.html

Source code is available here: https://github.com/crj32/Spectrum

Reference

Christopher R John, David Watson, Michael R Barnes, Costantino Pitzalis, Myles J Lewis, Spectrum: fast density-aware spectral clustering for single and multi-omic data, Bioinformatics, , btz704, https://doi.org/10.1093/bioinformatics/btz704

Running UMAP for data visualisation in R

UMAP is a non linear dimensionality reduction algorithm in the same family as t-SNE. In the first phase of UMAP a weighted k nearest neighbour graph is computed, in the second a low dimensionality layout of this is then calculated. Then the embedded data points can be visualised in a new space and compared with other variables of interest.

It can be used for the analysis of many types of data, including, single cell RNA-seq and cancer omic data. One easy way to run UMAP on your data and visualise the results is to make a wrapper function that uses the umap R package and ggplot2, this is easy to do yourself, but in this post we are going to have a look at the one included in the M3C package (https://bioconductor.org/packages/devel/bioc/html/M3C.html). The code for this simple function can be found here (https://github.com/crj32/M3C/blob/master/R/umap.R).

M3C is a clustering algorithm for finding the number of clusters in a given dataset, but it also has several other useful functions that can be used more generally. I don’t use M3C for single cell RNA-seq clustering it is better for other types of omic data, such as RNA-seq and standard omic platforms, however, its visualisation functions are useful for a number of different analyses I do. Otherwise I end up copy and pasting lengthy blocks of ggplot2 code again and again.

There have been two previous posts on the related PCA and t-SNE functions contained in the M3C package which both contain the same helpful parameters for extra labelling of the data points.

Let’s load some single cell RNA-seq data, this is the pollen_test_data we used in the last post when running t-SNE.

So this is just a basic UMAP plot. So in this data within clusters the samples are very similar so they are very compact in these plots.

# load your omic data here as mydata
library(M3C)
umap(pollen$data,colvec=c('skyblue'))

UMAP.png

This is for labelling the sample points with our ground truth labels, i.e. cell types. The colour scale is set to be controlled manually and this using some internal colours inside the function.

umap(pollen$data,labels=as.factor(pollen$celltypes),controlscale=TRUE,scale=3)

UMAPlabeled.png

This is for looking at an individual genes expression across the samples, so we are just labelling the points with a continuous variable, in this case GAGE4 because its variance is among the highest standardised to the mean.

umap(pollen$data, labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='GAGE4',])),
controlscale = TRUE,scale=2,legendtitle = 'GAGE4')

UMAPlabeled.png

Let’s just cluster this data with M3C and then display the clustering assignments on the UMAP plot.

r <- M3C(pollen$data,method=2)
umap(pollen$data,labels=as.factor(r$assignments),printres = TRUE,printwidth = 24)

UMAPlabeled.png

So that’s pretty neat, simple functions like this can be quite a lot of fun.

Let’s do one more plot, we will make up a continuous variable that could be a co-variate we are interested in and display that on the plot. I’ll show the code for printing the plot here which is done internally to simplify things.

Scale=1 means we use the spectral palette for colouring the data points. We have to set controlscale=TRUE for using custom colours, it is also possible to set a ‘high’ and ‘low’ colour manually through changing the scale parameter to 2. Type ?umap for the relevant documentation.

# make random data
var <- seq(1,ncol(pollen$data))
# do plot
umap(pollen$data,labels=var,controlscale=TRUE,scale=1,
legendtitle='var',printres = TRUE,printwidth = 24)

UMAPlabeled.png

Note that if you want to control the various parameters that UMAP uses internally rather than a quick analysis like this, you’ll have to make your own function to wrap ggplot2 and umap  (https://cran.r-project.org/web/packages/umap/index.html) or just run them as individual steps.

So, that bring us to the end of this little blog series on the M3C data visualisation functions. Hope you all enjoyed it!

Quick and easy t-SNE analysis in R

t-SNE is a useful nonlinear dimensionality reduction method that allows you to visualise data embedded in a lower number of dimensions, e.g. 2, in order to see patterns and trends in the data. It can deal with more complex nonlinear patterns of Gaussian clusters in multidimensional space compared to PCA so it is good for single cell RNA-seq analysis. Although is not suited to finding outliers because how the samples are arranged does not directly represent distance, like in PCA, instead t-SNE preserves local distance at the expensive of global distance.

An easy way to run t-SNE on your data is to use a pre-made wrapper function that uses the Rtsne package and ggplot2. Like the one that comes with the M3C package (https://bioconductor.org/packages/devel/bioc/html/M3C.html). You can also just take this code from the github and remake it for your own specific needs if necessary (https://github.com/crj32/M3C), it is quite straightforward. The package also has the equivalent functions for PCA and UMAP.

Let’s load some single cell RNA-seq data and demonstrate this function.

This is the Pollen et al. single cell RNA-seq data, but you can use a different kind of omic data, or non omic data. Let’s just take a look at that data.

# load your omic data here as mydata
library(M3C)
tsne(pollen$data,colvec=c('gold'))

TSNE1.png

So, we can see this data is neatly grouped into clusters for us. Let’s compare these with the cell type labels. This is just using the default colours, for changing to custom colours we can use the ‘controlscale, ‘scale’, and ‘colvec’ parameters. You can read about these in the documentation using ?tsne.

library(M3C)
tsne(pollen$data,labels=as.factor(pollen$celltypes))

TSNElabeled.png

So now we might want to display the expression of a specific gene onto the t-SNE plot, we can easily do this using this function. So the code below can be modified for any given gene, this is for gene ‘HBG1’, we can change this as needed. It just pulls a numerical vector out of the gene expression matrix for us, and z-score scales it for display.

I choose HBG1 because I had run the ‘featurefilter’ function from M3C that collects statistics about all genes in the data frame, such as the coefficient of variation and its second order derivative. Use ?featurefilter to see more information. This gene was the most variable one in the data. Basically this function is for filtering the data according to variance and producing summary statistics per gene.

Here we choose to control the colour scale and setting this to 2 means low is grey and high is red on the scale. However, ‘low’ and ‘high’ may be user specified.

tsne(pollen$data,
labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='HBG1',])),
controlscale = TRUE,scale=2)

TSNElabeled.png

Lastly, I want to find out which cell type that HBG1 expression is coming from, we can do this using the PCA function because the dots are too close together to see the text using t-SNE. To do this we have to add a text variable to the pca function.

pca(pollen$data,labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='HBG1',])),
controlscale=TRUE,scale=2,text=pollen$celltypes,
textlabelsize = 2)

PCAlabeled.png

So, basically we can zoom in on the printed image and see the red cluster is all the K562 cell type. The text parameter is primarily for finding outliers from PCA, but it can be used in other ways too.

M3C is not for clustering single cell RNA-seq data because of the high complexity of the algorithm and the type of consensus clustering it does, but we can use its functions for other purposes anyway. It is good for clustering cancer single-omic data though, for example. It is essentially an enhanced version of the Monti consensus clustering algorithm, implemented originally in the consensusclusterplus package.

For an algorithm that can handle single cell RNA-seq better and also a wide range of omic and non omic data, Spectrum is a good option (https://cran.r-project.org/web/packages/Spectrum/index.html). It is faster and more sophisticated in many ways to the consensus clustering approach. Also good for single cell RNA-seq clustering are SC3 and Seurat.

 

Easy quick PCA analysis in R

Principal component analysis (PCA) is very useful for doing some basic quality control (e.g. looking for batch effects) and assessment of how the data is distributed (e.g. finding outliers). A straightforward way is to make your own wrapper function for prcomp and ggplot2, another way is to use the one that comes with M3C (https://bioconductor.org/packages/devel/bioc/html/M3C.html) or another package. M3C is a consensus clustering tool that makes some major modifications to the Monti et al. (2003) algorithm so that it behaves better, it also provides functions for data visualisation.

Let’s have a go on an example cancer microarray dataset.

# M3C loads an example dataset mydata with metadata desx
library(M3C)
# do PCA
pca(mydata,colvec=c('gold'),printres=TRUE)

PCApriorclustering.png

So, now what prcomp has done is extracted the eigenvectors of the data’s covariance matrix, then projected the original data samples onto them using linear combination. This yields PC scores which are plotted on PC1 and PC2 here (eigenvectors 1 and 2). The eigenvectors are sorted and these first two contain the highest variation in the data, but it might be a good idea to look at additional PCs, which is beyond this simple blog post and function.

You can see above there are no obvious outliers for removal here, which is good for cluster analysis. However, were there outliers, we can label all samples with their names using the ‘text’ parameter.

library(M3C)
pca(mydata,colvec=c('gold'),printres=TRUE,text=colnames(mydata))

PCApriorclustering.png

Now other objectives would be comparing samples with batch to make sure we do not have batch effects driving the variation in our data, and comparing with other variables such as gender and age. Since the metadata only contains tumour class we are going to use that next to see how it is related to these PC scores.

This is a categorical variable, so the ‘scale’ parameter should be set to 3, ‘controlscale’ is set to TRUE because I want to choose the colours, and ‘labels’ parameter passes the metadata tumour class into the function. I am increasing the ‘printwidth’ from its default value because the legend is quite wide.

For more information see the function documentation using ?pca.

# first reformat meta to replace NA with Unknown
desx$class <- as.character(desx$class)
desx$class[is.na(desx$class)] <- as.factor('Unknown')
# next do the plot
pca(mydata,colvec=c('gold','skyblue','tomato','midnightblue','grey'),
labels=desx$class,controlscale=TRUE,scale=3,printres=TRUE,printwidth=25)

PCAlabeled.png

So, now it appears that the variation that governs these two PCs is indeed related to tumour class which is good. But, what if the variable is continuous, and we wanted to compare read mapping rate, or read duplication percentage, or age with our data? So, a simple change of the parameters can allow this too. Let’s make up a continuous variable, then add this to the plot. In this case we change the ‘scale’ parameter to reflect we are using continuous data, and the spectral palette is used for the colours by default.

randomvec <- seq(1,50)
pca(mydata,labels=randomvec,controlscale=TRUE,scale=1,legendtitle='Random',
printres=TRUE,printwidth=25)

PCAlabeled.png

So, since this is a random variable, we can see it has no relationship with our data. Let’s just define a custom scale now. So we change ‘scale’ to 2, then use the ‘low’ and ‘high’ parameters to define the scale colour range.

pca(mydata,labels=randomvec,controlscale=TRUE,scale=2,legendtitle='Random',
printres=TRUE,,printwidth=25,low='grey',high='red')

PCAlabeled.png

Super easy, yet pretty cool. The idea here is just to minimise the amount of code hanging around for doing basic analyses like these. You can rip the code from here: https://github.com/crj32/M3C/blob/master/R/pca.R. Remember if your features are on different scales, the data needs transforming to be comparable beforehand.

M3C is part of a series of cluster analysis tools that we are releasing, the next is ‘Spectrum’, a fast adaptive spectral clustering tool, which is already on CRAN (https://cran.r-project.org/web/packages/Spectrum/index.html). Perhaps there will be time to blog about that in the future.

For quantitative analysis of drivers in the variation on data, I recommend checking out David Watson’s great function in the ‘bioplotr’ package called ‘plot_drivers’ (https://github.com/dswatson/bioplotr). This compares the PCs with categorical and continuous variables and performs univariate statistical analysis on them producing a very nice plot.

Using clusterlab to benchmark clustering algorithms

Clusterlab is a CRAN package (https://cran.r-project.org/web/packages/clusterlab/index.html) for the routine testing of clustering algorithms. It can simulate positive (data-sets with >1 clusters) and negative controls (data-sets with 1 cluster). Why test clustering algorithms? Because they often fail in identifying the true K in practice, published algorithms are not always well tested, and we need to know about ones that have strange behaviour. I’ve found in many own experiments on clustering algorithms that algorithms many people are using are not necessary ones that provide the most sensible results. I can give a good example below.

I was interested to see clusterExperiment, a relatively new clustering package on the Bioconductor, cannot detect the ground truth K in my testing so far. Instead yielding solutions with many more clusters than there are in reality. On the other hand, the package I developed with David Watson here at QMUL, does work rather well. It is a derivative of the Monti et al. (2003) consensus clustering algorithm, fitted with a Monte Carlo reference procedure to eliminate overfitting. We called this algorithm M3C.

library(clusterExperiment)
library(clusterlab)
library(M3C)

Experiment 1: Simulate a positive control dataset (K=5)

PCAlabeled_real.png

k5 <- clusterlab(centers=5)

Experiment 2: Test RSEC (https://bioconductor.org/packages/release/bioc/html/clusterExperiment.html)

RSEC_test.png

rsec_test <- RSEC(as.matrix(k5$synthetic_data),ncores=10)
assignments <- primaryCluster(rsec_test)
pca(k5$synthetic_data,labels=assignments)

Experiment 3: Test M3C (https://bioconductor.org/packages/release/bioc/html/M3C.html)

PCAlabeled_M3C.png

M3C_test <- M3C(k5$synthetic_data,iters=10,cores=10)
optk <- which.max(M3C_test$scores$RCSI)+1
pca(M3C_test$realdataresults[[optk]]$ordered_data,
labels=M3C_test$realdataresults[[optk]]$ordered_annotation$consensuscluster,printres=TRUE)

Interesting isn't it R readers.

Well all I can say is I recommend comparing different machine learning methods and if in doubt, run your own control data-sets (positive and negative controls) to test various methods. In the other post we showed a remarkable bias in optimism corrected bootstrapping compared with LOOCV under certain conditions, simply by simulating null data-sets and passing them into the method.

If you are clustering omic’ data from a single platform (RNAseq, protein, methylation, etc) I have tested and recommend the following packages:

CLEST: https://www.rdocumentation.org/packages/RSKC/versions/2.4.2/topics/Clest
M3C: https://bioconductor.org/packages/release/bioc/html/M3C.html

And to be honest, that is about it. I have also tested PINSplus, GAP-statistic, and SNF, but they did not provide satisfactory results in my experiments on single platform clustering (currently unpublished data). Multi-omic and single cell RNA-seq analysis is another story, there will be more on that to come in the future R readers.

Remember there is a darkness out there R readers, not just in Washington, but in the world of statistics. It is there because of the positive results bias in science, because of people not checking methods and comparing them with one another, and because of several other reasons I can’t even be bothered to mention.

 

Part 5: Code corrections to optimism corrected bootstrapping series

The truth is out there R readers, but often it is not what we have been led to believe. The previous post examined the strong positive results bias in optimism corrected bootstrapping (a method of assessing a machine learning model’s predictive power) with increasing p (completely random features). This bias is real and is effecting publications. However, there were 2 implementations of the method given in the previous post, 1 has a slight error, 2 seem fine. The trend is still the same with the corrected code, but the problem with the earlier code is I did not set ‘replace=TRUE’ in the call to the ‘sample’ function. The problem inherent to the ‘optimism corrected bootstrapping’ method is more about estimating error using the same samples for training and testing, than how we are resampling the data, so this change does not make a large difference to the results. We will go into a little more detail about bootstrapping in this article and repeat the analyses with the corrected code. Once again, you are welcome to re-run the code written here and also test using the independent Caret implementation shown on the other page.

For the uninformed reader, I have also shown the problem using Caret (https://intobioinformatics.wordpress.com/2018/12/25/optimism-corrected-bootstrapping-a-problematic-method/), and where it originates from in the method here (it is obvious with a bit of statistical knowledge) (https://intobioinformatics.wordpress.com/2018/12/28/part-4-more-bias-and-why-does-bias-occur-in-optimism-corrected-bootstrapping/). It is a very simple test I have done (using simulated null data-sets with increasing features) and shows the problem with the method very clearly.

Thanks to ECOQUANT for pointing out to me the replace function should have been called with an additional parameter.

Let’s just recap what bootstrapping is and what optimism corrected bootstrapping is before we redo the experiments:

This is from Jason’s excellent blog (https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/), bootstrapping is:

  1. Choose a number of bootstrap samples to perform
  2. Choose a sample size
  3. For each bootstrap sample (b=1 … B)
    1. Draw a sample with replacement with the chosen size
    2. Calculate the statistic on the sample
  4. Calculate the mean of the calculated sample statistics.

The with replacement part means we have to put each individual sample back when getting the sample in the bth bootstrap iteration. Thus, we usually have duplicate samples in our sample of the data when doing bootstrapping.

This is the optimism corrected bootstrapping algorithm:

  1. Fit a model M to entire data S and estimate predictive ability C.
  2. Iterate from b=1…B:
    1. Take a resample from the original data, S*
    2. Fit the bootstrap model M* to S* and get predictive ability, C_boot
    3. Use the bootstrap model M* to get predictive ability on S, C_orig
  3. Optimism O is calculated as mean(C_boot – C_orig)
  4. Calculate optimism corrected performance as C-O.

Since we use the same data in step 3 of the bootstrap to train and test the model (an information leak), we would expect increasing bias (C_orig should be too high, thus O too small) when more and more random features are added. See the previous post for more explanation on this. Another point is, the optimism corrected bootstrap is done with a sample size of N instead of just a fraction of N, usually. I found the following quote to support this:

“The small data set was repeatedly re-sampled to produce b replicated data sets, each the same size as the original. We used b = 200. The predictive model was fitted to each of
the b replicated data sets in turn. Each fitted model was then applied both to the resampled data set from which it was generated and to the original data set.”

Smith, Gordon CS, et al. “Correcting for optimistic prediction in small data sets.” American journal of epidemiology 180.3 (2014): 318-324.

I have tried reducing the re-sampling size to a fraction of N, which reduces the bias somewhat, but it is still there. This makes sense due to the information leak in this method which results in an under estimation of the optimism (O).

Your welcome to experiment with this code yourselves. If you are thinking of using this method, I recommend simulating null datasets with the same number of dimensions to check how bias your AUC/ model performance will be first. When we have high numbers of features using this method is clearly a serious mistake.

This code can be directly copied and pasted into R to repeat the experiments.

Experiment 1: my implementation – glmnet (lasso logistic regression)


library(glmnet)
library(pROC)
library(caret)
library(ggplot2)
library(kimisc)

### TEST 1: bootstrap optimism with glmnet
cc <- c()
for (zz in seq(2,100,1)){
print(zz)
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')

### my own implementation of optimism corrected bootstrapping

## 1. fit model to entire test data (i.e. overfit it)

orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial")
preds <- predict(orig,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
original_auc <- as.numeric(auc$auc)

## 2. take resample of data and try to estimate optimism

test2 <- cbind(test,labelsa)
B <- 50
results <- matrix(ncol=2,nrow=B)
for (b in seq(1,B)){
## get the bootstrapped resampled data
boot <- test2[sample(row.names(test2),100,replace=TRUE),]
labelsb <- boot[,ncol(boot)]
boot <- boot[,-ncol(boot)]
## use the bootstrapped model to predict its own labels
bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial")
preds <- predict(bootf,newx=boot,type='response',s=0.01)
auc <- roc(labelsb,as.vector(preds))
boot_auc <- as.numeric(auc$auc)
## use bootstrap model to predict original labels
preds <- predict(bootf,newx=test,type='response',s=0.01)
auc <- roc(labelsa,as.vector(preds))
boot_original_auc <- as.numeric(auc$auc)
## add the data to results
results[b,1] <- boot_auc
results[b,2] <- boot_original_auc
}
## calculate the optimism
O <- mean(results[,1]-results[,2])
## calculate optimism corrected measure of prediction (AUC)
corrected <- original_auc-O
##
cc <- c(cc,corrected)
print(cc)
}

## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p1 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glmnet - random data only gives positive result with optimism corrected bootstrap')
print(p1)
png('glmnet_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')
print(p1)
dev.off()

Here are the results with 100 samples and 50 bootstrap iterations from 2 to 100 random features from a Gaussian distribution. We are re-sampling using the original sample size (N=100).

Random features are being added iteratively on the X axis, and on the Y, we have AUC. The AUC should be 0.5 to reflect the data has no real predictive power, but it is highly inflated.

glmnet_test_upto100

Experiment 2: another implementation – glm (logistic regression)


## TEST2
auc.adjust <- function(data, fit, B){
fit.model <- fit
data$pred.prob <- fitted(fit.model)
# get overfitted AUC
auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC'
auc.boot <- vector (mode = "numeric", length = B)
auc.orig <- vector (mode = "numeric", length = B)
o <- vector (mode = "numeric", length = B)
for(i in 1:B){
boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc'
fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial")
boot.sample$pred.prob <- fitted(fit.boot)
# get bootstrapped AUC
auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc
# get original data boot AUC
data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response")
auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc
# calculated optimism corrected version
o[i] <- auc.boot[i] - auc.orig[i]
}
auc.adj <- auc.app - (sum(o)/B)
return(auc.adj)
}
cc <- c()
for (zz in seq(2,100,1)){
print(zz)
## set up test data
test <- matrix(rnorm(100*zz, mean = 0, sd = 1),
nrow = 100, ncol = zz, byrow = TRUE)
labelsa <- as.factor(c(rep('A',50),rep('B',50)))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,100),sep='')
test2 <- data.frame(cbind(labelsa,test))
test2$labelsa <- as.factor(test2$labelsa)
## 1. make overfitted model
model <- glm(labelsa ~., data = test2, family = "binomial")
## 2. estimate optimism and correct
d <- auc.adjust(test2, model, B=200)
cc <- c(cc,d)
}
## print it
df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc)
p2 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) +
geom_line() + ggtitle('glm - random data only gives positive result with optimism corrected bootstrap')
print(p2)
png('glmt_test_upto100.png', height = 15, width = 27, units = 'cm',
res = 900, type = 'cairo')
print(p2)
dev.off()

glmt_test_upto100.png