Part 6: How not to validate your model with optimism corrected bootstrapping

When evaluating a machine learning model if the same data is used to train and test the model this results in overfitting. So the model performs much better in predictive ability  than it would if it was applied on completely new data, this is because the model uses random noise within the data to learn from and make predictions. However, new data will have different noise and so it is hard for the overfitted model to predict accurately just from noise on data it has not seen.

Keeping the training and test datasets completely seperate in machine learning from feature selection to fitting a model mitigates this problem, this is the basis for cross validation which iteratively splits the data into different chunks and iterates over them to avoid the problem. Normally, we use 100 times repeated 10 fold cross validation for medium to large datasets and leave-one-out cross validation (LOOCV) for small datasets.

There is another technique used to correct for the “optimism” resulting from fitting a model to the same data used to test it on, this is called optimism corrected bootstrapping. However, it is based on fundamentally unsound statistical principles that can introduce major bias into the results under certain conditions.

This is the optimism corrected bootstrapping method:

  1. Fit a model M to entire data S and estimate predictive ability C.
  2. Iterate from b=1…B:
    1. Take a bootstrap sample from the original data, S*
    2. Fit the bootstrap model M* to S* and get predictive ability, C_boot
    3. Use the bootstrap model M* to get predictive ability on S, C_orig
  3. Optimism O is calculated as mean(C_boot – C_orig)
  4. Calculate optimism corrected performance as C-O.

As I have stated previously, because the same data is used for training and testing in step 3 of the bootstrapping loop, there is an information leak here. Because the data used to make M* includes many of the same data that is used to test it, when applied on S. This can lead to C_orig being too high. Thus, the optimism (O) is expected to be too low, leading to the corrected performance from step 4 being too high. This can be a major problem.

The strength of this bias for this method strongly depends on: 1) the number of variables used, 2) the number of samples used, and, 3) the machine learning model used.

I can show the bias using glmnet and random forest in these new experiments. This code is different than in Part 5 because I am changing the ROC calculation to use my own evaluation package (https://cran.r-project.org/web/packages/MLeval/index.html), and also examining random forest instead of just logistic regression.

I am going to use a low N high p simulation, because that is what I am used to working with. N is the number of observations and p the number of features. Such dimensionalities are very common in the health care and bioinformatics industry and academia, using this method can lead to highly erroneous results, some of which end up in publications.

I am simulating null datasets here with random features sampled from a Gaussian distribution, so we should not expect any predictive ability thus the area under the ROC curve should be about 0.5 (chance expectation).

Experiment 1: my implementation.

N=100 and p=5 to 500 by 50. Random forest can be selected by using ‘ranger’ or lasso by using ‘glmnet’.

## my implementation optimism corrected bootstrapping

library(ggplot2)
library(MLeval)
library(ranger)
library(glmnet)

N <- 100
alg <- 'ranger'

## loop over increasing number features
cc <- c()
for (zz in seq(5,500,50)){

print(zz)

## set up test data
test <- matrix(rnorm(N*zz, mean = 0, sd = 1),
nrow = N, ncol = zz, byrow = TRUE)
labelsa <- as.factor(sample(c(rep('A',N/2),rep('B',N/2))))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,N),sep='')
test <- data.frame(test)
test[,zz+1] <- labelsa
colnames(test)[zz+1] <- 'XXX'
print(dim(test))

## 1. fit model to entire data and predict labels on same data
if (alg == 'glmnet'){
#orig <- glm(XXX~.,data=test,family='binomial')
X <- test[,1:zz]
Y <- labelsa
orig <- glmnet(as.matrix(X),Y,family='binomial')
preds <- predict(orig,newx=as.matrix(test[,-(zz+1)]),type='response',s=0.01)
}else{
orig <- ranger(XXX~.,data=test,probability=TRUE)
preds <- predict(orig,data=test[,-(zz+1)],type='response')
}

## 2. use MLeval to get auc
if (alg == 'ranger'){
preds <- data.frame(preds$predictions)
}else{
preds <- data.frame(B=preds[,1],A=1-preds[,1])
}
preds$obs <- labelsa
x <- evalm(preds,silent=TRUE,showplots=FALSE)
x <- x$stdres
auc <- x$Group1[13,1]
original_auc <- auc
print(paste('original:',auc))

## 3. bootstrapping to estimate optimism
B <- 50
results <- matrix(ncol=2,nrow=B)

for (b in seq(1,B)){

# get the bootstrapped data
boot <- test[sample(row.names(test),N,replace=TRUE),]
labelsb <- boot[,ncol(boot)]

# use the bootstrapped model to predict bootstrapped data labels
if (alg == 'ranger'){
bootf <- ranger(XXX~.,data=boot,probability=TRUE)
preds <- predict(bootf,data=boot[,-(zz+1)],type='response')
}else{
#bootf <- glm(XXX~.,data=boot,family='binomial')
X <- boot[,1:zz]
Y <- labelsb
bootf <- glmnet(as.matrix(X),Y,family='binomial')
preds <- predict(bootf,newx=as.matrix(boot[,-(zz+1)]),type='response',s=0.01)
}

# get auc of boot on boot
if (alg == 'ranger'){
preds <- data.frame(preds$predictions)
}else{
preds <- data.frame(B=preds[,1],A=1-preds[,1])
}
preds$obs <- labelsb
x <- evalm(preds,silent=TRUE,showplots=FALSE)
x <- x$stdres
boot_auc <- x$Group1[13,1]

# boot on test/ entire data
if (alg == 'ranger'){
## use bootstrap model to predict original labels
preds <- predict(bootf,data=test[,-(zz+1)],type='response')
}else{
preds <- predict(bootf,newx=as.matrix(test[,-(zz+1)]),type='response',s=0.01)
}

# get auc
if (alg == 'ranger'){
preds <- data.frame(preds$predictions)
}else{
preds <- data.frame(B=preds[,1],A=1-preds[,1])
}
preds$obs <- labelsa
x <- evalm(preds,silent=TRUE,showplots=FALSE)
x <- x$stdres
boot_original_auc <- x$Group1[13,1]
#

## add the data to results
results[b,1] <- boot_auc
results[b,2] <- boot_original_auc
}

## calculate the optimism
O <- mean(results[,1]-results[,2])

## calculate optimism corrected measure of prediction (AUC)
corrected <- original_auc-O

## append
cc <- c(cc,corrected)
print(cc)
}

my_imp_done

We can see that even with random features, optimism corrected bootstrapping can give very inflated performance metrics. Random data should give an AUC of about 0.5, this experiment shows the value of simulating null datasets and testing your machine learning pipeline with data with similar characteristics to that you will be using.

Below is code and results from the Caret implementation of this method (https://cran.r-project.org/web/packages/caret/index.html). Note, that the results are very similar to mine.

Caret will tune over a range of lambda for lasso by default (results below), whereas above I just selected a single value of this parameter. The bias for this method is much worse for models such as random forest and gbm, I suspect because they use decision trees which are non-linear so can more easily make accurate models out of noise if tested on the same data used to train with.

You can experiment easily with different types of cross validation and bootstrapping using null datasets with Caret. I invite you to do this, and if you change the cross validation method below, to, e.g. LOOCV or repeatedcv, you will see, these methods are not subject to this major overly optimistic results bias. They have their own issues, but I have found optimism corrected bootstrapping can give completely the wrong conclusion very easily, it seems not everyone knows of this problem before they are using it.

I recommend using Caret for machine learning. The model type and cross validation method can be changed simply without much effort. If ranger is changed to gbm in the below code we get a similar bias to using ranger.

Experiment 2: Caret implementation.

N=100 and p=5 to 500 by 50. Random forest can be selected by using ‘ranger’ or lasso by using ‘glmnet’.

## caret optimism corrected bootstrapping test

library(caret)

cc <- c()
i = 0
N <- 100

## add features iterate
for (zz in seq(5,500,50)){

i = i + 1

# simulate data
test <- matrix(rnorm(N*zz, mean = 0, sd = 1),
nrow = N, ncol = zz, byrow = TRUE)
labelsa <- as.factor(sample(c(rep('A',N/2),rep('B',N/2))))
colnames(test) <- paste('Y',seq(1,zz),sep='')
row.names(test) <- paste('Z',seq(1,N),sep='')
test <- data.frame(test)
test[,zz+1] <- labelsa
colnames(test)[zz+1] <- 'XXX'

# optimism corrected bootstrapping with caret
ctrl <- trainControl(method = 'optimism_boot',
summaryFunction=twoClassSummary,
classProbs=T,
savePredictions = T,
verboseIter = F)
fit4 <- train(as.formula( paste( 'XXX', '~', '.' ) ), data=test,
method="ranger", # preProc=c("center", "scale")
trControl=ctrl, metric = "ROC") #

cc <- c(cc, max(fit4$results$ROC))
print(max(fit4$results$ROC))
}

caret_imp

Code and png files zipped here: optimism_boot_experiment.

Further reading:

https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/

Consensus clustering in R

The logic behind the Monti consensus clustering algorithm is that in the face of resampling the ideal clusters should be stable, thus any pair of samples should either always or never cluster together. We can use this principle to infer the optimal number of clusters (K). This works by examining cluster stability from K=2 to K=10 during resampling. To do this for every K, we calculate the consensus rates for all sample pairs, which is the fraction of times a pair of samples cluster together. This gives us a consensus matrix for every K, which is symmetrical and in the range 0 to 1. A matrix entirely of 1s and 0s will represent perfect stability for a given K. We can simply compare stability of these matrices from K=2 to K=10 to determine the optimal K.

This is a very nice concept and the consensus matrix gives us a helpful visualisation tool. However, a major problem with this approach is that on random data the consensus matrix converges towards a matrix of perfect stability simply by chance. A second problem, is one general to cluster analysis, and is that we do not test the null hypothesis K=1. We developed M3C to correct for the internal bias of consensus clustering by using a Monte Carlo simulation, M3C can also test the null hypothesis K=1.

We have recently had the M3C paper published (https://www.nature.com/articles/s41598-020-58766-1), and this blog post is to demonstrate a new objective function included in the M3C algorithm (https://bioconductor.org/packages/devel/bioc/html/M3C.html) which did not get the chance the get into the paper. I’m going to do this without math notation because this is pretty informal.

Essentially we are in fact dealing with probabilities in the consensus matrix, where each element is the probability of samples i and j clustering together, given they were resampled together. Seeing the consensus matrix in this way, naturally leads to entropy as a very appropiate function to describe the stability (or uncertainty) of each consensus matrix. We simply can calculate entropy (binary information entropy) of each matrix and minimise it to determine the best K. Entropy, in this context, is the degree of surprisal, and we want the minimum surprise, if these are indeed good clusters.

Let’s do some analyses with the upgraded version of M3C on some glioblastoma cancer data (GBM).

## load package and test data
library(M3C)
dim(mydata) # we have 1740 features and 50 samples (gbm data)

## test code
test <- M3C(mydata)

Below is the CDF plot for every consensus matrix from 2 … 10, and it describes the distribution of numerical values in each matrix. A flat line would indicate all 1s and 0s. We can see a problem here because as K increases the line is converging towards a matrix of perfect stability. This also happens on random data with one cluster.

I think the CDF is a nice visualisation tool, but entropy does not require a CDF to work. Some metrics like the PAC score and delta k are calculated from the CDF, but I think it is more parsimonious to work directly with the consensus matrix probabilities. None of these methods correct for the convergence at chance problem of this method, for that we need to do Monte Carlo.

CDF

This next plot is entropy for every consensus matrix from 2 … 10.

Again, we can see a problem here because, although there is an elbow at K=4, the most certain or stable consensus matrix is for K=10. This is because even on random data, the algorithm generates consensus matrices that become more stable as K increases (see our paper for more details on this).

S

Remember, information entropy is describing uncertainty in the system, for the best clustering we want to minimise uncertainty (or surprisal) as much as possible despite perturbation. However, the random expectation must be taken into account.

The next plot is the RCSI, which corrects entropy for the chance expectation using a reference distribution derived using a Monte Carlo simulation and calculates 95% confidence intervals. We can see the bias has been corrected, and now we are just looking for a maximum value instead of elbows or last values before the floor, or other subjective approaches. This is nice.

RCSI

Four clusters is well supported by the histological data in which there are four subtypes for GBM, so I am satisfied with the decision made here. M3C also calculates p values for each K to test the null hypothesis K=1, see our vignette for more details.

Here are some general pointers on M3C and cluster analysis on Omic data.

  • When deciding K using M3C it is best to examine entropy, the RCSI, and the p values.
  • For more accurate p values, on some datasets, it is good to increase the Monte Carlo simulation number, the Monte Carlo simulation gives us the null distribution. Default is 25, this could be increased to 100.
  • Inner replications are set to 100, this is usually fine, however, on some datasets it is better to increase this to 250.
  • It is preferable to run SigClust on all pairs of clusters to check the significance.
  • It is preferable to run another method to support the K decision, we have found CLEST works well, a classic algorithm with some similarities to consensus clustering. This is implemented in the RSKC package.
  • The silhouette width does not work well on high dimensional cancer data in our experience and that of Șenbabaoğlu described in the 2014 paper. A possible alternative is to run t-SNE first before the silhouette width on the data with reduced dimensionality. Estimating K on complex Omic data is a hard problem to do well because of the high dimensionality and noise.

For demonstration of M3C on real TCGA data, see the supplementary information of this paper, in Table 1. The p value method was used in select K when running M3C here.

https://doi.org/10.1093/bioinformatics/btz704

For demonstration of M3C on simulated data and more TCGA data, this is the main paper that includes the method, but we used the PAC score instead of entropy as the objective function to minimise to find the best K. We still include the PAC score in the Bioconductor software because it is pretty good.

https://doi.org/10.1038/s41598-020-58766-1

Thanks for reading.

 

 

 

How to make a precision recall curve in R

Precision recall (PR) curves are useful for machine learning model evaluation when there is an extreme imbalance in the data and the analyst is interested particuarly in one class. A good example is credit card fraud, where the instances of fraud are extremely few compared with non fraud. Here are some facts about PR curves.

  • PR curves are sensitive to which class is defined as positive, unlike the ROC curve which is a more balanced method. So we will get a completely different result depending on which class is set as positive.
  • PR curves are sensitive to class distribution, so if the ratio of positives to negatives changes across different analyses, then the PR curve cannot be used to compare performance between them. This is because the chance line varies between datasets with different class distributions.
  • PR curves, because they use precision, instead of specificity (like ROC) can pick up false positives in the predicted positive fraction. This is very helpful when negatives >> positives. In these cases, the ROC is pretty insensitive and can be misleading, whereas PR curves reign supreme.
  • The area under the PR curve does not have a probabilistic interpretation like ROC.

The PR gain curve was made to deal with some of the above problems with PR curves, although it still is intended for extreme class imbalance situations. The main difference is the PR gain curve has a universal baseline as precision is corrected according to chance expectation. We will see this in an example.

Note, we are using non repeated cross validation with Caret to save time, but it requires repeating generally to reduce the variance and bias. Also, for imbalanced data in Caret we need to change to the ‘prSummary’ function that uses area under the PR curve to select the best model.

MLeval (https://cran.r-project.org/web/packages/MLeval/index.html) makes curves and calculates metrics, and will automatically extract the best model parameters and data from the Caret results.

## load libraries required for analysis
library(MLeval)
library(caret)

## simulate data
im <- twoClassSim(2000, intercept = -25, linearVars = 20)
table(im$Class)

## run caret
fitControl <- trainControl(method = "cv",summaryFunction=prSummary,
classProbs=T,savePredictions = T,verboseIter = F)
im_fit <- train(Class ~ ., data = im,method = "ranger",metric = "AUC",
trControl = fitControl)
im_fit2 <- train(Class ~ ., data = im,method = "xgbTree",metric = "AUC",
trControl = fitControl)

## run MLeval
x <- evalm(list(im_fit,im_fit2))

## curves and metrics are in the 'x' list object

We can see in the below analyses, both models are above chance and xgbTree does slightly better than random forest. Pretty cool.

This is the PR curve, see the grey line is baseline chance expectation.

pr.png

This is the PR gain curve, where baseline is zero.

prg.png

How to easily make a ROC curve in R

A typical task in evaluating the results of machine learning models is making a ROC curve, this plot can inform the analyst how well a model can discriminate one class from a second. We developed MLeval (https://cran.r-project.org/web/packages/MLeval/index.html), a evaluation package for R, to make ROC curves, PR curves, PR gain curves, and calibration curves. These plots are all using ggplot2 and it also yields performance metrics such as, Matthew’s correlation coefficient, specificity, sensitivity, and includes confidence intervals.

MLeval is aimed to make life as simple as possible. It can be run directly on a data frame of predicted probabilities and ground truth probabilities (labels), or on the Caret ‘train’ function output which performs cross validation to avoid overfitting. It also makes it easy to compare different models together. Let’s see an example.

## load libraries required for analysis
library(MLeval)
library(caret)

Run Caret on the Sonar data with 3 different models, then evaluate by passing the results objects as a list into ‘evalm’.


## load data and run Caret
data(Sonar)
ctrl <- trainControl(method="cv", summaryFunction=twoClassSummary, classProbs=T,
                     savePredictions = T)
fit1 <- train(Class ~ .,data=Sonar,method="rf",trControl=ctrl)
ctrl <- trainControl(method="cv", summaryFunction=twoClassSummary, classProbs=T,
                     savePredictions = T)
fit2 <- train(Class ~ .,data=Sonar,method="gbm",trControl=ctrl)
ctrl <- trainControl(method="cv", summaryFunction=twoClassSummary, classProbs=T,
                     savePredictions = T)
fit3 <- train(Class ~ .,data=Sonar,method="nb",trControl=ctrl)

## run MLeval
res <- evalm(list(fit1,fit2,fit3),gnames=c('rf','gbm','nb'))

The results (metrics and plots) can be accessed through the list object 'evalm' produces. We can see below that random forest and gbm perform the same, whereas naive bayes does not do as well falling behind the others in the two discrimination tests (ROC and PRG). However, in the calibration curves we can see all models are quite well calibrated, showing that being good at calibration does not always imply good discrimination.

The PRG curve standardises precision to the baseline, whereas the PR curve has a variable baseline, making it unsuitable to compare between data with different class distributions. This plot will change depending on which class is defined as positive, and is a deficiency of precision recall for non extremely imbalanced tasks. Credit card fraud is an example of where positives << negatives and it becomes more appropiate.

In the first two plots the analysis performed is the same, the probabilities are ranked from high to low then a sensitivity analysis is performed of the probability cut-off parameter to define a positive. For each iteration true positive rate vs true negative rate are calculated and plotted in the case of the ROC, for PRG it is precision gain vs recall gain. In the last plot, we plot predicted vs real probabilities (in bins) and the aim is for them to match as closely as possible (grey diagonal line = perfect). See our vignette for more information (https://cran.r-project.org/web/packages/MLeval/vignettes/introduction.pdf). Code is hosted here (https://github.com/crj32/MLeval).

ROC.pngPRG.png

CC.png

Fast adaptive spectral clustering in R: brain cancer RNA-seq

Spectral clustering refers to a family of algorithms that cluster eigenvectors derived from the matrix that represents the input data’s graph. An important step in this method is running the kernel function that is applied on the input data to generate a NXN similarity matrix or graph (where N is our number of input observations). Subsequent steps include computing the normalised graph Laplacian from this similarity matrix, getting the eigensystem of this graph, and lastly applying k-means on the top K eigenvectors to get the K clusters. Clustering in this way adds flexibility in the range of data that may be analysed and spectral clustering will often outperform k-means. It is an excellent option for image and bioinformatic cluster analysis including single platform and multi-omics.

‘Spectrum’ is a fast adaptive spectral clustering algorithm for R programmed by myself from QMUL and David Watson from Oxford. It contains both novel methodological advances and implementation of pre-existing methods. Spectrum has the following features; 1) A new density-aware kernel that increases similarity between observations that share common nearest neighbours, 2) A tensor product graph data integration and noise reduction system, 3) The eigengap method to decide on the number of clusters, 4) Gaussian Mixture Modelling for the final clustering of the eigenvectors, 5) Implementation of a Fast Approximate Spectral Clustering method for very big datasets, 6) Data imputation for multi-view analyses. Spectrum has been recently published as an article in Bioinformatics. It is available to download from CRAN: https://cran.r-project.org/web/packages/Spectrum/index.html.

In this demonstration, we are going to use Spectrum to cluster brain cancer RNA-seq to find distinct patient groups with different survival times. This is the link, braincancer_test_data, to download the test data for the analysis. The data can also be accessed through Synapse (https://www.synapse.org/#!Synapse:syn18911542/files/).

## load libraries required for analysis
library(Spectrum)
library(plot3D)
library(survival)
library(survminer)
library(scales)

The next block of code runs Spectrum, then does a t-sne analysis to visualise the clusters embedded in a lower dimensional space.

I have found 3D t-sne can work well to see complex patterns of Gaussian clusters which can be found in omic data (as in this cancer data), compared to PCA which is better for a more straightforward structure and outlier detection.

## run Spectrum
r <- Spectrum(brain[[1]])
## do t-sne analysis of results
y <- Rtsne::Rtsne(t(brain[[1]]),dim=3)
scatter3D(y$Y[,1],y$Y[,2],y$Y[,3], phi = 0, #bty = "g", ex = 2,
          ticktype = "detailed", colvar = r$assignments,
          col = gg.col(100), pch=20, cex=2, #type = 'h',
          xlab='Dim1', ylab='Dim2', zlab='Dim3')

Each eigenvector of the graph Laplacian numerically indicates membership of an observation to a block (cluster) in the data's graph Laplacian and each eigenvalue represents how disconnected that cluster is with the other clusters. In the ideal case, if we had eigenvalues reading 1, 1, 1, 1, 0.2 in the below plot, that would tell us we have 4 completely disconnected clusters, followed by one that is much more connected (undesirable when trying to find separate clusters). This principal can be extended to look for the greatest gap in the eigenvalues to find the number of clusters (K).

This first plot, produced automatically by Spectrum of the brain cancer data, shows the eigenvalues for each eigenvector of the graph Laplacian. Here is biggest gap is between the 4th and 5th eigenvectors, thus corresponding to K=4.

egap_brain.png

This next plot shows the 3D t-sne of the brain cancer RNA-seq clusters. We can see clear separation of the groups. On this type of complex data, Spectrum tends to excel due to its prioritisation of local similarities and its ability to reduce noise by making the k nearest neighbour graph and diffusing on it.

brain_tsne_rnaseq.png

Next, we can run a survival analysis using a Cox proportional hazards model and the log rank test is used to test whether there is a difference between the survival times of different groups (clusters).

## do test
clinicali <- brain[[2]]
clinicali$Death <- as.numeric(as.character(clinicali$Death))
coxFit <- coxph(Surv(time = Time, event = Death) ~ as.factor(r$assignments), data = clinicali, ties = "exact")
coxresults <- summary(coxFit)
print(coxresults$logtest[3])

## survival curve
survival_p <- coxresults$logtest[3]
fsize <- 18
clinicalj <- clinicali
if (0 %in% pr$cluster){
  res$cluster <- res$cluster+1
}
clinicalj$cluster <- as.factor(r$assignments)
fit = survfit(Surv(Time,Death)~cluster, data=clinicalj)

## plot
gg <- ggsurvplot(fit, data = clinicalj, legend = 'right', font.x =  fsize, font.y =  fsize, font.tickslab = fsize,
                 palette = gg.col(4),
                 legend.title = 'Cluster', #3 palette = "Dark2",
                 ggtheme = theme_classic2(base_family = "Arial", base_size = fsize))
gg$plot + theme(legend.text = element_text(size = fsize, color = "black")) + theme_bw() +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        axis.text.y = element_text(size = fsize, colour = 'black'),
        axis.text.x = element_text(size = fsize, colour = 'black'),
        axis.title.x = element_text(size = fsize),
        axis.title.y = element_text(size = fsize),
        legend.title = element_text(size = fsize),
        legend.text = element_text(size = fsize))

This is the survival curve for the analysis. The code above also gives us the p value (4.03E-22) for the log rank test which is highly significant, suggesting Spectrum yields good results. More extensive comparisons are included in our manuscript.

brain_survival.png

Spectrum is effective at reducing noise on a single-view (like here) as well as multi-view because it performs a tensor product graph diffusion operation in either case, and it uses a k nearest neighbour graph. Multi-omic clustering examples are included in the package vignette. Spectrum will also be well suited to a broad range of different types of data beyond this demonstration given today.

The density adaptive kernel enhances the intra-cluster similarity which helps to improve the quality of the similarity matrix. A good quality similarity matrix (that represents the data’s graph) is key to the performance of spectral clustering, where intra-cluster similarity should be maximised and inter-cluster similarity minimised.

Spectrum is a fast new algorithm for spectral clustering in R. It is largely based on work by Zelnik-Manor, Ng, and Zhang, and includes implementations of pre-existing methods as well as new innovations. This is the second of the pair of clustering tools we developed for precision medicine, the first being M3C, a consensus clustering algorithm which is on the Bioconductor. In many ways, spectral clustering is a more elegant and powerful method than consensus clustering type algorithms, but both are useful to examine data from a different perspective.

Spectrum can be downloaded here: https://cran.r-project.org/web/packages/Spectrum/index.html

Source code is available here: https://github.com/crj32/Spectrum

Reference

Christopher R John, David Watson, Michael R Barnes, Costantino Pitzalis, Myles J Lewis, Spectrum: fast density-aware spectral clustering for single and multi-omic data, Bioinformatics, , btz704, https://doi.org/10.1093/bioinformatics/btz704

Running UMAP for data visualisation in R

UMAP is a non linear dimensionality reduction algorithm in the same family as t-SNE. In the first phase of UMAP a weighted k nearest neighbour graph is computed, in the second a low dimensionality layout of this is then calculated. Then the embedded data points can be visualised in a new space and compared with other variables of interest.

It can be used for the analysis of many types of data, including, single cell RNA-seq and cancer omic data. One easy way to run UMAP on your data and visualise the results is to make a wrapper function that uses the umap R package and ggplot2, this is easy to do yourself, but in this post we are going to have a look at the one included in the M3C package (https://bioconductor.org/packages/devel/bioc/html/M3C.html). The code for this simple function can be found here (https://github.com/crj32/M3C/blob/master/R/umap.R).

M3C is a clustering algorithm for finding the number of clusters in a given dataset, but it also has several other useful functions that can be used more generally. I don’t use M3C for single cell RNA-seq clustering it is better for other types of omic data, such as RNA-seq and standard omic platforms, however, its visualisation functions are useful for a number of different analyses I do. Otherwise I end up copy and pasting lengthy blocks of ggplot2 code again and again.

There have been two previous posts on the related PCA and t-SNE functions contained in the M3C package which both contain the same helpful parameters for extra labelling of the data points.

Let’s load some single cell RNA-seq data, this is the pollen_test_data we used in the last post when running t-SNE.

So this is just a basic UMAP plot. So in this data within clusters the samples are very similar so they are very compact in these plots.

# load your omic data here as mydata
library(M3C)
umap(pollen$data,colvec=c('skyblue'))

UMAP.png

This is for labelling the sample points with our ground truth labels, i.e. cell types. The colour scale is set to be controlled manually and this using some internal colours inside the function.

umap(pollen$data,labels=as.factor(pollen$celltypes),controlscale=TRUE,scale=3)

UMAPlabeled.png

This is for looking at an individual genes expression across the samples, so we are just labelling the points with a continuous variable, in this case GAGE4 because its variance is among the highest standardised to the mean.

umap(pollen$data, labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='GAGE4',])),
controlscale = TRUE,scale=2,legendtitle = 'GAGE4')

UMAPlabeled.png

Let’s just cluster this data with M3C and then display the clustering assignments on the UMAP plot.

r <- M3C(pollen$data,method=2)
umap(pollen$data,labels=as.factor(r$assignments),printres = TRUE,printwidth = 24)

UMAPlabeled.png

So that’s pretty neat, simple functions like this can be quite a lot of fun.

Let’s do one more plot, we will make up a continuous variable that could be a co-variate we are interested in and display that on the plot. I’ll show the code for printing the plot here which is done internally to simplify things.

Scale=1 means we use the spectral palette for colouring the data points. We have to set controlscale=TRUE for using custom colours, it is also possible to set a ‘high’ and ‘low’ colour manually through changing the scale parameter to 2. Type ?umap for the relevant documentation.

# make random data
var <- seq(1,ncol(pollen$data))
# do plot
umap(pollen$data,labels=var,controlscale=TRUE,scale=1,
legendtitle='var',printres = TRUE,printwidth = 24)

UMAPlabeled.png

Note that if you want to control the various parameters that UMAP uses internally rather than a quick analysis like this, you’ll have to make your own function to wrap ggplot2 and umap  (https://cran.r-project.org/web/packages/umap/index.html) or just run them as individual steps.

So, that bring us to the end of this little blog series on the M3C data visualisation functions. Hope you all enjoyed it!

Quick and easy t-SNE analysis in R

t-SNE is a useful nonlinear dimensionality reduction method that allows you to visualise data embedded in a lower number of dimensions, e.g. 2, in order to see patterns and trends in the data. It can deal with more complex nonlinear patterns of Gaussian clusters in multidimensional space compared to PCA so it is good for single cell RNA-seq analysis. Although is not suited to finding outliers because how the samples are arranged does not directly represent distance, like in PCA, instead t-SNE preserves local distance at the expensive of global distance.

An easy way to run t-SNE on your data is to use a pre-made wrapper function that uses the Rtsne package and ggplot2. Like the one that comes with the M3C package (https://bioconductor.org/packages/devel/bioc/html/M3C.html). You can also just take this code from the github and remake it for your own specific needs if necessary (https://github.com/crj32/M3C), it is quite straightforward. The package also has the equivalent functions for PCA and UMAP.

Let’s load some single cell RNA-seq data and demonstrate this function.

This is the Pollen et al. single cell RNA-seq data, but you can use a different kind of omic data, or non omic data. Let’s just take a look at that data.

# load your omic data here as mydata
library(M3C)
tsne(pollen$data,colvec=c('gold'))

TSNE1.png

So, we can see this data is neatly grouped into clusters for us. Let’s compare these with the cell type labels. This is just using the default colours, for changing to custom colours we can use the ‘controlscale, ‘scale’, and ‘colvec’ parameters. You can read about these in the documentation using ?tsne.

library(M3C)
tsne(pollen$data,labels=as.factor(pollen$celltypes))

TSNElabeled.png

So now we might want to display the expression of a specific gene onto the t-SNE plot, we can easily do this using this function. So the code below can be modified for any given gene, this is for gene ‘HBG1’, we can change this as needed. It just pulls a numerical vector out of the gene expression matrix for us, and z-score scales it for display.

I choose HBG1 because I had run the ‘featurefilter’ function from M3C that collects statistics about all genes in the data frame, such as the coefficient of variation and its second order derivative. Use ?featurefilter to see more information. This gene was the most variable one in the data. Basically this function is for filtering the data according to variance and producing summary statistics per gene.

Here we choose to control the colour scale and setting this to 2 means low is grey and high is red on the scale. However, ‘low’ and ‘high’ may be user specified.

tsne(pollen$data,
labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='HBG1',])),
controlscale = TRUE,scale=2)

TSNElabeled.png

Lastly, I want to find out which cell type that HBG1 expression is coming from, we can do this using the PCA function because the dots are too close together to see the text using t-SNE. To do this we have to add a text variable to the pca function.

pca(pollen$data,labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='HBG1',])),
controlscale=TRUE,scale=2,text=pollen$celltypes,
textlabelsize = 2)

PCAlabeled.png

So, basically we can zoom in on the printed image and see the red cluster is all the K562 cell type. The text parameter is primarily for finding outliers from PCA, but it can be used in other ways too.

M3C is not for clustering single cell RNA-seq data because of the high complexity of the algorithm and the type of consensus clustering it does, but we can use its functions for other purposes anyway. It is good for clustering cancer single-omic data though, for example. It is essentially an enhanced version of the Monti consensus clustering algorithm, implemented originally in the consensusclusterplus package.

For an algorithm that can handle single cell RNA-seq better and also a wide range of omic and non omic data, Spectrum is a good option (https://cran.r-project.org/web/packages/Spectrum/index.html). It is faster and more sophisticated in many ways to the consensus clustering approach. Also good for single cell RNA-seq clustering are SC3 and Seurat.