UMAP is a non linear dimensionality reduction algorithm in the same family as t-SNE. In the first phase of UMAP a weighted k nearest neighbour graph is computed, in the second a low dimensionality layout of this is then calculated. Then the embedded data points can be visualised in a new space and compared with other variables of interest.

It can be used for the analysis of many types of data, including, single cell RNA-seq and cancer omic data. One easy way to run UMAP on your data and visualise the results is to make a wrapper function that uses the umap R package and ggplot2, this is easy to do yourself, but in this post we are going to have a look at the one included in the M3C package (https://bioconductor.org/packages/devel/bioc/html/M3C.html). The code for this simple function can be found here (https://github.com/crj32/M3C/blob/master/R/umap.R).

M3C is a clustering algorithm for finding the number of clusters in a given dataset, but it also has several other useful functions that can be used more generally. I don’t use M3C for single cell RNA-seq clustering it is better for other types of omic data, such as RNA-seq and standard omic platforms, however, its visualisation functions are useful for a number of different analyses I do. Otherwise I end up copy and pasting lengthy blocks of ggplot2 code again and again.

There have been two previous posts on the related PCA and t-SNE functions contained in the M3C package which both contain the same helpful parameters for extra labelling of the data points.

Let’s load some single cell RNA-seq data, this is the pollen_test_data we used in the last post when running t-SNE.

So this is just a basic UMAP plot. So in this data within clusters the samples are very similar so they are very compact in these plots.

# load your omic data here as mydata library(M3C) umap(pollen$data,colvec=c('skyblue'))

This is for labelling the sample points with our ground truth labels, i.e. cell types. The colour scale is set to be controlled manually and this using some internal colours inside the function.

umap(pollen$data,labels=as.factor(pollen$celltypes),controlscale=TRUE,scale=3)

This is for looking at an individual genes expression across the samples, so we are just labelling the points with a continuous variable, in this case GAGE4 because its variance is among the highest standardised to the mean.

umap(pollen$data, labels=scale(as.numeric(pollen$data[row.names(pollen$data)=='GAGE4',])), controlscale = TRUE,scale=2,legendtitle = 'GAGE4')

Let’s just cluster this data with M3C and then display the clustering assignments on the UMAP plot.

r <- M3C(pollen$data,method=2) umap(pollen$data,labels=as.factor(r$assignments),printres = TRUE,printwidth = 24)

So that’s pretty neat, simple functions like this can be quite a lot of fun.

Let’s do one more plot, we will make up a continuous variable that could be a co-variate we are interested in and display that on the plot. I’ll show the code for printing the plot here which is done internally to simplify things.

Scale=1 means we use the spectral palette for colouring the data points. We have to set controlscale=TRUE for using custom colours, it is also possible to set a ‘high’ and ‘low’ colour manually through changing the scale parameter to 2. Type ?umap for the relevant documentation.

# make random data var <- seq(1,ncol(pollen$data)) # do plot umap(pollen$data,labels=var,controlscale=TRUE,scale=1, legendtitle='var',printres = TRUE,printwidth = 24)

Note that if you want to control the various parameters that UMAP uses internally rather than a quick analysis like this, you’ll have to make your own function to wrap ggplot2 and umap (https://cran.r-project.org/web/packages/umap/index.html) or just run them as individual steps.

So, that bring us to the end of this little blog series on the M3C data visualisation functions. Hope you all enjoyed it!

FYI – https://github.com/jlmelville/uwot is another implememtation of UMAP in R that I’ve had good experiences with.