A package for dimensionality reduction of large data
Motivation
Note: Recently, two new UMAP R packages have appeared. These new packages provide more features than umapr does and they are more actively developed. These packages are:
umap
, which provides the same Python wrapping function as umapr and also an R implementation, removing the need for the Python version to be installed. It is available on CRAN.uwot
, which also provides an R implementation, removing the need for the Python version to be installed.
A few weeks ago, as part of the rOpenSci Unconference, a group of us (Sean Hughes, Malisa Smith, Angela Li, Ju Kim, and Ted Laderas) decided to work on making the UMAP algorithm accessible within R. UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that allows the user to reduce high dimensional data (multiple columns) into a smaller number of columns for visualization purposes (github, arxiv). It is similar to both Principal Components Analysis (PCA) and t-SNE, which are techniques often used in the single-cell omics (such as genomics, flow cytometry, proteomics) world to visualize high dimensional data. t-SNE is actually quite a slow algorithm; one of the advantages of UMAP is that it runs faster than t-SNE. Because the data.frames
that are typically run with these algorithms can run into millions of rows, efficiency is important.
We decided to start working on the umapr
package to make this technique accessible within R. As with most rOpenSci Unconf projects, this started with an issue entry in the rOpenSci unconf repo:
I recently read about a new non-linear dimensionality reduction algorithm called UMAP, which is much faster than t-SNE, while producing two-dimensional visualizations that share many characteristics with t-SNE. I initially found out about it in the context of use on high-dimensional single-cell data in this paper.
….
My thought is that the ideal would be a package focused on UMAP specifically, implemented in R or Rcpp. Unfortunately I am not at all an expert in this topic or familiar with the mathematics involved, so the best I would be able to do is try to translate the Python implementation into R.
We all met at the unconference the first day and decided that this was a project worth working on. Since t-SNE is so used in the single cell and flow-cytometry community, we thought that having an alternative that was just as good, but faster to run would be helpful.
Making a Development Plan
Rather than recreate the UMAP code completely from scratch in R, we decided to use the reticulate
package to call the implementation in Python from R. It was tempting to just wrap the function’s arguments with ...
and let the user refer to the python documentation. However, we didn’t really think that was in the spirit of the unconf. We wanted to make UMAP much more accessible.
Learning about Package Building, Testing, and Documentation
Although our package only really has one main function (umap()
), we felt it was important to have good documentation and unit tests. We spent some time learning about roxygen
for function documentation and testthat
for unit testing, and setting up our package with Travis-CI for continuous integration testing. This included unit tests on each argument and including examples varying the essential parameters.
We spent a lot of time learning more about the specifics of package building and vignette building in R. We were definitely excited by all of the available tools and built a vignette profiling the performance of the UMAP algorithm versus other dimensionality reduction techniques, such as t-SNE. Our vignette can be read here: https://github.com/ropenscilabs/umapr#basic-use
Profiling umapr
using different datasets
Part of the appeal of UMAP is that it is faster than t-SNE. So we profiled the performance of UMAP on a number of different datasets: iris
(of course!), the BreastCancer
dataset from the mlbench
package, a Soybean
dataset from mlbench
, and finally, a single cell RNA dataset. You can see our results in our readme file.
Thankfully, UMAP does run faster than t-SNE on these datasets, showing a reduction of 66% compared to both versions of t-SNE for the Soybean
dataset, and reduced memory usage for all of the datasets, except for the single cell RNA dataset (see above figure).
Exploring the Results with Shiny
We built a small Shiny app that lets people explore their embedding vectors (the dimensionally reduced vectors) and how they separate the data into different groupings in the 2D space. The app is simple, but allows users to immediately assess the results of the UMAP algorithm in differentiating groupings in the data by coloring the umap
result by the different variables included in the analysis.
Final Results: Get umapr
umapr
is currently available in the ropenscilabs
organization, and can be installed with the following commands, after the python modules are installed.
install.packages("devtools")
devtools::install_github("ropenscilabs/umapr")
As a group, we learned a lot by building the umapr
package. More importantly, I think we’ll work together on future projects. It was great to work together, and we are talking about having a hackathon between our multiple groups to improve some current open source flow cytometry tools. This was a really fun project and we’re excited to do more!