What is UMAP and why is it useful for Data Scientist? UMAP (Un | Big Data Science

What is UMAP and why is it useful for Data Scientist?
UMAP (Uniform Manifold Approximation and Projection) is a universal manifold learning and dimensionality reduction algorithm. It is designed to be compatible with scikit-learn, uses the same API, and can be added to sklearn pipelines. As a stochastic algorithm, UMAP uses randomization to speed up the approximation and optimization steps. This means that different UMAP runs may produce different results. Although the UMAP is relatively stable, ideally the difference between runs should be relatively small, but it is. To ensure accurate reproduction of results, UMAP allows the user to set a random initial state.
Since version 0.4, UMAP also supports multithreading to improve performance, and when optimized, race conditions between threads are allowed at certain stages. The randomness in the UMAP output for the multithreaded case depends not only on the input random seed, but also on race conditions between threads during optimization, which is impossible to control. Therefore, multithreaded UMAP results cannot be explicitly reproduced.
UMAP can be used as an efficient preprocessing step to improve the performance of density-based clustering. But UMAP, like t-SNE, does not completely preserve density and can create false discontinuities in clusters. Compared to t-SNE, UMAP maintains a more global structure, creating more meaningful clusters. And thanks to the support for arbitrary embed sizes, UMAP allows you to work with large dimensional spaces.
Due to the active use of the nearest neighbors method, for some datasets, UMAP can consume excessive memory. Setting low_memory to True will help to switch to a slower, but less intensive approach to calculating nearest neighbors. It's also important to know that when run without a random seed, UMAP will use a parallel implementation of NUMBA to multithread and consume CPU cores. By default, it will use as many cores as available. You can limit the number of threads Numba uses by using the NUMBA_NUM_THREADS environment variable. Also due to the nature of Numba, UMAP does not support 32-bit Windows.
Despite some disadvantages, UMAP can be used in the following cases:
• exploratory data analysis (EDA);
• interactive visualization of the analysis results;
• processing of sparse matrices;
• detection of malicious programs based on behavioral data;
• preprocessing vectors of phrases for clustering;
• preprocessing of image embeddings (Inception) for clustering.
https://github.com/lmcinnes/umap
https://umap-learn.readthedocs.io/en/latest/index.html

Big Data Science

🎅 1.44K
Technologies

Big Data Science channel gathers together all interesting facts about Data Science. For cooperation: a.chernobrovov@gmail.com. 💼 — https://t.me/bds_job — channel about Data Science jobs and car...

Join
▲ Vote (1)

What is UMAP and why is it useful for Data Scientist? UMAP (Un | Big Data Science

Login