Request: more computational details for supervised/semi-supervised dimension reduction #624

tinagmar · 2021-03-19T10:33:20Z

Hi,
Would it be possible to get more information about the supervised/semi-supervised version of UMAP with labels on the read the docs page? The example is already good but an explanation of the computational details of what the algorithm does with the labels (and an idea of the math behind it) would be great.
Thanks a lot!

lmcinnes · 2021-03-22T14:52:42Z

The principle is to treat the labels as merely a different view of the data which has a different associated metric. For categorical labels we simply use a categorical metric (distance 1 if labels are different, distance 0 if they are the same, and potentially distance 0.5 in the semi-supervised case where one or more of the labels in the pair is not defined), but you can use different metrics using the target_metric keyword argument (including, I believe Levenshtein on string data).

Given two metric spaces for the same underlying data we can construct fuzzy simplicial sets for each, as per the "How UMAP works" documentation. The catch now is to combine these two distinct structures into just one. We do this via take an intersection of the fuzzy simplicial sets. Considering them as graphs, where the fuzzy membership value is interpreted as the probability that the edge between two datapoints exists, you can think of this as creating a new graph where the probability of an edge between two points is the probability that the edge exists in both the original graphs. Thus we are asking that the new combined graph respect both the metric spaces -- the data space, but also the label space.

At this point, with the combined graph, the rest of the UMAP algorithm can proceed as before, with an initialization and optimization of a layout based on this graph, as described in the "How UMAP works" documentation.

From a practical standpoint of implementing this the largest constraint is that while the fuzzy simplicial set of the data space has few simplices, the fuzzy simplicial set of the label space may have far to many simplices to work with easily. The code shortcuts this by making use of the fact that simplices in the data fss with strength 0 will result in strength 0 edges in the combined graph, so we only need to compute strengths of simplices in the label space fss where the data space fss has non-zero strength.

CRISTIANJULIOCESAR · 2022-01-23T23:48:12Z

Hi, @lmcinnes
This gimme a way better picture of how it works supervised UMAP, However, I want to understand this, both simplicial sets are computed in the same traditional UMAP way(Riemannian metric)? and the difference is only in the distance metric (categorical one for the labels and for original data another metric)?

What do you mean by "potentially distance" 0.5, the distance measures of 0, 1, and 0.5 are by default?

Thanks a lot!

dillondaudert mentioned this issue Sep 15, 2023

V0.2 dev dillondaudert/UMAP.jl#55

Draft

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: more computational details for supervised/semi-supervised dimension reduction #624

Request: more computational details for supervised/semi-supervised dimension reduction #624

tinagmar commented Mar 19, 2021

lmcinnes commented Mar 22, 2021

CRISTIANJULIOCESAR commented Jan 23, 2022

Request: more computational details for supervised/semi-supervised dimension reduction #624

Request: more computational details for supervised/semi-supervised dimension reduction #624

Comments

tinagmar commented Mar 19, 2021

lmcinnes commented Mar 22, 2021

CRISTIANJULIOCESAR commented Jan 23, 2022