Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: more computational details for supervised/semi-supervised dimension reduction #624

Open
tinagmar opened this issue Mar 19, 2021 · 2 comments

Comments

@tinagmar
Copy link

Hi,
Would it be possible to get more information about the supervised/semi-supervised version of UMAP with labels on the read the docs page? The example is already good but an explanation of the computational details of what the algorithm does with the labels (and an idea of the math behind it) would be great.
Thanks a lot!

@lmcinnes
Copy link
Owner

The principle is to treat the labels as merely a different view of the data which has a different associated metric. For categorical labels we simply use a categorical metric (distance 1 if labels are different, distance 0 if they are the same, and potentially distance 0.5 in the semi-supervised case where one or more of the labels in the pair is not defined), but you can use different metrics using the target_metric keyword argument (including, I believe Levenshtein on string data).

Given two metric spaces for the same underlying data we can construct fuzzy simplicial sets for each, as per the "How UMAP works" documentation. The catch now is to combine these two distinct structures into just one. We do this via take an intersection of the fuzzy simplicial sets. Considering them as graphs, where the fuzzy membership value is interpreted as the probability that the edge between two datapoints exists, you can think of this as creating a new graph where the probability of an edge between two points is the probability that the edge exists in both the original graphs. Thus we are asking that the new combined graph respect both the metric spaces -- the data space, but also the label space.

At this point, with the combined graph, the rest of the UMAP algorithm can proceed as before, with an initialization and optimization of a layout based on this graph, as described in the "How UMAP works" documentation.

From a practical standpoint of implementing this the largest constraint is that while the fuzzy simplicial set of the data space has few simplices, the fuzzy simplicial set of the label space may have far to many simplices to work with easily. The code shortcuts this by making use of the fact that simplices in the data fss with strength 0 will result in strength 0 edges in the combined graph, so we only need to compute strengths of simplices in the label space fss where the data space fss has non-zero strength.

@CRISTIANJULIOCESAR
Copy link

Hi, @lmcinnes
This gimme a way better picture of how it works supervised UMAP, However, I want to understand this, both simplicial sets are computed in the same traditional UMAP way(Riemannian metric)? and the difference is only in the distance metric (categorical one for the labels and for original data another metric)?

What do you mean by "potentially distance" 0.5, the distance measures of 0, 1, and 0.5 are by default?

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants