-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request: more computational details for supervised/semi-supervised dimension reduction #624
Comments
The principle is to treat the labels as merely a different view of the data which has a different associated metric. For categorical labels we simply use a categorical metric (distance 1 if labels are different, distance 0 if they are the same, and potentially distance 0.5 in the semi-supervised case where one or more of the labels in the pair is not defined), but you can use different metrics using the Given two metric spaces for the same underlying data we can construct fuzzy simplicial sets for each, as per the "How UMAP works" documentation. The catch now is to combine these two distinct structures into just one. We do this via take an intersection of the fuzzy simplicial sets. Considering them as graphs, where the fuzzy membership value is interpreted as the probability that the edge between two datapoints exists, you can think of this as creating a new graph where the probability of an edge between two points is the probability that the edge exists in both the original graphs. Thus we are asking that the new combined graph respect both the metric spaces -- the data space, but also the label space. At this point, with the combined graph, the rest of the UMAP algorithm can proceed as before, with an initialization and optimization of a layout based on this graph, as described in the "How UMAP works" documentation. From a practical standpoint of implementing this the largest constraint is that while the fuzzy simplicial set of the data space has few simplices, the fuzzy simplicial set of the label space may have far to many simplices to work with easily. The code shortcuts this by making use of the fact that simplices in the data fss with strength 0 will result in strength 0 edges in the combined graph, so we only need to compute strengths of simplices in the label space fss where the data space fss has non-zero strength. |
Hi, @lmcinnes What do you mean by "potentially distance" 0.5, the distance measures of 0, 1, and 0.5 are by default? Thanks a lot! |
Hi,
Would it be possible to get more information about the supervised/semi-supervised version of UMAP with labels on the read the docs page? The example is already good but an explanation of the computational details of what the algorithm does with the labels (and an idea of the math behind it) would be great.
Thanks a lot!
The text was updated successfully, but these errors were encountered: