Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V0.2 dev #55

Draft
wants to merge 59 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
3b80c20
Add loss.jl docs initial
dillondaudert Oct 29, 2019
2606036
Update docs
dillondaudert Oct 30, 2019
75e3129
Merge branch 'master' into v0.2-dev
dillondaudert Dec 28, 2019
b0b3160
Merge branch 'master' into v0.2-dev
May 24, 2020
0a36fc5
Start draft of neighbors.jl
dillondaudert May 25, 2020
d12eda7
Merge branch 'master' into v0.2-dev
dillondaudert May 29, 2020
285ce2a
Update new knn_search draft to use named tuples
dillondaudert May 29, 2020
e6fd11d
knn_search with fit tests; started simplicial_sets.jl
dillondaudert Jun 5, 2020
312924a
More simplicial set impl and initial tests
dillondaudert Jun 9, 2020
fae8aec
Add coalesce_views and very simple test
dillondaudert Jun 10, 2020
7669ca4
Add docs to knn_search and refine
dillondaudert Jun 11, 2020
ba3976a
Add docs to fuzzy_simplicial_set and refine
dillondaudert Jun 11, 2020
edd5579
Begin initialize_embedding
dillondaudert Jun 11, 2020
ad47b0d
Begin pluto notebook for advanced usage
Dec 8, 2020
b3c4c38
Add membership_fn.jl; move fit_ab to it
Dec 8, 2020
43b4694
Update target params membership params; draft fit.jl
Dec 10, 2020
802d452
Begin optimize_new, add Setfield
Dec 10, 2020
4abee0e
Fix partial function def
Dec 15, 2020
45887cb
Initial optimize_new.jl implementation with target_metric fucntion
Dec 15, 2020
4f94fc2
Correclty parameterize target_metric _EuclideanManifold
Dec 15, 2020
cb67cbe
Fix var name self_reference
Dec 21, 2020
ee734b6
Initial fit.jl; rename optimize -> optimize_old.jl
Dec 21, 2020
108b3f0
Rename optimize_new to optimize.jl
Dec 21, 2020
4f85b01
Begin basic fit API.
Dec 21, 2020
116b743
Add config constructors and docstrings; expand fit() functionality to…
Dec 21, 2020
8d5e809
Delete old src files.
Dec 21, 2020
e4a658f
Remove old import
Dec 21, 2020
5abf0c6
Fix gradient coefficient
Dec 23, 2020
df8ceca
Fix typo
Dec 23, 2020
6a0404b
Fix epoch iterator bug
Dec 23, 2020
f117c0c
Update mnist.jl Pluto notebook
Dec 24, 2020
b3e0cf2
Fix type instability in optimize.jl grad_coef
Dec 28, 2020
8e98724
Specialize optimize_embedding for sqeuclidean on Euclidean manifold; …
Dec 28, 2020
98f4e02
Initial reimpl transform
Dec 28, 2020
fef77c8
Add missing end
Dec 28, 2020
42ef438
Include transform.jl
Dec 28, 2020
6a4df84
Update pluto doc notebooks
Dec 29, 2020
dc76771
Add knn_search transform tests
Dec 30, 2020
928ee2b
Add simple transform simplicial set test
Dec 30, 2020
441c233
Remove old exports
Dec 30, 2020
2e19066
v0.2-dev: Update pluto notebooks
Apr 22, 2022
1e03a66
Merge branch 'master' into v0.2-dev
Apr 22, 2022
de75769
v0.2-dev: Remove unintended merge conflict added data
Apr 22, 2022
e05feb5
v0.2: Add version compat for new package Setfield
Apr 22, 2022
2c4dd2a
v0.2-dev: Fix neighbors tests to permit transform with matrices
Apr 22, 2022
114dec3
v0.2-dev: Add the exisitng utils tests to test suite
Apr 22, 2022
9a3f12a
v0.2-dev: Work on general simpl set intersection/union; reset local c…
Apr 24, 2022
680de58
v0.2-dev: Implement reset_local_connectivity; need tests
Apr 25, 2022
0bf580d
Add tests for simplicial set utils
May 4, 2022
1a6dd8b
Change fit default global mix weight to 0.5
May 4, 2022
af85fb2
Initial trustworthiness; need testing
May 8, 2022
ddb3aa7
Update .gitignore
dillondaudert Sep 18, 2023
32e950d
Properly check nonzero minimum in fset intersection; tests
dillondaudert Sep 18, 2023
e18d794
Update version to v0.2; support latest Julia 1.9
Sep 19, 2023
d7246fc
Fix norm_sparse to properly norm columns
Sep 19, 2023
3b9ba15
Undo change to _norm_sparse
Sep 19, 2023
d1103e8
Update advanced usage
Sep 19, 2023
b9ba44b
Use KNNGraph as PrecomputedNeighbors (only fit makes sense)
Sep 20, 2023
7126d24
Update advanced_usage.jl
Sep 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ deps/deps.jl
*.ipynb_checkpoints

Manifest.toml
.vscode/settings.json
6 changes: 4 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "UMAP"
uuid = "c4f8c510-2410-5be4-91d7-4fbaeb39457e"
authors = ["Dillon Daudert <[email protected]>"]
version = "0.1.9"
version = "0.2.0"

[deps]
Arpack = "7d9fca2a-8960-54d3-9f78-7d1dccf2cb97"
Expand All @@ -10,14 +10,16 @@ LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
LsqFit = "2fda8390-95c7-5789-9bda-21331edee243"
NearestNeighborDescent = "dd2c4c9e-a32f-5b2f-b342-08c2f244fce8"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Setfield = "efcf1570-3423-57d1-acb7-fd33fddbac46"
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"

[compat]
Arpack = "0.4, 0.5"
Distances = "0.8, 0.9, 0.10"
LsqFit = "0.6, 0.7, 0.8, 0.9, 0.10, 0.11, 0.12"
NearestNeighborDescent = "0.3"
julia = "1.6, 1.7"
Setfield = "0.8"
julia = "1.6, 1.7, 1.8, 1.9"

[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
Expand Down
9 changes: 9 additions & 0 deletions docs/examples/Project.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
[deps]
Distances = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"
MLDatasets = "eb30cadb-4394-5ae3-aed4-317e484a6458"
NearestNeighborDescent = "dd2c4c9e-a32f-5b2f-b342-08c2f244fce8"
PlotlyJS = "f0f68f2c-4968-5e81-91da-67840de0976a"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
Pluto = "c3e4b0f8-55cb-11ea-2926-15256bba5781"
StringDistances = "88034a9c-02f8-509d-84a9-84ec65e18404"
UMAP = "c4f8c510-2410-5be4-91d7-4fbaeb39457e"
320 changes: 320 additions & 0 deletions docs/examples/advanced_usage.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,320 @@
### A Pluto.jl notebook ###
# v0.19.27

using Markdown
using InteractiveUtils

# ╔═╡ 75d1e5a1-5468-4c82-b074-36e3a6c6f4ec
import Pkg

# ╔═╡ b9dd81e8-193e-45ad-8db9-885d59f02f1b
Pkg.activate(@__DIR__)

# ╔═╡ dcd32c80-398b-11eb-2e05-456e126db257
using UMAP

# ╔═╡ 0028c794-398c-11eb-3464-55d473eb6584
using Distances

# ╔═╡ 2467fefe-398c-11eb-3bc8-997aa34112d3
using StringDistances

# ╔═╡ 279cd6ba-398c-11eb-3726-017ceb9dea5c
md"""
# Advanced Usage
"""

# ╔═╡ 72b19124-3996-11eb-37cb-4184976b0d9b
md"""
## Algorithm
At a high level, the UMAP algorithm proceeds in the following steps:

```julia
knns_dists = knn_search(data, knn_params)
fuzzy_sets = fuzzy_simplicial_set(knns_dists, knn_params, src_view_params)
umap_graph = coalesce_views(fuzzy_sets, src_global_params)
embedding = initialize_embedding(umap_graph, tgt_params)
optimize_embedding!(embedding, umap_graph, tgt_params, opt_params)
```
"""

# ╔═╡ 2e7552d4-398c-11eb-2f64-63b8af73b208
md"""
## KNN Search
In a typical workflow, the first step of the UMAP algorithm is to find a (approximate) k-nearest neighbor graph.
"""

# ╔═╡ ecee2216-398c-11eb-0903-4d55ae073c58
md"""
### Example: Approximate neighbors for vector data
A very simple example of this is to find 4 approximate nearest neighbors for vectors in R^n using the Euclidean metric:
"""

# ╔═╡ 8f641ed6-398c-11eb-1678-1d25fec1110e
xs = [rand(10) for _ in 1:10];

# ╔═╡ d8424a74-398c-11eb-2caa-07c75477d11e
knn_params = UMAP.DescentNeighbors(4, Euclidean())

# ╔═╡ c1410072-398c-11eb-1398-47403e535012
UMAP.knn_search(xs, knn_params)

# ╔═╡ 9279de48-398d-11eb-1e07-1136141af11e
md"""
The return result in this case is a tuple of 4x10 (`n_neighbors` x `n_points`) matrices, one for the indices of the nearest neighbors and the second for the distances.

e.g. `knn_search(xs, knn_params) -> indices, distances`
"""

# ╔═╡ be4537e8-398d-11eb-22ba-b51e1aa3dee8
md"""
The knn parameter struct `DescentNeighbors` uses `NearestNeighborDescent.jl` to find the approximate knns of the data. It also allows passing keyword arguments to `nndescent`:
"""

# ╔═╡ 0862883a-398e-11eb-3188-255fe8d4d14f
knn_params_kw = UMAP.DescentNeighbors(4, Euclidean(), (max_iters=15,));

# ╔═╡ 3067ce2e-398e-11eb-3cde-97c95b306cef
UMAP.knn_search(xs, knn_params_kw)

# ╔═╡ 469b60a4-398e-11eb-13a2-4dab74b0f1bb
md"""
### Example: Precomputed distances
Alternatively, a precomputed distance matrix can be passed in if the pairwise distances are already known. This is done by using the `PrecomputedNeighbors` knn parameter struct (note that `n_neighbors` is still required in order to later construct the fuzzy simplicial set, and for transforming new data):
"""

# ╔═╡ cd0c3398-398e-11eb-3486-cdf107f27159
distances = [0. 2 1;
2 0 3;
1 3 0];

# ╔═╡ 89729294-398e-11eb-2d30-fbed1c13ce51
knn_params_pre = UMAP.PrecomputedNeighbors(2, distances)

# ╔═╡ f3d66db6-398e-11eb-2432-8d2a4804d4c5
UMAP.knn_search(nothing, knn_params_pre)

# ╔═╡ 143aac84-398f-11eb-29f4-05727bb571de
md"""
### Example: Multiple views
One key feature of UMAP is combining multiple, heterogeneous views of the same dataset. For the knn search step, this is set up by passing a named tuple of data views and a corresponding named tuple of knn parameter structs. The `knn_search` function then broadcasts for each (data, knn_param) pair and returns a named tuple of (indices, distances) that similarly corresponds to the input.

For example, in addition to the vector data `xs` we might also have string data:
"""

# ╔═╡ 16633022-3990-11eb-1e5a-7f96e5fca442
xs_str = [join(rand('A':'Z', 10), "") for _ in 1:10];

# ╔═╡ 650f8432-3990-11eb-0217-37f490ec414c
knn_params_str = UMAP.DescentNeighbors(4, RatcliffObershelp());

# ╔═╡ aeef6b0e-398f-11eb-3aac-d1d6dc7210d5
data_views = (view_1=xs,
view_2=xs_str)

# ╔═╡ a92040d0-3990-11eb-063b-ed48334893db
knn_params_views = (view_1=knn_params,
view_2=knn_params_str)

# ╔═╡ c81b8454-3990-11eb-3376-9faac9aa5987
UMAP.knn_search(data_views, knn_params_views)

# ╔═╡ 6cac2bc2-3991-11eb-098b-15e21571c561
md"""
## Fuzzy Simplicial Sets
Once we have one or more set of knns for our data (one for each view), we can construct a global fuzzy simplicial set. This is done via the function

`fuzzy_simplicial_set(...) -> umap_graph::SparseMatrixCSC`

A global fuzzy simplicial set is constructed **for each view** of the data with construction paramaterized by the `SourceViewParams` struct. If there is more than one view, their results are combined to return a single fuzzy simplicial set (represented as a weighted, undirected graph).
"""

# ╔═╡ 0009ca40-3993-11eb-3b0b-db3511e3d9a7
md"""
### Example: Fuzzy simplicial set - one view
To create a fuzzy simplicial set for our original dataset of vectors:
"""

# ╔═╡ 32f95f38-3993-11eb-38e5-216e944d198e
src_view_params = UMAP.SourceViewParams(1, 1, 1)

# ╔═╡ 49d1c3c6-3993-11eb-2e2a-c7c5d3a1de18
knns_dists = UMAP.knn_search(xs, knn_params)

# ╔═╡ 844c0f66-3993-11eb-3a40-992da37a639e
UMAP.fuzzy_simplicial_set(knns_dists, knn_params, src_view_params)

# ╔═╡ fa5d0822-3993-11eb-1152-a5e1333fe70f
md"""
### Example: Fuzzy simplicial set - multiple views
As before, multiple views can be passed to `fuzzy_simplicial_set` - each parameterized by its own `SourceViewParams` - and combined into a single, global fuzzy simplicial set.

Using our combination of vector and string data:
"""

# ╔═╡ 307f647c-3994-11eb-042d-47ae4e4779d2
knns_dists_views = UMAP.knn_search(data_views, knn_params_views)

# ╔═╡ 554503a2-3994-11eb-032d-9122491e1d55
src_view_params_views = (view_1=src_view_params,
view_2=src_view_params)

# ╔═╡ 83b5aa0c-3994-11eb-1ede-7f3440865586
fsset_views = UMAP.fuzzy_simplicial_set(knns_dists_views, knn_params_views, src_view_params_views)

# ╔═╡ 0ce99b28-3995-11eb-08c8-7fe41c8e5ff6
md"""
### Example: Combining views' fuzzy simplicial sets
We need a single umap graph (i.e. global fuzzy simplicial set) in order to perform optimization, so if there are multiple dataset views we must combine their sets.

The views' fuzzy sets are combined left-to-right according to `mix_ratio`:
"""

# ╔═╡ 8f21a1b8-3995-11eb-189d-9b0dd552f1c8
src_gbl_params = UMAP.SourceGlobalParams(0.5)

# ╔═╡ fb400814-3995-11eb-0fc8-372701323b2c
_graph = UMAP.coalesce_views(fsset_views, src_gbl_params)

# ╔═╡ d2915640-3998-11eb-22c5-adc30c539cd6
md"""
## Initialize and optimize target embedding
- initialize target space membership function and gradient functions
- initialize target space embedding
- optimize target embedding
"""

# ╔═╡ 8cb95f52-3b0d-11eb-209a-c5a6d17a89e7
md"""
## Initialize target embedding
The target space and initialization method can be parameterized by the `TargetParams` struct:

```julia
struct TargetParams{M, D, I, F}
manifold::M
metric::D
init::I
memb_params::F
end
```

It is possible to specify the target manifold, a distance metric in the target space `metric`, and an initialization method.

The default target space is d-dimensional Euclidean space, with the squared Euclidean distance metric. Two initialization methods are provided: random and spectral layout.
"""

# ╔═╡ be8795d0-3b0d-11eb-18fd-9f7d61210ae2
md"""
### Example: Initializing vectors in R^2
"""

# ╔═╡ cbdb30a2-3b0d-11eb-1167-dbf6e87d80bd
tgt_params = UMAP.TargetParams(UMAP._EuclideanManifold{2}(), SqEuclidean(), UMAP.UniformInitialization(), nothing)

# ╔═╡ 415b2066-3b0f-11eb-37c7-6fa74b7282b1
umap_graph = UMAP.fuzzy_simplicial_set(knns_dists, knn_params, src_view_params);

# ╔═╡ 880b3960-3b0f-11eb-33a5-a12e8248f6fe
xs_embed = UMAP.initialize_embedding(umap_graph, tgt_params)

# ╔═╡ 544c3254-43a8-11eb-2645-3d26e34bd982
md"""
### MembershipFnParams
These parameters control the layout of points embedded in the target space by adjusting the membership function. *TO DO*.

```julia
struct MembershipFnParams
min_dist
spread
a
b
end
```
"""

# ╔═╡ b8481434-43a9-11eb-3902-6d5426beda92
a, b = UMAP.fit_ab(1, 1)

# ╔═╡ 011969ec-43aa-11eb-1545-6b63b42277fe
full_tgt_params = UMAP.TargetParams(UMAP._EuclideanManifold{2}(), SqEuclidean(), UMAP.UniformInitialization(), UMAP.MembershipFnParams(1., 1., a, b))

# ╔═╡ b7ea70da-43a8-11eb-35b6-d1836e6849c5
md"""
## Optimize target embedding
The embedding is optimized by minimizing the fuzzy set cross entropy loss between the
two fuzzy set representations of the data.
"""

# ╔═╡ 0e8db138-43a9-11eb-1dba-dbddcfdd10f7
md"""
### Example: Optimize one epoch
The optimization process is parameterized by the struct `OptimizationParams`:

```julia
struct OptimizationParams
n_epochs # number of epochs to perform optimization
lr # learning rate
repulsion_strength # weight to give negative samples
neg_sample_rate # number of negative samples per positive sample
end
```
"""

# ╔═╡ a3d69b10-43a9-11eb-1dba-03059f2afcb0
opt_params = UMAP.OptimizationParams(1, 1., 1., 5)

# ╔═╡ afcb98a0-43a9-11eb-2bc6-cbd18349b749
UMAP.optimize_embedding!(xs_embed, umap_graph, full_tgt_params, opt_params)

# ╔═╡ Cell order:
# ╠═75d1e5a1-5468-4c82-b074-36e3a6c6f4ec
# ╠═b9dd81e8-193e-45ad-8db9-885d59f02f1b
# ╠═dcd32c80-398b-11eb-2e05-456e126db257
# ╠═0028c794-398c-11eb-3464-55d473eb6584
# ╠═2467fefe-398c-11eb-3bc8-997aa34112d3
# ╟─279cd6ba-398c-11eb-3726-017ceb9dea5c
# ╟─72b19124-3996-11eb-37cb-4184976b0d9b
# ╟─2e7552d4-398c-11eb-2f64-63b8af73b208
# ╟─ecee2216-398c-11eb-0903-4d55ae073c58
# ╠═8f641ed6-398c-11eb-1678-1d25fec1110e
# ╠═d8424a74-398c-11eb-2caa-07c75477d11e
# ╠═c1410072-398c-11eb-1398-47403e535012
# ╟─9279de48-398d-11eb-1e07-1136141af11e
# ╟─be4537e8-398d-11eb-22ba-b51e1aa3dee8
# ╠═0862883a-398e-11eb-3188-255fe8d4d14f
# ╠═3067ce2e-398e-11eb-3cde-97c95b306cef
# ╟─469b60a4-398e-11eb-13a2-4dab74b0f1bb
# ╠═cd0c3398-398e-11eb-3486-cdf107f27159
# ╠═89729294-398e-11eb-2d30-fbed1c13ce51
# ╠═f3d66db6-398e-11eb-2432-8d2a4804d4c5
# ╟─143aac84-398f-11eb-29f4-05727bb571de
# ╠═16633022-3990-11eb-1e5a-7f96e5fca442
# ╠═650f8432-3990-11eb-0217-37f490ec414c
# ╠═aeef6b0e-398f-11eb-3aac-d1d6dc7210d5
# ╠═a92040d0-3990-11eb-063b-ed48334893db
# ╠═c81b8454-3990-11eb-3376-9faac9aa5987
# ╟─6cac2bc2-3991-11eb-098b-15e21571c561
# ╟─0009ca40-3993-11eb-3b0b-db3511e3d9a7
# ╠═32f95f38-3993-11eb-38e5-216e944d198e
# ╠═49d1c3c6-3993-11eb-2e2a-c7c5d3a1de18
# ╠═844c0f66-3993-11eb-3a40-992da37a639e
# ╟─fa5d0822-3993-11eb-1152-a5e1333fe70f
# ╠═307f647c-3994-11eb-042d-47ae4e4779d2
# ╠═554503a2-3994-11eb-032d-9122491e1d55
# ╠═83b5aa0c-3994-11eb-1ede-7f3440865586
# ╟─0ce99b28-3995-11eb-08c8-7fe41c8e5ff6
# ╠═8f21a1b8-3995-11eb-189d-9b0dd552f1c8
# ╠═fb400814-3995-11eb-0fc8-372701323b2c
# ╠═d2915640-3998-11eb-22c5-adc30c539cd6
# ╟─8cb95f52-3b0d-11eb-209a-c5a6d17a89e7
# ╟─be8795d0-3b0d-11eb-18fd-9f7d61210ae2
# ╠═cbdb30a2-3b0d-11eb-1167-dbf6e87d80bd
# ╠═415b2066-3b0f-11eb-37c7-6fa74b7282b1
# ╠═880b3960-3b0f-11eb-33a5-a12e8248f6fe
# ╟─544c3254-43a8-11eb-2645-3d26e34bd982
# ╠═b8481434-43a9-11eb-3902-6d5426beda92
# ╠═011969ec-43aa-11eb-1545-6b63b42277fe
# ╟─b7ea70da-43a8-11eb-35b6-d1836e6849c5
# ╟─0e8db138-43a9-11eb-1dba-dbddcfdd10f7
# ╠═a3d69b10-43a9-11eb-1dba-03059f2afcb0
# ╠═afcb98a0-43a9-11eb-2bc6-cbd18349b749
Loading