forked from cemoody/lda2vec
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTODO
97 lines (91 loc) · 2.72 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
Try higher importance to the prior
Change prob model to just model prob of word in topic
Add super simple explanatory models
Hook up RTD to docstrings
Remove spacy dep
Add an example script with 20 newsgroups -- LDA
Add visualization for topic-word
Add examples of specific documents
Add an example script with wiki word2vec data -- w2v
Add an example script with PTB word2vec data -- w2v
Add an example script with HN data -- LDA & w2v & ratings
Add better README
Add model saving
Add model predicting
Add bigramming
Change EmbedMixture naming to possible values and n latent factors
Print out topics while training
Add doctets to lda2vec main classes
Randomize chunking order on fit
Add loss tracking and reporting classes to code
Finish filling out docstrings
Add multiple targets for one component
Implement skipgram contexts
Prevent mixing between documents
Add temperature to perplexity measurements
Add temperature to viz
Add convergence criterion
Add visualization for biclustering topics
Add docs on:
Installation
HN Tutorial
Parse document into vector
Setup LDA for document
Mesure perplexity
Visualize topics
Add supervised component
Mesure perplexity
Visualize topics
Add another component for time
Mesure perplexity
Visualize topics
Visualize topics, changing temperature
Data formats
Loose
Compact
Flat
Contexts
Categorical contexts
Other contexts TBA
Targets
RMSE
Logistic
Softmax
Advanced
Options
GPU
Gradient Clipping
Online learning, fraction argument
Logging progress
Perplexity
Model saving, prediction
Dropout fractions
Nomenclature
Categorical Feature
Each category in set has n_possible_values
Each feature has n_latent_factors
Each feature has a single target
Components
Each component defined total number of documents and number of topics
Each component may also have supervised targets
Done:
Add BoW mode
Add logger
Add fake data generator
Add perplexity measurements
Add tracking utility
Add utilities for converting corpora
Put license
Add masks / skips / pads
Add reindexing on the fly
Convert docstrings to numpy format
Implement corpus loose to dense and vice versa
Add fit function for all data at once
Add CI & coverage & license icons
Add readthedocs support
Add examples to CI
Add dropout
Change component naming to 'categorical feature'
Add linear layers between input latent and output context
Merge skipgram branch
Add topic numbers to topic print out