Skip to content

Commit

Permalink
updated docs to reflect PEP8 changes
Browse files Browse the repository at this point in the history
* also fixed and updated several doc strings and comments, esp. docsim.py
  • Loading branch information
piskvorky committed Jun 13, 2011
1 parent 482c73f commit 88f2a3b
Show file tree
Hide file tree
Showing 68 changed files with 686 additions and 917 deletions.
1 change: 0 additions & 1 deletion docs/_sources/apiref.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ Modules:
matutils
corpora/bleicorpus
corpora/dictionary
corpora/dmlcorpus
corpora/lowcorpus
corpora/mmcorpus
corpora/svmlightcorpus
Expand Down
6 changes: 3 additions & 3 deletions docs/_sources/dist_lda.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Run LDA like you normally would, but turn on the `distributed=True` constructor
parameter::

>>> # extract 100 LDA topics, using default parameters
>>> lda = LdaModel(corpus=mm, id2word=id2word, numTopics=100, distributed=True)
>>> lda = LdaModel(corpus=mm, id2word=id2word, num_topics=100, distributed=True)
using distributed version with 4 workers
running online LDA training, 100 topics, 1 passes over the supplied corpus of 3199665 documets, updating model once every 40000 documents
..
Expand All @@ -36,7 +36,7 @@ with `ATLAS <http://math-atlas.sourceforge.net/>`_), the wallclock time taken dr
To run standard batch LDA (no online updates of mini-batches) instead, you would similarly
call::

>>> lda = LdaModel(corpus=mm, id2word=id2token, numTopics=100, update_every=0, passes=20, distributed=True)
>>> lda = LdaModel(corpus=mm, id2word=id2token, num_topics=100, update_every=0, passes=20, distributed=True)
using distributed version with 4 workers
running batch LDA training, 100 topics, 20 passes over the supplied corpus of 3199665 documets, updating model once every 3199665 documents
initializing workers
Expand All @@ -52,7 +52,7 @@ and then, some two days later::

::

>>> lda.printTopics(20)
>>> lda.print_topics(20)
topic #0: 0.007*disease + 0.006*medical + 0.005*treatment + 0.005*cells + 0.005*cell + 0.005*cancer + 0.005*health + 0.005*blood + 0.004*patients + 0.004*drug
topic #1: 0.024*king + 0.013*ii + 0.013*prince + 0.013*emperor + 0.008*duke + 0.008*empire + 0.007*son + 0.007*china + 0.007*dynasty + 0.007*iii
topic #2: 0.031*film + 0.017*films + 0.005*movie + 0.005*directed + 0.004*man + 0.004*episode + 0.003*character + 0.003*cast + 0.003*father + 0.003*mother
Expand Down
20 changes: 10 additions & 10 deletions docs/_sources/dist_lsi.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,9 @@ won't be doing much with CPU most of the time, but pick a computer with ample m

And that's it! The cluster is set up and running, ready to accept jobs. To remove
a worker later on, simply terminate its `lsi_worker` process. To add another worker, run another
`lsi_worker` (this will not affect a computation that is already running). If you terminate
`lsi_dispatcher`, you won't be able to run computations until you run it again
(surviving workers can be re-used though).
`lsi_worker` (this will not affect a computation that is already running, the additions/deletions are not dynamic).
If you terminate `lsi_dispatcher`, you won't be able to run computations until you run it again
(surviving worker processes can be re-used though).


Running LSA
Expand All @@ -65,7 +65,7 @@ our choice is incidental) and try::
>>> corpus = corpora.MmCorpus('/tmp/deerwester.mm') # load a corpus of nine documents, from the Tutorials
>>> id2word = corpora.Dictionary.load('/tmp/deerwester.dict')

>>> lsi = models.LsiModel(corpus, id2word=id2word, numTopics=200, chunks=1, distributed=True) # run distributed LSA on nine documents
>>> lsi = models.LsiModel(corpus, id2word=id2word, num_topics=200, chunks=1, distributed=True) # run distributed LSA on nine documents

This uses the corpus and feature-token mapping created in the :doc:`tut1` tutorial.
If you look at the log in your Python session, you should see a line similar to::
Expand All @@ -76,7 +76,7 @@ which means all went well. You can also check the logs coming from your worker a
processes --- this is especially helpful in case of problems.
To check the LSA results, let's print the first two latent topics::

>>> lsi.printTopics(numTopics=2, numWords=5)
>>> lsi.print_topics(num_topics=2, num_words=5)
topic #0(3.341): -0.644*"system" + -0.404*"user" + -0.301*"eps" + -0.265*"time" + -0.265*"response"
topic #1(2.542): -0.623*"graph" + -0.490*"trees" + -0.451*"minors" + -0.274*"survey" + 0.167*"system"

Expand All @@ -89,9 +89,9 @@ So let's run LSA on **one million documents** instead::
>>> # inflate the corpus to 1M documents, by repeating its documents over&over
>>> corpus1m = utils.RepeatCorpus(corpus, 1000000)
>>> # run distributed LSA on 1 million documents
>>> lsi1m = models.LsiModel(corpus1m, id2word=id2word, numTopics=200, chunks=10000, distributed=True)
>>> lsi1m = models.LsiModel(corpus1m, id2word=id2word, num_topics=200, chunks=10000, distributed=True)

>>> lsi1m.printTopics(numTopics=2, numWords=5)
>>> lsi1m.printTopics(num_topics=2, num_words=5)
topic #0(1113.628): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"time" + 0.265*"response"
topic #1(847.233): 0.623*"graph" + 0.490*"trees" + 0.451*"minors" + 0.274*"survey" + -0.167*"system"

Expand Down Expand Up @@ -122,7 +122,7 @@ the corpus iterator with::
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level logging.INFO)

>>> # load id->word mapping (the dictionary)
>>> id2word = gensim.corpora.Dictionary.loadFromText('wiki_en_wordids.txt')
>>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
>>> # load corpus iterator
>>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
>>> # mm = gensim.corpora.MmCorpus(bz2.BZ2File('wiki_en_tfidf.mm.bz2')) # use this if you compressed the TFIDF output
Expand All @@ -133,10 +133,10 @@ the corpus iterator with::
Now we're ready to run distributed LSA on the English Wikipedia::

>>> # extract 400 LSI topics, using a cluster of nodes
>>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, numTopics=400, chunks=20000, distributed=True)
>>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400, chunks=20000, distributed=True)

>>> # print the most contributing words (both positively and negatively) for each of the first ten topics
>>> lsi.printTopics(10)
>>> lsi.print_topics(10)
2010-11-03 16:08:27,602 : INFO : topic #0(200.990): -0.475*"delete" + -0.383*"deletion" + -0.275*"debate" + -0.223*"comments" + -0.220*"edits" + -0.213*"modify" + -0.208*"appropriate" + -0.194*"subsequent" + -0.155*"wp" + -0.117*"notability"
2010-11-03 16:08:27,626 : INFO : topic #1(143.129): -0.320*"diff" + -0.305*"link" + -0.199*"image" + -0.171*"www" + -0.162*"user" + 0.149*"delete" + -0.147*"undo" + -0.144*"contribs" + -0.122*"album" + 0.113*"deletion"
2010-11-03 16:08:27,651 : INFO : topic #2(135.665): -0.437*"diff" + -0.400*"link" + -0.202*"undo" + -0.192*"user" + -0.182*"www" + -0.176*"contribs" + 0.168*"image" + -0.109*"added" + 0.106*"album" + 0.097*"copyright"
Expand Down
2 changes: 1 addition & 1 deletion docs/_sources/distributed.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ much communication going on), so the network is allowed to be of relatively high

To see what BLAS and LAPACK you are using, type into your shell::

python -c 'import scipy; scipy.show_config()'
python -c 'import numpy; numpy.show_config()'

Prerequisites
-----------------
Expand Down
14 changes: 7 additions & 7 deletions docs/_sources/index.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@ Gensim -- Python Framework for Vector Space Modelling

* 26/4/2011: version 0.7.8 is out! `CHANGELOG <https://github.com/piskvorky/gensim/blob/master/CHANGELOG.txt>`_
* 12/2/2011: faster and leaner **Latent Semantic Indexing (LSI)** and **Latent Dirichlet Allocation (LDA)**:

* :doc:`Processing the English Wikipedia <wiki>`, 3.2 million documents (`NIPS workshop paper <http://arxiv.org/abs/1102.5597>`_)
* :doc:`dist_lsi` & :doc:`dist_lda`

* 12/2/2011: Input corpus iterators can come from a compressed file (**bzip2**, **gzip**, ...), to save disk space when dealing with
very large corpora.

Expand All @@ -23,7 +23,7 @@ For **installation** and **troubleshooting**, see the :doc:`installation <instal

For **examples** on how to use it, try the :doc:`tutorials <tutorial>`.

When **citing** `gensim` in academic papers, please use
When **citing** `gensim` in academic papers, please use
`this BibTeX entry <http://nlp.fi.muni.cz/projekty/gensim/bibtex_gensim.bib>`_.


Expand All @@ -36,20 +36,20 @@ Quick Reference Example
>>> corpus = corpora.MmCorpus('/path/to/corpus.mm')
>>>
>>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions)
>>> lsi = models.LsiModel(corpus, numTopics=200)
>>> lsi = models.LsiModel(corpus, num_topics=200)
>>>
>>> # convert another corpus to the latent space and index it
>>> index = similarities.MatrixSimilarity(lsi[another_corpus])
>>>
>>> # perform similarity query of a query in LSI space against the whole corpus
>>>
>>> # determine similarity of a query document against each document in the index
>>> sims = index[query]



.. toctree::
:hidden:
:maxdepth: 1

intro
install
tutorial
Expand Down
20 changes: 10 additions & 10 deletions docs/_sources/tut1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Corpora and Vector Spaces
Don't forget to set

>>> import logging
>>> logging.root.setLevel(logging.INFO) # will suppress DEBUG level events
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

if you want to see logging events.

Expand Down Expand Up @@ -42,9 +42,9 @@ as well as words that only appear once in the corpus:
>>> for document in documents]
>>>
>>> # remove words that appear only once
>>> allTokens = sum(texts, [])
>>> tokensOnce = set(word for word in set(allTokens) if allTokens.count(word) == 1)
>>> texts = [[word for word in text if word not in tokensOnce]
>>> all_tokens = sum(texts, [])
>>> tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
>>> texts = [[word for word in text if word not in tokens_once]
>>> for text in texts]
>>>
>>> print texts
Expand Down Expand Up @@ -97,8 +97,8 @@ To see the mapping between words and their ids:

To actually convert tokenized documents to vectors:

>>> newDoc = "Human computer interaction"
>>> newVec = dictionary.doc2bow(newDoc.lower().split())
>>> new_doc = "Human computer interaction"
>>> new_vec = dictionary.doc2bow(new_doc.lower().split())
>>> print newVec # the word "interaction" does not appear in the dictionary and is ignored
[(0, 1), (1, 1)]

Expand Down Expand Up @@ -132,7 +132,7 @@ Corpus Streaming -- One Document at a Time
Note that `corpus` above resides fully in memory, as a plain Python list.
In this simple example, it doesn't matter much, but just to make things clear,
let's assume there are millions of documents in the corpus. Storing all of them in RAM won't do.
Instead, the documents are stored in a file on disk, one document per line. Gensim
Instead, let's assume the documents are stored in a file on disk, one document per line. Gensim
only requires that a corpus must be able to return one document vector at a time::

>>> class MyCorpus(object):
Expand All @@ -152,8 +152,8 @@ then convert the tokens via a dictionary to their ids and yield the resulting sp
>>> print corpus_memory_friendly
<__main__.MyCorpus object at 0x10d5690>

Corpus is now an object. We didn't define any way to print it, so print just outputs address
of the object in memory. Not very useful. To see the constituent vectors, let's
Corpus is now an object. We didn't define any way to print it, so `print` just outputs address
of the object in memory. Not very useful. To see the constituent vectors, let's
iterate over the corpus and print each document vector (one at a time)::

>>> for vector in corpus_memory_friendly: # load one vector into memory at a time
Expand All @@ -180,7 +180,7 @@ Similarly, to construct the dictionary without loading all texts into memory::
>>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
>>> if stopword in dictionary.token2id]
>>> once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq == 1]
>>> dictionary.filterTokens(stop_ids + once_ids) # remove stop words and words that appear only once
>>> dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once
>>> dictionary.compactify() # remove gaps in id sequence after words that were removed
>>> print dictionary
Dictionary(12 unique tokens)
Expand Down
29 changes: 15 additions & 14 deletions docs/_sources/tut2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Topics and Transformations
Don't forget to set

>>> import logging
>>> logging.root.setLevel(logging.INFO) # will suppress DEBUG level events
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

if you want to see logging events.

Expand All @@ -28,7 +28,7 @@ into another. This process serves two goals:

1. To bring out hidden structure in the corpus, discover relationships between
words and use them to describe the documents in a new and
(hopefully) more realistic way.
(hopefully) more semantic way.
2. To make the document representation more compact. This both improves efficiency
(new representation consumes less resources) and efficacy (marginal data
trends are ignored, noise-reduction).
Expand All @@ -41,7 +41,7 @@ a :dfn:`training corpus`:

>>> tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model

We used our old corpus to initialize (train) the transformation model. Different
We used our old corpus from tutorial 1 to initialize (train) the transformation model. Different
transformations may require different initialization parameters; in case of TfIdf, the
"training" consists simply of going through the supplied corpus once and computing document frequencies
of all its features. Training other models, such as Latent Semantic Analysis or Latent Dirichlet
Expand Down Expand Up @@ -101,14 +101,14 @@ folding-in for LSA, by topic inference for LDA etc.

Transformations can also be serialized, one on top of another, in a sort of chain:

>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, numTopics=2) # initialize an LSI transformation
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
>>> corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

Here we transformed our Tf-Idf corpus via `Latent Semantic Indexing <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_
into a latent 2-D space (2-D because we set ``numTopics=2``). Now you're probably wondering: what do these two latent
dimensions stand for? Let's inspect with :func:`models.LsiModel.printTopics`:
into a latent 2-D space (2-D because we set ``num_topics=2``). Now you're probably wondering: what do these two latent
dimensions stand for? Let's inspect with :func:`models.LsiModel.print_topics`:

>>> lsi.printTopics(2)
>>> lsi.print_topics(2)
topic #0(1.594): -0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"response" + -0.060*"time" + -0.058*"user" + -0.049*"computer" + -0.035*"interface"
topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"

Expand Down Expand Up @@ -166,24 +166,25 @@ Gensim implements several popular Vector Space Model algorithms:
2 latent dimensions, but on real corpora, target dimensionality of 200--500 is recommended
as a "golden standard" [1]_.

>>> model = lsimodel.LsiModel(tfidf_corpus, id2word=dictionary, numTopics=300)
>>> model = lsimodel.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

LSI training is unique in that we can continue "training" at any point, simply
by providing more training documents. This is done by incremental updates to
the underlying model, in a process called `online training`. Because of this feature, the
input document stream may even be infinite -- just keep feeding LSI new documents
as they arrive, while using the computed transformation model as read-only in the meanwhile!

>>> model.addDocuments(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
>>> model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
>>> lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model
>>> ...
>>> model.addDocuments(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
>>> model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
>>> lsi_vec = model[tfidf_vec]
>>> ...

See the :mod:`gensim.models.lsimodel` documentation for details on how to make
LSI gradually "forget" old observations in infinite streams and how to tweak parameters
affecting speed vs. memory footprint vs. numerical precision of the algorithm.
LSI gradually "forget" old observations in infinite streams. If you want to get dirty,
there are also parameters you can tweak that affect speed vs. memory footprint vs. numerical
precision of the LSI algorithm.

`gensim` uses a novel online incremental streamed distributed training algorithm (quite a mouthful!),
which I published in [5]_. `gensim` also executes a stochastic multi-pass algorithm
Expand All @@ -197,7 +198,7 @@ Gensim implements several popular Vector Space Model algorithms:
CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness.
Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.

>>> model = rpmodel.RpModel(tfidf_corpus, numTopics=500)
>>> model = rpmodel.RpModel(tfidf_corpus, num_topics=500)

* `Latent Dirichlet Allocation, LDA <http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation>`_
is yet another transformation from bag-of-words counts into a topic space of lower
Expand All @@ -206,7 +207,7 @@ Gensim implements several popular Vector Space Model algorithms:
just like with LSA, inferred automatically from a training corpus. Documents
are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).

>>> model = ldamodel.LdaModel(bow_corpus, id2word=dictionary, numTopics=100)
>>> model = ldamodel.LdaModel(bow_corpus, id2word=dictionary, num_topics=100)

`gensim` uses a fast implementation of online LDA parameter estimation based on [2]_,
modified to run in :doc:`distributed mode <distributed>` on a cluster of computers.
Expand Down
Loading

0 comments on commit 88f2a3b

Please sign in to comment.