updated docs to reflect PEP8 changes

* also fixed and updated several doc strings and comments, esp. docsim.py
piskvorky · Jun 13, 2011 · 88f2a3b · 88f2a3b
1 parent 482c73f
commit 88f2a3b
Show file tree

Hide file tree

Showing 68 changed files with 686 additions and 917 deletions.
diff --git a/docs/_sources/apiref.txt b/docs/_sources/apiref.txt
@@ -13,7 +13,6 @@ Modules:
     matutils
     corpora/bleicorpus
     corpora/dictionary
-    corpora/dmlcorpus
     corpora/lowcorpus
     corpora/mmcorpus
     corpora/svmlightcorpus

diff --git a/docs/_sources/dist_lda.txt b/docs/_sources/dist_lda.txt
@@ -22,7 +22,7 @@ Run LDA like you normally would, but turn on the `distributed=True` constructor
 parameter::
 
     >>> # extract 100 LDA topics, using default parameters
-    >>> lda = LdaModel(corpus=mm, id2word=id2word, numTopics=100, distributed=True)
+    >>> lda = LdaModel(corpus=mm, id2word=id2word, num_topics=100, distributed=True)
     using distributed version with 4 workers
     running online LDA training, 100 topics, 1 passes over the supplied corpus of 3199665 documets, updating model once every 40000 documents
     ..
@@ -36,7 +36,7 @@ with `ATLAS <http://math-atlas.sourceforge.net/>`_), the wallclock time taken dr
 To run standard batch LDA (no online updates of mini-batches) instead, you would similarly
 call::
 
-    >>> lda = LdaModel(corpus=mm, id2word=id2token, numTopics=100, update_every=0, passes=20, distributed=True)
+    >>> lda = LdaModel(corpus=mm, id2word=id2token, num_topics=100, update_every=0, passes=20, distributed=True)
     using distributed version with 4 workers
     running batch LDA training, 100 topics, 20 passes over the supplied corpus of 3199665 documets, updating model once every 3199665 documents
     initializing workers
@@ -52,7 +52,7 @@ and then, some two days later::
 
 ::
 
-    >>> lda.printTopics(20)
+    >>> lda.print_topics(20)
     topic #0: 0.007*disease + 0.006*medical + 0.005*treatment + 0.005*cells + 0.005*cell + 0.005*cancer + 0.005*health + 0.005*blood + 0.004*patients + 0.004*drug
     topic #1: 0.024*king + 0.013*ii + 0.013*prince + 0.013*emperor + 0.008*duke + 0.008*empire + 0.007*son + 0.007*china + 0.007*dynasty + 0.007*iii
     topic #2: 0.031*film + 0.017*films + 0.005*movie + 0.005*directed + 0.004*man + 0.004*episode + 0.003*character + 0.003*cast + 0.003*father + 0.003*mother

diff --git a/docs/_sources/dist_lsi.txt b/docs/_sources/dist_lsi.txt
@@ -45,9 +45,9 @@ won't be  doing much with CPU most of the time, but pick a computer with ample m
 
 And that's it! The cluster is set up and running, ready to accept jobs. To remove
 a worker later on, simply terminate its `lsi_worker` process. To add another worker, run another
-`lsi_worker` (this will not affect a computation that is already running). If you terminate
-`lsi_dispatcher`, you won't be able to run computations until you run it again
-(surviving workers can be re-used though).
+`lsi_worker` (this will not affect a computation that is already running, the additions/deletions are not dynamic).
+If you terminate `lsi_dispatcher`, you won't be able to run computations until you run it again
+(surviving worker processes can be re-used though).
 
 
 Running LSA
@@ -65,7 +65,7 @@ our choice is incidental) and try::
     >>> corpus = corpora.MmCorpus('/tmp/deerwester.mm') # load a corpus of nine documents, from the Tutorials
     >>> id2word = corpora.Dictionary.load('/tmp/deerwester.dict')
 
-    >>> lsi = models.LsiModel(corpus, id2word=id2word, numTopics=200, chunks=1, distributed=True) # run distributed LSA on nine documents
+    >>> lsi = models.LsiModel(corpus, id2word=id2word, num_topics=200, chunks=1, distributed=True) # run distributed LSA on nine documents
 
 This uses the corpus and feature-token mapping created in the :doc:`tut1` tutorial.
 If you look at the log in your Python session, you should see a line similar to::
@@ -76,7 +76,7 @@ which means all went well. You can also check the logs coming from your worker a
 processes --- this is especially helpful in case of problems.
 To check the LSA results, let's print the first two latent topics::
 
-    >>> lsi.printTopics(numTopics=2, numWords=5)
+    >>> lsi.print_topics(num_topics=2, num_words=5)
     topic #0(3.341): -0.644*"system" + -0.404*"user" + -0.301*"eps" + -0.265*"time" + -0.265*"response"
     topic #1(2.542): -0.623*"graph" + -0.490*"trees" + -0.451*"minors" + -0.274*"survey" + 0.167*"system"
 
@@ -89,9 +89,9 @@ So let's run LSA on **one million documents** instead::
     >>> # inflate the corpus to 1M documents, by repeating its documents over&over
     >>> corpus1m = utils.RepeatCorpus(corpus, 1000000)
     >>> # run distributed LSA on 1 million documents
-    >>> lsi1m = models.LsiModel(corpus1m, id2word=id2word, numTopics=200, chunks=10000, distributed=True)
+    >>> lsi1m = models.LsiModel(corpus1m, id2word=id2word, num_topics=200, chunks=10000, distributed=True)
 
-    >>> lsi1m.printTopics(numTopics=2, numWords=5)
+    >>> lsi1m.printTopics(num_topics=2, num_words=5)
     topic #0(1113.628): 0.644*"system" + 0.404*"user" + 0.301*"eps" + 0.265*"time" + 0.265*"response"
     topic #1(847.233): 0.623*"graph" + 0.490*"trees" + 0.451*"minors" + 0.274*"survey" + -0.167*"system"
 
@@ -122,7 +122,7 @@ the corpus iterator with::
     >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level logging.INFO)
 
     >>> # load id->word mapping (the dictionary)
-    >>> id2word = gensim.corpora.Dictionary.loadFromText('wiki_en_wordids.txt')
+    >>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
     >>> # load corpus iterator
     >>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
     >>> # mm = gensim.corpora.MmCorpus(bz2.BZ2File('wiki_en_tfidf.mm.bz2')) # use this if you compressed the TFIDF output
@@ -133,10 +133,10 @@ the corpus iterator with::
 Now we're ready to run distributed LSA on the English Wikipedia::
 
     >>> # extract 400 LSI topics, using a cluster of nodes
-    >>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, numTopics=400, chunks=20000, distributed=True)
+    >>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400, chunks=20000, distributed=True)
 
     >>> # print the most contributing words (both positively and negatively) for each of the first ten topics
-    >>> lsi.printTopics(10)
+    >>> lsi.print_topics(10)
     2010-11-03 16:08:27,602 : INFO : topic #0(200.990): -0.475*"delete" + -0.383*"deletion" + -0.275*"debate" + -0.223*"comments" + -0.220*"edits" + -0.213*"modify" + -0.208*"appropriate" + -0.194*"subsequent" + -0.155*"wp" + -0.117*"notability"
     2010-11-03 16:08:27,626 : INFO : topic #1(143.129): -0.320*"diff" + -0.305*"link" + -0.199*"image" + -0.171*"www" + -0.162*"user" + 0.149*"delete" + -0.147*"undo" + -0.144*"contribs" + -0.122*"album" + 0.113*"deletion"
     2010-11-03 16:08:27,651 : INFO : topic #2(135.665): -0.437*"diff" + -0.400*"link" + -0.202*"undo" + -0.192*"user" + -0.182*"www" + -0.176*"contribs" + 0.168*"image" + -0.109*"added" + 0.106*"album" + 0.097*"copyright"

diff --git a/docs/_sources/distributed.txt b/docs/_sources/distributed.txt
@@ -30,7 +30,7 @@ much communication going on), so the network is allowed to be of relatively high
 
   To see what BLAS and LAPACK you are using, type into your shell::
 
-    python -c 'import scipy; scipy.show_config()'
+    python -c 'import numpy; numpy.show_config()'
 
 Prerequisites
 -----------------

diff --git a/docs/_sources/index.txt b/docs/_sources/index.txt
@@ -10,10 +10,10 @@ Gensim -- Python Framework for Vector Space Modelling
 
    * 26/4/2011: version 0.7.8 is out! `CHANGELOG <https://github.com/piskvorky/gensim/blob/master/CHANGELOG.txt>`_
    * 12/2/2011: faster and leaner **Latent Semantic Indexing (LSI)** and **Latent Dirichlet Allocation (LDA)**:
-   
+
      * :doc:`Processing the English Wikipedia <wiki>`, 3.2 million documents (`NIPS workshop paper <http://arxiv.org/abs/1102.5597>`_)
      * :doc:`dist_lsi` & :doc:`dist_lda`
-     
+
    * 12/2/2011: Input corpus iterators can come from a compressed file (**bzip2**, **gzip**, ...), to save disk space when dealing with
      very large corpora.
 
@@ -23,7 +23,7 @@ For **installation** and **troubleshooting**, see the :doc:`installation <instal
 
 For **examples** on how to use it, try the :doc:`tutorials <tutorial>`.
 
-When **citing** `gensim` in academic papers, please use 
+When **citing** `gensim` in academic papers, please use
 `this BibTeX entry <http://nlp.fi.muni.cz/projekty/gensim/bibtex_gensim.bib>`_.
 
 
@@ -36,20 +36,20 @@ Quick Reference Example
 >>> corpus = corpora.MmCorpus('/path/to/corpus.mm')
 >>>
 >>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions)
->>> lsi = models.LsiModel(corpus, numTopics=200)
+>>> lsi = models.LsiModel(corpus, num_topics=200)
 >>>
 >>> # convert another corpus to the latent space and index it
 >>> index = similarities.MatrixSimilarity(lsi[another_corpus])
->>> 
->>> # perform similarity query of a query in LSI space against the whole corpus
+>>>
+>>> # determine similarity of a query document against each document in the index
 >>> sims = index[query]
 
 
 
 .. toctree::
    :hidden:
    :maxdepth: 1
-   
+
    intro
    install
    tutorial

diff --git a/docs/_sources/tut1.txt b/docs/_sources/tut1.txt
@@ -6,7 +6,7 @@ Corpora and Vector Spaces
 Don't forget to set
 
 >>> import logging
->>> logging.root.setLevel(logging.INFO) # will suppress DEBUG level events
+>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 
 if you want to see logging events.
 
@@ -42,9 +42,9 @@ as well as words that only appear once in the corpus:
 >>>          for document in documents]
 >>>
 >>> # remove words that appear only once
->>> allTokens = sum(texts, [])
->>> tokensOnce = set(word for word in set(allTokens) if allTokens.count(word) == 1)
->>> texts = [[word for word in text if word not in tokensOnce]
+>>> all_tokens = sum(texts, [])
+>>> tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
+>>> texts = [[word for word in text if word not in tokens_once]
 >>>          for text in texts]
 >>>
 >>> print texts
@@ -97,8 +97,8 @@ To see the mapping between words and their ids:
 
 To actually convert tokenized documents to vectors:
 
->>> newDoc = "Human computer interaction"
->>> newVec = dictionary.doc2bow(newDoc.lower().split())
+>>> new_doc = "Human computer interaction"
+>>> new_vec = dictionary.doc2bow(new_doc.lower().split())
 >>> print newVec # the word "interaction" does not appear in the dictionary and is ignored
 [(0, 1), (1, 1)]
 
@@ -132,7 +132,7 @@ Corpus Streaming -- One Document at a Time
 Note that `corpus` above resides fully in memory, as a plain Python list.
 In this simple example, it doesn't matter much, but just to make things clear,
 let's assume there are millions of documents in the corpus. Storing all of them in RAM won't do.
-Instead, the documents are stored in a file on disk, one document per line. Gensim
+Instead, let's assume the documents are stored in a file on disk, one document per line. Gensim
 only requires that a corpus must be able to return one document vector at a time::
 
 >>> class MyCorpus(object):
@@ -152,8 +152,8 @@ then convert the tokens via a dictionary to their ids and yield the resulting sp
 >>> print corpus_memory_friendly
 <__main__.MyCorpus object at 0x10d5690>
 
-Corpus is now an object. We didn't define any way to print it, so print just outputs address
-of the object in memory. Not very useful. To see the constituent vectors, let's 
+Corpus is now an object. We didn't define any way to print it, so `print` just outputs address
+of the object in memory. Not very useful. To see the constituent vectors, let's
 iterate over the corpus and print each document vector (one at a time)::
 
     >>> for vector in corpus_memory_friendly: # load one vector into memory at a time
@@ -180,7 +180,7 @@ Similarly, to construct the dictionary without loading all texts into memory::
     >>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
     >>>             if stopword in dictionary.token2id]
     >>> once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq == 1]
-    >>> dictionary.filterTokens(stop_ids + once_ids) # remove stop words and words that appear only once
+    >>> dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once
     >>> dictionary.compactify() # remove gaps in id sequence after words that were removed
     >>> print dictionary
     Dictionary(12 unique tokens)

diff --git a/docs/_sources/tut2.txt b/docs/_sources/tut2.txt
@@ -7,7 +7,7 @@ Topics and Transformations
 Don't forget to set
 
 >>> import logging
->>> logging.root.setLevel(logging.INFO) # will suppress DEBUG level events
+>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 
 if you want to see logging events.
 
@@ -28,7 +28,7 @@ into another. This process serves two goals:
 
 1. To bring out hidden structure in the corpus, discover relationships between
    words and use them to describe the documents in a new and
-   (hopefully) more realistic way.
+   (hopefully) more semantic way.
 2. To make the document representation more compact. This both improves efficiency
    (new representation consumes less resources) and efficacy (marginal data
    trends are ignored, noise-reduction).
@@ -41,7 +41,7 @@ a :dfn:`training corpus`:
 
 >>> tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model
 
-We used our old corpus to initialize (train) the transformation model. Different
+We used our old corpus from tutorial 1 to initialize (train) the transformation model. Different
 transformations may require different initialization parameters; in case of TfIdf, the
 "training" consists simply of going through the supplied corpus once and computing document frequencies
 of all its features. Training other models, such as Latent Semantic Analysis or Latent Dirichlet
@@ -101,14 +101,14 @@ folding-in for LSA, by topic inference for LDA etc.
 
 Transformations can also be serialized, one on top of another, in a sort of chain:
 
->>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, numTopics=2) # initialize an LSI transformation
+>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation
 >>> corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi
 
 Here we transformed our Tf-Idf corpus via `Latent Semantic Indexing <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_
-into a latent 2-D space (2-D because we set ``numTopics=2``). Now you're probably wondering: what do these two latent
-dimensions stand for? Let's inspect with :func:`models.LsiModel.printTopics`:
+into a latent 2-D space (2-D because we set ``num_topics=2``). Now you're probably wondering: what do these two latent
+dimensions stand for? Let's inspect with :func:`models.LsiModel.print_topics`:
 
-  >>> lsi.printTopics(2)
+  >>> lsi.print_topics(2)
   topic #0(1.594): -0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"response" + -0.060*"time" + -0.058*"user" + -0.049*"computer" + -0.035*"interface"
   topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"
 
@@ -166,24 +166,25 @@ Gensim implements several popular Vector Space Model algorithms:
   2 latent dimensions, but on real corpora, target dimensionality of 200--500 is recommended
   as a "golden standard" [1]_.
 
-  >>> model = lsimodel.LsiModel(tfidf_corpus, id2word=dictionary, numTopics=300)
+  >>> model = lsimodel.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)
 
   LSI training is unique in that we can continue "training" at any point, simply
   by providing more training documents. This is done by incremental updates to
   the underlying model, in a process called `online training`. Because of this feature, the
   input document stream may even be infinite -- just keep feeding LSI new documents
   as they arrive, while using the computed transformation model as read-only in the meanwhile!
 
-  >>> model.addDocuments(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
+  >>> model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
   >>> lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model
   >>> ...
-  >>> model.addDocuments(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
+  >>> model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
   >>> lsi_vec = model[tfidf_vec]
   >>> ...
 
   See the :mod:`gensim.models.lsimodel` documentation for details on how to make
-  LSI gradually "forget" old observations in infinite streams and how to tweak parameters
-  affecting speed vs. memory footprint vs. numerical precision of the algorithm.
+  LSI gradually "forget" old observations in infinite streams. If you want to get dirty,
+  there are also parameters you can tweak that affect speed vs. memory footprint vs. numerical
+  precision of the LSI algorithm.
 
   `gensim` uses a novel online incremental streamed distributed training algorithm (quite a mouthful!),
   which I published in [5]_. `gensim` also executes a stochastic multi-pass algorithm
@@ -197,7 +198,7 @@ Gensim implements several popular Vector Space Model algorithms:
   CPU-friendly) approach to approximating TfIdf distances between documents, by throwing in a little randomness.
   Recommended target dimensionality is again in the hundreds/thousands, depending on your dataset.
 
-  >>> model = rpmodel.RpModel(tfidf_corpus, numTopics=500)
+  >>> model = rpmodel.RpModel(tfidf_corpus, num_topics=500)
 
 * `Latent Dirichlet Allocation, LDA <http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation>`_
   is yet another transformation from bag-of-words counts into a topic space of lower
@@ -206,7 +207,7 @@ Gensim implements several popular Vector Space Model algorithms:
   just like with LSA, inferred automatically from a training corpus. Documents
   are in turn interpreted as a (soft) mixture of these topics (again, just like with LSA).
 
-  >>> model = ldamodel.LdaModel(bow_corpus, id2word=dictionary, numTopics=100)
+  >>> model = ldamodel.LdaModel(bow_corpus, id2word=dictionary, num_topics=100)
 
   `gensim` uses a fast implementation of online LDA parameter estimation based on [2]_,
   modified to run in :doc:`distributed mode <distributed>` on a cluster of computers.