diff --git a/CHANGELOG.md b/CHANGELOG.md index 70cdbb2694..143f669e96 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -22,6 +22,7 @@ Changes * [#3125](https://github.com/RaRe-Technologies/gensim/pull/3125): Improve & unify docs for dirichlet priors, by [@jonaschn](https://github.com/jonaschn) * [#3133](https://github.com/RaRe-Technologies/gensim/pull/3133): Update link to Hoffman paper (online VB LDA), by [@jonaschn](https://github.com/jonaschn) * [#3141](https://github.com/RaRe-Technologies/gensim/pull/3141): Update link for online LDA paper, by [@dymil](https://github.com/dymil) +* [#3148](https://github.com/RaRe-Technologies/gensim/pull/3148): Fix broken link in documentation, by [@rohit901](https://github.com/rohit901) * [#3155](https://github.com/RaRe-Technologies/gensim/pull/3155): Correct parameter name in documentation of fasttext.py, by [@bizzyvinci](https://github.com/bizzyvinci) ## 4.0.1, 2021-04-01 diff --git a/docs/src/auto_examples/core/images/sphx_glr_run_corpora_and_vector_spaces_001.png b/docs/src/auto_examples/core/images/sphx_glr_run_corpora_and_vector_spaces_001.png index 807a84c4de..5c86a24471 100644 Binary files a/docs/src/auto_examples/core/images/sphx_glr_run_corpora_and_vector_spaces_001.png and b/docs/src/auto_examples/core/images/sphx_glr_run_corpora_and_vector_spaces_001.png differ diff --git a/docs/src/auto_examples/core/images/thumb/sphx_glr_run_corpora_and_vector_spaces_thumb.png b/docs/src/auto_examples/core/images/thumb/sphx_glr_run_corpora_and_vector_spaces_thumb.png index 5a8e564326..bab9aec4a4 100644 Binary files a/docs/src/auto_examples/core/images/thumb/sphx_glr_run_corpora_and_vector_spaces_thumb.png and b/docs/src/auto_examples/core/images/thumb/sphx_glr_run_corpora_and_vector_spaces_thumb.png differ diff --git a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.ipynb b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.ipynb index 40a3324206..875db7b507 100644 --- a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.ipynb +++ b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.ipynb @@ -15,7 +15,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\nCorpora and Vector Spaces\n=========================\n\nDemonstrates transforming text into a vector space representation.\n\nAlso introduces corpus streaming and persistence to disk in various formats.\n\n" + "\n# Corpora and Vector Spaces\n\nDemonstrates transforming text into a vector space representation.\n\nAlso introduces corpus streaming and persistence to disk in various formats.\n" ] }, { @@ -33,7 +33,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "First, let\u2019s create a small corpus of nine short documents [1]_:\n\n\nFrom Strings to Vectors\n------------------------\n\nThis time, let's start from documents represented as strings:\n\n\n" + "First, let\u2019s create a small corpus of nine short documents [1]_:\n\n\n## From Strings to Vectors\n\nThis time, let's start from documents represented as strings:\n\n\n" ] }, { @@ -141,7 +141,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "By now it should be clear that the vector feature with ``id=10`` stands for the question \"How many\ntimes does the word `graph` appear in the document?\" and that the answer is \"zero\" for\nthe first six documents and \"one\" for the remaining three.\n\n\nCorpus Streaming -- One Document at a Time\n-------------------------------------------\n\nNote that `corpus` above resides fully in memory, as a plain Python list.\nIn this simple example, it doesn't matter much, but just to make things clear,\nlet's assume there are millions of documents in the corpus. Storing all of them in RAM won't do.\nInstead, let's assume the documents are stored in a file on disk, one document per line. Gensim\nonly requires that a corpus must be able to return one document vector at a time:\n\n\n" + "By now it should be clear that the vector feature with ``id=10`` stands for the question \"How many\ntimes does the word `graph` appear in the document?\" and that the answer is \"zero\" for\nthe first six documents and \"one\" for the remaining three.\n\n\n## Corpus Streaming -- One Document at a Time\n\nNote that `corpus` above resides fully in memory, as a plain Python list.\nIn this simple example, it doesn't matter much, but just to make things clear,\nlet's assume there are millions of documents in the corpus. Storing all of them in RAM won't do.\nInstead, let's assume the documents are stored in a file on disk, one document per line. Gensim\nonly requires that a corpus must be able to return one document vector at a time:\n\n\n" ] }, { @@ -152,7 +152,7 @@ }, "outputs": [], "source": [ - "from smart_open import open # for transparently opening remote files\n\n\nclass MyCorpus:\n def __iter__(self):\n for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):\n # assume there's one document per line, tokens separated by whitespace\n yield dictionary.doc2bow(line.lower().split())" + "from smart_open import open # for transparently opening remote files\n\n\nclass MyCorpus:\n def __iter__(self):\n for line in open('https://radimrehurek.com/mycorpus.txt'):\n # assume there's one document per line, tokens separated by whitespace\n yield dictionary.doc2bow(line.lower().split())" ] }, { @@ -177,7 +177,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Download the sample `mycorpus.txt file here <./mycorpus.txt>`_. The assumption that\neach document occupies one line in a single file is not important; you can mold\nthe `__iter__` function to fit your input format, whatever it is.\nWalking directories, parsing XML, accessing the network...\nJust parse your input to retrieve a clean list of tokens in each document,\nthen convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside `__iter__`.\n\n" + "Download the sample `mycorpus.txt file here `_. The assumption that\neach document occupies one line in a single file is not important; you can mold\nthe `__iter__` function to fit your input format, whatever it is.\nWalking directories, parsing XML, accessing the network...\nJust parse your input to retrieve a clean list of tokens in each document,\nthen convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside `__iter__`.\n\n" ] }, { @@ -224,14 +224,14 @@ }, "outputs": [], "source": [ - "# collect statistics about all tokens\ndictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/gensim/mycorpus.txt'))\n# remove stop words and words that appear only once\nstop_ids = [\n dictionary.token2id[stopword]\n for stopword in stoplist\n if stopword in dictionary.token2id\n]\nonce_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]\ndictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once\ndictionary.compactify() # remove gaps in id sequence after words that were removed\nprint(dictionary)" + "# collect statistics about all tokens\ndictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/mycorpus.txt'))\n# remove stop words and words that appear only once\nstop_ids = [\n dictionary.token2id[stopword]\n for stopword in stoplist\n if stopword in dictionary.token2id\n]\nonce_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]\ndictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once\ndictionary.compactify() # remove gaps in id sequence after words that were removed\nprint(dictionary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "And that is all there is to it! At least as far as bag-of-words representation is concerned.\nOf course, what we do with such a corpus is another question; it is not at all clear\nhow counting the frequency of distinct words could be useful. As it turns out, it isn't, and\nwe will need to apply a transformation on this simple representation first, before\nwe can use it to compute any meaningful document vs. document similarities.\nTransformations are covered in the next tutorial\n(`sphx_glr_auto_examples_core_run_topics_and_transformations.py`),\nbut before that, let's briefly turn our attention to *corpus persistency*.\n\n\nCorpus Formats\n---------------\n\nThere exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk.\n`Gensim` implements them via the *streaming corpus interface* mentioned earlier:\ndocuments are read from (resp. stored to) disk in a lazy fashion, one document at\na time, without the whole corpus being read into main memory at once.\n\nOne of the more notable file formats is the `Market Matrix format `_.\nTo save a corpus in the Matrix Market format:\n\ncreate a toy corpus of 2 documents, as a plain Python list\n\n" + "And that is all there is to it! At least as far as bag-of-words representation is concerned.\nOf course, what we do with such a corpus is another question; it is not at all clear\nhow counting the frequency of distinct words could be useful. As it turns out, it isn't, and\nwe will need to apply a transformation on this simple representation first, before\nwe can use it to compute any meaningful document vs. document similarities.\nTransformations are covered in the next tutorial\n(`sphx_glr_auto_examples_core_run_topics_and_transformations.py`),\nbut before that, let's briefly turn our attention to *corpus persistency*.\n\n\n## Corpus Formats\n\nThere exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk.\n`Gensim` implements them via the *streaming corpus interface* mentioned earlier:\ndocuments are read from (resp. stored to) disk in a lazy fashion, one document at\na time, without the whole corpus being read into main memory at once.\n\nOne of the more notable file formats is the `Market Matrix format `_.\nTo save a corpus in the Matrix Market format:\n\ncreate a toy corpus of 2 documents, as a plain Python list\n\n" ] }, { @@ -357,7 +357,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this way, `gensim` can also be used as a memory-efficient **I/O format conversion tool**:\njust load a document stream using one format and immediately save it in another format.\nAdding new formats is dead easy, check out the `code for the SVMlight corpus\n`_ for an example.\n\nCompatibility with NumPy and SciPy\n----------------------------------\n\nGensim also contains `efficient utility functions `_\nto help converting from/to numpy matrices\n\n" + "In this way, `gensim` can also be used as a memory-efficient **I/O format conversion tool**:\njust load a document stream using one format and immediately save it in another format.\nAdding new formats is dead easy, check out the `code for the SVMlight corpus\n`_ for an example.\n\n## Compatibility with NumPy and SciPy\n\nGensim also contains `efficient utility functions `_\nto help converting from/to numpy matrices\n\n" ] }, { @@ -393,7 +393,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "What Next\n---------\n\nRead about `sphx_glr_auto_examples_core_run_topics_and_transformations.py`.\n\nReferences\n----------\n\nFor a complete reference (Want to prune the dictionary to a smaller size?\nOptimize converting between corpora and NumPy/SciPy arrays?), see the `apiref`.\n\n.. [1] This is the same corpus as used in\n `Deerwester et al. (1990): Indexing by Latent Semantic Analysis `_, Table 2.\n\n" + "## What Next\n\nRead about `sphx_glr_auto_examples_core_run_topics_and_transformations.py`.\n\n## References\n\nFor a complete reference (Want to prune the dictionary to a smaller size?\nOptimize converting between corpora and NumPy/SciPy arrays?), see the `apiref`.\n\n.. [1] This is the same corpus as used in\n `Deerwester et al. (1990): Indexing by Latent Semantic Analysis `_, Table 2.\n\n" ] }, { @@ -424,7 +424,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.8.5" } }, "nbformat": 4, diff --git a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py index 0a49614123..983a9d1235 100644 --- a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py +++ b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py @@ -138,7 +138,7 @@ class MyCorpus: def __iter__(self): - for line in open('https://radimrehurek.com/gensim/mycorpus.txt'): + for line in open('https://radimrehurek.com/mycorpus.txt'): # assume there's one document per line, tokens separated by whitespace yield dictionary.doc2bow(line.lower().split()) @@ -154,7 +154,7 @@ def __iter__(self): # in RAM at once. You can even create the documents on the fly! ############################################################################### -# Download the sample `mycorpus.txt file here <./mycorpus.txt>`_. The assumption that +# Download the sample `mycorpus.txt file here `_. The assumption that # each document occupies one line in a single file is not important; you can mold # the `__iter__` function to fit your input format, whatever it is. # Walking directories, parsing XML, accessing the network... @@ -180,7 +180,7 @@ def __iter__(self): # Similarly, to construct the dictionary without loading all texts into memory: # collect statistics about all tokens -dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/gensim/mycorpus.txt')) +dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/mycorpus.txt')) # remove stop words and words that appear only once stop_ids = [ dictionary.token2id[stopword] diff --git a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py.md5 b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py.md5 index 935e0357af..174fe2a139 100644 --- a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py.md5 +++ b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.py.md5 @@ -1 +1 @@ -6b98413399bca9fd1ed8fe420da85692 \ No newline at end of file +55a8a886f05e5005c5f66d57569ee79d \ No newline at end of file diff --git a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.rst b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.rst index 7f8d25cfec..3cc549dd65 100644 --- a/docs/src/auto_examples/core/run_corpora_and_vector_spaces.rst +++ b/docs/src/auto_examples/core/run_corpora_and_vector_spaces.rst @@ -1,12 +1,21 @@ + +.. DO NOT EDIT. +.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. +.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: +.. "auto_examples/core/run_corpora_and_vector_spaces.py" +.. LINE NUMBERS ARE GIVEN BELOW. + .. only:: html .. note:: :class: sphx-glr-download-link-note - Click :ref:`here ` to download the full example code - .. rst-class:: sphx-glr-example-title + Click :ref:`here ` + to download the full example code + +.. rst-class:: sphx-glr-example-title - .. _sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py: +.. _sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py: Corpora and Vector Spaces @@ -16,6 +25,7 @@ Demonstrates transforming text into a vector space representation. Also introduces corpus streaming and persistence to disk in various formats. +.. GENERATED FROM PYTHON SOURCE LINES 9-13 .. code-block:: default @@ -30,6 +40,8 @@ Also introduces corpus streaming and persistence to disk in various formats. +.. GENERATED FROM PYTHON SOURCE LINES 14-23 + First, let’s create a small corpus of nine short documents [1]_: .. _second example: @@ -40,6 +52,7 @@ From Strings to Vectors This time, let's start from documents represented as strings: +.. GENERATED FROM PYTHON SOURCE LINES 23-35 .. code-block:: default @@ -62,11 +75,14 @@ This time, let's start from documents represented as strings: +.. GENERATED FROM PYTHON SOURCE LINES 36-40 + This is a tiny corpus of nine documents, each consisting of only a single sentence. First, let's tokenize the documents, remove common words (using a toy stoplist) as well as words that only appear once in the corpus: +.. GENERATED FROM PYTHON SOURCE LINES 40-64 .. code-block:: default @@ -117,6 +133,8 @@ as well as words that only appear once in the corpus: +.. GENERATED FROM PYTHON SOURCE LINES 65-87 + Your way of processing the documents will likely vary; here, I only split on whitespace to tokenize, followed by lowercasing each word. In fact, I use this particular (simplistic and inefficient) setup to mimic the experiment done in Deerwester et al.'s @@ -140,6 +158,7 @@ a question-answer pair, in the style of: It is advantageous to represent the questions only by their (integer) ids. The mapping between the questions and ids is called a dictionary: +.. GENERATED FROM PYTHON SOURCE LINES 87-93 .. code-block:: default @@ -159,21 +178,25 @@ between the questions and ids is called a dictionary: .. code-block:: none - 2020-10-28 00:52:02,550 : INFO : adding document #0 to Dictionary(0 unique tokens: []) - 2020-10-28 00:52:02,550 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions) - 2020-10-28 00:52:02,550 : INFO : saving Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) under /tmp/deerwester.dict, separately None - 2020-10-28 00:52:02,552 : INFO : saved /tmp/deerwester.dict + 2021-06-01 10:34:56,824 : INFO : adding document #0 to Dictionary(0 unique tokens: []) + 2021-06-01 10:34:56,824 : INFO : built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions) + 2021-06-01 10:34:56,834 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) from 9 documents (total 29 corpus positions)", 'datetime': '2021-06-01T10:34:56.825003', 'gensim': '4.1.0.dev0', 'python': '3.8.5 (default, Jan 27 2021, 15:41:15) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-73-generic-x86_64-with-glibc2.29', 'event': 'created'} + 2021-06-01 10:34:56,834 : INFO : Dictionary lifecycle event {'fname_or_handle': '/tmp/deerwester.dict', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-06-01T10:34:56.834300', 'gensim': '4.1.0.dev0', 'python': '3.8.5 (default, Jan 27 2021, 15:41:15) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-73-generic-x86_64-with-glibc2.29', 'event': 'saving'} + 2021-06-01 10:34:56,834 : INFO : saved /tmp/deerwester.dict Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) +.. GENERATED FROM PYTHON SOURCE LINES 94-99 + Here we assigned a unique integer id to all words appearing in the corpus with the :class:`gensim.corpora.dictionary.Dictionary` class. This sweeps across the texts, collecting word counts and relevant statistics. In the end, we see there are twelve distinct words in the processed corpus, which means each document will be represented by twelve numbers (ie., by a 12-D vector). To see the mapping between words and their ids: +.. GENERATED FROM PYTHON SOURCE LINES 99-102 .. code-block:: default @@ -195,8 +218,11 @@ To see the mapping between words and their ids: +.. GENERATED FROM PYTHON SOURCE LINES 103-104 + To actually convert tokenized documents to vectors: +.. GENERATED FROM PYTHON SOURCE LINES 104-109 .. code-block:: default @@ -220,12 +246,15 @@ To actually convert tokenized documents to vectors: +.. GENERATED FROM PYTHON SOURCE LINES 110-115 + The function :func:`doc2bow` simply counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. The sparse vector ``[(0, 1), (1, 1)]`` therefore reads: in the document `"Human computer interaction"`, the words `computer` (id 0) and `human` (id 1) appear once; the other ten dictionary words appear (implicitly) zero times. +.. GENERATED FROM PYTHON SOURCE LINES 115-120 .. code-block:: default @@ -244,16 +273,18 @@ therefore reads: in the document `"Human computer interaction"`, the words `comp .. code-block:: none - 2020-10-28 00:52:02,830 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm - 2020-10-28 00:52:02,832 : INFO : saving sparse matrix to /tmp/deerwester.mm - 2020-10-28 00:52:02,832 : INFO : PROGRESS: saving document #0 - 2020-10-28 00:52:02,834 : INFO : saved 9x12 matrix, density=25.926% (28/108) - 2020-10-28 00:52:02,834 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index + 2021-06-01 10:34:57,074 : INFO : storing corpus in Matrix Market format to /tmp/deerwester.mm + 2021-06-01 10:34:57,075 : INFO : saving sparse matrix to /tmp/deerwester.mm + 2021-06-01 10:34:57,075 : INFO : PROGRESS: saving document #0 + 2021-06-01 10:34:57,076 : INFO : saved 9x12 matrix, density=25.926% (28/108) + 2021-06-01 10:34:57,076 : INFO : saving MmCorpus index to /tmp/deerwester.mm.index [[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]] +.. GENERATED FROM PYTHON SOURCE LINES 121-136 + By now it should be clear that the vector feature with ``id=10`` stands for the question "How many times does the word `graph` appear in the document?" and that the answer is "zero" for the first six documents and "one" for the remaining three. @@ -270,6 +301,7 @@ Instead, let's assume the documents are stored in a file on disk, one document p only requires that a corpus must be able to return one document vector at a time: +.. GENERATED FROM PYTHON SOURCE LINES 136-145 .. code-block:: default @@ -278,7 +310,7 @@ only requires that a corpus must be able to return one document vector at a time class MyCorpus: def __iter__(self): - for line in open('https://radimrehurek.com/gensim/mycorpus.txt'): + for line in open('https://radimrehurek.com/mycorpus.txt'): # assume there's one document per line, tokens separated by whitespace yield dictionary.doc2bow(line.lower().split()) @@ -289,11 +321,14 @@ only requires that a corpus must be able to return one document vector at a time +.. GENERATED FROM PYTHON SOURCE LINES 146-150 + The full power of Gensim comes from the fact that a corpus doesn't have to be a ``list``, or a ``NumPy`` array, or a ``Pandas`` dataframe, or whatever. Gensim *accepts any object that, when iterated over, successively yields documents*. +.. GENERATED FROM PYTHON SOURCE LINES 150-156 .. code-block:: default @@ -310,13 +345,16 @@ documents*. -Download the sample `mycorpus.txt file here <./mycorpus.txt>`_. The assumption that +.. GENERATED FROM PYTHON SOURCE LINES 157-163 + +Download the sample `mycorpus.txt file here `_. The assumption that each document occupies one line in a single file is not important; you can mold the `__iter__` function to fit your input format, whatever it is. Walking directories, parsing XML, accessing the network... Just parse your input to retrieve a clean list of tokens in each document, then convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside `__iter__`. +.. GENERATED FROM PYTHON SOURCE LINES 163-167 .. code-block:: default @@ -334,15 +372,18 @@ then convert the tokens via a dictionary to their ids and yield the resulting sp .. code-block:: none - <__main__.MyCorpus object at 0x11e77bb38> + <__main__.MyCorpus object at 0x7f389b5f8520> +.. GENERATED FROM PYTHON SOURCE LINES 168-171 + Corpus is now an object. We didn't define any way to print it, so `print` just outputs address of the object in memory. Not very useful. To see the constituent vectors, let's iterate over the corpus and print each document vector (one at a time): +.. GENERATED FROM PYTHON SOURCE LINES 171-175 .. code-block:: default @@ -373,18 +414,21 @@ iterate over the corpus and print each document vector (one at a time): +.. GENERATED FROM PYTHON SOURCE LINES 176-181 + Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want. Similarly, to construct the dictionary without loading all texts into memory: +.. GENERATED FROM PYTHON SOURCE LINES 181-195 .. code-block:: default # collect statistics about all tokens - dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/gensim/mycorpus.txt')) + dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/mycorpus.txt')) # remove stop words and words that appear only once stop_ids = [ dictionary.token2id[stopword] @@ -406,13 +450,16 @@ Similarly, to construct the dictionary without loading all texts into memory: .. code-block:: none - 2020-10-28 00:52:04,241 : INFO : adding document #0 to Dictionary(0 unique tokens: []) - 2020-10-28 00:52:04,243 : INFO : built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions) + 2021-06-01 10:34:58,466 : INFO : adding document #0 to Dictionary(0 unique tokens: []) + 2021-06-01 10:34:58,467 : INFO : built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions) + 2021-06-01 10:34:58,467 : INFO : Dictionary lifecycle event {'msg': "built Dictionary(42 unique tokens: ['abc', 'applications', 'computer', 'for', 'human']...) from 9 documents (total 69 corpus positions)", 'datetime': '2021-06-01T10:34:58.467454', 'gensim': '4.1.0.dev0', 'python': '3.8.5 (default, Jan 27 2021, 15:41:15) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-73-generic-x86_64-with-glibc2.29', 'event': 'created'} Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...) +.. GENERATED FROM PYTHON SOURCE LINES 196-219 + And that is all there is to it! At least as far as bag-of-words representation is concerned. Of course, what we do with such a corpus is another question; it is not at all clear how counting the frequency of distinct words could be useful. As it turns out, it isn't, and @@ -437,6 +484,7 @@ To save a corpus in the Matrix Market format: create a toy corpus of 2 documents, as a plain Python list +.. GENERATED FROM PYTHON SOURCE LINES 219-223 .. code-block:: default @@ -454,19 +502,22 @@ create a toy corpus of 2 documents, as a plain Python list .. code-block:: none - 2020-10-28 00:52:04,368 : INFO : storing corpus in Matrix Market format to /tmp/corpus.mm - 2020-10-28 00:52:04,370 : INFO : saving sparse matrix to /tmp/corpus.mm - 2020-10-28 00:52:04,370 : INFO : PROGRESS: saving document #0 - 2020-10-28 00:52:04,370 : INFO : saved 2x2 matrix, density=25.000% (1/4) - 2020-10-28 00:52:04,370 : INFO : saving MmCorpus index to /tmp/corpus.mm.index + 2021-06-01 10:34:58,603 : INFO : storing corpus in Matrix Market format to /tmp/corpus.mm + 2021-06-01 10:34:58,604 : INFO : saving sparse matrix to /tmp/corpus.mm + 2021-06-01 10:34:58,604 : INFO : PROGRESS: saving document #0 + 2021-06-01 10:34:58,604 : INFO : saved 2x2 matrix, density=25.000% (1/4) + 2021-06-01 10:34:58,604 : INFO : saving MmCorpus index to /tmp/corpus.mm.index +.. GENERATED FROM PYTHON SOURCE LINES 224-227 + Other formats include `Joachim's SVMlight format `_, `Blei's LDA-C format `_ and `GibbsLDA++ format `_. +.. GENERATED FROM PYTHON SOURCE LINES 227-233 .. code-block:: default @@ -486,22 +537,25 @@ Other formats include `Joachim's SVMlight format .. code-block:: none - 2020-10-28 00:52:04,425 : INFO : converting corpus to SVMlight format: /tmp/corpus.svmlight - 2020-10-28 00:52:04,426 : INFO : saving SvmLightCorpus index to /tmp/corpus.svmlight.index - 2020-10-28 00:52:04,427 : INFO : no word id mapping provided; initializing from corpus - 2020-10-28 00:52:04,427 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c - 2020-10-28 00:52:04,427 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab - 2020-10-28 00:52:04,427 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index - 2020-10-28 00:52:04,481 : INFO : no word id mapping provided; initializing from corpus - 2020-10-28 00:52:04,481 : INFO : storing corpus in List-Of-Words format into /tmp/corpus.low - 2020-10-28 00:52:04,482 : WARNING : List-of-words format can only save vectors with integer elements; 1 float entries were truncated to integer value - 2020-10-28 00:52:04,482 : INFO : saving LowCorpus index to /tmp/corpus.low.index + 2021-06-01 10:34:58,653 : INFO : converting corpus to SVMlight format: /tmp/corpus.svmlight + 2021-06-01 10:34:58,654 : INFO : saving SvmLightCorpus index to /tmp/corpus.svmlight.index + 2021-06-01 10:34:58,654 : INFO : no word id mapping provided; initializing from corpus + 2021-06-01 10:34:58,654 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c + 2021-06-01 10:34:58,654 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab + 2021-06-01 10:34:58,654 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index + 2021-06-01 10:34:58,707 : INFO : no word id mapping provided; initializing from corpus + 2021-06-01 10:34:58,708 : INFO : storing corpus in List-Of-Words format into /tmp/corpus.low + 2021-06-01 10:34:58,708 : WARNING : List-of-words format can only save vectors with integer elements; 1 float entries were truncated to integer value + 2021-06-01 10:34:58,708 : INFO : saving LowCorpus index to /tmp/corpus.low.index + +.. GENERATED FROM PYTHON SOURCE LINES 234-235 Conversely, to load a corpus iterator from a Matrix Market file: +.. GENERATED FROM PYTHON SOURCE LINES 235-238 .. code-block:: default @@ -518,15 +572,18 @@ Conversely, to load a corpus iterator from a Matrix Market file: .. code-block:: none - 2020-10-28 00:52:04,538 : INFO : loaded corpus index from /tmp/corpus.mm.index - 2020-10-28 00:52:04,540 : INFO : initializing cython corpus reader from /tmp/corpus.mm - 2020-10-28 00:52:04,540 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries + 2021-06-01 10:34:58,756 : INFO : loaded corpus index from /tmp/corpus.mm.index + 2021-06-01 10:34:58,757 : INFO : initializing cython corpus reader from /tmp/corpus.mm + 2021-06-01 10:34:58,757 : INFO : accepted corpus with 2 documents, 2 features, 1 non-zero entries + +.. GENERATED FROM PYTHON SOURCE LINES 239-240 Corpus objects are streams, so typically you won't be able to print them directly: +.. GENERATED FROM PYTHON SOURCE LINES 240-243 .. code-block:: default @@ -548,8 +605,11 @@ Corpus objects are streams, so typically you won't be able to print them directl +.. GENERATED FROM PYTHON SOURCE LINES 244-245 + Instead, to view the contents of a corpus: +.. GENERATED FROM PYTHON SOURCE LINES 245-249 .. code-block:: default @@ -572,8 +632,11 @@ Instead, to view the contents of a corpus: +.. GENERATED FROM PYTHON SOURCE LINES 250-251 + or +.. GENERATED FROM PYTHON SOURCE LINES 251-256 .. code-block:: default @@ -598,11 +661,14 @@ or +.. GENERATED FROM PYTHON SOURCE LINES 257-261 + The second way is obviously more memory-friendly, but for testing and development purposes, nothing beats the simplicity of calling ``list(corpus)``. To save the same Matrix Market document stream in Blei's LDA-C format, +.. GENERATED FROM PYTHON SOURCE LINES 261-264 .. code-block:: default @@ -619,14 +685,16 @@ To save the same Matrix Market document stream in Blei's LDA-C format, .. code-block:: none - 2020-10-28 00:52:04,921 : INFO : no word id mapping provided; initializing from corpus - 2020-10-28 00:52:04,922 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c - 2020-10-28 00:52:04,923 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab - 2020-10-28 00:52:04,923 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index + 2021-06-01 10:34:59,085 : INFO : no word id mapping provided; initializing from corpus + 2021-06-01 10:34:59,086 : INFO : storing corpus in Blei's LDA-C format into /tmp/corpus.lda-c + 2021-06-01 10:34:59,087 : INFO : saving vocabulary of 2 words to /tmp/corpus.lda-c.vocab + 2021-06-01 10:34:59,087 : INFO : saving BleiCorpus index to /tmp/corpus.lda-c.index +.. GENERATED FROM PYTHON SOURCE LINES 265-275 + In this way, `gensim` can also be used as a memory-efficient **I/O format conversion tool**: just load a document stream using one format and immediately save it in another format. Adding new formats is dead easy, check out the `code for the SVMlight corpus @@ -638,6 +706,7 @@ Compatibility with NumPy and SciPy Gensim also contains `efficient utility functions `_ to help converting from/to numpy matrices +.. GENERATED FROM PYTHON SOURCE LINES 275-282 .. code-block:: default @@ -655,8 +724,11 @@ to help converting from/to numpy matrices +.. GENERATED FROM PYTHON SOURCE LINES 283-284 + and from/to `scipy.sparse` matrices +.. GENERATED FROM PYTHON SOURCE LINES 284-290 .. code-block:: default @@ -673,6 +745,8 @@ and from/to `scipy.sparse` matrices +.. GENERATED FROM PYTHON SOURCE LINES 291-304 + What Next --------- @@ -687,6 +761,7 @@ Optimize converting between corpora and NumPy/SciPy arrays?), see the :ref:`apir .. [1] This is the same corpus as used in `Deerwester et al. (1990): Indexing by Latent Semantic Analysis `_, Table 2. +.. GENERATED FROM PYTHON SOURCE LINES 304-310 .. code-block:: default @@ -710,9 +785,9 @@ Optimize converting between corpora and NumPy/SciPy arrays?), see the :ref:`apir .. rst-class:: sphx-glr-timing - **Total running time of the script:** ( 0 minutes 4.010 seconds) + **Total running time of the script:** ( 0 minutes 3.242 seconds) -**Estimated memory usage:** 40 MB +**Estimated memory usage:** 48 MB .. _sphx_glr_download_auto_examples_core_run_corpora_and_vector_spaces.py: diff --git a/docs/src/auto_examples/core/sg_execution_times.rst b/docs/src/auto_examples/core/sg_execution_times.rst index 9e36b38b09..da5c34f485 100644 --- a/docs/src/auto_examples/core/sg_execution_times.rst +++ b/docs/src/auto_examples/core/sg_execution_times.rst @@ -5,10 +5,10 @@ Computation times ================= -**00:04.010** total execution time for **auto_examples_core** files: +**00:03.242** total execution time for **auto_examples_core** files: +--------------------------------------------------------------------------------------------------------------+-----------+---------+ -| :ref:`sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py` (``run_corpora_and_vector_spaces.py``) | 00:04.010 | 39.8 MB | +| :ref:`sphx_glr_auto_examples_core_run_corpora_and_vector_spaces.py` (``run_corpora_and_vector_spaces.py``) | 00:03.242 | 48.2 MB | +--------------------------------------------------------------------------------------------------------------+-----------+---------+ | :ref:`sphx_glr_auto_examples_core_run_core_concepts.py` (``run_core_concepts.py``) | 00:00.000 | 0.0 MB | +--------------------------------------------------------------------------------------------------------------+-----------+---------+ diff --git a/docs/src/auto_examples/index.rst b/docs/src/auto_examples/index.rst index 1fa9eeca12..d15cccd8e6 100644 --- a/docs/src/auto_examples/index.rst +++ b/docs/src/auto_examples/index.rst @@ -71,7 +71,7 @@ Understanding this functionality is vital for using gensim effectively. .. raw:: html -
+
.. only:: html @@ -92,7 +92,7 @@ Understanding this functionality is vital for using gensim effectively. .. raw:: html -
+
.. only:: html @@ -169,7 +169,7 @@ Learning-oriented lessons that introduce a particular gensim feature, e.g. a mod .. raw:: html -
+
.. only:: html @@ -190,7 +190,7 @@ Learning-oriented lessons that introduce a particular gensim feature, e.g. a mod .. raw:: html -
+
.. only:: html @@ -288,7 +288,7 @@ These **goal-oriented guides** demonstrate how to **solve a specific problem** u .. raw:: html -
+
.. only:: html @@ -309,7 +309,7 @@ These **goal-oriented guides** demonstrate how to **solve a specific problem** u .. raw:: html -
+
.. only:: html @@ -426,13 +426,13 @@ Blog posts, tutorial videos, hackathons and other useful Gensim resources, from .. container:: sphx-glr-download sphx-glr-download-python - :download:`Download all examples in Python source code: auto_examples_python.zip ` + :download:`Download all examples in Python source code: auto_examples_python.zip ` .. container:: sphx-glr-download sphx-glr-download-jupyter - :download:`Download all examples in Jupyter notebooks: auto_examples_jupyter.zip ` + :download:`Download all examples in Jupyter notebooks: auto_examples_jupyter.zip ` .. only:: html diff --git a/docs/src/gallery/core/run_corpora_and_vector_spaces.py b/docs/src/gallery/core/run_corpora_and_vector_spaces.py index 0a49614123..983a9d1235 100644 --- a/docs/src/gallery/core/run_corpora_and_vector_spaces.py +++ b/docs/src/gallery/core/run_corpora_and_vector_spaces.py @@ -138,7 +138,7 @@ class MyCorpus: def __iter__(self): - for line in open('https://radimrehurek.com/gensim/mycorpus.txt'): + for line in open('https://radimrehurek.com/mycorpus.txt'): # assume there's one document per line, tokens separated by whitespace yield dictionary.doc2bow(line.lower().split()) @@ -154,7 +154,7 @@ def __iter__(self): # in RAM at once. You can even create the documents on the fly! ############################################################################### -# Download the sample `mycorpus.txt file here <./mycorpus.txt>`_. The assumption that +# Download the sample `mycorpus.txt file here `_. The assumption that # each document occupies one line in a single file is not important; you can mold # the `__iter__` function to fit your input format, whatever it is. # Walking directories, parsing XML, accessing the network... @@ -180,7 +180,7 @@ def __iter__(self): # Similarly, to construct the dictionary without loading all texts into memory: # collect statistics about all tokens -dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/gensim/mycorpus.txt')) +dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/mycorpus.txt')) # remove stop words and words that appear only once stop_ids = [ dictionary.token2id[stopword]