Merge pull request #1415 from microsoft/zhangya

Zhangya
recommenders-team · Jun 15, 2021 · 85d696c · 85d696c
2 parents a566e9f + e5cc9aa
commit 85d696c
Show file tree

Hide file tree

Showing 78 changed files with 1,875 additions and 1,434 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -23,7 +23,7 @@ Here are the basic steps to get started with your first contribution. Please rea
 5. Install development requirements. `pip install -r dev-requirements.txt`
 6. Create a test that replicates the issue.
 7. Make code changes.
-8. Ensure unit tests pass and code style / formatting is consistent (see [wiki](https://github.com/Microsoft/Recommenders/wiki/Coding-Guidelines#python-and-docstrings-style) for more details).
+8. Ensure that unit tests pass and code style / formatting is consistent (see the [coding guidelines](https://github.com/Microsoft/Recommenders/wiki/Coding-Guidelines#python-and-docstrings-style) for more details). In particular, make sure that there is a docstring for every function and class you add and that it conforms to the [Google style](http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html). 
 9. Create a pull request against **staging** branch.
 
 Once the features included in a [milestone](https://github.com/microsoft/recommenders/milestones) are completed, we will merge staging into main. See the wiki for more detail about our [merge strategy](https://github.com/microsoft/recommenders/wiki/Strategy-to-merge-the-code-to-main-branch).

diff --git a/docs/.readthedocs.yaml b/docs/.readthedocs.yaml
@@ -0,0 +1,11 @@
+version: 2
+
+# Build from the docs/ directory with Sphinx
+sphinx:
+  configuration: docs/source/conf.py
+
+# Explicitly set the version of Python and its requirements
+python:
+  version: 3.7
+  install:
+    - requirements: docs/requirements.txt
diff --git a/docs/README.md b/docs/README.md
@@ -2,7 +2,9 @@
 
 To setup the documentation, first you need to install the dependencies of the full environment. For it please follow the [SETUP.md](../SETUP.md). Then type:
 
+    conda create -n reco_full python=3.6 cudatoolkit=10.0 cudnn>=7.6
     conda activate reco_full
+    pip install .[all]
     pip install sphinx_rtd_theme
 
 
@@ -11,3 +13,4 @@ To build the documentation as HTML:
     cd docs
     make html
 
+To contribute to this repository, please follow our [coding guidelines](https://github.com/Microsoft/Recommenders/wiki/Coding-Guidelines). See also the [reStructuredText documentation](https://www.sphinx-doc.org/en/master/usage/restructuredtext/index.html) for the syntax of docstrings.
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -0,0 +1,34 @@
+numpy>=1.14
+pandas>1.0.3,<2
+scipy>=1.0.0,<2
+tqdm>=4.31.1,<5
+matplotlib>=2.2.2,<4
+scikit-learn>=0.22.1,<1
+numba>=0.38.1,<1
+lightfm>=1.15,<2
+lightgbm>=2.2.1,<3
+memory_profiler>=0.54.0,<1
+nltk>=3.4,<4
+pydocumentdb>=2.3.3<3  
+pymanopt>=0.2.5,<1
+seaborn>=0.8.1,<1
+transformers>=2.5.0,<5
+bottleneck>=1.2.1,<2
+category_encoders>=1.3.0,<2
+jinja2>=2,<3
+pyyaml>=5.4.1,<6
+requests>=2.0.0,<3
+cornac>=1.1.2,<2
+scikit-surprise>=0.19.1,<=1.1.1
+retrying>=1.3.3
+azure.mgmt.cosmosdb>=0.8.0,<1
+hyperopt>=0.1.2,<1
+ipykernel>=4.6.1,<5
+jupyter>=1,<2
+locust>=1,<2
+papermill>=2.1.2,<3
+scrapbook>=0.5.0,<1.0.0
+nvidia-ml-py3>=7.352.0
+tensorflow-gpu>=1.15.0,<2
+torch==1.2.0
+fastai>=1.0.46,<2
diff --git a/docs/source/azureml.rst b/docs/source/azureml.rst
diff --git a/docs/source/common.rst b/docs/source/common.rst
@@ -18,6 +18,13 @@ GPU utilities
     :members:
 
 
+Kubernetes utilities
+===============================
+
+.. automodule:: reco_utils.common.k8s_utils
+    :members:
+
+
 Notebook utilities
 ===============================
 

diff --git a/docs/source/dataset.rst b/docs/source/dataset.rst
@@ -1,41 +1,132 @@
 .. _dataset:
 
 Dataset module
-**************************
+##############
+
+Recommendation datasets and related utilities
 
 Recommendation datasets 
-===============================
+***********************
 
-.. automodule:: reco_utils.dataset.movielens
+Amazon Reviews
+==============
+
+`Amazon Reviews dataset <https://snap.stanford.edu/data/web-Amazon.html>`_ consists of reviews from Amazon. 
+The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user 
+information, ratings, and a plaintext review.
+
+:Citation:
+
+    J. McAuley and J. Leskovec, "Hidden factors and hidden topics: understanding rating dimensions with review text", 
+    RecSys, 2013.
+
+.. automodule:: reco_utils.dataset.amazon_reviews
     :members:
 
+CORD-19
+=======
+
+`COVID-19 Open Research Dataset (CORD-19) <https://azure.microsoft.com/en-us/services/open-datasets/catalog/covid-19-open-research/>`_ is a full-text 
+and metadata dataset of COVID-19 and coronavirus-related scholarly articles optimized 
+for machine readability and made available for use by the global research community.
+
+In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups 
+to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of 
+over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and the 
+coronavirus family of viruses for use by the global research community.
+
+This dataset is intended to mobilize researchers to apply recent advances in natural language processing 
+to generate new insights in support of the fight against this infectious disease.
+
+:Citation:
+
+    Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., 
+    Funk, K., Kinney, R., Liu, Z., Merrill, W. and Mooney, P. "Cord-19: The COVID-19 Open Research Dataset.", 2020.
+
+
+.. automodule:: reco_utils.dataset.covid_utils
+    :members:
+
+Criteo
+======
+
+`Criteo dataset <https://www.kaggle.com/c/criteo-display-ad-challenge/overview>`_, released by Criteo Labs, is an online advertising dataset that contains feature values and click feedback 
+for millions of display Ads. Every Ad has has 40 attributes, the first attribute is the label where a value 1 represents 
+that the Ad has been clicked on and a 0 represents it wasn't clicked on. The rest consist of 13 integer columns and 
+26 categorical columns.
+
 .. automodule:: reco_utils.dataset.criteo
     :members:
 
+MIND
+====
+
+`MIcrosoft News Dataset (MIND) <https://msnews.github.io/>`_, is a large-scale dataset for news recommendation research. It was collected from 
+anonymized behavior logs of Microsoft News website.
+
+MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. 
+Every news article contains rich textual content including title, abstract, body, category and entities. 
+Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before 
+this impression. To protect user privacy, each user was de-linked from the production system when securely hashed into an anonymized ID.
+
+:Citation:
+
+    Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu 
+    and Ming Zhou, "MIND: A Large-scale Dataset for News Recommendation", ACL, 2020.
+
+
+
+.. automodule:: reco_utils.dataset.mind
+    :members:  
+
+MovieLens
+=========
+
+The `MovieLens datasets <https://grouplens.org/datasets/movielens/>`_, first released in 1998, 
+describe people's expressed preferences
+for movies. These preferences take the form of `<user, item, rating, timestamp>` tuples, 
+each the result of a person expressing a preference (a 0-5 star rating) for a movie
+at a particular time.
+
+It comes with several sizes:
+
+* MovieLens 100k: 100,000 ratings from 1000 users on 1700 movies.
+* MovieLens 1M: 1 million ratings from 6000 users on 4000 movies.
+* MovieLens 10M: 10 million ratings from 72000 users on 10000 movies.
+* MovieLens 20M: 20 million ratings from 138000 users on 27000 movies
+
+:Citation:
+
+    F. M. Harper and J. A. Konstan. "The MovieLens Datasets: History and Context". 
+    ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19, 
+    DOI=http://dx.doi.org/10.1145/2827872, 2015.
+
+.. automodule:: reco_utils.dataset.movielens
+    :members:
 
 Download utilities 
-===============================
+******************
 
 .. automodule:: reco_utils.dataset.download_utils
     :members:
 
 
-Cosmos CLI 
-===============================
+Cosmos CLI utilities
+*********************
 
 .. automodule:: reco_utils.dataset.cosmos_cli
     :members:
 
 
-Pandas dataframe utils
-===============================
+Pandas dataframe utilities
+***************************
 
 .. automodule:: reco_utils.dataset.pandas_df_utils
     :members:
 
 
 Splitter utilities
-===============================
+******************
 
 .. automodule:: reco_utils.dataset.python_splitters
     :members:
@@ -48,14 +139,14 @@ Splitter utilities
 
 
 Sparse utilities
-===============================
+****************
 
 .. automodule:: reco_utils.dataset.sparse
     :members:
 
 
 Knowledge graph utilities
-===============================
+*************************
 
 .. automodule:: reco_utils.dataset.wikidata
     :members:
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -11,7 +11,6 @@ evaluating recommender systems.
    :maxdepth: 1
    :caption: Contents:
 
-    AzureML <azureml>
     Common <common>
     Dataset <dataset>
     Evaluation <evaluation>