Movielens20m dataset #1336

zkid18 · 2021-10-27T11:22:53Z

Pull Request FAQ

documentation
contribution guide
minimal examples section
changelog for main framework updates
Catalyst slack (#__questions channel) for issue discussion

Description

Movielens 20M dataset for RecSys.
Current implementation progress:

[x] download dataset
[x] parser and pre-processing for the rating.csv
[x] user-based train/test splitter
[ ] interaction matrix generation
[ ] seq dataset generation
[ ] upload dataset in sparse format in PyTorch.

I stuck with a proper final representation of the dataset.

Couple words on Movielens20M dataset.
1-5 can be considered as context-based information, while rating provides the user interaction data. Hence we can omit 1-5 at that time.

    The data are contained in six files: 
    1. genome-scores.csv 
    2. genome-tags.csv 
    3. links.csv 
    4. movies.csv
    5. tags.csv  
    6. ratings.csv

rating.csv head -5

   userId  movieId  rating   timestamp
0       1        2     3.5  1112486027
1       1       29     3.5  1112484676
2       1       32     3.5  1112484819
3       1       47     3.5  1112484727
4       1       50     3.5  1112484580

So I have split the dataset randomly by users. I need to better understand how the dataset for the sequential RecSys should look like.

Related Issue

Type of Change

Examples / docs / tutorials / contributors update
Bug fix (non-breaking change which fixes an issue)
Improvement (non-breaking change which improves an existing feature)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Checklist

Have you updated tests for the new functionality?
Have you added your new classes/functions to the docs?
Have you updated the CHANGELOG?
Have you run colab minimal CI/CD with latest and minimal requirements?
Have you checked XLA integration with single and multiple processes?

pep8speaks · 2021-10-27T11:22:56Z

Hello @zkid18! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-27 05:25:19 UTC

Scitator

it would be great if we could correct the code style and add a few usage examples for the dataset

tmp_data/MovieLens20M/raw/ml-20m/README.txt

mergify · 2021-11-07T18:24:21Z

This pull request is now in conflicts. @zkid18, could you fix it? 🙏

zkid18 · 2021-11-08T10:47:04Z

https://colab.research.google.com/drive/1l6mpAWBKbkEWILnz9ThwpAoTQDCXkG-M?authuser=1#scrollTo=HkzzfT1moccj&uniqifier=1

Colab to test MovieLens20m

CLAassistant · 2021-11-20T19:37:16Z

All committers have signed the CLA.

mergify · 2021-12-18T07:20:11Z

This pull request is now in conflicts. @zkid18, could you fix it? 🙏

Scitator

todo: torch>=1.3.0 ;)

Pull request has been modified.

requirements/requirements-cv.txt

Scitator · 2021-12-20T16:43:37Z

requirements/requirements-ml.txt

@@ -1,4 +1,4 @@
 scipy>=1.4.1
 matplotlib>=3.1.0
-pandas>=0.25.0
+pandas>=1.1.1


1.0? or 0.25.0

Pull request has been modified.

Scitator · 2021-12-24T14:47:00Z

tests/catalyst/contrib/datasets/test_movielens_20m.py

+    """
+    Test movielense download
+    """
+    MovieLens20M("./tmp_data", download=True, sample=True)


could we use TemporaryDirectory? like https://github.com/catalyst-team/catalyst/blob/e1d78b7dd568c4d5cd94be417d6d8b93ef27b2e5/tests/catalyst/runners/test_train_flags.py

Scitator

comments ;)

Pull request has been modified.

zkid18 added 2 commits October 26, 2021 21:21

wip movielnes parsing

eabaebe

stuck with format

3908dd2

zkid18 requested review from bagxi, ditwoo and Scitator as code owners October 27, 2021 11:22

zkid18 changed the title ~~Movielens20m~~ [WIP] Movielens20m Oct 27, 2021

Scitator reviewed Oct 28, 2021

View reviewed changes

tmp_data/MovieLens20M/raw/ml-20m/README.txt Outdated Show resolved Hide resolved

train-test splitter

aee834b

zkid18 added 2 commits November 8, 2021 10:19

add comments

0e759d0

fix MovieLens100k error

02318a7

zkid18 added 2 commits November 8, 2021 10:47

remove tmp_data

8b42c24

add tests

ea10abb

zkid18 added 12 commits November 21, 2021 17:50

add tests; add user/item filtering algorithm

0b2e5ac

codestyle wip

86a1bff

codestyle fix

7aa388c

movielines codestyle

eb2e55c

movielnes codestyle

37564c2

movielnes codestyle

dadc98c

merge with master

109302c

fixed tests

fd9359e

codestyle minors

fa5331f

codestyle minors

e1da558

minor fixes

04ace76

codestyle minor fix

1d9151e

zkid18 changed the title ~~[WIP] Movielens20m~~ Movielens20m dataset Nov 26, 2021

fix movielens tests

b4643bb

zkid18 added 7 commits December 16, 2021 14:53

change scipy to 1.4.1

34b59f1

changed pytorch version

cddb12d

removed serrialization param

81c2e17

torchvision 0.5.0

c9d482c

update torch 1.6.0

69a3b18

update torch 1.7.0

b2fe8f8

updated torchvision 0.8.0

3e415c0

Scitator previously requested changes Dec 20, 2021

View reviewed changes

add version validation

3516bbe

Scitator previously requested changes Dec 20, 2021

View reviewed changes

version check

fd49c75

zkid18 added 10 commits December 20, 2021 19:02

check requerements

72dbf32

cganged torchvisionm version

13e3bf8

Changelog

a4f0b07

pandas server

4bae487

changed init

7b10d4d

changed init

0f8d8de

change import logic

195b12e

codestyle

d5c984e

removed parse

4c2de99

removed parse

c5c2fe2

Scitator reviewed Dec 24, 2021

View reviewed changes

Scitator previously requested changes Dec 24, 2021

View reviewed changes

Update requirements-ml.txt

bb65567

Update test_movielens_20m.py

15d5c20

Scitator merged commit 1d0a0b2 into catalyst-team:master Dec 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Movielens20m dataset #1336

Movielens20m dataset #1336

zkid18 commented Oct 27, 2021

pep8speaks commented Oct 27, 2021 •

edited

Loading

Scitator left a comment

mergify bot commented Nov 7, 2021

zkid18 commented Nov 8, 2021

CLAassistant commented Nov 20, 2021 •

edited

Loading

mergify bot commented Dec 18, 2021

Scitator left a comment

Scitator Dec 20, 2021

Scitator Dec 24, 2021

Scitator Dec 24, 2021

Scitator left a comment

Movielens20m dataset #1336

Movielens20m dataset #1336

Conversation

zkid18 commented Oct 27, 2021

Pull Request FAQ

Description

Related Issue

Type of Change

PR review

Checklist

pep8speaks commented Oct 27, 2021 • edited Loading

Comment last updated at 2021-12-27 05:25:19 UTC

Scitator left a comment

Choose a reason for hiding this comment

mergify bot commented Nov 7, 2021

zkid18 commented Nov 8, 2021

CLAassistant commented Nov 20, 2021 • edited Loading

mergify bot commented Dec 18, 2021

Scitator left a comment

Choose a reason for hiding this comment

Scitator Dec 20, 2021

Choose a reason for hiding this comment

Scitator Dec 24, 2021

Choose a reason for hiding this comment

Scitator Dec 24, 2021

Choose a reason for hiding this comment

Scitator left a comment

Choose a reason for hiding this comment

pep8speaks commented Oct 27, 2021 •

edited

Loading

CLAassistant commented Nov 20, 2021 •

edited

Loading