Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Movielens20m dataset #1336

Merged
merged 49 commits into from
Dec 27, 2021
Merged

Movielens20m dataset #1336

merged 49 commits into from
Dec 27, 2021

Conversation

zkid18
Copy link
Contributor

@zkid18 zkid18 commented Oct 27, 2021

Pull Request FAQ

Description

Movielens 20M dataset for RecSys.
Current implementation progress:

[x] download dataset
[x] parser and pre-processing for the rating.csv
[x] user-based train/test splitter
[ ] interaction matrix generation
[ ] seq dataset generation
[ ] upload dataset in sparse format in PyTorch.

I stuck with a proper final representation of the dataset.

Couple words on Movielens20M dataset.
1-5 can be considered as context-based information, while rating provides the user interaction data. Hence we can omit 1-5 at that time.

    The data are contained in six files: 
    1. genome-scores.csv 
    2. genome-tags.csv 
    3. links.csv 
    4. movies.csv
    5. tags.csv  
    6. ratings.csv

rating.csv head -5

   userId  movieId  rating   timestamp
0       1        2     3.5  1112486027
1       1       29     3.5  1112484676
2       1       32     3.5  1112484819
3       1       47     3.5  1112484727
4       1       50     3.5  1112484580

So I have split the dataset randomly by users. I need to better understand how the dataset for the sequential RecSys should look like.

Related Issue

Type of Change

  • Examples / docs / tutorials / contributors update
  • Bug fix (non-breaking change which fixes an issue)
  • Improvement (non-breaking change which improves an existing feature)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Checklist

  • Have you updated tests for the new functionality?
  • Have you added your new classes/functions to the docs?
  • Have you updated the CHANGELOG?
  • Have you run colab minimal CI/CD with latest and minimal requirements?
  • Have you checked XLA integration with single and multiple processes?

@pep8speaks
Copy link

pep8speaks commented Oct 27, 2021

Hello @zkid18! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-27 05:25:19 UTC

@zkid18 zkid18 changed the title Movielens20m [WIP] Movielens20m Oct 27, 2021
Copy link
Member

@Scitator Scitator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great if we could correct the code style and add a few usage examples for the dataset

tmp_data/MovieLens20M/raw/ml-20m/README.txt Outdated Show resolved Hide resolved
@mergify
Copy link

mergify bot commented Nov 7, 2021

This pull request is now in conflicts. @zkid18, could you fix it? 🙏

@zkid18
Copy link
Contributor Author

zkid18 commented Nov 8, 2021

@CLAassistant
Copy link

CLAassistant commented Nov 20, 2021

CLA assistant check
All committers have signed the CLA.

@zkid18 zkid18 changed the title [WIP] Movielens20m Movielens20m dataset Nov 26, 2021
@mergify
Copy link

mergify bot commented Dec 18, 2021

This pull request is now in conflicts. @zkid18, could you fix it? 🙏

Copy link
Member

@Scitator Scitator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: torch>=1.3.0 ;)

@mergify mergify bot dismissed Scitator’s stale review December 20, 2021 16:33

Pull request has been modified.

requirements/requirements-cv.txt Outdated Show resolved Hide resolved
@@ -1,4 +1,4 @@
scipy>=1.4.1
matplotlib>=3.1.0
pandas>=0.25.0
pandas>=1.1.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0? or 0.25.0

@mergify mergify bot dismissed Scitator’s stale review December 20, 2021 17:27

Pull request has been modified.

"""
Test movielense download
"""
MovieLens20M("./tmp_data", download=True, sample=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@Scitator Scitator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments ;)

@mergify mergify bot dismissed Scitator’s stale review December 27, 2021 04:10

Pull request has been modified.

@Scitator Scitator merged commit 1d0a0b2 into catalyst-team:master Dec 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants