-
-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Movielens20m dataset #1336
Movielens20m dataset #1336
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be great if we could correct the code style and add a few usage examples for the dataset
This pull request is now in conflicts. @zkid18, could you fix it? 🙏 |
This pull request is now in conflicts. @zkid18, could you fix it? 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: torch>=1.3.0
;)
Pull request has been modified.
requirements/requirements-ml.txt
Outdated
@@ -1,4 +1,4 @@ | |||
scipy>=1.4.1 | |||
matplotlib>=3.1.0 | |||
pandas>=0.25.0 | |||
pandas>=1.1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.0? or 0.25.0
Pull request has been modified.
""" | ||
Test movielense download | ||
""" | ||
MovieLens20M("./tmp_data", download=True, sample=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use TemporaryDirectory? like https://github.com/catalyst-team/catalyst/blob/e1d78b7dd568c4d5cd94be417d6d8b93ef27b2e5/tests/catalyst/runners/test_train_flags.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments ;)
Pull request has been modified.
Pull Request FAQ
Description
Movielens 20M dataset for RecSys.
Current implementation progress:
[x] download dataset
[x] parser and pre-processing for the rating.csv
[x] user-based train/test splitter
[ ] interaction matrix generation
[ ] seq dataset generation
[ ] upload dataset in sparse format in PyTorch.
I stuck with a proper final representation of the dataset.
Couple words on Movielens20M dataset.
1-5 can be considered as context-based information, while
rating
provides the user interaction data. Hence we can omit 1-5 at that time.rating.csv head -5
So I have split the dataset randomly by users. I need to better understand how the dataset for the sequential RecSys should look like.
Related Issue
Type of Change
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Checklist
latest
andminimal
requirements?