Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update readme for wikidata KG #883

Merged
merged 1 commit into from
Aug 8, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions notebooks/01_prepare_data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,16 @@ data preparation tasks witnessed in recommendation system development.

| Notebook | Description |
| --- | --- |
| [data_split](data_split.ipynb) | Details on splitting data (randomly, chronologically, etc).
| [data_transform](data_transform.ipynb) | Guidance on how to transform (implicit / explicit) data for building collaborative filtering typed recommender.
| [data_split](data_split.ipynb) | Details on splitting data (randomly, chronologically, etc). |
| [data_transform](data_transform.ipynb) | Guidance on how to transform (implicit / explicit) data for building collaborative filtering typed recommender. |
| [wikidata knowledge graph](wikidata_KG.ipynb) | Details on how to create a knowledge graph using Wikidata |

### Data split

** Data split
Three methods of splitting the data for training and testing are demonstrated in this notebook. Each supports both Spark and pandas DataFrames.
1. Random Split: this is the simplest way to split the data, it randomly assigns entries to either the train set or the test set based on the allocation ratio desired.
2. Chronological Split: in many cases accounting for temporal variations when evaluating your model can provide more realistic measures of performance. This approach will split the train and test set based on timestamps by user or item.
3. Stratified Split: it may be preferable to ensure the same set of users or items are in the train and test sets, this method of splitting will ensure that is the case.

** Data transform
### Data transform
Data transformation techniques which are commonly used in various recommendation scenarios are introduced and reviewed.