-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wikidata #778
Wikidata #778
Conversation
Check out this pull request on ReviewNB: https://app.reviewnb.com/microsoft/recommenders/pull/778 Visit www.reviewnb.com to know how we simplify your Jupyter Notebook workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work!
In general, it would be great to have
- More text and background introduction about knowledge graph. Since it can be a super big topic, so probably it would be good to focus the notebook to be around something that directly relates to recommendation (e.g., DKN). While entity linking can be a core part of the notebook.
- In the repository, we have developed a DevOps pipeline that helps test codes in the utility functions and notebooks. Try to understand how it works and how to write good unit tests for your functions. Goo examples can be found in the folder of
tests
.
reco_utils/dataset/wikidata.py
Outdated
entityID: wikidata entityID corresponding to the title string. | ||
'entityNotFound' will be returned if no page is found | ||
""" | ||
url = "https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&format=json&titles=" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded string may not be desirable. Make it an input variable or constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, sorry for the delay in the answer. since the request to the new query (I had to do some changes) is a concatenation of strings and a variable:
requests.get("https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch="+name+"&format=json&prop=pageprops&ppprop=wikibase_item")
how do you suggest I can do this? should make the two substrings constants and concatenate them in the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's an approach
# defined at beginning of module
import urllib
API_URL = "https://en.wikipedia.org/w/api.php"
def find_wikidataID(name):
url_opts = "&".join([
"action=query",
"list=search",
"srsearch={}".format(urllib.parse.quote(name)),
"format=json",
"prop=pageprops",
"ppprop=wikibase_item",
])
requests.get("{url}?{opts}".format(url=API_URL, opts=url_opts))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this here: a5338c0 Thanks a lot for the tip!
reco_utils/dataset/wikidata.py
Outdated
try: | ||
entityID = r.json()["query"]["pages"][entityID]["pageprops"]["wikibase_item"] | ||
except: | ||
entityID = "entityNotFound" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why returning a string vs raising an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I return a string for the cases where the function is used in a loop, so we can have a record of that entity not having a response and the code can keep running. Do you recommend doing something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can vary a bit, but often I find it is easier to raise the exception here and let the calling function catch it and handle it as needed. This removes the need to check the output for a specific string defined here and so there is looser coupling across functions and more flexibility in handling errors upstream.
@almudenasanz let me know if you need help with this. There is information here https://github.com/microsoft/recommenders/tree/master/tests |
Hi @almudenasanz, we were talking internally. @chenhuims is going to work KG networks with you. He will probably do CKE, which will complement your work on Ripple and KGCN. We thought that it would be interesting that the final output of the notebook that you are doing could be the knowledge graph of Movielens as a dumped file (in the format of the networks), which we would save in a blob. Then in the notebooks of the KG networks, we will start from that saved file. After that, you guys could work in parallel. We would need to have a KG for the 4 ML (100k, 1M, 10M, 20M). @Leavingseason was wondering if there is a restriction in the number of requests in the Wikidata API |
One note on ways of working for @almudenasanz and @chenhuims. In other situations where there are several people working on the same issue, they will agree on the tasks to work on (depending on the bandwidth) and then either push to the same branch (in your case it would be |
@Leavingseason @miguelgfierro In the Query Limits section (https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual) of the documentation they mention: "the service is limited to 5 parallel queries per IP". Seems like the only limit is in parallel queries per IP, not in sequential queries |
just curious if this would be made simpler by leveraging an existing package like: https://pypi.org/project/Wikidata/ |
The latest release is from 2017, is this maintained? |
hmm, there are comments from the maintainer on issues that are more recent Dec 2018, it's possible the api hasn't changed enough to warrant any updates? |
Hi @gramhagen , I looked into the package but for the specific implementations that I use on the notebook:
I did not find implementations in the package that are simpler than the one's I implemented. But happy to discuss any suggestions! |
@miguelgfierro I added to the notebook a new section that extracts the entities for the 100k Movielens version. I had to reimplement a bit the find_wikidataID method, because I was matching strings to exact wikipedia page titles. Some of the movie titles did not match exactly the wikipedia title, so I added a simple query that uses a text query to retrieve the first matching page title, and now works well for all movies. I need to ask you where to put the output file of the MovieLens KG |
ok, no problem, wasn't sure if it would be helpful to leverage that, but sounds like it's not in this case. thanks for checking into it |
we are changing the blobs of recommenders, can you send me the file somehow? then I'll uploaded to the correct place |
hey @almudenasanz @chenhuims would you please have an update on the state of this PR and the work you are doing with movielens+wikidata? Please let me know if you have any blockers |
this looks good @almudenasanz. One question, how long does it take to compute the small KG with movielens 100k? Can you add a test for the notebook? Depending on the time, they will go in unit test or maybe smoke tests Here info on how to add the test https://github.com/microsoft/recommenders/tree/master/tests#how-to-create-tests-on-notebooks-with-papermill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good @almudenasanz. One question, how long does it take to compute the small KG with movielens 100k?
It takes 1-2 seconds per entry (combining text query to wikidata entity ID and finding related entities), so I was able to query the 100K Movielens dataset containing 1682 movies in 45 mins. The time it will take for the other KGs will depend on the number of movies reviewed. According to this link https://grouplens.org/datasets/movielens/ the amounts are:
- Movielens 100K ~1.700 movies
- Movielens 1M ~4.000 movies
- Movielens 10M ~10.000 movies
The API supports up to 5 parallel queries from the same IP, so we could reduce it 5 times.
Can you add a test for the notebook? Depending on the time, they will go in unit test or maybe smoke tests
Here info on how to add the test https://github.com/microsoft/recommenders/tree/master/tests#how-to-create-tests-on-notebooks-with-papermill
I will look into the tests
wow that's a lot. Maybe we can think of a way of running the first part (without movielens) in the unit tests, and then do some queries (not all) in the movielens part. Ideally we want the unit test to be less than one min and integration less than 15-20min. Do you have any idea on how to execute the notebook under the times I mentioned? |
I can create a parameter to run the tests only on a sample of the movielens dataset, would that work? |
yeah that's reasonable. I think a good way of doing this would be to have a unit test that just check that the first part of the notebook runs, similarly to this example. Then we can have what you suggested in the integration test and check programmatically some outputs like in this example. Maybe check that the first rows of the KG of movielens are correctly created. Kind of similar to what we are doing in the criteo tests. Under this schema, maybe the integration tests take less than 5min |
Great, I already started working on them, and querying the 1M MovieLens dataset. I will commit them when I finish |
I added the tests, and sent you the 1M file by email I introduced sample parameters so both the unit and integration tests can run in under 1 min, and the integration test checks that the output os the file has the expected number of responses |
@almudenasanz, I solved the conflicts but one test failed, there is a small error in the code:
just fixed it |
@almudenasanz there is a time out:
|
I have changed the default parameters of the notebook to do the sampling, and the test have passed. Seems like the injection of the parameters for the test was not working. |
* new file with wikidata functions * fix in json extraction * new notebook with wikidata use examples * retry request with lowercase in case of failure * WIP: example creating KG from movielens entities * introduced new step to retrieve first page title from a text query in wikipedia * updated movielens links extraction using wikidata * adapted docstrings for sphinx and removed parenthesis from output * added description and labels to nodes to graph preview * #778 (comment) new format for queries * raising exceptions in requests and using get() to retrieve dict values * moved imports to first cell and movielens size as a parameter * output file name as paramenter * DATA: update sum check * adding unit test for sum to 1 issue * improved description and adapted to tests * improved Exception descriptions * integration tests * unit tests * added wikidata_KG to conftest * changed name notebook * *NOTE: Adding shows the computation time of all tests.* * imports up * Update wikidata.py * changed default parameter of sample for tests * Add sphinx documentation for wikidata * modified parameter extraction for tests * added parameters tag to cell * changed default sampling to test parameters in test * notebook cleaned cells output * Docker Support (#718) * DOCKER: add pyspark docker file * DOCKER: remove unused line * DOCKER: remove old file * DOCKER: add SETUP text * DOCKER: add azureml` * DOCKER: udpate dockerfile * DOCKER: use a branch of the repo * SETUP: update setup * DOCKER: update dockerfile * DOC: update setup * DOCKER: one that binds all * SETUP: update docker use * DOCKER: move to top level * SETUP: use a different base name * DOCKER: use the same keywords in the repo for environment arg * SETUP: update environment variable names * updating dockerfile to use multistage build and adding readme * adding full stage * fixing documentation * adding info for running full env * README: update notes for exporting environment on certain platform * README: updated with example on Windows * README: fix typo
Description
The final objetive if to use Wikidata as a new Knowledge Graph for Recommendation algorithms, and to extract entities description to use new datasets (like Movielens) with DKN in DKN. This is the first step in that direction. I have implemented:
New utils functions to do specific queries in Wikidata:
To test the functions created I have added a new notebook. The first section consists on creating a Knowledge Graph from the linked entities in Wikidata, and visualising the result of the KG. The second part tests the enriching of the name of an entity with their description and list of related entities, the goal is using this enriching for new datasets (like Movielens) with DKN.
Related Issues
#525
Checklist: