-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using JSONLD to retrieve information #43
Conversation
I am a fan of the If you swap there argument order around, if has a really elegant way of writing it, You can define it as
Demo:
|
Codecov Report
@@ Coverage Diff @@
## master #43 +/- ##
==========================================
+ Coverage 95.03% 95.49% +0.45%
==========================================
Files 15 18 +3
Lines 302 355 +53
==========================================
+ Hits 287 339 +52
- Misses 15 16 +1
Continue to review full report at Codecov.
|
Right, I was thinking of adding |
That is what the code I posted does. It recursively peals off the first key until it is found, or until there is no keys left. |
We should probably check that the JSON-LD has the key-value pair
And if it doesn't erroring out; or at least I believe On the other hand we could just not worry about it: Garbage-in => garbage out |
abstract type JSONLD <: DataRepo | ||
end | ||
|
||
export JSONLD_Web, JSONLD_DOI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All exports go in src/DataDepsGenerators.jl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how I used to do for abstracted types like DataOnev2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well it isn't a big deal and it is trivial to change later. So I guess either is good.
src/JSONLD/JSONLD.jl
Outdated
function get_license(mainpage) | ||
license = handle_keys("license", "", mainpage) | ||
if license != nothing | ||
if isa(license, String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isa
is available in infix notation
license isa String
src/JSONLD/JSONLD.jl
Outdated
end | ||
end | ||
|
||
function handle_keys(key1::String, key2::String, json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do want this changed to the have collection first then keys as a vararg.
Preference arguement order: JuliaLang/julia#19150 (comment)
Collection then Key.
Varargs:
https://docs.julialang.org/en/v0.6.1/manual/functions/#Varargs-Functions-1
src/JSONLD/JSONLD.jl
Outdated
end | ||
end | ||
|
||
function get_urls(repo::JSONLD, page) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for both initializing this and for having an else statement.
Do one or the other.
src/JSONLD/JSONLD.jl
Outdated
urls | ||
end | ||
|
||
function get_checksums(repo::JSONLD, page) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just define:
get_checksums(::DataRepo) = nothing
And then only overwrite it when we have something.
Rather than repeating it for all types.
src/JSONLD/JSONLD.jl
Outdated
end | ||
|
||
function mainpage_url(repo::JSONLD, dataname) | ||
#We are making it work for both figshare id or doi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment doesn't belong here.
Also shouldn't this method be on JSONLD_Web
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have this method in JSONLD_Web
already.
So delete this one.
src/JSONLD/JSONLD_DOI.jl
Outdated
if match_doi(dataname) != nothing | ||
url = joinpath("https://data.datacite.org/", match_doi(dataname)) | ||
resp = HTTP.get(url, ["Accept"=>"application/vnd.schemaorg.ld+json"]; forwardheaders=true) | ||
json = JSON.parse(resp.body |> String |> strip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are going to use the pipe operator, use it completely, or not at all.
resp.body |> String |> strip) |> JSON.parse
Or
JSON.parse(strip(String(resp.body)))
Also, is the strip
required?
test/JSONLD/JSONLD.jl
Outdated
using ReferenceTests | ||
|
||
@testset "JSONLD test" begin | ||
@test_reference "../references/JSONLD_Web Kaggle.txt" generate(JSONLD_Web(), "https://zenodo.org/record/1287281") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title says Kaggle
but the URL says zenodo
.
We want to test both Kaggle, and Zenodo,
and DataVerse,
and FigShare via Web
License: https://creativecommons.org/licenses/by/4.0/ | ||
Date: December 20, 2016 | ||
|
||
<p>Prepared by the Research Group on Earthquake Geology in Greece (http://eqgeogr.weebly.com/)</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need to strip any HTML out of the description.
We have Gumbo to parse HTML and the text_only
method already.
So should be easy to add a function in utils.jl
for that
src/utils.jl
Outdated
@@ -18,6 +23,15 @@ text_only(doc::HTMLDocument) = text_only(doc.root) | |||
text_only(frag) = join([replace(text(leaf), "\r","") for leaf in Leaves(frag) if leaf isa HTMLText], " ") | |||
text_only(frags::Vector) = join(text_only.(frags), " ") | |||
|
|||
function filter_html(random) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split this into two methods.
filter_html(::Void)=""
and filter_html(content) = ...
Also random
is a terrible name, content
or text
is better.
Splitting it is two avoids using isa
in favor of dispatch,
which is more idiomatic julia
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was debugging and forgot to update 😅
Dataset: Phenotypic evaluation data of medium duration Pigeonpea advanced varieties trial | ||
Website: http://dataverse.icrisat.org/dataset.xhtml?persistentId=doi:10.21421/D2/ZS6XX1 | ||
Author: Sameer Kumar, CV, Anupama Hingane | ||
License: <img src ="https://licensebuttons.net/l/by/4.0/88x31.png"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like stripping the HTML is also required on this field.
And since license is potentially long,
it should probably be always last, after things like date
src/JSONLD/JSONLD.jl
Outdated
elseif authors isa Dict | ||
return [handle_keys(authors, "name")] | ||
else | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add an @assert(authors==nothing
if that is what you are expecting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't quite understand this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances does the else trigger?
src/JSONLD/JSONLD.jl
Outdated
try | ||
return Dates.format(Dates.DateTime(rawdate), "U d, yyyy") | ||
catch error | ||
if error isa MethodError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this to handle rawdate==nothing
If so better to handle it if an if rawdate==nothing
than with exception handling.
Exception handling can be the answer to this kind of thing I think.
But much more if there is a sequence of functions that might fail at anypoint,
and you want to treat them all the same.
For just one guarding against it with an if
is better.
Julia (unlike say python
) generally prefers avoiding exceptions to handling them.,
Also you forgot the else rethrow()
in the catch block's if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to handle the difference in the incoming date formats. Some have just the year and some have different formats of date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not good to catch all errors.
And when not clear (like this is to me) it is good to comment with what kind of thing us causing the error.
handle_keys(json, otherkeys...) | ||
end | ||
|
||
handle_keys(json) = nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this missing
,
which will have to be done anyway later (when we move most things to be allowed to be missing and to properate missings everywhere)
and it will let skipmissing
be used on URL's too.
A bunch of nothing
conditions will need to become missing
conditions, but that is ok and will need to be done anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will I handle if url_list != nothing
if I am using missing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a PR on this and show you.
This is mergable right now I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it will be helpful if you can.
src/JSONLD/JSONLD.jl
Outdated
urls = [] | ||
url_list = handle_keys(page, "distribution") | ||
if url_list != nothing | ||
urls = [handle_keys(ii, "contentUrl") for ii in url_list if handle_keys(ii, "contentUrl") != nothing] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use skipmissing
as discussed above
src/utils.jl
Outdated
@@ -8,6 +8,11 @@ downloads and parses the page from the URL | |||
""" | |||
getpage(url) = parsehtml(String(read(quiet_download(url)))) | |||
|
|||
# function parsehtml(junk::String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete this commented out method?
src/utils.jl
Outdated
if random isa Void | ||
return "" | ||
end | ||
if ismatch(r"<(\"[^\"]*\"|'[^']*'|[^'\">])*>", random) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea, that you check if it possibly could be HTML, using regex, before parsing it and stripping tags.
It makes for a bit faster then parsing unnecessarily.
It would be good to add a comment to this line saying: "Check if might be HTML"
so it is clear that is what is going on.
Kaggle for some reason is down. Will restart the tests once that works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool.
I think we have technically cracked the 150 million "datasets" mark with this.
I say technically because 97 million of them would be from CrossRef DOIs, which are not data as most people would consider them
No description provided.