Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using JSONLD to retrieve information #43

Merged
merged 8 commits into from
Jul 6, 2018

Conversation

SebastinSanty
Copy link
Collaborator

No description provided.

@oxinabox
Copy link
Owner

oxinabox commented Jul 4, 2018

I am a fan of the handle_key method.

If you swap there argument order around, if has a really elegant way of writing it,
using the splatting arguments. (I think they say varadic function?)
(and follows the preferred order of collections before keys)

You can define it as

handle_key(json, key, otherkeys...) = get(json,  key) do  # do block is called when they is not found
     handle_key(json, otherkeys...) # This key missed, so try the others
end
handle_key(json) = nothing

Demo:

julia> foo = Dict("a"=>2)
Dict{String,Int64} with 1 entry:
  "a" => 2

julia> handle_key(foo, "a")  |> println
2

julia> handle_key(foo, "b") |> println
nothing

julia> handle_key(foo, "b", "a") |> println
2

@codecov-io
Copy link

codecov-io commented Jul 4, 2018

Codecov Report

Merging #43 into master will increase coverage by 0.45%.
The diff coverage is 96.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #43      +/-   ##
==========================================
+ Coverage   95.03%   95.49%   +0.45%     
==========================================
  Files          15       18       +3     
  Lines         302      355      +53     
==========================================
+ Hits          287      339      +52     
- Misses         15       16       +1
Impacted Files Coverage Δ
src/JSONLD/JSONLD_DOI.jl 100% <100%> (ø)
src/DataDepsGenerators.jl 91.42% <100%> (ø) ⬆️
src/utils.jl 100% <100%> (+10%) ⬆️
src/JSONLD/JSONLD_Web.jl 90% <90%> (ø)
src/JSONLD/JSONLD.jl 96.96% <96.96%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0b6dfc...a4929ff. Read the comment docs.

@SebastinSanty
Copy link
Collaborator Author

Right, I was thinking of adding kwargs like multiple arguments for handle_keys, which can check in the given order till the correct key is found.

@oxinabox
Copy link
Owner

oxinabox commented Jul 4, 2018

which can check in the given order till the correct key is found.

That is what the code I posted does. It recursively peals off the first key until it is found, or until there is no keys left.
It just does so via dispatch and splatting

@oxinabox
Copy link
Owner

oxinabox commented Jul 4, 2018

We should probably check that the JSON-LD has the key-value pair

"@context": "http://schema.org",

And if it doesn't erroring out; or at least warning

I believe @context is JSON-LD for namespaces. Or something like that
So if it doesn't have that we can't know that it is using the names we expect.

On the other hand we could just not worry about it: Garbage-in => garbage out

abstract type JSONLD <: DataRepo
end

export JSONLD_Web, JSONLD_DOI
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All exports go in src/DataDepsGenerators.jl

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how I used to do for abstracted types like DataOnev2.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well it isn't a big deal and it is trivial to change later. So I guess either is good.

function get_license(mainpage)
license = handle_keys("license", "", mainpage)
if license != nothing
if isa(license, String)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isa is available in infix notation
license isa String

end
end

function handle_keys(key1::String, key2::String, json)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want this changed to the have collection first then keys as a vararg.
Preference arguement order: JuliaLang/julia#19150 (comment)
Collection then Key.

Varargs:
https://docs.julialang.org/en/v0.6.1/manual/functions/#Varargs-Functions-1

end
end

function get_urls(repo::JSONLD, page)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need for both initializing this and for having an else statement.
Do one or the other.

urls
end

function get_checksums(repo::JSONLD, page)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just define:
get_checksums(::DataRepo) = nothing
And then only overwrite it when we have something.
Rather than repeating it for all types.

end

function mainpage_url(repo::JSONLD, dataname)
#We are making it work for both figshare id or doi
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment doesn't belong here.

Also shouldn't this method be on JSONLD_Web ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have this method in JSONLD_Web already.
So delete this one.

if match_doi(dataname) != nothing
url = joinpath("https://data.datacite.org/", match_doi(dataname))
resp = HTTP.get(url, ["Accept"=>"application/vnd.schemaorg.ld+json"]; forwardheaders=true)
json = JSON.parse(resp.body |> String |> strip)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to use the pipe operator, use it completely, or not at all.
resp.body |> String |> strip) |> JSON.parse

Or

JSON.parse(strip(String(resp.body)))
Also, is the strip required?

using ReferenceTests

@testset "JSONLD test" begin
@test_reference "../references/JSONLD_Web Kaggle.txt" generate(JSONLD_Web(), "https://zenodo.org/record/1287281")
Copy link
Owner

@oxinabox oxinabox Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title says Kaggle but the URL says zenodo.
We want to test both Kaggle, and Zenodo,
and DataVerse,
and FigShare via Web

License: https://creativecommons.org/licenses/by/4.0/
Date: December 20, 2016

<p>Prepared by the Research Group on Earthquake Geology in Greece (http://eqgeogr.weebly.com/)</p>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need to strip any HTML out of the description.
We have Gumbo to parse HTML and the text_only method already.
So should be easy to add a function in utils.jl for that

src/utils.jl Outdated
@@ -18,6 +23,15 @@ text_only(doc::HTMLDocument) = text_only(doc.root)
text_only(frag) = join([replace(text(leaf), "\r","") for leaf in Leaves(frag) if leaf isa HTMLText], " ")
text_only(frags::Vector) = join(text_only.(frags), " ")

function filter_html(random)
Copy link
Owner

@oxinabox oxinabox Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split this into two methods.
filter_html(::Void)=""
and filter_html(content) = ...
Also random is a terrible name, content or text is better.

Splitting it is two avoids using isa in favor of dispatch,
which is more idiomatic julia

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was debugging and forgot to update 😅

Dataset: Phenotypic evaluation data of medium duration Pigeonpea advanced varieties trial
Website: http://dataverse.icrisat.org/dataset.xhtml?persistentId=doi:10.21421/D2/ZS6XX1
Author: Sameer Kumar, CV, Anupama Hingane
License: <img src ="https://licensebuttons.net/l/by/4.0/88x31.png">
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like stripping the HTML is also required on this field.

And since license is potentially long,
it should probably be always last, after things like date

elseif authors isa Dict
return [handle_keys(authors, "name")]
else
return []
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add an @assert(authors==nothing
if that is what you are expecting here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't quite understand this

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what circumstances does the else trigger?

try
return Dates.format(Dates.DateTime(rawdate), "U d, yyyy")
catch error
if error isa MethodError
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this to handle rawdate==nothing
If so better to handle it if an if rawdate==nothing than with exception handling.

Exception handling can be the answer to this kind of thing I think.
But much more if there is a sequence of functions that might fail at anypoint,
and you want to treat them all the same.
For just one guarding against it with an if is better.
Julia (unlike say python) generally prefers avoiding exceptions to handling them.,

Also you forgot the else rethrow() in the catch block's if

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to handle the difference in the incoming date formats. Some have just the year and some have different formats of date.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not good to catch all errors.
And when not clear (like this is to me) it is good to comment with what kind of thing us causing the error.

handle_keys(json, otherkeys...)
end

handle_keys(json) = nothing
Copy link
Owner

@oxinabox oxinabox Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this missing,
which will have to be done anyway later (when we move most things to be allowed to be missing and to properate missings everywhere)
and it will let skipmissing be used on URL's too.

A bunch of nothing conditions will need to become missing conditions, but that is ok and will need to be done anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will I handle if url_list != nothing if I am using missing?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a PR on this and show you.
This is mergable right now I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it will be helpful if you can.

urls = []
url_list = handle_keys(page, "distribution")
if url_list != nothing
urls = [handle_keys(ii, "contentUrl") for ii in url_list if handle_keys(ii, "contentUrl") != nothing]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use skipmissing as discussed above

src/utils.jl Outdated
@@ -8,6 +8,11 @@ downloads and parses the page from the URL
"""
getpage(url) = parsehtml(String(read(quiet_download(url))))

# function parsehtml(junk::String)
Copy link
Owner

@oxinabox oxinabox Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this commented out method?

src/utils.jl Outdated
if random isa Void
return ""
end
if ismatch(r"<(\"[^\"]*\"|'[^']*'|[^'\">])*>", random)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea, that you check if it possibly could be HTML, using regex, before parsing it and stripping tags.
It makes for a bit faster then parsing unnecessarily.
It would be good to add a comment to this line saying: "Check if might be HTML"
so it is clear that is what is going on.

@SebastinSanty
Copy link
Collaborator Author

Kaggle for some reason is down. Will restart the tests once that works.

Copy link
Owner

@oxinabox oxinabox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool.
I think we have technically cracked the 150 million "datasets" mark with this.
I say technically because 97 million of them would be from CrossRef DOIs, which are not data as most people would consider them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants