Using JSONLD to retrieve information #43

SebastinSanty · 2018-07-04T12:14:00Z

No description provided.

oxinabox · 2018-07-04T12:55:31Z

I am a fan of the handle_key method.

If you swap there argument order around, if has a really elegant way of writing it,
using the splatting arguments. (I think they say varadic function?)
(and follows the preferred order of collections before keys)

You can define it as

handle_key(json, key, otherkeys...) = get(json,  key) do  # do block is called when they is not found
     handle_key(json, otherkeys...) # This key missed, so try the others
end
handle_key(json) = nothing

Demo:

julia> foo = Dict("a"=>2)
Dict{String,Int64} with 1 entry:
  "a" => 2

julia> handle_key(foo, "a")  |> println
2

julia> handle_key(foo, "b") |> println
nothing

julia> handle_key(foo, "b", "a") |> println
2

codecov-io · 2018-07-04T12:56:10Z

Codecov Report

Merging #43 into master will increase coverage by 0.45%.
The diff coverage is 96.29%.

@@            Coverage Diff             @@
##           master      #43      +/-   ##
==========================================
+ Coverage   95.03%   95.49%   +0.45%     
==========================================
  Files          15       18       +3     
  Lines         302      355      +53     
==========================================
+ Hits          287      339      +52     
- Misses         15       16       +1

Impacted Files	Coverage Δ
src/JSONLD/JSONLD_DOI.jl	`100% <100%> (ø)`
src/DataDepsGenerators.jl	`91.42% <100%> (ø)`	⬆️
src/utils.jl	`100% <100%> (+10%)`	⬆️
src/JSONLD/JSONLD_Web.jl	`90% <90%> (ø)`
src/JSONLD/JSONLD.jl	`96.96% <96.96%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0b6dfc...a4929ff. Read the comment docs.

SebastinSanty · 2018-07-04T12:59:53Z

Right, I was thinking of adding kwargs like multiple arguments for handle_keys, which can check in the given order till the correct key is found.

oxinabox · 2018-07-04T13:02:41Z

which can check in the given order till the correct key is found.

That is what the code I posted does. It recursively peals off the first key until it is found, or until there is no keys left.
It just does so via dispatch and splatting

oxinabox · 2018-07-04T13:04:47Z

We should probably check that the JSON-LD has the key-value pair

"@context": "http://schema.org",

And if it doesn't erroring out; or at least warning

I believe @context is JSON-LD for namespaces. Or something like that
So if it doesn't have that we can't know that it is using the names we expect.

On the other hand we could just not worry about it: Garbage-in => garbage out

oxinabox · 2018-07-04T13:09:21Z

src/JSONLD/JSONLD.jl

+abstract type JSONLD <: DataRepo
+end
+
+export JSONLD_Web, JSONLD_DOI


All exports go in src/DataDepsGenerators.jl

This is how I used to do for abstracted types like DataOnev2.

well it isn't a big deal and it is trivial to change later. So I guess either is good.

oxinabox · 2018-07-04T13:17:26Z

src/JSONLD/JSONLD.jl

+function get_license(mainpage)
+    license = handle_keys("license", "", mainpage)
+    if license != nothing
+        if isa(license, String)


isa is available in infix notation
license isa String

oxinabox · 2018-07-04T13:26:27Z

src/JSONLD/JSONLD.jl

+    end
+end
+
+function handle_keys(key1::String, key2::String, json)


I do want this changed to the have collection first then keys as a vararg.
Preference arguement order: JuliaLang/julia#19150 (comment)
Collection then Key.

Varargs:
https://docs.julialang.org/en/v0.6.1/manual/functions/#Varargs-Functions-1

oxinabox · 2018-07-04T13:28:41Z

src/JSONLD/JSONLD.jl

+    end
+end
+
+function get_urls(repo::JSONLD, page)


There is no need for both initializing this and for having an else statement.
Do one or the other.

oxinabox · 2018-07-04T13:29:45Z

src/JSONLD/JSONLD.jl

+    urls
+end
+
+function get_checksums(repo::JSONLD, page)


I think we should just define:
get_checksums(::DataRepo) = nothing
And then only overwrite it when we have something.
Rather than repeating it for all types.

oxinabox · 2018-07-04T13:30:58Z

src/JSONLD/JSONLD.jl

+end
+
+function mainpage_url(repo::JSONLD, dataname)
+    #We are making it work for both figshare id or doi


this comment doesn't belong here.

Also shouldn't this method be on JSONLD_Web ?

You have this method in JSONLD_Web already.
So delete this one.

oxinabox · 2018-07-04T13:35:22Z

src/JSONLD/JSONLD_DOI.jl

+    if match_doi(dataname) != nothing
+        url = joinpath("https://data.datacite.org/", match_doi(dataname))
+        resp = HTTP.get(url, ["Accept"=>"application/vnd.schemaorg.ld+json"]; forwardheaders=true)
+        json = JSON.parse(resp.body |> String |> strip)


If you are going to use the pipe operator, use it completely, or not at all.
resp.body |> String |> strip) |> JSON.parse

Or

JSON.parse(strip(String(resp.body)))
Also, is the strip required?

oxinabox · 2018-07-04T13:41:17Z

test/JSONLD/JSONLD.jl

+using ReferenceTests
+
+@testset "JSONLD test" begin
+    @test_reference "../references/JSONLD_Web Kaggle.txt" generate(JSONLD_Web(), "https://zenodo.org/record/1287281")


The title says Kaggle but the URL says zenodo.
We want to test both Kaggle, and Zenodo,
and DataVerse,
and FigShare via Web

oxinabox · 2018-07-04T13:45:01Z

test/references/JSONLD_Web Kaggle.txt

+	License: https://creativecommons.org/licenses/by/4.0/
+	Date: December 20, 2016
+
+	<p>Prepared by the Research Group on Earthquake Geology in Greece (http://eqgeogr.weebly.com/)</p>


Looks like we need to strip any HTML out of the description.
We have Gumbo to parse HTML and the text_only method already.
So should be easy to add a function in utils.jl for that

oxinabox · 2018-07-05T03:31:55Z

src/utils.jl

@@ -18,6 +23,15 @@ text_only(doc::HTMLDocument) = text_only(doc.root)
 text_only(frag) = join([replace(text(leaf), "\r","") for leaf in Leaves(frag) if leaf isa HTMLText], " ")
 text_only(frags::Vector) = join(text_only.(frags), " ")

+function filter_html(random)


Split this into two methods.
filter_html(::Void)=""
and filter_html(content) = ...
Also random is a terrible name, content or text is better.

Splitting it is two avoids using isa in favor of dispatch,
which is more idiomatic julia

Yes, I was debugging and forgot to update 😅

oxinabox · 2018-07-05T03:35:20Z

test/references/JSONLD_Web ICRISAT.txt

+    	Dataset: Phenotypic evaluation data of medium duration Pigeonpea advanced varieties trial
+	Website: http://dataverse.icrisat.org/dataset.xhtml?persistentId=doi:10.21421/D2/ZS6XX1
+	Author: Sameer Kumar, CV, Anupama Hingane
+	License: <img src ="https://licensebuttons.net/l/by/4.0/88x31.png">


Looks like stripping the HTML is also required on this field.

And since license is potentially long,
it should probably be always last, after things like date

oxinabox · 2018-07-05T05:41:14Z

src/JSONLD/JSONLD.jl

+    elseif authors isa Dict
+        return [handle_keys(authors, "name")]
+    else
+        return []


maybe add an @assert(authors==nothing
if that is what you are expecting here.

Didn't quite understand this

Under what circumstances does the else trigger?

oxinabox · 2018-07-05T05:47:14Z

src/JSONLD/JSONLD.jl

+    try
+        return Dates.format(Dates.DateTime(rawdate), "U d, yyyy")
+    catch error
+        if error isa MethodError


is this to handle rawdate==nothing
If so better to handle it if an if rawdate==nothing than with exception handling.

Exception handling can be the answer to this kind of thing I think.
But much more if there is a sequence of functions that might fail at anypoint,
and you want to treat them all the same.
For just one guarding against it with an if is better.
Julia (unlike say python) generally prefers avoiding exceptions to handling them.,

Also you forgot the else rethrow() in the catch block's if

This is to handle the difference in the incoming date formats. Some have just the year and some have different formats of date.

It is not good to catch all errors.
And when not clear (like this is to me) it is good to comment with what kind of thing us causing the error.

oxinabox · 2018-07-05T06:27:44Z

src/JSONLD/JSONLD.jl

+    handle_keys(json, otherkeys...)
+end
+
+handle_keys(json) = nothing


make this missing,
which will have to be done anyway later (when we move most things to be allowed to be missing and to properate missings everywhere)
and it will let skipmissing be used on URL's too.

A bunch of nothing conditions will need to become missing conditions, but that is ok and will need to be done anyway.

How will I handle if url_list != nothing if I am using missing?

I'll make a PR on this and show you.
This is mergable right now I think

Sure, it will be helpful if you can.

oxinabox · 2018-07-05T06:28:03Z

src/JSONLD/JSONLD.jl

+    urls = []
+    url_list = handle_keys(page, "distribution")
+    if url_list != nothing
+        urls = [handle_keys(ii, "contentUrl") for ii in url_list if handle_keys(ii, "contentUrl") != nothing]


use skipmissing as discussed above

oxinabox · 2018-07-05T06:30:13Z

src/utils.jl

@@ -8,6 +8,11 @@ downloads and parses the page from the URL
 """
 getpage(url) = parsehtml(String(read(quiet_download(url))))

+# function parsehtml(junk::String)


Delete this commented out method?

oxinabox · 2018-07-05T06:32:24Z

src/utils.jl

+    if random isa Void
+        return ""
+    end
+    if ismatch(r"<(\"[^\"]*\"|'[^']*'|[^'\">])*>", random)


I like this idea, that you check if it possibly could be HTML, using regex, before parsing it and stripping tags.
It makes for a bit faster then parsing unnecessarily.
It would be good to add a comment to this line saying: "Check if might be HTML"
so it is clear that is what is going on.

SebastinSanty · 2018-07-05T14:50:39Z

Kaggle for some reason is down. Will restart the tests once that works.

oxinabox

very cool.
I think we have technically cracked the 150 million "datasets" mark with this.
I say technically because 97 million of them would be from CrossRef DOIs, which are not data as most people would consider them

SebastinSanty added 2 commits July 4, 2018 18:10

Using JSONLD to retrieve information

a0ec390

Add to JSONLD runtests

63640a2

SebastinSanty force-pushed the jsonldhtml branch from 67ea43b to 63640a2 Compare July 4, 2018 12:41

Use get() instead of try for dict keys

04a9709

oxinabox reviewed Jul 4, 2018

View reviewed changes

Make necessary changes

921f137

oxinabox reviewed Jul 5, 2018

View reviewed changes

Necessary changes II

60483e3

Resolve mistake

6f0b52e

SebastinSanty added 2 commits July 6, 2018 06:46

Make necessary changes III

7fd9f56

Strip HTML from license

a4929ff

oxinabox approved these changes Jul 6, 2018

View reviewed changes

oxinabox merged commit 69da784 into oxinabox:master Jul 6, 2018

This was referenced Jul 6, 2018

Use linked data to handle (nearly) arbitrary websites? #30

Closed

Support *all* DOIs? (CrossRef and DataCite, at least), via Content Negotiation for RDF? #29

Closed

Handle JSON-LD URLs and Authors with Missing #44

Merged

Using JSONLD to retrieve information #43

Using JSONLD to retrieve information #43

Conversation

SebastinSanty commented Jul 4, 2018

oxinabox commented Jul 4, 2018 • edited Loading

codecov-io commented Jul 4, 2018 • edited Loading

Codecov Report

SebastinSanty commented Jul 4, 2018

oxinabox commented Jul 4, 2018

oxinabox commented Jul 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oxinabox Jul 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oxinabox Jul 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oxinabox Jul 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oxinabox Jul 5, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SebastinSanty commented Jul 5, 2018

oxinabox left a comment

Choose a reason for hiding this comment

oxinabox commented Jul 4, 2018 •

edited

Loading

codecov-io commented Jul 4, 2018 •

edited

Loading

oxinabox Jul 4, 2018 •

edited

Loading

oxinabox Jul 5, 2018 •

edited

Loading

oxinabox Jul 5, 2018 •

edited

Loading

oxinabox Jul 5, 2018 •

edited

Loading