-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass Accept
header in contrib.utils.download
#1491
Comments
Thanks @GraemeWatt. This is important, so we're quite happy you're bringing this (back) to our attention. So, from my understanding of what you've shown here, as the recommendation from DataCite is that "DOIs should resolve to a landing page, not directly to the content" and that "The DOI should be appropriately tagged (so that machines can read it)" and "can retrieve additional information about the item that might not be easily retrievable from the item itself." But as you've said that there's no way to get access to the actual data products associated with that particular then I guess I'm not clear on what purpose the DOI has if it just is the metadata. In the section The landing page should provide a way to access the item
only makes explicit mention of humans as opposed to humans and machines. So does this mean that DOIs are becoming human use only and that accessing a data product associated with a DOI is necessarily a two step process (get the DOI and then from the DOI landing page the the data product download URL)? I am perhaps missing something obvious about all of this. If so, if you have an explicit example that would be great to see. |
Hey @mfenner - can you help here? I think it should be possible to programatically query the DOI and get the location of the underlying object, then fetch it. Is this correct? Is there any code available that demonstrates this? |
Just wanted to follow up on this if @mfenner has time for input. Any thoughts here are appreciated! |
Unfortunately DOIs routinely point to landing pages and not the content, as mentioned in the comments above. There are a number of reasons why this makes sense, e.g. access restrictions and different file formats, but that makes automated machine access very hard. A new DOI metadata field Metadata are specific to each DOI registration agency, so these things might work slightly differently for Crossref or any of the other DOI registration agencies. If schema.org metadata are available (via the landing page), one can use the |
I've been investigating three options to directly return content (i.e. the
with requests.get(archive_url, headers={'Accept': 'application/x-tar'}) as response: Some other suggestions for improvements to this code:
Making these changes should not break the functionality with the current situation (where https://doi.org/10.17182/hepdata.89408.v1/r2 returns the tarball directly). I'd therefore recommend you make them ASAP before the next |
Accept
header in contrib.utils.download
I agree with your analysis. The DataCite media API was deprecated as it doesn't really fit into the outlined model. And content negotiation for |
(Very nice analysis). One slight concern I have with this is that the HistFactory JSON should not be treated as the only kind of JSON-like item that would be uploaded to HEPData -- is this taking to account a way to request a particular item as such, or would this be downloading all JSON items in a record? |
@kratsg, it seems you misunderstood, so let me try to clarify. Solutions 1. to 3. above are ways to download a resource file (e.g. a The Schema.org JSON-LD referred to in solution 1. is a way of embedding metadata in a web page so it can be indexed by search engines (see "Understand how structured data works" from Google). This has nothing to do with the |
Thanks for this excellent analysis and summary @GraemeWatt — truly appreciated! 🚀 I'll get this in right away and then we can make additional improvements.
These are all excellent as well. I'll make these a new issue for |
I'm copying a comment here that I made in the HEPData Zulip chat on 16th October 2020.
Regarding the issue (HEPData/hepdata#162) to mint DOIs for all local resource files attached to a submission, if we do eventually get around to addressing it, we would probably redirect the DOI to a landing page for the resource file, rather than to the resource file itself (e.g. the pyhf tarball). This would follow the DataCite Best Practices for DOI Landing Pages, e.g. "DOIs should resolve to a landing page, not directly to the content", which I'm currently breaking for the two manually minted DOIs. In the issue (HEPData/hepdata#162) I mentioned the possibility of using DataCite Content Negotiation to redirect to the resource file itself, but the linked page now says "Custom content types are no longer supported since January 1st, 2020". I thought maybe content negotiation could be used to return the
.tar.gz
file directly, but the intended purpose is to retrieve DOI metadata in different formats, not to provide the content itself. In anticipation of possible future changes, I'd recommend that you use the URL directly rather than the DOI in pyhf download scripts and documentation (e.g. revert #1109).The text was updated successfully, but these errors were encountered: