-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analyze all Jupyter notebooks mentioned in PubMed Central #25
Comments
A few notes... With EuropePMC a search for ipynb OR jupyter gives 107 results: I find it extremely interesting to note that EuropePMC has the full text for 102 of these 107 articles/preprints (jupyter OR ipynb) AND (HAS_FT:Y) Which demonstrates that jupyter/IPython notebooks are almost exclusively associated with open-friendly journals(?) Or perhaps this is a bias influenced by legally-enforced inability to do full text search on 'closed' journals where jupyter/ipynb might be mentioned but can't be found by EuropePMC because they are not allowed. |
Rcode to get bibliographic metadata on each of those 107 hits from EuropePMC:
I've also made available the resulting CSV as an editable spreadsheet via GDocs: Perhaps with this sheet we can assign who takes responsibility for which papers? |
That's a great starting point — thanks! |
+1 from me. Interested to contribute and to see the output. |
We've taken Ross' spreadsheet and added some columns for
The "Code in problem cell" column documents the notebook code causing the first problem, and the "Problem" column gives more details. So far, basically none of the notebooks ran through: We normally stopped after the first such error and went on to the next notebook, but for one rather complex notebook, we tried to go through to the end, which we have not reached yet. |
I've also added a column for the PMC URL to reduce the fiddling with URLs. |
I notified the Jupyter mailing list: https://groups.google.com/forum/#!topic/jupyter/6pQIarRmrsc . |
Here's a write-up of our efforts: https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html Many thanks to @Daniel-Mietchen for the original idea, and for all the help over the weekend! |
@mrw34 Thanks - I'll go right into it. |
I found one that actually ran through, albeit after a warning about an old kernel: To celebrate the event, I introduced color coding to the spreadsheet: red for cases where the run resulted in an error, green when it did not. |
Here's a notebook shared only as a screenshot, from a paper about reproducibility: Just added yellow to the spreadsheet for cases like this where the notebook did not produce errors nor ran through, and where "n/a" is not applicable in the sense that there is no notebook. |
There is a nice "Ten simple rules" series in PLOS Computational Biology: They already have Ten Simple Rules for Reproducible Computational Research and Ten Simple Rules for Cultivating Open Science and Collaborative R&D as well as other somewhat related articles, but none of them seem to touch upon Jupyter notebooks. |
Some comments relevant for here are also in #41 (comment) . |
The above close was just as part of the wrap-up of the doathon. I will keep working on it and document my progress over at Daniel-Mietchen/ideas#2 . |
@mrw34 @Daniel-Mietchen excellent write-up! If licensing allows could you upload somewhere all the .ipynb notebooks you found that were related to those 107 papers? |
Mark's write-up is now up at https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html . |
There is a validator tool for Jupyter notebooks: https://github.com/jupyter/nbformat/blob/master/nbformat/validator.py |
I am thinking of submitting this to JupyterCon — submission deadline March 14. Anyone in? |
@Daniel-Mietchen I'd be happy to help you prepare the abstract submission, do a bit more analysis but I can't go to the meeting :) Does that count as 'in' ? |
That's "in enough" for my taste. Don't know whether I can go either, but the point is to reach out to the Jupyter community and to help do something about these issues, e.g. by refining the recommendations and perhaps offering some validation mechanism (think Schematron for XML). |
There is a JupyterCon talk about citing Jupyter notebooks. I have contacted the speakers. |
At the WikiCite 2017 hackathon today, we made some further progress in terms of making this analysis itself more reproducible — a Jupyter notebook that runs the Jupyter notebooks listed in our Google spreadsheet and spits out the first error message: http://paws-public.wmflabs.org/paws-public/995/WikiCite%20notebook%20validator.ipynb . @mpacer - yes, it makes use of nbformat.read() We also looked at Jupyter notebooks cited from Wikipedia — notes at https://meta.wikimedia.org/wiki/WikiCite_2017/Jupyter_notebooks_on_Wikimedia_sites . |
Hello! I'm part of the team that's working on beta.mybinder.org and related stuff, and am extremely interested in the idea of a 'badge that represents that your code is reproducible, and has been reproduced by CI'. Funnily, I also built the PAWS stuff, which is awesome to find in completely unrelated contexts :D The part of the stack I've been attacking right now is the 'how do we reproduce the environment that the analysis took place in', as part of the mybinder work. You can see the project used for that here: https://github.com/jupyter/repo2docker. It takes a git repository and converts it into a Docker image, using conventions that should be easy to use for most people (and does not require them to understand or use Docker unless they want to). It's what powers the building bits of mybinder :) As part of the CI for that project, you can see that we also build and validate some external repositories that are popular! We just represent these as YAML files here: https://github.com/jupyter/repo2docker/tree/master/tests/external and have them auto test on push so we make sure we can keep building them. This can be inverted too - in the repo's CI they can use repo2docker to make sure their changes don't break the build. The part where we haven't made much progress yet is in actual validation. nbval mentioned here seems to be the one I like most - it integrates into pytest! We can possibly integrate repo2docker into pytest too, and use that to easily validate repos? Lots of possible avenues to work towards :) One of the things I'd love to have is something like what https://www.ssllabs.com/ssltest/analyze.html?d=beta.mybinder.org does for HTTPS on websites - scores you on a bunch of factors, with clear ways of improving it. Doing something like that for git repos with notebooks would be great, and I believe we can do a fair amount of work towards it now. I'll also be at JupyterCon giving a few talks, and would love to meet up if any of you are going to be there! /ccing @choldgraf who also does a lot of these things with me :) |
Hi! @Daniel-Mietchen pointed me at this thread/project yesterday, and it seems quite interesting. I wonder if it makes sense to think about short term and long term reproducibility for notebooks? By short term, I mean that the notebook might depend on a python package that has to be installed, which could be done by pip before running the notebook, and this step could be automated by a notebook launcher perhaps. And long term meaning that at some point, the dependency will not work, pip will be replaced by something new, etc., and the only way to solve this is to capture the full environment. This seems similar to what @yuvipanda describes, and what @TanuMalik is trying to do a bit differently in https://arxiv.org/abs/1707.05731 (though I don't think her code is available). And long term here might still have OS requirements, so maybe I really mean medium term. Also, I thought I would cc some other people who I think will be interested in this topic, and could perhaps point to other work done in the context of making notebooks reproducible: @fperez @labarba @jennybc @katyhuff |
@khinsen - sorry I lost track of this thread back in March... Yes, I think @Chroxvi's |
Hi Dan,
Our code is available from https://bitbucket.org/geotrust/sciunit-cli
Current documentation is available from
http://geotrusthub.org/geotrust_html/GeoTrust.html
Yes ours is restricted to Linux OS for now. We have a bare bones version
for Mac OS X that is not in production use.
We are currently working on enabling reproducibility for workflows and
Jupyter notebooks through application virtualization. We need some more
work to capture standard I/O.
Tanu
…On Tue, Aug 8, 2017 at 2:17 AM, Thomas Arildsen ***@***.***> wrote:
@khinsen <https://github.com/khinsen> - sorry I lost track of this thread
back in March... Yes, I think @Chroxvi <https://github.com/chroxvi>'s
reproducibility can be carved out of the Magni package and I actually
hope to do that sometime this fall. I hope to do that along with @ppeder08
<https://github.com/ppeder08>'s validation (https://github.com/SIP-AAU/
Magni/tree/master/magni/utils/validation) which can be used for
in-/output-validation of for example somewhat more abstract "data types"
than Python's built-in types.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANq42YxHXdtU_GCCGLGot-JRIZV9vo0hks5sWAuTgaJpZM4MMOSq>
.
|
Thanks for the additional comments. I have proposed to work on this further during the Wikimania hackathon: https://phabricator.wikimedia.org/T172848 . |
I got sick during the hackathon and haven't fully recoverd, but JupyterCon is just days away, so I have started to distill the discussion here into an outline for the talk next Friday: |
After chatting with @Daniel-Mietchen about this idea, we've implemented web app to autorun notebooks mentioned in the paper. Just add list of papers' URLs, like It is a pre-pre-pre-alpha version done for fun and in the name of reproducibility. Please, report all the issues and suggest improvements. Current setup might require additional horsepower to consume bigger datasets. Also we plan to implement whole repo autodeployment, too many fails because of lack of this feature at the moment. List of current issues: https://github.com/sciAI/exe/blob/master/Executability%20issues.md Validator code: https://github.com/sciAI/exe All the credits are going to sci.AI team and, especially @AlexanderPashuk. Alex, thank you for the effort and fights with libs compatibility. |
@yuvipanda, nice job. High five! Dan mentioned, you are on the conference now, right? If you are interested, we can combine effort |
We're both at the conference now and will be at the hack sessions tomorrow! |
I was traveling during the hackathon - had heard about it too late. In any case, I hope we can combine efforts with the Binder team. For those interested, the talk sits at |
Seen at JupyterCon: |
Apologies for reviving a closed issue. I am also interested in the reproducibility badge (which is not necessarily the same as the (Also cc'ing @yuvipanda as he has been involved in |
@cgpu Hi! At the time |
Hi @mwoodbri , thank you for the prompt response and the background information! I am really fond of the idea of having a binary
Thanks once again! |
@cgpu Here's a version converted to use GitHub Actions: https://github.com/mwoodbri/jupyter-ci |
@mwoodbri thank you! Time for me to test now :) @Daniel-Mietchen thank you for providing the reproducibility cafe space for further discussions 2 years after the start of the initiative. Feel free to close this. |
Hello everyone. It has been a while since the last post in this thread, but I am happy to report that there is now a preprint that reports on a reproducibility analysis of the Jupyter notebooks associated with publications available via PubMed Central: Computational reproducibility of Jupyter notebooks from biomedical publications — joint work with @Sheeba-Samuel . Here is the abstract:
For data and code, see https://doi.org/10.5281/zenodo.6802158 . I'll keep this thread open until the paper is formally published, and invite your comments in the meantime. Extra pings to some of you who have contributed to this thread before: @mwoodbri @khinsen @yuvipanda @rossmounce @tompollard @RomanGurinovich @choldgraf @JosephMcArthur . |
Congrats @Daniel-Mietchen, excellent! I look forward to reading the paper, and will be sure to include it in my reading list for next year's reproducible research course I teach at Berkeley! cc @facusapienza21. |
Lots of twitter interest in this preprint. Might be good to dig a bit deeper into all those unknown dependency resolution issues - or perhaps just feature a couple of examples as a panel. There were some comments regarding Docker. |
Thanks @Daniel-Mietchen for the update! The preprint is on my e-book reader. |
Holy shit this is AWESOME! |
We're still working on the revision of the paper but here are the slides of our JupyterCon 2023 talk: https://doi.org/10.5281/zenodo.7854503 . Slide 23 — How you can get involved — asks for community input along a number of dimensions, which I am copying here.
|
We recently submitted the revision of the paper — see https://arxiv.org/abs/2308.07333 for the latest preprint, which describes a complete re-run of our pipeline and provides some more contextualization. In the discussion, we also briefly touch upon scaling issues with such reproducibility studies, mentioning ReScience (pinging @khinsen) as an example. We are keen on putting this dataset to use, e.g. in educational settings (cc @fperez ). As always, comments etc. are welcome. |
Dear all, The paper on this (with @Sheeba-Samuel) was published yesterday: Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113 . We remain interested in
@mwoodbri @fperez @khinsen @yuvipanda @rossmounce @tompollard @RomanGurinovich @choldgraf @JosephMcArthur . With that, I am closing this ticket after nearly 7 years - feel free to open up new ones in relevant places to discuss potential follow-ups. |
Extremely impressive feat! Well done @Sheeba-Samuel & @Daniel-Mietchen ! |
Jupyter notebooks are a popular vehicle these days to share data science workflows. To get an idea of best practices in this regard, it would be good to analyze a good number of them in terms of their reproducibility and other aspects of usability (e.g. documentation, ease of reuse).
A search in PubMed Central (PMC) reveals the following results:
With currently just 102 hits, a systematic reanalysis seems entirely doable and could perhaps itself be documented by way of reproducible notebooks that might eventually end up being mentioned in PMC.
A good starting point here could be An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study, for which both a Jupyter notebook and a Docker image are available.
I plan to give a lightning talk on this. Some background is in this recent news piece.
The text was updated successfully, but these errors were encountered: