-
Notifications
You must be signed in to change notification settings - Fork 574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --clean
option to --to notebook
to strip all metadata and output?
#637
Comments
Something like this already exists as a separate tool, called nbstripout: Personally, I'd say that having it as a separate tool is neater than adding another option to nbconvert. |
Thanks, I'm aware of nbstripout - it's one of the options mentioned in the article I linked to. The pypi page says that "[nbstripout] does mostly the same thing as the Clear All Output command in the notebook UI". I'm really looking for more than that. (Actually, it seems that nbstripout does remove more than the output, but it does not remove all metadata. I don't know if that's because it can't keep up with jupyter notebook changes, or because the project has a different intention such as being about stripping output?)
The documentation for
So from my point of view (as a naive user), I guess I don't see why it would be neater to use an external tool than to have a |
It would fit OK, but I'd rather encourage an ecosystem of small tools which can be pieced together rather than one giant tool with loads of options. It's quite easy to modify notebooks programatically, especially if you use the |
I'm frustrated as a long-time user of notebooks by it being "quite easy to modify notebooks programmatically", and there being a lot of "little tools that do exactly what you want". I've seen dozens of such tools come and go over the years, each one doing precisely what that author wanted to do, at that time, and then getting stale and unmaintained and unused. I've written two or three of them myself, in different projects for different employers in different years, as have at least a half-dozen people I know personally, and yet we are still here today asking this question. Each one of those tools fails to strip out some new category of metadata added later in some Jupyter version or by some Jupyter extension. I would think that if there were one single main tool, always available, stripping out everything that was not put there by the user, then people could use that consistently and maintain that to strip out any new metadata that crops up. There may be people who want lots of configuration, and want detailed control, but it seems like they can be the ones to write specialized little tools and maintain them. Instead what seems much more crucial to offer directly from the Jupyter project is something simple and reliable that we can always ask our users to do, that we can consistently use ourselves, and that simply makes a minimal version of the notebook consisting of what was actually typed in and nothing else -- no output, no widget state, no pygments version, nothing but the notebook format version and the actual contents of the code and markdown cells. That's what I want to archive as the source code for a notebook, and I don't think there is actually all that wide a diversity of opinions about this. |
If what you want to do is literally strip all metadata, that's very easy to do reliably. The complication and inconsistency is because lots of people want to strip out some metadata while leaving other bits. |
Right. So can Jupyter offer a supported, maintained way to strip all metadata that's not needed for re-running the notebook? Those lots of people can indeed use lots of tools to handle their specific needs, but I strongly believe there is a core, simple need shared across a wide range of projects and people for simply stripping everything that doesn't constitute "source code". Personally, I'd even want a button for this in the interface. |
@jbednar What metadata do you consider necessary to run the notebook? Partially this will depend on what contexts do you consider "count" as far as running a notebook (e.g., the live notebook server, nbconvert --execute, nbsphinx, &c.). The different contexts require different pieces of information which is why people will disagree about what should be left in. |
I don't have any opinion about nbconvert or nbsphinx, which both seem relevant almost exclusively to rendered, non-cleared notebooks, and here we're talking about cleared notebooks suitable for archiving as source code. Basically, when I open a new notebook with Jupyter 4.3.0, a file is created that briefly consists only of (1):
That's nice and clean -- an empty list of cells, plus some numbers declaring the notebook format, which is clearly important to record. It's beautiful source code that I am very happy to archive, preserve, and diff. Let's say I add a single source code cell containing "2+3". As far as I the user am concerned, I've added three characters of data, plus a few more to record the fact that I opened a new code cell. So I would hope to see a file like (2):
I.e., the original skeleton plus the information that there is one more cell, that it is a code cell, and that it contains the three characters that I typed. That's what I deliberately added as a user. Unfortunately, this minimal version is not considered legal by Jupyter, which requires a bunch of metadata to be present whether or not it captures anything the user did (3):
What's worse is that this minimal legal version isn't what's stored by Jupyter, even after "clear output". Instead we get (4):
Not only does version (4) completely obscure what I actually typed as source code, it also varies across jupyter versions, varies depending on what extensions are installed, and varies in ways that seem very unpredictable as notebooks are run, used, and cleared, with the result that it's nearly impossible to detect what the user actually typed. Even when only one character is changed, it's often impossible to detect that the character changed in a diff. It's completely diff-incompatible, and seemingly impossible to treat as source code, even though just about the only thing the users are actually supplying is source code. So, version (4) is what we get now, but I want version (2), both for "clear output" in Jupyter and for |
Oops, I just noticed that version 4 I had pasted above didn't actually include the source code "2+3". I guess that just underscores the point -- The actual change I meant to make was completely obscured by all the extra metadata. Edited to fix that above. |
A few things:
jq --indent 1 \
'
(.cells[] | select(has("outputs")) | .outputs) = []
| (.cells[] | select(has("execution_count")) | .execution_count) = null
| .metadata = {"language_info": {"name":"python", "pygments_lexer": "ipython3"}}
| .cells[].metadata = {}
' notebook.ipynb That specifically includes information relevant to
What you seem to be asking for would be easier to implement (and faster to run) by modifying the script you gave as your example jq --indent 1 \
'
(.cells[] | select(has("outputs")) | .outputs) = []
| (.cells[] | select(has("execution_count")) | .execution_count) = null
| .metadata = {}
| .cells[].metadata = {}
' notebook.ipynb Crucially, I don't think nbconvert is the right tool for this job. You don't want something that is carefully attending to the content of the notebook, and nbconvert incurs a lot of overhead costs by trying to carefully attend to the content of the notebook. A quick, context-free script like above (for command line use, and a similar JSON manipulation script for in-browser use) would seem to be much closer to the use case you're aiming at than having this built into nbconvert. However, having something that allows you to clean some kinds of metadata with nbconvert in an intelligent, configurable way would allow you to do something like what you're asking by setting the configuration in a certain manner. I'd imagine |
Just to be clear, that was Chris's request, not mine! I'm just dogpiling. :-)
Any notebook that we store in a github repo as source code ends up getting edited by lots of different people who have different Python versions, different Anaconda environments, etc., each with their own kernel. As you say, the user chooses a kernel, but that's because they must choose a kernel in order to be able to run it to test anything. And storing that information with the output is very important as a record of what was run. But it doesn't change what the source code was. I could imagine wanting to declare and persist that it's Python2 or Python3 or R, but really only if the user declares that, not just because it was auto-filled in by running it. And I definitely do not want to see "acon2", "testenv", or any of the other huge variety of kernels that we run across, and that change every time someone wants to edit any of our notebooks. I'm feeling tired just thinking about all the work we have to do every time anyone wants to edit anything! Anyway, I don't have any strong opinion about whether nbconvert is the right place for this functionality, but I do strongly believe that there should be a simple way to strip out everything that isn't needed to capture what the user put in. |
Sorry I wasn't clear in my original request by giving a distracting example! What I really want is a standard way for people to save a notebook containing only the input cell source and type. A save called "inputs only" or "source only" or something like that, I guess. Nothing user- or system-specific would be included, so nothing about which kernel they used to run the notebook, or which cells they ran in what order, or how they viewed cells in the notebook, etc. I would really want this "source only" save to be standardized in some way (I mean, to be defined by jupyter - not to replace the default type of save) because although I have written my own tools to strip everything but inputs and whatever is necessary to allow jupyter to open notebooks over the years, I have to explain to every contributor why running some weird extra tool is necessary and how to install it, plus I have to keep the tool up to date with changes in jupyter.
I made this request on nbconvert because it's part of jupyter, thus is standard and everyone has it (I think?), plus it already allows notebook to notebook conversion, so I thought an option on that might fit easily and use code already in nbconvert. I would definitely love a 'source only' save button to be in everybody's jupyter notebook viewers, but I figured that would be a more controversial and difficult request (e.g. because of potentially confusing users).
Yes, sorry I wasn't clear there. I regret using that example script because it was distracting; I see now that the author added something to help a particular tool, which I hadn't noticed before. I want only the input source code (or anything that is a property of the input source code itself, like the type of source code it is).
I am ok with that. I can always select the kernel I want to use in my viewer (or specify a default for my viewer outside of any particular notebook), and that choice will subsequently be saved into the notebook via the 'normal' save button. If I do a "source only" save, I would expect that information to be removed and to have to specify it again. Or, if using nbconvert to execute a "source only" saved notebook, I can specify the kernel on the commandline. (I'm not sure now, but at least in the past it was usually necessary to specify the kernel option anyway, because the kernel saved in the notebook by someone else was unlikely to exist on my system.) If there were standard metadata that said the source is python or julia or R or whatever i.e. was a property of the input source code itself, I would be ok with that remaining. It would not change just because someone opened the notebook and was using a different kernel from me - it would only change if there is a meaningful change to the source code. |
Exactly! I woke up thinking about precisely that point (sad, I know). A given notebook is either in R, Python2, Python3, or compatible with both 2 and 3, for the most part. It would be great if the notebook could declare which of those it is, as source code. Separately, it's great that the rendered notebook (with outputs) records which kernel was actually used when running. But those are entirely separate things -- declaring which language the notebook is written in, and recording which kernel was used in a run. A "source only" version of a notebook should be able to preserve any explicit declaration the user made about the language being used, without changing simply because the user actually ran the code and then cleared the cells. Right now those two very different things (declaring and recording) are being conflated, which is confusing every single time someone wants to edit a notebook in an environment even slightly different from the one in which it was created. At a more fundamental level, I think it is incontrovertible that:
|
You should as part of the git repo set up a clean and smudge filter that would remove the information on commits / readd-it on checkout. Yes someone should write a blog post that show how to do that. I have contributed to several repository (not python) that do similar things, and it (usually) is part of the setup part of the project – equivalent of Now I'm not saying that we could make that easier, I'm just saying that right now, if that's your pain point, it can be fixed. The other possibility – which I should work on – is a GitHub Hook that can fix user submitted pull-request. If "Allow Edit from maintainers" is checked that shoud not be too hard to automatically amend a user PR.
The issue is that the notebook server application does need some of this information. Unfortunately so many of our users are used to "a web application autosaves" even without realizing that, that we have to autosave and embed this information. There are also a number of case where by default this information is expected when user are sharing notebooks. The notion of "Save as " with a different format that would be stripped is complex for many users. It should get better with JupyterLab that allow to attach kernels to any document types, but it's not there yet.
"Source only" means a lot of different things in different context. Is that input only for you ? If that's the case then ipynb might not be the best format to store things. We haven't yet developed somtehing that could work, but a notebook-like interface to edit a format like Rmarkdown would be possible. There are only 24h/day unfortunately. But we'll come to it.
It's funny how we used to have only the language a couple of years ago – we even had a language per-cell but that was a footgun – and evolved to use kernelspec per user request. Astonishingly "only the language" is not enough to reproduce for a wide majority of our users. WE got request for the language version as well, then some people run things on PyPy and not CPython, so want an extra field. In the end there is always some people that want more, and some people that want less. Let's not speak about people that actually request for us to also store hardware spec in the notebook, because they need to know ahead of time if a notebook requires a GPU or not. So it's a tough spot, and we are working on making things easier but we are between a rock and a hard place.
I think this is not a question for nbconvert, I believe that where the notebook format in itself is not suitable. That being said it does not meed the the notebook interface cannot be used for a new designed format. There have been a number of exploration (like nbexplode) that make that way easier. Nbdiff is working as well. Now these things are far from having the quality necessary to be in core, and it would be nice to have them clean a bit before inclusion.
In general we are trying to make the core contain as few things as possible. It is not because it is in the core that something will be better maintained. Though we try to pride as much hook as possible to make the core extensible. Technically you should not need to commit something to core nbconvert. You should be able to have your own package with an entry-point. Once something work and is regularly used and we get help for maintenance we can decide to move it in core. One of the issue will be the consensus deciding what the minimal necessary things are. I've been involve for ~5years in designing the notebook format, and usually people ask us for more than less. It also seem that the "Cleaning" part is not directly a feature request for nbconvert, but a feature request to make collaboration easier. We may want to take a different approach for that. |
What I'm asking for would be a very good candidate for a git filter. What is currently available is not a good candidate for a git filter, because all of the available scripts like that break all of the time, whenever new metadata is added somewhere. Some of the people commenting or mentioned above have indeed set up these filters, but have then backed away because they turn out to be a lot of trouble in practice, over time. A fully supported, maintained way to make a truly clean notebook would be a great step in this direction. Being able to apply this on demand to a user contribution PR would be really useful too, but is more than I am currently dreaming of!
The more you autosave and embed useful stuff like that, the more crucial it is that there be a supported, obvious way to clear out this information. They go hand in hand -- if you are taking it on yourself to doctor and augment what the user provides, then there should be some clear way for the user to revert those changes when it's not appropriate to preserve them. If you weren't adding the extra metadata, then a clean operation wouldn't be needed, but it is.
I'm not sure what you mean by that, but what I mean is just that we want to archive and capture the actual textual contributions from someone explicitly editing a notebook, precisely in the same way as a text editor captures the explicit textual contributions from someone editing a text file. The text editor doesn't record the sequence of menu items the user selects, or the way the mouse moved, it just saves the actual text. The same is needed here -- some way to save just the non-auto-generated stuff, such that it can be re-run as a notebook but which one can always go back to. People use notebooks to write source code, and we need a way to archive and maintain that source code without having it be drowned out by auto-generated fluff.
Sure. I'm happy for Jupyter to cram the rendered notebooks with as much metadata as you like, recording everything. Great! The more the merrier. But that doesn't change the need for there to be a way to get back to the underlying, user-generated content in a cleared notebook. If you ask a selection of serious developers who use notebooks and actually care about their content, with multiple people editing it, storing it, and maintaining it over time, I think you will hear very similar woes. E.g. see the blog post above and any number of people who have written such stripping and cleaning utilities, all of whom are only the tip of the iceberg of people affected (since most people just curse and move on). You'll also find a lot of people who decide that notebooks are entirely inappropriate for content that is maintained over time, but I think that's only for these incidental reasons that can be fixed, not inherent to the notebook format.
Oh no! Please don't treat this as a request for some additional, separate format. While I do happen to hate JSON, and would much rather notebooks be some human-readable/editable format like YAML, trying to go down that path seems like an infinite distraction that won't ever solve current problems for the very large and ever-increasing installed userbase. The current JSON format is actually fairly reasonable if the extra fluff is just eliminated (see version 2 above). And it's very portable, with people everywhere able to work with it already, unlike variants that have been created that have to first be converted into JSON. We just need a way to get to the currently possible minimal (or at least predictable and repeatable) specification reliably.
I have no opinion about whether nbconvert is the right way to create a "source only" version. I only plead for (a) there to be a supported, proactively maintained command-line tool for generating such a version given a saved file with output and metadata, and if possible (b) to be able to generate such a file from within the interface itself, which will often but not always be more convenient than a command-line tool. (a) is far more important than (b), since a command-line tool is automatable, but both are helpful. |
I am involved in developing a course that will make extensive use of Jupyter notebooks, both for making study material available to students and for collecting work from students. The study material is developed in a (Git) repository. I have been surprised at the amount of hard-to-control clutter generated inside notebooks as people modify them. My search for a way to 'normalize' notebooks has been unsatisfactory so far. That is why we, reluctantly, started developing our own tool set. Out-of-the-box support for this would be appreciated and would help acceptance of Jupyter notebooks as long-lasting carriers of information. Notebook (top-level) metadata, such as The JSON schemas in the To help inspect notebooks quickly, we have
For cleaning up we have
Options to remove all non-standard (notebook/cell) metadata keys are coming. |
I'm pretty new to Jupyter and surprised that this is not a solved problem. It's certainly already a headache for me even on a personal project using git as it makes my git diffs impossible to navigate. My option seems to be to gitignore the notebook and try to remember to check it in explicitly on occassion. From my point of view, source code and metadata like output and formatting should be in separate user specific location that can be gitignored and otherwise not shared. Any of those that a user wants included in the project files should be explicitly imported to the project file. |
nbstripout can be configured as a git hook. You can also integrate nbdime with git for nicer diff/merge. |
Also the |
Hey, I just created a pull request (#805) that creates a preprocess that clears all the metadata present on the code cells. I believe that this preprocess, together with ClearOutput, does almost the same thing as nbstripout. |
Hi, I found this issue and wanted to say that I agree with @jbednar in #637 (comment)
@takluyver - you wrote above in #637 (comment) :
As a user, it's really hard to know which package to use (apparently https://github.com/kynan/nbstripout would be a good choice here) and to have to install it separately, and then my experience over the past years is also like what @jbednar said: these little packages often are not well-maintained, and one has to find and learn a new one after a year or two. So +1 to put more features into the core packages like In the project I work on @Bultako wrote this 10-line function and expose it via the command line to make stripping notebooks easy for our contributors working on notebooks. Basically I'd prefer if we didn't have that code and it were built-in in |
Just throwing this out as an option -- although certainly a more complicated one to implement. In a perfect world it'd be nice if jupyter notebooks were broken down into smaller pieces -- the input file as one entity (*), the various bits of metadata stored elsewhere, and the output stored elsewhere. (*) You might even want program code split out separately -- imagine your python code parts of a notebook just existing as python files that all your normal python tooling could operate on. The root notebook file might then just be a set of references to the other files that are required for Jupyter to assemble something that is managed in the presentation layer as a single coherent document. Then you could decide which parts you want in your version control. Certainly the markdown and python input is critical. Maybe your metadata state is important, but the things that are generated as a result of executing other things stored in the repository isn't. |
I landed here by chance (I was actually searching for the new
That is exactly the idea behind Jupytext: the code and markdown cells are stored in a Markdown file (or a script). Only the selected metadata is stored in that file. This Markdown file comes as an addition to the original notebook, in which the full metadata and outputs are preserved. Both files can be edited. When the notebook is refreshed in Jupyter, Jupytext reads the input cells from the text file, while the outputs and the filtered metadata are loaded from the ipynb file.
With Jupytext you can add
Well, sorry... Jupytext does implement a few separate formats for Jupyter notebooks, as e.g. GitHub/VScode compliant Markdown. But maybe you will appreciate these alternative formats - especially if you need to manually edit a notebook... or merge contributions... You'll tell me! |
Jupytext is great, but it's a separate thing, as already argued above, because it is not the single, shared, fully interchangeable format. Supporting a clean form of .ipynb benefits every possible tool and usage of Jupyter, because anything Jupyter-related can use .ipynb, but only some use cases are addressed by a non-ipynb format. So jupytext is orthogonal to this request; it being available is great, but doesn't mean the requested functionality is needed any less. |
Just to clarify since I wasn't involved in these original threads much. Is there anything beyond |
@MSeal sorry I didn't see your response until someone pointed it out to me now. Thanks very much for following up! I ran:
The cell outputs have been removed, but the "user-specific" notebook metadata is still there (e.g. my quite personal kernel name,
The nbconvert version and the notebooks are below the line, at the end of this post. Incidentally, the first few times I tried I had a case mistake (
nb.ipynb:
nbclean.ipynb:
Trying to use
|
@mwouts I use jupytext a lot, and I wish everyone else used it, but they don't :( ("they" being users of notebooks I maintain, maintainers of notebooks I use, github, vscode, etc etc...) |
Yes sorry about that -- I don't particularly like how the config loads commands options either but the backlog of things to fix for 6.0 includes fixing that option at least (PRs to help on some of those 6.0 tasks would be really appreciated if others have time they can contribute).
So the metadata clearing does only clear cell metadata. If you cleared the notebook level kernelspec metadata the notebook would not be valid and unable to load in UIs. It has to have https://github.com/jupyter/nbformat/blob/master/nbformat/v4/nbformat.v4.4.schema.json#L8 defined to be valid so clearing it generically doesn't quite do the trick. I'd be ok with adding additional options to the |
@MSeal, I have not found that to be true in practice. See my "version 3" example above, which does not specify the kernelspec. From at least classic Jupyter notebook 4.3.0 (latest when this issue was filed) up to my current installation of notebook 6.0.3, the UI starts and runs just fine without a kernelspec or any warnings. For JupyterLab 2.1.4, the user is presented with a modal requestor: After selecting a kernel, again there are no errors or warnings. So I don't currently see any problem with clearing the output down to "version 3", and in fact that's what we normally do in practice using our own separate tools. As discussed above, it would be great if the language (Python 2, 3, R, etc.) could be declared separately from recording the kernel name, to avoid having to guess Python 3, but in the meantime the lack of a reliable and supported way for ordinary users to clear all the output and non-user-derived metadata is still a serious issue. |
Ahh I had deleted too much in my test. I stand corrected that the metadata can be blank, even though nested fields inside it are required if present, metadata itself is not explicitly required.
As I stated, I'm fine doing ☝️. @SylvainCorlay @t-makaro @maartenbreddels since you all have been contributing the most for the 6.0 release, do you think we should change the default behavior for this preprocessor or add a new option to preserve backwards compatability? |
Really, the issue about the metadata field is that it is not split into
With such a split, we would have an easy means to clean up the notebook metadata. |
Agreed! In fact I do want to preserve the slide-related metadata (for RISE, etc.), precisely for that reason; it is a user declaration about the contents of this cell, and not a recording of incidental state from notebook execution. It's actually been a major pain for me that the slide-related metadata gets cleared by our own custom clearing scripts, but I didn't want to bring that up, to avoid complicating things further. Still, I'd be wary of trying to introduce a change in the spec to make such a distinction now, though, because it would result in different notebook formats that would be painful for users. In the absence of such an incompatible structural change to the metadata format, it seems like the alternative is for the tools to maintain a list of which bits of metadata represent user input, and clear the rest. |
Yeah metadata isn't separated into output and input in the schema, and even some fields are sometimes inputs and sometimes outputs of particular processing patterns. One thing we could consider doing to clearing all non-spec'd metadata (anything that doesn't have a schema for the given key in the format)? |
#1314 merged and is going out in the 6.0 release this week. It does provide more granular control of metadata removal. Specifically these fields now give a better control plain for metadata removal:
With this I am going to close the issue as I think the required asks are now met for better controls in place now. |
On projects where multiple people contribute to "source code only" notebooks stored in revision control, system-specific or user-specific metadata in the notebooks are often unwanted sources of diffs and conflicts. I would love to be able to tell contributors how to remove such metadata before submitting changes, without them having to download and configure external software/scripts. Apart from the extra steps involved in downloading, installing, and configuring external scripts, such scripts may not be up to date with the changing notebook format.
I think http://timstaley.co.uk/posts/making-git-and-jupyter-notebooks-play-nice/ describes my problem in more detail (although that article also covers git integration, which I'm not looking for here). I understand that many users of notebooks do want to keep all the metadata and output, but I guess where the notebooks are used more like e.g. python source code (which doesn't normally specify the version of python, or store any of the output of the program, or any metadata about the system that last opened the file), the metadata is always annoying. Right now we have to explain to people how to remove widget metadata, scroll state, kernel spec, etc, or we have to look at diffs on github that include all that stuff and deal with conflicts about things that don't matter to us.
My first question is, would it be an acceptable feature for
--to notebook
to offer a--clean
option to remove everything other than the inputs and the minimum required metadata? Something that does the equivalent of the following, from the article above:My second question is, how much work would it be to implement such a feature (if that's easy to answer)?
Thanks!
The text was updated successfully, but these errors were encountered: