Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --clean option to --to notebook to strip all metadata and output? #637

Closed
ceball opened this issue Aug 2, 2017 · 34 comments
Closed

Comments

@ceball
Copy link

ceball commented Aug 2, 2017

On projects where multiple people contribute to "source code only" notebooks stored in revision control, system-specific or user-specific metadata in the notebooks are often unwanted sources of diffs and conflicts. I would love to be able to tell contributors how to remove such metadata before submitting changes, without them having to download and configure external software/scripts. Apart from the extra steps involved in downloading, installing, and configuring external scripts, such scripts may not be up to date with the changing notebook format.

I think http://timstaley.co.uk/posts/making-git-and-jupyter-notebooks-play-nice/ describes my problem in more detail (although that article also covers git integration, which I'm not looking for here). I understand that many users of notebooks do want to keep all the metadata and output, but I guess where the notebooks are used more like e.g. python source code (which doesn't normally specify the version of python, or store any of the output of the program, or any metadata about the system that last opened the file), the metadata is always annoying. Right now we have to explain to people how to remove widget metadata, scroll state, kernel spec, etc, or we have to look at diffs on github that include all that stuff and deal with conflicts about things that don't matter to us.

My first question is, would it be an acceptable feature for --to notebook to offer a --clean option to remove everything other than the inputs and the minimum required metadata? Something that does the equivalent of the following, from the article above:

jq --indent 1 \
    '
    (.cells[] | select(has("outputs")) | .outputs) = []
    | (.cells[] | select(has("execution_count")) | .execution_count) = null
    | .metadata = {"language_info": {"name":"python", "pygments_lexer": "ipython3"}}
    | .cells[].metadata = {}
    ' notebook.ipynb

My second question is, how much work would it be to implement such a feature (if that's easy to answer)?

Thanks!

@takluyver
Copy link
Member

Something like this already exists as a separate tool, called nbstripout:
https://pypi.python.org/pypi/nbstripout

Personally, I'd say that having it as a separate tool is neater than adding another option to nbconvert.

@ceball
Copy link
Author

ceball commented Aug 2, 2017

Something like this already exists as a separate tool, called nbstripout:
https://pypi.python.org/pypi/nbstripout

Thanks, I'm aware of nbstripout - it's one of the options mentioned in the article I linked to. The pypi page says that "[nbstripout] does mostly the same thing as the Clear All Output command in the notebook UI". I'm really looking for more than that. (Actually, it seems that nbstripout does remove more than the output, but it does not remove all metadata. I don't know if that's because it can't keep up with jupyter notebook changes, or because the project has a different intention such as being about stripping output?)

a separate tool is neater than adding another option to nbconvert

The documentation for --to notebook at https://nbconvert.readthedocs.io/en/stable/usage.html#convert-notebook says:

This doesn’t convert a notebook to a different format per se, instead it allows the running of nbconvert preprocessors on a notebook, and/or conversion to other notebook formats. For example:

jupyter nbconvert --to notebook --execute mynotebook.ipynb

This will open the notebook, execute it, capture new output, and save the result in mynotebook.nbconvert.ipynb.

[...]

The following command:

jupyter nbconvert --to notebook --nbformat 3 mynotebook

will create a copy of mynotebook.ipynb in mynotebook.v3.ipynb in version 3 of the notebook format.

So from my point of view (as a naive user), I guess I don't see why it would be neater to use an external tool than to have a --clean/--strip-metadata or similar kind of option in nbconvert itself. Could you explain a little bit more why such an option would not fit well into nbconvert? Thanks!

@takluyver
Copy link
Member

Could you explain a little bit more why such an option would not fit well into nbconvert?

It would fit OK, but I'd rather encourage an ecosystem of small tools which can be pieced together rather than one giant tool with loads of options. It's quite easy to modify notebooks programatically, especially if you use the nbformat Python library to read and write them, so it should be pretty simple to write a little tool that does exactly what you want.

@jbednar
Copy link

jbednar commented Aug 2, 2017

I'm frustrated as a long-time user of notebooks by it being "quite easy to modify notebooks programmatically", and there being a lot of "little tools that do exactly what you want". I've seen dozens of such tools come and go over the years, each one doing precisely what that author wanted to do, at that time, and then getting stale and unmaintained and unused. I've written two or three of them myself, in different projects for different employers in different years, as have at least a half-dozen people I know personally, and yet we are still here today asking this question. Each one of those tools fails to strip out some new category of metadata added later in some Jupyter version or by some Jupyter extension.

I would think that if there were one single main tool, always available, stripping out everything that was not put there by the user, then people could use that consistently and maintain that to strip out any new metadata that crops up. There may be people who want lots of configuration, and want detailed control, but it seems like they can be the ones to write specialized little tools and maintain them.

Instead what seems much more crucial to offer directly from the Jupyter project is something simple and reliable that we can always ask our users to do, that we can consistently use ourselves, and that simply makes a minimal version of the notebook consisting of what was actually typed in and nothing else -- no output, no widget state, no pygments version, nothing but the notebook format version and the actual contents of the code and markdown cells. That's what I want to archive as the source code for a notebook, and I don't think there is actually all that wide a diversity of opinions about this.

@takluyver
Copy link
Member

If what you want to do is literally strip all metadata, that's very easy to do reliably. The complication and inconsistency is because lots of people want to strip out some metadata while leaving other bits.

@jbednar
Copy link

jbednar commented Aug 2, 2017

Right. So can Jupyter offer a supported, maintained way to strip all metadata that's not needed for re-running the notebook? Those lots of people can indeed use lots of tools to handle their specific needs, but I strongly believe there is a core, simple need shared across a wide range of projects and people for simply stripping everything that doesn't constitute "source code". Personally, I'd even want a button for this in the interface.

@mpacer
Copy link
Member

mpacer commented Aug 3, 2017

@jbednar What metadata do you consider necessary to run the notebook?

Partially this will depend on what contexts do you consider "count" as far as running a notebook (e.g., the live notebook server, nbconvert --execute, nbsphinx, &c.). The different contexts require different pieces of information which is why people will disagree about what should be left in.

@jbednar
Copy link

jbednar commented Aug 3, 2017

I don't have any opinion about nbconvert or nbsphinx, which both seem relevant almost exclusively to rendered, non-cleared notebooks, and here we're talking about cleared notebooks suitable for archiving as source code.

Basically, when I open a new notebook with Jupyter 4.3.0, a file is created that briefly consists only of (1):

{
 "cells": [],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}

That's nice and clean -- an empty list of cells, plus some numbers declaring the notebook format, which is clearly important to record. It's beautiful source code that I am very happy to archive, preserve, and diff.

Let's say I add a single source code cell containing "2+3". As far as I the user am concerned, I've added three characters of data, plus a few more to record the fact that I opened a new code cell. So I would hope to see a file like (2):

{
 "cells": [
  {
   "cell_type": "code",
   "source": ["2+3"]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}

I.e., the original skeleton plus the information that there is one more cell, that it is a code cell, and that it contains the three characters that I typed. That's what I deliberately added as a user.

Unfortunately, this minimal version is not considered legal by Jupyter, which requires a bunch of metadata to be present whether or not it captures anything the user did (3):

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "2+3"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 2
}

What's worse is that this minimal legal version isn't what's stored by Jupyter, even after "clear output". Instead we get (4):

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "2+3"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:ds]",
   "language": "python",
   "name": "conda-env-ds-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

Not only does version (4) completely obscure what I actually typed as source code, it also varies across jupyter versions, varies depending on what extensions are installed, and varies in ways that seem very unpredictable as notebooks are run, used, and cleared, with the result that it's nearly impossible to detect what the user actually typed. Even when only one character is changed, it's often impossible to detect that the character changed in a diff. It's completely diff-incompatible, and seemingly impossible to treat as source code, even though just about the only thing the users are actually supplying is source code.

So, version (4) is what we get now, but I want version (2), both for "clear output" in Jupyter and for nbconvert --clean *.ipynb. Unfortunately, version (2) is not legal, so I'd settle for version (3).

@jbednar
Copy link

jbednar commented Aug 3, 2017

Oops, I just noticed that version 4 I had pasted above didn't actually include the source code "2+3". I guess that just underscores the point -- The actual change I meant to make was completely obscured by all the extra metadata. Edited to fix that above.

@mpacer
Copy link
Member

mpacer commented Aug 3, 2017

A few things:

  1. In your request, you asked for something equivalent to the following:
jq --indent 1 \
    '
    (.cells[] | select(has("outputs")) | .outputs) = []
    | (.cells[] | select(has("execution_count")) | .execution_count) = null
    | .metadata = {"language_info": {"name":"python", "pygments_lexer": "ipython3"}}
    | .cells[].metadata = {}
    ' notebook.ipynb

That specifically includes information relevant to nbsphinx since to the author of that blog post notebooks as a part of documentation were part of the story. That also means nbconvert needs to not run into issues when it runs into that. This is what @takluyver was saying about different people wanting different things.

  1. You would need to specify the kernel when you run your bare notebook… that is where a lot of the metadata you are saying doesn't occur in response to a user-action is being added. It is in response to your action of choosing a kernel.

  2. You asked to clear metadata, but the distinction between 2 and 3 doesn't have anything to do with metadata except in that in the one case there is no metadata key inside the cell and in the other, there is (with an empty object as its value).

What you seem to be asking for would be easier to implement (and faster to run) by modifying the script you gave as your example

jq --indent 1 \
    '
    (.cells[] | select(has("outputs")) | .outputs) = []
    | (.cells[] | select(has("execution_count")) | .execution_count) = null
    | .metadata = {}
    | .cells[].metadata = {}
    ' notebook.ipynb

Crucially, I don't think nbconvert is the right tool for this job. You don't want something that is carefully attending to the content of the notebook, and nbconvert incurs a lot of overhead costs by trying to carefully attend to the content of the notebook. A quick, context-free script like above (for command line use, and a similar JSON manipulation script for in-browser use) would seem to be much closer to the use case you're aiming at than having this built into nbconvert.

However, having something that allows you to clean some kinds of metadata with nbconvert in an intelligent, configurable way would allow you to do something like what you're asking by setting the configuration in a certain manner. I'd imagine --clean running the CleanFooMetadataPreprocessors and the ClearOutputPreprocessor adhering to those customised settings whenever it's run.

@jbednar
Copy link

jbednar commented Aug 4, 2017

In your request, you asked for something equivalent to the following:

Just to be clear, that was Chris's request, not mine! I'm just dogpiling. :-)

a lot of the metadata you are saying doesn't occur in response to a user-action is being added. It is in response to your action of choosing a kernel.

Any notebook that we store in a github repo as source code ends up getting edited by lots of different people who have different Python versions, different Anaconda environments, etc., each with their own kernel. As you say, the user chooses a kernel, but that's because they must choose a kernel in order to be able to run it to test anything. And storing that information with the output is very important as a record of what was run. But it doesn't change what the source code was. I could imagine wanting to declare and persist that it's Python2 or Python3 or R, but really only if the user declares that, not just because it was auto-filled in by running it. And I definitely do not want to see "acon2", "testenv", or any of the other huge variety of kernels that we run across, and that change every time someone wants to edit any of our notebooks. I'm feeling tired just thinking about all the work we have to do every time anyone wants to edit anything!

Anyway, I don't have any strong opinion about whether nbconvert is the right place for this functionality, but I do strongly believe that there should be a simple way to strip out everything that isn't needed to capture what the user put in.

@ceball
Copy link
Author

ceball commented Aug 4, 2017

Sorry I wasn't clear in my original request by giving a distracting example! What I really want is a standard way for people to save a notebook containing only the input cell source and type. A save called "inputs only" or "source only" or something like that, I guess. Nothing user- or system-specific would be included, so nothing about which kernel they used to run the notebook, or which cells they ran in what order, or how they viewed cells in the notebook, etc.

I would really want this "source only" save to be standardized in some way (I mean, to be defined by jupyter - not to replace the default type of save) because although I have written my own tools to strip everything but inputs and whatever is necessary to allow jupyter to open notebooks over the years, I have to explain to every contributor why running some weird extra tool is necessary and how to install it, plus I have to keep the tool up to date with changes in jupyter.

Crucially, I don't think nbconvert is the right tool for this job. [...] A quick, context-free script like above (for command line use, and a similar JSON manipulation script for in-browser use) would seem to be much closer to the use case you're aiming at than having this built into nbconvert.

I made this request on nbconvert because it's part of jupyter, thus is standard and everyone has it (I think?), plus it already allows notebook to notebook conversion, so I thought an option on that might fit easily and use code already in nbconvert. I would definitely love a 'source only' save button to be in everybody's jupyter notebook viewers, but I figured that would be a more controversial and difficult request (e.g. because of potentially confusing users).

  1. In your request, you asked for something equivalent to [a script that] specifically includes information relevant to nbsphinx

Yes, sorry I wasn't clear there. I regret using that example script because it was distracting; I see now that the author added something to help a particular tool, which I hadn't noticed before. I want only the input source code (or anything that is a property of the input source code itself, like the type of source code it is).

  1. You would need to specify the kernel when you run your bare notebook

I am ok with that. I can always select the kernel I want to use in my viewer (or specify a default for my viewer outside of any particular notebook), and that choice will subsequently be saved into the notebook via the 'normal' save button. If I do a "source only" save, I would expect that information to be removed and to have to specify it again. Or, if using nbconvert to execute a "source only" saved notebook, I can specify the kernel on the commandline. (I'm not sure now, but at least in the past it was usually necessary to specify the kernel option anyway, because the kernel saved in the notebook by someone else was unlikely to exist on my system.)

If there were standard metadata that said the source is python or julia or R or whatever i.e. was a property of the input source code itself, I would be ok with that remaining. It would not change just because someone opened the notebook and was using a different kernel from me - it would only change if there is a meaningful change to the source code.

@jbednar
Copy link

jbednar commented Aug 4, 2017

If there were standard metadata that said the source is python or julia or R or whatever i.e. was a property of the input source code itself, I would be ok with that remaining. It would not change just because someone opened the notebook and was using a different kernel from me - it would only change if there is a meaningful change to the source code.

Exactly! I woke up thinking about precisely that point (sad, I know). A given notebook is either in R, Python2, Python3, or compatible with both 2 and 3, for the most part. It would be great if the notebook could declare which of those it is, as source code. Separately, it's great that the rendered notebook (with outputs) records which kernel was actually used when running. But those are entirely separate things -- declaring which language the notebook is written in, and recording which kernel was used in a run. A "source only" version of a notebook should be able to preserve any explicit declaration the user made about the language being used, without changing simply because the user actually ran the code and then cleared the cells. Right now those two very different things (declaring and recording) are being conflated, which is confusing every single time someone wants to edit a notebook in an environment even slightly different from the one in which it was created.

At a more fundamental level, I think it is incontrovertible that:

  1. The Jupyter Notebook is being used by untold numbers of people.
  2. Many of those users are single users developing one-off, throwaway things that don't need to be maintained, tracked, or archived over time, and these users are very well supported by Jupyter as it stands, by preserving the rendered uncleared notebooks.
  3. Many other users, or those same users at different times, are developing substantial, meaningful things in notebooks, i.e., original artifacts that are worth preserving and maintaining over time---things that act like source code.
  4. There is not currently any reliable, maintainable way for users to extract, preserve, and maintain their contributions to these notebooks, in a form that behaves like source code (i.e., only changes when the user explicitly changes it).
  5. This failure to provide a reliable, universal, predictable way to work with the source code is causing untold daily misery in the Jupyter community, is a huge drain on our productivity, and is very actively discouraging user contributions to group-maintained and community-maintained notebooks.
  6. Supporting this large, important class of Jupyter usage is entirely feasible, and should not be overlooked just because there are lots of other somewhat similar things that various stakeholders might also want to do.

@Carreau
Copy link
Member

Carreau commented Aug 4, 2017

Any notebook that we store in a github repo as source code ends up getting edited by lots of different people who have different Python versions, different Anaconda environments, etc., each with their own kernel. As you say, the user chooses a kernel, but that's because they must choose a kernel in order to be able to run it to test anything. And storing that information with the output is very important as a record of what was run. But it doesn't change what the source code was. I could imagine wanting to declare and persist that it's Python2 or Python3 or R, but really only if the user declares that, not just because it was auto-filled in by running it. And I definitely do not want to see "acon2", "testenv", or any of the other huge variety of kernels that we run across, and that change every time someone wants to edit any of our notebooks. I'm feeling tired just thinking about all the work we have to do every time anyone wants to edit anything!

You should as part of the git repo set up a clean and smudge filter that would remove the information on commits / readd-it on checkout. Yes someone should write a blog post that show how to do that. I have contributed to several repository (not python) that do similar things, and it (usually) is part of the setup part of the project – equivalent of pip install -e ..

Now I'm not saying that we could make that easier, I'm just saying that right now, if that's your pain point, it can be fixed. The other possibility – which I should work on – is a GitHub Hook that can fix user submitted pull-request. If "Allow Edit from maintainers" is checked that shoud not be too hard to automatically amend a user PR.

Anyway, I don't have any strong opinion about whether nbconvert is the right place for this functionality, but I do strongly believe that there should be a simple way to strip out everything that isn't needed to capture what the user put in.

The issue is that the notebook server application does need some of this information. Unfortunately so many of our users are used to "a web application autosaves" even without realizing that, that we have to autosave and embed this information. There are also a number of case where by default this information is expected when user are sharing notebooks. The notion of "Save as " with a different format that would be stripped is complex for many users. It should get better with JupyterLab that allow to attach kernels to any document types, but it's not there yet.

I would really want this "source only" save to be standardized in some way (I mean, to be defined by jupyter - not to replace the default type of save) because although I have written my own tools to strip everything but inputs and whatever is necessary to allow jupyter to open notebooks over the years, I have to explain to every contributor why running some weird extra tool is necessary and how to install it, plus I have to keep the tool up to date with changes in jupyter.

"Source only" means a lot of different things in different context. Is that input only for you ? If that's the case then ipynb might not be the best format to store things. We haven't yet developed somtehing that could work, but a notebook-like interface to edit a format like Rmarkdown would be possible. There are only 24h/day unfortunately. But we'll come to it.

Exactly! I woke up thinking about precisely that point (sad, I know). A given notebook is either in R, Python2, Python3, or compatible with both 2 and 3, for the most part. It would be great if the notebook could declare which of those it is, as source code. Separately, it's great that the rendered notebook (with outputs) records which kernel was actually used when running. But those are entirely separate things -- declaring which language the notebook is written in, and recording which kernel was used in a run. A "source only" version of a notebook should be able to preserve any explicit declaration the user made about the language being used, without changing simply because the user actually ran the code and then cleared the cells. Right now those two very different things (declaring and recording) are being conflated, which is confusing every single time someone wants to edit a notebook in an environment even slightly different from the one in which it was created.

It's funny how we used to have only the language a couple of years ago – we even had a language per-cell but that was a footgun – and evolved to use kernelspec per user request. Astonishingly "only the language" is not enough to reproduce for a wide majority of our users. WE got request for the language version as well, then some people run things on PyPy and not CPython, so want an extra field. In the end there is always some people that want more, and some people that want less. Let's not speak about people that actually request for us to also store hardware spec in the notebook, because they need to know ahead of time if a notebook requires a GPU or not. So it's a tough spot, and we are working on making things easier but we are between a rock and a hard place.

There is not currently any reliable, maintainable way for users to extract, preserve, and maintain their contributions to these notebooks, in a form that behaves like source code (i.e., only changes when the user explicitly changes it).
This failure to provide a reliable, universal, predictable way to work with the source code is causing untold daily misery in the Jupyter community, is a huge drain on our productivity, and is very actively discouraging user contributions to group-maintained and community-maintained notebooks.

I think this is not a question for nbconvert, I believe that where the notebook format in itself is not suitable. That being said it does not meed the the notebook interface cannot be used for a new designed format. There have been a number of exploration (like nbexplode) that make that way easier. Nbdiff is working as well. Now these things are far from having the quality necessary to be in core, and it would be nice to have them clean a bit before inclusion.

Supporting this large, important class of Jupyter usage is entirely feasible, and should not be overlooked just because there are lots of other somewhat similar things that various stakeholders might also want to do.

In general we are trying to make the core contain as few things as possible. It is not because it is in the core that something will be better maintained. Though we try to pride as much hook as possible to make the core extensible. Technically you should not need to commit something to core nbconvert. You should be able to have your own package with an entry-point. Once something work and is regularly used and we get help for maintenance we can decide to move it in core. One of the issue will be the consensus deciding what the minimal necessary things are. I've been involve for ~5years in designing the notebook format, and usually people ask us for more than less.

It also seem that the "Cleaning" part is not directly a feature request for nbconvert, but a feature request to make collaboration easier. We may want to take a different approach for that.

@jbednar
Copy link

jbednar commented Aug 9, 2017

You should as part of the git repo set up a clean and smudge filter that would remove the information on commits / read-it on checkout

What I'm asking for would be a very good candidate for a git filter. What is currently available is not a good candidate for a git filter, because all of the available scripts like that break all of the time, whenever new metadata is added somewhere. Some of the people commenting or mentioned above have indeed set up these filters, but have then backed away because they turn out to be a lot of trouble in practice, over time. A fully supported, maintained way to make a truly clean notebook would be a great step in this direction. Being able to apply this on demand to a user contribution PR would be really useful too, but is more than I am currently dreaming of!

Unfortunately so many of our users are used to "a web application autosaves" even without realizing that, that we have to autosave and embed this information.

The more you autosave and embed useful stuff like that, the more crucial it is that there be a supported, obvious way to clear out this information. They go hand in hand -- if you are taking it on yourself to doctor and augment what the user provides, then there should be some clear way for the user to revert those changes when it's not appropriate to preserve them. If you weren't adding the extra metadata, then a clean operation wouldn't be needed, but it is.

"Source only" means a lot of different things in different context. Is that input only for you ?

I'm not sure what you mean by that, but what I mean is just that we want to archive and capture the actual textual contributions from someone explicitly editing a notebook, precisely in the same way as a text editor captures the explicit textual contributions from someone editing a text file. The text editor doesn't record the sequence of menu items the user selects, or the way the mouse moved, it just saves the actual text. The same is needed here -- some way to save just the non-auto-generated stuff, such that it can be re-run as a notebook but which one can always go back to. People use notebooks to write source code, and we need a way to archive and maintain that source code without having it be drowned out by auto-generated fluff.

there is always some people that want more, and some people that want less.

Sure. I'm happy for Jupyter to cram the rendered notebooks with as much metadata as you like, recording everything. Great! The more the merrier. But that doesn't change the need for there to be a way to get back to the underlying, user-generated content in a cleared notebook. If you ask a selection of serious developers who use notebooks and actually care about their content, with multiple people editing it, storing it, and maintaining it over time, I think you will hear very similar woes. E.g. see the blog post above and any number of people who have written such stripping and cleaning utilities, all of whom are only the tip of the iceberg of people affected (since most people just curse and move on). You'll also find a lot of people who decide that notebooks are entirely inappropriate for content that is maintained over time, but I think that's only for these incidental reasons that can be fixed, not inherent to the notebook format.

I believe that where the notebook format in itself is not suitable.

Oh no! Please don't treat this as a request for some additional, separate format. While I do happen to hate JSON, and would much rather notebooks be some human-readable/editable format like YAML, trying to go down that path seems like an infinite distraction that won't ever solve current problems for the very large and ever-increasing installed userbase. The current JSON format is actually fairly reasonable if the extra fluff is just eliminated (see version 2 above). And it's very portable, with people everywhere able to work with it already, unlike variants that have been created that have to first be converted into JSON. We just need a way to get to the currently possible minimal (or at least predictable and repeatable) specification reliably.

the "Cleaning" part is not directly a feature request for nbconvert, but a feature request to make collaboration easier. We may want to take a different approach for that.

I have no opinion about whether nbconvert is the right way to create a "source only" version. I only plead for (a) there to be a supported, proactively maintained command-line tool for generating such a version given a saved file with output and metadata, and if possible (b) to be able to generate such a file from within the interface itself, which will often but not always be more convenient than a command-line tool. (a) is far more important than (b), since a command-line tool is automatable, but both are helpful.

@wstomv
Copy link

wstomv commented Aug 31, 2017

I am involved in developing a course that will make extensive use of Jupyter notebooks, both for making study material available to students and for collecting work from students. The study material is developed in a (Git) repository. I have been surprised at the amount of hard-to-control clutter generated inside notebooks as people modify them.

My search for a way to 'normalize' notebooks has been unsatisfactory so far. That is why we, reluctantly, started developing our own tool set. Out-of-the-box support for this would be appreciated and would help acceptance of Jupyter notebooks as long-lasting carriers of information.

Notebook (top-level) metadata, such as celltoolbar, anaconda-cloud, nbpresent, toc, gets copied around and sticks like a virus. Cell metadata, such as collapsed, scrolled (still referred to as autoscroll in the nbformat docs), tags, nbpresent, slideshow, etc., jumps around. It depends in part on the extensions that the individual user has installed, and it is difficult to 'reset' (the Jupyter Notebook Editor currently only supports 'toggling' of collapsed and scrolled).

The JSON schemas in the nbformat project clearly indicate what is required and what is 'known-optional'.

To help inspect notebooks quickly, we have nbstats. Here is example output with option --all:

Comprehensions and Generators.ipynb
Valid notebook structure:
   True
Notebook metadata:
    4.1 format version
  python3 kernel
  python 3.6.1 language
Other notebook metadata keys:
        nbpresent
        anaconda-cloud
        toc
Cell types:
     98 Total
     45 code
     53 markdown
Cell metadata:
      9 collapsed
      8 scrolled
Code cells:
     87 In[#] maximum
     42 Total outputs
     27 execute_result
     15 stream
     15 stream stdout
  False All code cells were executed, in order

For cleaning up we have nbclean:

usage: nbclean.py [-h] [-n] [-g NOTEBOOK_METADATA_KEYS]
                  [-m CELL_METADATA_KEYS] [-t TAGS] [-c] [-o] [-v]
                  nb.ipynb [nb.ipynb ...]

Clean up notebooks.

positional arguments:
  nb.ipynb              notebooks to clean up

optional arguments:
  -h, --help            show this help message and exit
  -n, --no-execute      dry run
  -g KEY, --global KEY
                        key to remove from metadata of notebook
  -m KEY, --metadata KEY
                        key to remove from metadata of all cells
  -t TAG, --tag TAG   tag to remove from all cells (use '-m tags' to remove
                        all tags)
  -c, --clear-output    clear all outputs from all code cells
  -o, --overwrite       overwrite original notebooks
  -v, --verbose         verbose mode

Options to remove all non-standard (notebook/cell) metadata keys are coming.

@crystalneth
Copy link

I'm pretty new to Jupyter and surprised that this is not a solved problem. It's certainly already a headache for me even on a personal project using git as it makes my git diffs impossible to navigate. My option seems to be to gitignore the notebook and try to remember to check it in explicitly on occassion.

From my point of view, source code and metadata like output and formatting should be in separate user specific location that can be gitignored and otherwise not shared. Any of those that a user wants included in the project files should be explicitly imported to the project file.

@takluyver
Copy link
Member

nbstripout can be configured as a git hook. You can also integrate nbdime with git for nicer diff/merge.

@wstomv
Copy link

wstomv commented Apr 10, 2018

Also the clean tool of nbtoolset can be used to 'normalize' notebooks before committing them to a configuration management repository.

@henriqueribeiro
Copy link
Contributor

Hey, I just created a pull request (#805) that creates a preprocess that clears all the metadata present on the code cells. I believe that this preprocess, together with ClearOutput, does almost the same thing as nbstripout.

@cdeil
Copy link

cdeil commented Nov 6, 2018

Hi, I found this issue and wanted to say that I agree with @jbednar in #637 (comment)

I'm frustrated as a long-time user of notebooks by it being "quite easy to modify notebooks programmatically", and there being a lot of "little tools that do exactly what you want". I've seen dozens of such tools come and go over the years

@takluyver - you wrote above in #637 (comment) :

I'd rather encourage an ecosystem of small tools which can be pieced together rather than one giant tool with loads of options.

As a user, it's really hard to know which package to use (apparently https://github.com/kynan/nbstripout would be a good choice here) and to have to install it separately, and then my experience over the past years is also like what @jbednar said: these little packages often are not well-maintained, and one has to find and learn a new one after a year or two.

So +1 to put more features into the core packages like nbconvert. Or alternatively, if you don't want that, somewhere clearly recommend which of the many small packages to use for which task, e.g. for this case I guess https://github.com/kynan/nbstripout .

In the project I work on @Bultako wrote this 10-line function and expose it via the command line to make stripping notebooks easy for our contributors working on notebooks.

https://github.com/gammapy/gammapy/blob/089d552885256c560c3febdb4610b98b4e708bf0/gammapy/scripts/jupyter.py#L70

Basically I'd prefer if we didn't have that code and it were built-in in nbconvert to do this task from the CLI that's already exposed via the "strip output" option in the jupyter web interface.

@nickurak
Copy link

Just throwing this out as an option -- although certainly a more complicated one to implement.

In a perfect world it'd be nice if jupyter notebooks were broken down into smaller pieces -- the input file as one entity (*), the various bits of metadata stored elsewhere, and the output stored elsewhere.

(*) You might even want program code split out separately -- imagine your python code parts of a notebook just existing as python files that all your normal python tooling could operate on.

The root notebook file might then just be a set of references to the other files that are required for Jupyter to assemble something that is managed in the presentation layer as a single coherent document.

Then you could decide which parts you want in your version control. Certainly the markdown and python input is critical. Maybe your metadata state is important, but the things that are generated as a result of executing other things stored in the repository isn't.

@mwouts
Copy link

mwouts commented Jun 25, 2019

I landed here by chance (I was actually searching for the new --no-input option of nbconvert). May I ask, @ceball , @jbednar and @crystalneth if you've tried Jupytext ?

what was actually typed in and nothing else -- no output, no widget state, no pygments version, nothing but the notebook format version and the actual contents of the code and markdown cells

That is exactly the idea behind Jupytext: the code and markdown cells are stored in a Markdown file (or a script). Only the selected metadata is stored in that file.

This Markdown file comes as an addition to the original notebook, in which the full metadata and outputs are preserved. Both files can be edited. When the notebook is refreshed in Jupyter, Jupytext reads the input cells from the text file, while the outputs and the filtered metadata are loaded from the ipynb file.

From my point of view, source code and metadata like output and formatting should be in separate user specific location that can be gitignored and otherwise not shared

With Jupytext you can add *.ipynb to your .gitignore, unless of course you want to version the outputs. If missing, the .ipynb file will be reconstructed automatically the next time the notebook is opened (from the Markdown file) and saved.

Please don't treat this as a request for some additional, separate format.

Well, sorry... Jupytext does implement a few separate formats for Jupyter notebooks, as e.g. GitHub/VScode compliant Markdown. But maybe you will appreciate these alternative formats - especially if you need to manually edit a notebook... or merge contributions... You'll tell me!

@jbednar
Copy link

jbednar commented Jun 26, 2019

Jupytext is great, but it's a separate thing, as already argued above, because it is not the single, shared, fully interchangeable format. Supporting a clean form of .ipynb benefits every possible tool and usage of Jupyter, because anything Jupyter-related can use .ipynb, but only some use cases are addressed by a non-ipynb format. So jupytext is orthogonal to this request; it being available is great, but doesn't mean the requested functionality is needed any less.

@MSeal
Copy link
Contributor

MSeal commented Sep 4, 2019

Just to clarify since I wasn't involved in these original threads much. Is there anything beyond --ClearMetadataPreprocessor.enabled=True --ClearOutput.enabled=True --to=notebook needed for nbconvert to meet requested functionality? If so, could we close the issue? if not what's remaining. While I agree that nbstripout is the nicer tool for git hook integration, I see the desire to be able to clean notebooks with nbconvert and want to make sure this is easily achieved.

@ceball
Copy link
Author

ceball commented Jun 17, 2020

@MSeal sorry I didn't see your response until someone pointed it out to me now. Thanks very much for following up!

I ran:

$ jupyter nbconvert --to=notebook --ClearOutputPreprocessor.enabled=True --ClearMetadataPreprocessor.enabled=True --output=nbclean nb.ipynb

The cell outputs have been removed, but the "user-specific" notebook metadata is still there (e.g. my quite personal kernel name, python37664bitcelltestsuiconda549e14be7f784a2fbde278c18c8be829).

ClearMetadataPreprocessor appears to be about removing only cell-level metadata, unless I have misunderstood.

The nbconvert version and the notebooks are below the line, at the end of this post.

Incidentally, the first few times I tried I had a case mistake (clearOutputPreprocessor.enabled=True), so output clearing didn't work. There's no warning about a bad argument, so it took me a while to realize what was going on.

I also tried --clear-output but I could not seem to get that to work (attempts shown at the end, below the line). #822

$ jupyter nbconvert --help
[...]
--clear-output
    Clear output of current file and save in place, 
    overwriting the existing notebook.

$ jupyter nbconvert --version
5.6.1

nb.ipynb:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.7.6 64-bit ('celltestsui': conda)",
   "language": "python",
   "name": "python37664bitcelltestsuiconda549e14be7f784a2fbde278c18c8be829"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

nbclean.ipynb:

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.7.6 64-bit ('celltestsui': conda)",
   "language": "python",
   "name": "python37664bitcelltestsuiconda549e14be7f784a2fbde278c18c8be829"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

Trying to use --clear-out (in each case, resulting notebook contained output - same size in bytes as original):

$ ls -l nb.ipynb 
-rw-rw-r-- 1 sefkw sefkw 932 Jun 17 08:50 nb.ipynb

$ jupyter nbconvert --clear-out  nb.ipynb 
[NbConvertApp] Converting notebook nb.ipynb to notebook
[NbConvertApp] Writing 932 bytes to nb.ipynb

$ jupyter nbconvert --clear-out --inplace nb.ipynb 
[NbConvertApp] Converting notebook nb.ipynb to notebook
[NbConvertApp] Writing 932 bytes to nb.ipynb

$ jupyter nbconvert --clear-out --to=notebook --inplace nb.ipynb 
[NbConvertApp] Converting notebook nb.ipynb to notebook
[NbConvertApp] Writing 932 bytes to nb.ipynb

$ jupyter nbconvert --clear-out --to=notebook --output=nbclean.ipynb nb.ipynb 
[NbConvertApp] Converting notebook nb.ipynb to notebook
[NbConvertApp] Writing 932 bytes to nbclean.ipynb

@ceball
Copy link
Author

ceball commented Jun 17, 2020

@mwouts I use jupytext a lot, and I wish everyone else used it, but they don't :( ("they" being users of notebooks I maintain, maintainers of notebooks I use, github, vscode, etc etc...)

@MSeal
Copy link
Contributor

MSeal commented Jun 17, 2020

There's no warning about a bad argument, so it took me a while to realize what was going on.

Yes sorry about that -- I don't particularly like how the config loads commands options either but the backlog of things to fix for 6.0 includes fixing that option at least (PRs to help on some of those 6.0 tasks would be really appreciated if others have time they can contribute).

The cell outputs have been removed, but the "user-specific" notebook metadata is still there (e.g. my quite personal kernel name, python37664bitcelltestsuiconda549e14be7f784a2fbde278c18c8be829).

So the metadata clearing does only clear cell metadata. If you cleared the notebook level kernelspec metadata the notebook would not be valid and unable to load in UIs. It has to have https://github.com/jupyter/nbformat/blob/master/nbformat/v4/nbformat.v4.4.schema.json#L8 defined to be valid so clearing it generically doesn't quite do the trick. I'd be ok with adding additional options to the ClearMetadataPreprocessor to clear non-required metadata in the notebook level metadata, but the kernelspec requires a name field of some sort. You can make your own preprocessor to change the kernel name by following https://nbconvert.readthedocs.io/en/latest/nbconvert_library.html?highlight=preprocessor#Custom-Preprocessors

@jbednar
Copy link

jbednar commented Jun 17, 2020

If you cleared the notebook level kernelspec metadata the notebook would not be valid and unable to load in UIs.

@MSeal, I have not found that to be true in practice. See my "version 3" example above, which does not specify the kernelspec. From at least classic Jupyter notebook 4.3.0 (latest when this issue was filed) up to my current installation of notebook 6.0.3, the UI starts and runs just fine without a kernelspec or any warnings. For JupyterLab 2.1.4, the user is presented with a modal requestor:

image

After selecting a kernel, again there are no errors or warnings. So I don't currently see any problem with clearing the output down to "version 3", and in fact that's what we normally do in practice using our own separate tools.

As discussed above, it would be great if the language (Python 2, 3, R, etc.) could be declared separately from recording the kernel name, to avoid having to guess Python 3, but in the meantime the lack of a reliable and supported way for ordinary users to clear all the output and non-user-derived metadata is still a serious issue.

@MSeal
Copy link
Contributor

MSeal commented Jun 22, 2020

Ahh I had deleted too much in my test. I stand corrected that the metadata can be blank, even though nested fields inside it are required if present, metadata itself is not explicitly required.

I'd be ok with adding additional options to the ClearMetadataPreprocessor to clear non-required metadata in the notebook level metadata

As I stated, I'm fine doing ☝️. @SylvainCorlay @t-makaro @maartenbreddels since you all have been contributing the most for the 6.0 release, do you think we should change the default behavior for this preprocessor or add a new option to preserve backwards compatability?

@SylvainCorlay
Copy link
Member

Really, the issue about the metadata field is that it is not split into

  • input metadata, which are really user input, such as as tags defining whether a cell is a slide, or should be skipped in a slidershow.
  • output metadata, generated by the notebook execution, such as widgets state.

With such a split, we would have an easy means to clean up the notebook metadata.

@jbednar
Copy link

jbednar commented Jul 2, 2020

Agreed! In fact I do want to preserve the slide-related metadata (for RISE, etc.), precisely for that reason; it is a user declaration about the contents of this cell, and not a recording of incidental state from notebook execution. It's actually been a major pain for me that the slide-related metadata gets cleared by our own custom clearing scripts, but I didn't want to bring that up, to avoid complicating things further.

Still, I'd be wary of trying to introduce a change in the spec to make such a distinction now, though, because it would result in different notebook formats that would be painful for users. In the absence of such an incompatible structural change to the metadata format, it seems like the alternative is for the tools to maintain a list of which bits of metadata represent user input, and clear the rest.

@MSeal
Copy link
Contributor

MSeal commented Jul 2, 2020

Yeah metadata isn't separated into output and input in the schema, and even some fields are sometimes inputs and sometimes outputs of particular processing patterns. One thing we could consider doing to clearing all non-spec'd metadata (anything that doesn't have a schema for the given key in the format)?

@MSeal
Copy link
Contributor

MSeal commented Sep 8, 2020

#1314 merged and is going out in the 6.0 release this week. It does provide more granular control of metadata removal.

Specifically these fields now give a better control plain for metadata removal:

    clear_cell_metadata = Bool(True,
        help=("Flag to choose if cell metadata is to be cleared "
              "in addition to notebook metadata.")).tag(config=True)
    clear_notebook_metadata = Bool(True,
        help=("Flag to choose if notebook metadata is to be cleared "
              "in addition to cell metadata.")).tag(config=True)
    preserve_nb_metadata_mask = Set([('language_info', 'name')],
        help=("Indicates the key paths to preserve when deleting metadata "
               "across both cells and notebook metadata fields. Tuples of "
               "keys can be passed to preserved specific nested values")).tag(config=True)
    preserve_cell_metadata_mask = Set(
        help=("Indicates the key paths to preserve when deleting metadata "
               "across both cells and notebook metadata fields. Tuples of "
               "keys can be passed to preserved specific nested values")).tag(config=True)

With this I am going to close the issue as I think the required asks are now met for better controls in place now.

@MSeal MSeal closed this as completed Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests