Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support exporting to JSON Table Schema #14386

Closed
rgbkrk opened this issue Oct 10, 2016 · 18 comments
Closed

Support exporting to JSON Table Schema #14386

rgbkrk opened this issue Oct 10, 2016 · 18 comments
Labels
Enhancement IO JSON read_json, to_json, json_normalize
Milestone

Comments

@rgbkrk
Copy link
Contributor

rgbkrk commented Oct 10, 2016

For Jupyter based frontends, we would love to see a common tabular format in JSON that we can render (in addition to or in lieu of the current HTML). This would provide us the flexibility to style and format according to data type, as well as have better hooks for theming of tabular data on frontends. Everyone has an opinion, let's give them flexibility to apply it.

It's important to us to support a common JSON format so that for R, Julia, and other languages also can display their DataFrames with similar formatting and styling out of the box.

The best one I've seen so far, with a great amount of discussion and collaboration, is the JSON Table Schema.

Update: In order to include both data + schema, we're using data resource which has media type application/vnd.dataresource+json.

/cc @captainsafia @ellisonbg @jreback @TomAugspurger

@jreback
Copy link
Contributor

jreback commented Oct 10, 2016

xref #9146, #9166

@jreback jreback added API Design IO JSON read_json, to_json, json_normalize Compat pandas objects compatability with Numpy or Python functions Needs Discussion Requires discussion from core team before further action labels Oct 10, 2016
@TomAugspurger
Copy link
Contributor

I'll dig into the schema later, but just to make sure: the basic idea is for pandas to publish multiple outputs (application/html, application/json) wherever we publish just the HTML right now?
More concretely, what changes do we need to make to Series / DataFrames / Indexes to support this? IIRC there isn't a _repr_json_ equivalent of _repr_html_.

@rgbkrk
Copy link
Contributor Author

rgbkrk commented Oct 10, 2016

Interesting - I just noticed they wrote a wrapper for pandas: https://github.com/frictionlessdata/jsontableschema-pandas-py

On the JupyterLab, notebook, and nteract side, we'd have https://github.com/frictionlessdata/jsontableschema-js to lean on.

@rgbkrk
Copy link
Contributor Author

rgbkrk commented Oct 10, 2016

the basic idea is for pandas to publish multiple outputs (application/html, application/json) wherever we publish just the HTML right now?

Yes. The media type (mime type in Jupyter parlance) would be something like application/vnd.table-schema.v1+json.

@rgbkrk
Copy link
Contributor Author

rgbkrk commented Oct 10, 2016

While there's not a repr for arbitrary media types in IPython (we can evolve that as a result of this discussion), there is a way to display raw messages with IPython.display.display:

IPython.display.display({
    'application/json': releases
}, raw=True)

Which shows up in nteract as:

screen shot 2016-10-10 at 10 39 25 am

@pwalsh
Copy link

pwalsh commented Oct 10, 2016

Hi. I'm one of the authors of JSON Table Schema, and also part of the team working on reference implementations for this and the related family of specs. The JavaScript implementation is just a little behind the Python one, and probably also of relevance here.

Happy to help.

edit: added link to the JavaScript implementation, in addition to the Python one previously linked.

@rgbkrk
Copy link
Contributor Author

rgbkrk commented Oct 10, 2016

By the way, on the nteract and jupyterlab side, it's pretty easy for us to iterate with new renderers and media types.

@TomAugspurger
Copy link
Contributor

I don't really see a reason not to add this in pandas; The additional code shouldn't be too much of a burden.

Would clients expect to receive the entire DataFrame, and do their own truncation? I worry a bit about the overhead of publishing huge DataFrames. I would say follow the options in pd.options.display.max_rows, etc. and only ship over some of the DataFrame (but need some way of saying that there's more...)

A few things directly related to the spec that pandas might have trouble with:

  • field descriptors: in principal _metadata should carry this, but IIRC we don't have a good story on propagating that though operations, so it's liable to be dropped
  • field types: shouldn't have any problems here
  • primary key: Typically this would be the (multi)Index, but we don't require uniqueness on that.
  • field names: Somewhat rare, but we can have MultiIndexes in the columns, so we could have "multiple rows" of field names; These can be collapsed down to tuples.

@pwalsh
Copy link

pwalsh commented Oct 11, 2016

We are very happy to make any changes needed to https://github.com/frictionlessdata/jsontableschema-pandas-py in order to support this smoothly, and especially in reference to things like streaming data out of a DataFrame, or limiting the rows from a frame for preview, and so forth.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 12, 2016

Started on this here: master...TomAugspurger:json-schema very early, one test, no docs :D

Some design things I'd like to nail down before submitting a PR:

  • The actual message published to the jupyter channel will be
{'schema': schema, 'data': data}

where schema is a valid JSON table schema and data is like pd.DataFrame.to_json(orient='records')

{
  "data": "[{\"a\":3,\"b\":3},{\"a\":0,\"b\":2},{\"a\":1,\"b\":1},{\"a\":3,\"b\":0},{\"a\":3,\"b\":1}]",
  "schema": {
    "fields": [
      {
        "type": "integer",
        "name": "a"
      },
      {
        "type": "integer",
        "name": "b"
      }
    ]
  }
}

Does that sound right?

  • Truncation: I think we'll follow pd.options.display.max_rows and only send that many rows; Will need to think about if people have set their display.large_repr to be info...
  • Name: I've called it _repr_json_ for now, thoughts on what that should be? IIUC this won't be special like _repr_html_ is and called automatically. We'll have to publish this ourselves, and we can choose the name?
  • @jreback all this stuff I'm doing here, do we already have a simpler way of going from type to a "base" type. I don't want to have to worry about int16 vs int32, etc.
  • Speaking of types, pandas doesn't have a string type, so right now we send those over as "any". :( Do we want to do a bit of inference to maybe send those as strings, or leave that to the client? pandas 2 will have a string type, but that'll be a bit.
  • Indexes: When should we send them?
    1. Always
    2. When any (or all) of the levels are named

@jreback
Copy link
Contributor

jreback commented Dec 12, 2016

@TomAugspurger

don't put this in core/generic.py (the actual table creation), instead pandas.formats.json might be appropriate (but make it clear this is an export only format).

so we already have all of the accessors, you can simply use your translation function.

In [5]: from pandas.types.common import is_integer_dtype, is_timedelta64_dtype, is_string_dtype

In [6]: is_integer_dtype(np.float)
Out[6]: False

In [7]: is_integer_dtype(np.integer)
Out[7]: True

In [8]: is_integer_dtype(np.dtype('m8[ns]'))
Out[8]: False

In [9]: is_timedelta64_dtype(np.dtype('m8[ns]'))
Out[9]: True

In [10]: is_string_dtype(np.dtype('O'))
Out[10]: True

In [11]: is_string_dtype(pandas.types.dtypes.CategoricalDtype())
Out[11]: True

@rgbkrk
Copy link
Contributor Author

rgbkrk commented Dec 13, 2016

Does the data field have to be double encoded? We can handle raw JSON across the jupyter messaging spec.

Name: I've called it repr_json for now, thoughts on what that should be? IIUC this won't be special like repr_html is and called automatically. We'll have to publish this ourselves, and we can choose the name?

_repr_json_ will tell the frontend to render application/json which in nteract and in the soon to be released notebook provides a tree view of a JSON structure:

screen shot 2016-12-12 at 5 31 09 pm

I'd like to see this table get published with a custom mimetype. To demonstrate, I took the liberty of taking parts of your function, a fake mimetype (not sure what the official is), and creating a little React component (style would get better after):

screen shot 2016-12-12 at 5 29 22 pm

The mimetype I used is application/vnd.tableschema.v1+json and I published it via IPython.display rather than a repr function since we don't have a precedent for this table type yet.

/cc @minrk @takluyver

@pwalsh
Copy link

pwalsh commented Dec 13, 2016

Hi @rgbkrk

Addressing some points above and raised in our Gitter channel

(I'm one of the authors of JSON Table Schema and related specs)

  1. Mime types: See my notes in here. I'm working on this right now (meaning, making the submission for the new mime types today). We'll be submitting application/tableschema+json
  2. jsontableschema-js is npm installable, has feature parity with jsontableschema-py
  3. Just FYI, I'm currently on a sprint to close a range of issues and publish v1 of all our specs before end of year, and IETF RFC submissions follow immediately. There are other aspects there that are relevant here (e.g.: "Tabular Data Resource" specification), but I can go over them with you (if you like) after we release v1

@takluyver
Copy link
Contributor

I've called it _repr_json_ for now, thoughts on what that should be? IIUC this won't be special like _repr_html_ is and called automatically.

We do actually look for _repr_json_:

https://github.com/ipython/ipython/blob/5.1.0/IPython/core/formatters.py#L782

@minrk
Copy link
Contributor

minrk commented Dec 14, 2016

We currently only support single method name:mime-type mapping. This doesn't extend to custom mime-types, though the protocol allows it. I've been planning to add a _repr_mime_, where the method returns the mime-keyed dict(s), but haven't gotten to it. I thought I opened an issue for it years ago, but maybe only in my brain. I just opened ipython/ipython#10090 for this.

@rgbkrk
Copy link
Contributor Author

rgbkrk commented Dec 14, 2016

I did open a similarly worded issue in ipython/ipython#10058. 😉 Either way, I would love to have the ability to return mime bundles for a repr.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 17, 2016
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 17, 2016
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 14, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 16, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 2, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 5, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 5, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
@jreback
Copy link
Contributor

jreback commented Feb 6, 2017

@TomAugspurger I think this will close #9166 if you make build_table_schema accessible, e.g.

pandas.io.json.table.build_schema , certainly not publicly broadcast, but accessible

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 7, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 7, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 7, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 16, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 18, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 18, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Feb 23, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Mar 1, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Mar 1, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Mar 3, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

DOC: More notes in prose docs

Move files

use isoformat

updates

Moved to to_json

json_table

no config

refactor with classes

Added duration tests

more timedelta

Change default orient

Series test

fixup docs

JSON Table -> Table

doc

Change to table orient

added version

Handle Categorical

Many more tests
jorisvandenbossche pushed a commit that referenced this issue Mar 4, 2017
Lays the groundwork for #14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

Added publish to dataframe repr
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
Lays the groundwork for pandas-dev#14386
This handles the schema part of the request there. We'll still need to
do the work to publish the data to the frontend, but that can be done
as a followup.

Added publish to dataframe repr
@TomAugspurger
Copy link
Contributor

Closed by #14904

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Apr 16, 2017
@jorisvandenbossche jorisvandenbossche added Enhancement and removed Needs Discussion Requires discussion from core team before further action API Design Compat pandas objects compatability with Numpy or Python functions labels Apr 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

7 participants