Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added numpy npy format #183

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Added numpy npy format #183

wants to merge 2 commits into from

Conversation

KOLANICH
Copy link
Contributor

No description provided.

@KOLANICH
Copy link
Contributor Author

@GreyCat

xref:
wikidata: Q197520
doc: |
Serialization format used by numpy. An antipattern, because:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to put a paragraph break between the first sentence and the list of problems with the format.

Perhaps the list of complaints should also be shortened a little, since they are also explained in detail elsewhere in the spec.

* uses dicts in python syntax:
* in order to parse the array one has to have a parser for python language syntax. This cripples interoperability: noone wants to create a parser for python syntax to just parse a numpy array.
* it is tempting to `exec` these dicts, which could have been a security issue.
* they had to use `ast.parse` and manual walking of the tree in order to mitigate that. Fortunately now it is built into python as `ast.literal_eval`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reorder/restructure these two points a little:

  • Talk about ast.literal_eval first, since it's the correct way to read the dict if you're already using Python
  • Warn against using exec afterwards
  • Remove the part about ast.parse completely, as it's no longer relevant in current Python versions (ast.literal_eval has existed since Python 2.6, which was released in 2008)

I would also mention somewhere that the header is specified to be a dict literal, so ast.literal_eval (or equivalent) is always sufficient - you never have to execute arbitrary code to read the header.

* in order to parse the array one has to have a parser for python language syntax. This cripples interoperability: noone wants to create a parser for python syntax to just parse a numpy array.
* it is tempting to `exec` these dicts, which could have been a security issue.
* they had to use `ast.parse` and manual walking of the tree in order to mitigate that. Fortunately now it is built into python as `ast.literal_eval`.
The schema of the dicts is fixed, the values in the dicts are integers, the better solution would be to put these numbers as standardized variable-length integers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this is explained in the numpy docs - it makes it easier to reverse-engineer the format from scratch in the future, because a structured text representation is easier for humans to understand than packed integer fields. It also makes it easier to add more fields to the header in the future.

Copy link
Contributor Author

@KOLANICH KOLANICH Jul 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it makes reverse engineering easier. Though there is an extremily widespread and natural pattern in formats: providing length first and then an array of values. Arrays of integers and their lengths and endiannesses are easy to distinguish because of leading zeros. If the numbers do not make sense as integers, the next try are floats. So if they followed this pattern, it would have been almost as easy to RE that format and it would be far more convenient to implement it in other languages.

* it is tempting to `exec` these dicts, which could have been a security issue.
* they had to use `ast.parse` and manual walking of the tree in order to mitigate that. Fortunately now it is built into python as `ast.literal_eval`.
The schema of the dicts is fixed, the values in the dicts are integers, the better solution would be to put these numbers as standardized variable-length integers.
* it uses pickle, which is a remote code execution. The better solution is to say that serializing anything except arrays of literal types is out of scope of the format. If one needs pickle, he is already fucked up, so he should use pickle instead.
Copy link
Contributor

@dgelessus dgelessus Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that pickle is only used when serializing arrays of Python objects. Arrays of primitive types are safe, as they are always stored in a flat binary representation. Recent versions of numpy also disable pickle support by default when reading (see CVE-2019-6446).

Also, maybe we should try not swearing in specs? Apparently we don't have a COC or any other official rules about this here, but I think swearing in specs is unnecessary and out of place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that pickle is only used when serializing arrays of Python objects. Arrays of primitive types are safe, as they are always stored in a flat binary representation. Recent versions of numpy also disable pickle support by default when reading (see CVE-2019-6446).

IMHO it was a mistake to rely on pickle in the format. It can be disabled in an impl, but it doesn't mean that the format itself was designed right.

* they had to use `ast.parse` and manual walking of the tree in order to mitigate that. Fortunately now it is built into python as `ast.literal_eval`.
The schema of the dicts is fixed, the values in the dicts are integers, the better solution would be to put these numbers as standardized variable-length integers.
* it uses pickle, which is a remote code execution. The better solution is to say that serializing anything except arrays of literal types is out of scope of the format. If one needs pickle, he is already fucked up, so he should use pickle instead.
Never design formats like this!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, we don't have any official rules about this, but I don't think this kind of opinion belongs into the spec. I think it's completely okay (and sometimes important) to point out limitations/problems/risks in a format and what they mean in practice when implementing it, but I don't think it's helpful to say "this is a bad format, never do something like this".

serialization/numpy_npy.ksy Outdated Show resolved Hide resolved
serialization/numpy_npy.ksy Outdated Show resolved Hide resolved
serialization/numpy_npy.ksy Outdated Show resolved Hide resolved
common/ieee754_float/float.ksy Outdated Show resolved Hide resolved
common/ieee754_float/float.ksy Outdated Show resolved Hide resolved
serialization/numpy_npy.ksy Show resolved Hide resolved
serialization/numpy_npy.ksy Show resolved Hide resolved
@dgelessus
Copy link
Contributor

dgelessus commented Jul 30, 2019

Strange, GitHub didn't notify me about the force-pushes even though I'm subscribed to this PR - I only got a notification for the comment where you pinged me.

With complex dtypes in mind, it makes more sense why the header is stored as a Python literal instead of a binary format. Complex dtypes can be nested arbitrarily deeply, and such a structure is less straightforward to serialize in a packed binary form. (It's possible of course, but it would make reverse-engineering more complex again, because it's no longer just length-prefixed strings.)

And while today it would make sense to use JSON for the text representation of the data, the npy format was designed in 2007. The current Python version at that point was Python 2.5, which had no built-in support for JSON or similar formats (the standard json module was added in Python 2.6, released in 2008). The only real alternative supported by Python at that point would have been XML, and you can be glad that they didn't choose that :P

@KOLANICH
Copy link
Contributor Author

KOLANICH commented Jul 31, 2019

Good point, though that can be solved by a dependency on a third-party JSON library, probably even a native one (for example I usually use ujson when the data is needed to occasionally be shown to humans (ujson has no pretty-printing with tabs)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants