-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added numpy npy format #183
base: master
Are you sure you want to change the base?
Conversation
serialization/numpy_npy.ksy
Outdated
xref: | ||
wikidata: Q197520 | ||
doc: | | ||
Serialization format used by numpy. An antipattern, because: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to put a paragraph break between the first sentence and the list of problems with the format.
Perhaps the list of complaints should also be shortened a little, since they are also explained in detail elsewhere in the spec.
serialization/numpy_npy.ksy
Outdated
* uses dicts in python syntax: | ||
* in order to parse the array one has to have a parser for python language syntax. This cripples interoperability: noone wants to create a parser for python syntax to just parse a numpy array. | ||
* it is tempting to `exec` these dicts, which could have been a security issue. | ||
* they had to use `ast.parse` and manual walking of the tree in order to mitigate that. Fortunately now it is built into python as `ast.literal_eval`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would reorder/restructure these two points a little:
- Talk about
ast.literal_eval
first, since it's the correct way to read the dict if you're already using Python - Warn against using
exec
afterwards - Remove the part about
ast.parse
completely, as it's no longer relevant in current Python versions (ast.literal_eval
has existed since Python 2.6, which was released in 2008)
I would also mention somewhere that the header is specified to be a dict literal, so ast.literal_eval
(or equivalent) is always sufficient - you never have to execute arbitrary code to read the header.
serialization/numpy_npy.ksy
Outdated
* in order to parse the array one has to have a parser for python language syntax. This cripples interoperability: noone wants to create a parser for python syntax to just parse a numpy array. | ||
* it is tempting to `exec` these dicts, which could have been a security issue. | ||
* they had to use `ast.parse` and manual walking of the tree in order to mitigate that. Fortunately now it is built into python as `ast.literal_eval`. | ||
The schema of the dicts is fixed, the values in the dicts are integers, the better solution would be to put these numbers as standardized variable-length integers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this is explained in the numpy docs - it makes it easier to reverse-engineer the format from scratch in the future, because a structured text representation is easier for humans to understand than packed integer fields. It also makes it easier to add more fields to the header in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it makes reverse engineering easier. Though there is an extremily widespread and natural pattern in formats: providing length first and then an array of values. Arrays of integers and their lengths and endiannesses are easy to distinguish because of leading zeros. If the numbers do not make sense as integers, the next try are floats. So if they followed this pattern, it would have been almost as easy to RE that format and it would be far more convenient to implement it in other languages.
serialization/numpy_npy.ksy
Outdated
* it is tempting to `exec` these dicts, which could have been a security issue. | ||
* they had to use `ast.parse` and manual walking of the tree in order to mitigate that. Fortunately now it is built into python as `ast.literal_eval`. | ||
The schema of the dicts is fixed, the values in the dicts are integers, the better solution would be to put these numbers as standardized variable-length integers. | ||
* it uses pickle, which is a remote code execution. The better solution is to say that serializing anything except arrays of literal types is out of scope of the format. If one needs pickle, he is already fucked up, so he should use pickle instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that pickle is only used when serializing arrays of Python objects. Arrays of primitive types are safe, as they are always stored in a flat binary representation. Recent versions of numpy also disable pickle support by default when reading (see CVE-2019-6446).
Also, maybe we should try not swearing in specs? Apparently we don't have a COC or any other official rules about this here, but I think swearing in specs is unnecessary and out of place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that pickle is only used when serializing arrays of Python objects. Arrays of primitive types are safe, as they are always stored in a flat binary representation. Recent versions of numpy also disable pickle support by default when reading (see CVE-2019-6446).
IMHO it was a mistake to rely on pickle in the format. It can be disabled in an impl, but it doesn't mean that the format itself was designed right.
serialization/numpy_npy.ksy
Outdated
* they had to use `ast.parse` and manual walking of the tree in order to mitigate that. Fortunately now it is built into python as `ast.literal_eval`. | ||
The schema of the dicts is fixed, the values in the dicts are integers, the better solution would be to put these numbers as standardized variable-length integers. | ||
* it uses pickle, which is a remote code execution. The better solution is to say that serializing anything except arrays of literal types is out of scope of the format. If one needs pickle, he is already fucked up, so he should use pickle instead. | ||
Never design formats like this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, we don't have any official rules about this, but I don't think this kind of opinion belongs into the spec. I think it's completely okay (and sometimes important) to point out limitations/problems/risks in a format and what they mean in practice when implementing it, but I don't think it's helpful to say "this is a bad format, never do something like this".
Strange, GitHub didn't notify me about the force-pushes even though I'm subscribed to this PR - I only got a notification for the comment where you pinged me. With complex dtypes in mind, it makes more sense why the header is stored as a Python literal instead of a binary format. Complex dtypes can be nested arbitrarily deeply, and such a structure is less straightforward to serialize in a packed binary form. (It's possible of course, but it would make reverse-engineering more complex again, because it's no longer just length-prefixed strings.) And while today it would make sense to use JSON for the text representation of the data, the npy format was designed in 2007. The current Python version at that point was Python 2.5, which had no built-in support for JSON or similar formats (the standard |
Good point, though that can be solved by a dependency on a third-party JSON library, probably even a native one (for example I usually use |
No description provided.