Added numpy npy format #183

KOLANICH · 2019-07-21T20:45:57Z

No description provided.

KOLANICH · 2019-07-23T20:23:45Z

dgelessus · 2019-07-23T22:49:36Z

serialization/numpy_npy.ksy

+  xref:
+    wikidata: Q197520
+doc: |
+  Serialization format used by numpy. An antipattern, because:


I think it would be good to put a paragraph break between the first sentence and the list of problems with the format.

Perhaps the list of complaints should also be shortened a little, since they are also explained in detail elsewhere in the spec.

dgelessus · 2019-07-23T23:16:12Z

serialization/numpy_npy.ksy

+    * uses dicts in python syntax:
+      * in order to parse the array one has to have a parser for python language syntax. This cripples interoperability: noone wants to create a parser for python syntax to just parse a numpy array.
+      * it is tempting to `exec` these dicts, which could have been a security issue.
+      * they had to use `ast.parse` and manual walking of the tree in order to mitigate that. Fortunately now it is built into python as `ast.literal_eval`.


I would reorder/restructure these two points a little:

Talk about ast.literal_eval first, since it's the correct way to read the dict if you're already using Python

Warn against using exec afterwards

Remove the part about ast.parse completely, as it's no longer relevant in current Python versions (ast.literal_eval has existed since Python 2.6, which was released in 2008)

I would also mention somewhere that the header is specified to be a dict literal, so ast.literal_eval (or equivalent) is always sufficient - you never have to execute arbitrary code to read the header.

dgelessus · 2019-07-23T23:21:28Z

serialization/numpy_npy.ksy

+      * in order to parse the array one has to have a parser for python language syntax. This cripples interoperability: noone wants to create a parser for python syntax to just parse a numpy array.
+      * it is tempting to `exec` these dicts, which could have been a security issue.
+      * they had to use `ast.parse` and manual walking of the tree in order to mitigate that. Fortunately now it is built into python as `ast.literal_eval`.
+    The schema of the dicts is fixed, the values in the dicts are integers, the better solution would be to put these numbers as standardized variable-length integers.


The reason for this is explained in the numpy docs - it makes it easier to reverse-engineer the format from scratch in the future, because a structured text representation is easier for humans to understand than packed integer fields. It also makes it easier to add more fields to the header in the future.

Yes, it makes reverse engineering easier. Though there is an extremily widespread and natural pattern in formats: providing length first and then an array of values. Arrays of integers and their lengths and endiannesses are easy to distinguish because of leading zeros. If the numbers do not make sense as integers, the next try are floats. So if they followed this pattern, it would have been almost as easy to RE that format and it would be far more convenient to implement it in other languages.

dgelessus · 2019-07-23T23:31:34Z

serialization/numpy_npy.ksy

+      * it is tempting to `exec` these dicts, which could have been a security issue.
+      * they had to use `ast.parse` and manual walking of the tree in order to mitigate that. Fortunately now it is built into python as `ast.literal_eval`.
+    The schema of the dicts is fixed, the values in the dicts are integers, the better solution would be to put these numbers as standardized variable-length integers.
+    * it uses pickle, which is a remote code execution. The better solution is to say that serializing anything except arrays of literal types is out of scope of the format. If one needs pickle, he is already fucked up, so he should use pickle instead.


It's worth noting that pickle is only used when serializing arrays of Python objects. Arrays of primitive types are safe, as they are always stored in a flat binary representation. Recent versions of numpy also disable pickle support by default when reading (see CVE-2019-6446).

Also, maybe we should try not swearing in specs? Apparently we don't have a COC or any other official rules about this here, but I think swearing in specs is unnecessary and out of place.

It's worth noting that pickle is only used when serializing arrays of Python objects. Arrays of primitive types are safe, as they are always stored in a flat binary representation. Recent versions of numpy also disable pickle support by default when reading (see CVE-2019-6446).

IMHO it was a mistake to rely on pickle in the format. It can be disabled in an impl, but it doesn't mean that the format itself was designed right.

dgelessus · 2019-07-23T23:44:14Z

serialization/numpy_npy.ksy

+      * they had to use `ast.parse` and manual walking of the tree in order to mitigate that. Fortunately now it is built into python as `ast.literal_eval`.
+    The schema of the dicts is fixed, the values in the dicts are integers, the better solution would be to put these numbers as standardized variable-length integers.
+    * it uses pickle, which is a remote code execution. The better solution is to say that serializing anything except arrays of literal types is out of scope of the format. If one needs pickle, he is already fucked up, so he should use pickle instead.
+  Never design formats like this!


Again, we don't have any official rules about this, but I don't think this kind of opinion belongs into the spec. I think it's completely okay (and sometimes important) to point out limitations/problems/risks in a format and what they mean in practice when implementing it, but I don't think it's helpful to say "this is a bad format, never do something like this".

serialization/numpy_npy.ksy

common/ieee754_float/float.ksy

serialization/numpy_npy.ksy

dgelessus · 2019-07-30T21:54:46Z

Strange, GitHub didn't notify me about the force-pushes even though I'm subscribed to this PR - I only got a notification for the comment where you pinged me.

With complex dtypes in mind, it makes more sense why the header is stored as a Python literal instead of a binary format. Complex dtypes can be nested arbitrarily deeply, and such a structure is less straightforward to serialize in a packed binary form. (It's possible of course, but it would make reverse-engineering more complex again, because it's no longer just length-prefixed strings.)

And while today it would make sense to use JSON for the text representation of the data, the npy format was designed in 2007. The current Python version at that point was Python 2.5, which had no built-in support for JSON or similar formats (the standard json module was added in Python 2.6, released in 2008). The only real alternative supported by Python at that point would have been XML, and you can be glad that they didn't choose that :P

KOLANICH · 2019-07-31T08:51:15Z

Good point, though that can be solved by a dependency on a third-party JSON library, probably even a native one (for example I usually use ujson when the data is needed to occasionally be shown to humans (ujson has no pretty-printing with tabs)).

dgelessus reviewed Jul 24, 2019

View reviewed changes

dgelessus reviewed Jul 30, 2019

View reviewed changes

common/ieee754_float/float.ksy Outdated Show resolved Hide resolved

common/ieee754_float/float.ksy Outdated Show resolved Hide resolved

serialization/numpy_npy.ksy Show resolved Hide resolved

serialization/numpy_npy.ksy Show resolved Hide resolved

KOLANICH added 2 commits February 10, 2020 18:56

Added IEEE 754 floats.

8fa61c4

Added numpy npy format

6eaa26a

KOLANICH mentioned this pull request Feb 22, 2021

Encode the specs into Kaitai Struct zarr-developers/zarr-specs#109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added numpy npy format #183

Added numpy npy format #183

KOLANICH commented Jul 21, 2019

KOLANICH commented Jul 23, 2019

dgelessus Jul 23, 2019

dgelessus Jul 23, 2019

dgelessus Jul 23, 2019

KOLANICH Jul 24, 2019 •

edited

Loading

dgelessus Jul 23, 2019 •

edited

Loading

KOLANICH Jul 24, 2019

dgelessus Jul 23, 2019

dgelessus commented Jul 30, 2019 •

edited

Loading

KOLANICH commented Jul 31, 2019 •

edited

Loading

Added numpy npy format #183

Are you sure you want to change the base?

Added numpy npy format #183

Conversation

KOLANICH commented Jul 21, 2019

KOLANICH commented Jul 23, 2019

dgelessus Jul 23, 2019

Choose a reason for hiding this comment

dgelessus Jul 23, 2019

Choose a reason for hiding this comment

dgelessus Jul 23, 2019

Choose a reason for hiding this comment

KOLANICH Jul 24, 2019 • edited Loading

Choose a reason for hiding this comment

dgelessus Jul 23, 2019 • edited Loading

Choose a reason for hiding this comment

KOLANICH Jul 24, 2019

Choose a reason for hiding this comment

dgelessus Jul 23, 2019

Choose a reason for hiding this comment

dgelessus commented Jul 30, 2019 • edited Loading

KOLANICH commented Jul 31, 2019 • edited Loading

KOLANICH Jul 24, 2019 •

edited

Loading

dgelessus Jul 23, 2019 •

edited

Loading

dgelessus commented Jul 30, 2019 •

edited

Loading

KOLANICH commented Jul 31, 2019 •

edited

Loading