Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON (de)serialization #49

Merged
merged 42 commits into from
Aug 6, 2024
Merged

Conversation

librolibro
Copy link
Contributor

@librolibro librolibro commented Jun 30, 2024

Closes #32

This one works without extra boilerplate code (some tests included) and I kinda like the solution, but still ...

To get things done I reimplemented a part of source code (see innerParse function in std/json/static.zig, the part with the .Struct case body). Since few internal non-public was needed i just copied it (eww).

Zig is on its early stages and standard library often changes - and that's why I partly don't like this PR because there might be lots or refactors (for example, standard token parser might change)

Feel free to judge it)

@Arwalk
Copy link
Owner

Arwalk commented Jul 2, 2024

Thanks again for everything. I'll do my best to check it out ASAP.

@Arwalk
Copy link
Owner

Arwalk commented Jul 12, 2024

Just saying that i haven't forgotten this, just trying to find some time to do it.

@Arwalk
Copy link
Owner

Arwalk commented Jul 22, 2024

Started reviewing it. You're still not forgotten :)

Copy link
Owner

@Arwalk Arwalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good start for something great. There are a few comments already placed in the review here, but i'll add this:

  • How is OneOf case handled? Could you add a test (and code to make it work) ?
  • I'm not a huge fan of using std.json "leaky" code without it being explicit to the end user. I think we need to use non-leaky methods and return a Parsed(T), and at worst propose leaky json_decode on a separate interface (that would require ArenaAllocator instead of just Allocator).

Overall this is very good and my requests do not change the fact that i'm very thankful for your time and investment in this.

src/protobuf.zig Show resolved Hide resolved
src/protobuf.zig Show resolved Hide resolved
src/protobuf.zig Outdated Show resolved Hide resolved
src/protobuf.zig Show resolved Hide resolved
src/protobuf.zig Show resolved Hide resolved
@librolibro
Copy link
Contributor Author

Made some commits.

All unions are handled by std.json internally, see here and here - it just worked out of the box, made some tests to prove it.

About using "leaky" versions of std.json (de)serialization functions - the only reason I chose them is because I didn't get it to work with non-leaky alternatives) I just remember than some tests blamed about memory leaks (std.testing.allocator is able to catch leaks) and some code didn't even compile, don't remember why tbh. I can try to explore it again with a fresh look.

And thanks for feedback, even the late one)

@Arwalk
Copy link
Owner

Arwalk commented Jul 22, 2024

This is great. I'll have some fun with it and might even add some suggestions before approving it, but overall i'm positive this will be merged soon. Thanks again for your work.

@librolibro
Copy link
Contributor Author

LUL i just switched to a non-leaky function for JSON decoding and everything works :)

There are no memory leaks or compile errors that I've encountered before.

@Arwalk Arwalk self-assigned this Jul 24, 2024
@Arwalk
Copy link
Owner

Arwalk commented Jul 24, 2024

@librolibro I added a few commits directly, i'll let you look at them and tell me if they're ok for you. According to the json spec for protobuf bytes types are supposed to be base64 encoded.

We did not have a specific information for bytes, we had a shortcut to handle it just like strings. These last commits do not change the zig-facing interface of using ManagedStrings for bytes fields, but add the information needed in the _desc_table so that json encoding/decoding know when to do the the base64 soup.

I'll look tomorrow into the Map type, because i'm pretty sure we are not correct with the spec too (we handle it as a list of entries).

@librolibro
Copy link
Contributor Author

librolibro commented Jul 25, 2024

Everything looks clean at the first glance, good job at that! (@embedfile usage was nice, didn't know about it :))

Another thing that comes to my mind is the ability to not add fields into JSON string if their value is equal to default - e.g. Python's implementation of google.protobuf.json_format.MessageToJson comes with including_default_value_fields argument which if False by default. While specs say that

A proto3 JSON implementation MAY provide the following options

and it's not a strict requirement it may be worth implementing this and since the message field iteration was already made in this code and not in std.json depths it shouldn't be that hard to do.

P.S. Here's even more things that might be worth implementing

librolibro and others added 16 commits July 25, 2024 08:38
Since this project uses ManagedString structs (instead of []const u8),
  I also need to add jsonParse/jsonStringify method for it
Previously it was made by expecting the type name to be
"ArrayListAligned" which was not good since user-defined
message could have the same name as built-in ArrayList() struct.
The problem is that std.meta.eql can't properly compare all types
(e.g. if struct contains pointer it just compares pointer
value itself - that's how it compares ArrayLists).

With this "ultimate comparer" function tests looks more clean and less -
although the function itself is not that simple and might some tests :)
Maybe it'll be useful somewhere outside tests_json.zig (e.g. in other tests)
@Arwalk Arwalk force-pushed the json_encoding_and_decoding branch from bf06693 to ad352b9 Compare July 25, 2024 06:39
@Arwalk
Copy link
Owner

Arwalk commented Jul 25, 2024

I took the liberty to rebase after #51 was merged, there was a small conflict.

@Arwalk Arwalk added this to the v1.1.0 milestone Jul 25, 2024
@Arwalk
Copy link
Owner

Arwalk commented Jul 25, 2024

We need to set ourselves some clear goals for this PR.

Required TODO

  • In json parsing/writing, Map fields should be written/read as json objects. Making it an actual map on the zig side will be another subject (Make Map fields a HashMap instead of a list of entries #52)
  • Use proto field name instead of lowerCamelCase name: according to the spec, we should convert field names to lowerCamelCase when writing, and accept lowerCamelCase and actual name when parsing.

@librolibro if you start any of these subjects, tell me so we avoid doing it twice.

Worth analyzing in separate issues (that will be created after the merge of this PR)

  • Check support for mapping of types defined by google (Timestamp, Duration...)
  • Ensure special values for floats are properly handled (NaN, Infinity..)
  • Pass options to the parser
    • to ignore unknown fields
    • to always emits fields without presence
  • Pass options to the writer
    • to support writing actual name and not the camelCase name
    • to support emitting enum values instead of names

@james-callahan
Copy link
Contributor

I'm gonna check if there are ways to do that, but i'm not sure i have access to the slices in the json encoding api. In the end, it's not dramatic if this stays as is, if you're already working with json encoding/decoding you're probably not too worried about a pair of temporary allocations.

Edit: nevermind, found a way in 1c8ab07 . That require rewriting some of the code inside the json writer, we might have to make a request to make some methods public in the std.

In theory you don't need to: a base64 encoder should be able to work on (multiples of) 3-byte pieces.
So e.g. do:

  const str = value.getSlice();
  try jws.valueStart();
  try jws.stream.writeByte('\"');
  var temp_out: [5]u8;
  var i=0;
  while (i < str.len) : (i+=3) {
    const s = base64.standard.Encoder.encode(&temp_out, str[i..i+3]);
    try jws.stream.writeAll(s);  
  }
  try jws.stream.writeByte('\"');
  jws.valueDone();

@Arwalk
Copy link
Owner

Arwalk commented Jul 31, 2024

I'm gonna check if there are ways to do that, but i'm not sure i have access to the slices in the json encoding api. In the end, it's not dramatic if this stays as is, if you're already working with json encoding/decoding you're probably not too worried about a pair of temporary allocations.
Edit: nevermind, found a way in 1c8ab07 . That require rewriting some of the code inside the json writer, we might have to make a request to make some methods public in the std.

In theory you don't need to: a base64 encoder should be able to work on (multiples of) 3-byte pieces. So e.g. do:

  const str = value.getSlice();
  try jws.valueStart();
  try jws.stream.writeByte('\"');
  var temp_out: [5]u8;
  var i=0;
  while (i < str.len) : (i+=3) {
    const s = base64.standard.Encoder.encode(&temp_out, str[i..i+3]);
    try jws.stream.writeAll(s);  
  }
  try jws.stream.writeByte('\"');
  jws.valueDone();

Could have worked indeed. I managed to make it in one pass, that should be good enough.

@librolibro
Copy link
Contributor Author

librolibro commented Aug 1, 2024

repeated bytes, oneof=[NaN or +/-inf] tests fails as expected (NaN in arrays and oneof=.Bytes would probably fail as well), I'll make some changes to support these, for now there are failing tests to not cover this problem.

P.S. For every type we're handling by itself (except ManagedString and MessageMixins) we need to paste almost the same code in three places: single field handling, repeated field handling and also oneof handling.

P.P.S. Is there a way to disable workflow runs temporarily? I certainly know that tests will fail because I wrote them to fail :) Or maybe I shouldn't push until I won't make it good

@Arwalk
Copy link
Owner

Arwalk commented Aug 2, 2024

P.P.S. Is there a way to disable workflow runs temporarily? I certainly know that tests will fail because I wrote them to fail :) Or maybe I shouldn't push until I won't make it good

It's actually me that's starting them manually sometimes, just tell me when you feel they should run.

Looks like all types are being handled correctly whether
it's single/repeated/oneof field (need more tests for sure)
@librolibro
Copy link
Contributor Author

librolibro commented Aug 3, 2024

For now it looks like all (supported ATM) types are processed as expected. I think the current JSON code base is messy IMO - especially in case of function naming. I think it's better to refactor all JSON code according to your code style and not to Zig's one (snake_case for functions instead of camelCase etc.) to avoid more refactor for now. Maybe it's time to move JSON code to separate file because there's a lot of things copied from std.json + things implemented by you and me (400+ lines of code).

Also more tests ... For things like deserializing integer/float from JSON string (whether it's single, repeated of inside of oneof) etc.

P.S. Since it's your project it's for you to decide whether you need to refactor all of your code to completely follow Zig style guide or not. It's better when all parts of the code are homogeneous, how it's a bit messy)

@librolibro
Copy link
Contributor Author

Please run another workflow for this PR. I want to check if tests will pass now of Windows (there were strange fails, it was probably about different line endings, but I'm not 100% sure)

@Arwalk
Copy link
Owner

Arwalk commented Aug 4, 2024

Looks like it's not fixed yet. Don't bother too much, i'll take the time to look into it tuesday. Thank you for all your work already.

@librolibro
Copy link
Contributor Author

Now .gitattributes work as expected (for some reason leading ./ was the problem and I missed it). If Windows build failures are about line endings it might fix the problem now :)

$ git check-attr --all -- tests/json_data/fixed_sizes/camelCase_1.json
tests/json_data/fixed_sizes/camelCase_1.json: text: set
tests/json_data/fixed_sizes/camelCase_1.json: eol: lf

@librolibro
Copy link
Contributor Author

Last major thing wasn't made yet is Always emit fields without presence described here + tests for cases when not all fields are present in the input JSON string (I think there's MessageMixins.init call missed and these tests would fail).

For now I don't know how to vary parser's behavior (omit default fields or not, serialize using camelCase or snake_case etc.), there's no able to pass extra state into std.json inner functions and hence into our custom jsonParse/jsonStringify implementations. I have couple "possibly good" ideas how to make it, I would try these after default fields workaround, maybe in the weekend.

@Arwalk
Copy link
Owner

Arwalk commented Aug 6, 2024

This will be good enough for a good "basic" json encoding/decoding. I'll create issues for everything mentionned in #49 (comment) So we can do bit by bit everything right.

Thanks a lot for your involvment !

@Arwalk
Copy link
Owner

Arwalk commented Aug 6, 2024

@Arwalk Arwalk merged commit 9115a73 into Arwalk:master Aug 6, 2024
5 checks passed
@librolibro
Copy link
Contributor Author

I was on the way to push some commits :) Earlier I thought that Github PR's can be reopened with merge up-to-date master with fork's branch, and then do it again for next issue (does Github allows this?)... But separate issues/PRs might be a cleaner solution.

@Arwalk
Copy link
Owner

Arwalk commented Aug 6, 2024

Woops sorry! I'd really prefer if we split it up, it will be cleaner to handle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Json serialization/parsing
3 participants