Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite wasmparser's module parser and validator #40

Merged
merged 9 commits into from
Jul 13, 2020

Conversation

alexcrichton
Copy link
Member

@alexcrichton alexcrichton commented Jul 9, 2020

This commit is a major refactoring (bordering on rewrite) of
wasmparser's top-level wasm module parsing and validation. Lots of code
from the previous validator has been reused but it's moved around so
much that it looks more like a rewrite. At a high level this commit
removes the old Parser, ModuleReader, and ValidatingParser types.
These are replaced with a new Parser, Validator, and
FuncValidator. There are a number of points motivating this rewrite:

  • Parsing a module currently requires the entire module to be resident
    in-memory. There is no way to incrementally parse a module (for
    example as it arrives from the network).

  • Validating a module is an all-or-nothing operation. This, like parsing
    above, means it's not friendly for incrementally acquired wasm
    binaries.

  • Validation does not expose its results, nor does it provide a way
    for doing so. This means that you can validate a wasm blob in its
    entirety but you can't retroactively ask what the type of function 3
    was. More concretely, if you're implementing a code translator you
    have to track a lot of state the validator was already keeping for you.

  • Validation did not easily provide the ability to parse, validate, and
    possibly compile wasm functions in parallel. The single monolithic
    Validator would be difficult to interleave application-specific
    details into, such as parallelism.

These issues are all currently deep architectural issues in how code is
organized today, so the approach taken in this commit is to rewrite
these as opposed to adding them on as a feature. Much of this work was
motivated after recent refactorings for the module linking proposal. The
amount of bookeeping needed to keep track of type aliases and such was a
big enough job for validation that I didn't want to have to redo it all
again in wasmtime later on!

The new Parser and Validator types are designed to be used both in
high-level and low-level contexts. Handling a WebAssembly module
efficiently can often involve a lot of moving pieces at runtime which
are very application-specific, and it seems like overkill or impossible
at worst to try to encapsulate all these patterns in wasmparser. Instead
the intention here is that the lowest level bits are able to be reused
regardless of how you're parsing wasm, and the higher level bits are as
straightforward to reimplement and use as possible. This ideally means
that if you can't use some high-level conveniences in wasmparser it
should be obvious how you can rewrite them locally to work in your own
application.

Detailed design of the new APIs added here is best learned by reading
the rustdoc documentation, the examples, or tests. At a high-level
though the way these new types operate are:

  • Parser is fed chunks of data, and it will return one chunk of parsed
    data which is a view into the input buffer. If it can't parse a chunk
    then it will tell the application it needs to wait for more data to be
    available.

  • Most sections are parsed as-a-whole, meaning that they need to be
    entirely resident in memory before being parsed. For example this
    rewrite does not support incrementally parsing the type section. This
    is done for ease with the expectation that most sections are
    reasonably quite small and have no reason to be incrementally
    processed beyond the section-at-a-time level.

  • Parser, however, will allow incremental downloads of the code and
    module code sections. This means that it supports parsing a singular
    function or a singular module at a time. This allows functions to be
    validated/processed immediately as they're received, without having to
    wait for the next function to be available.

  • The Validator type receives as input the payloads returned by
    Parser. The Validator type is intended to be persistently living
    adjacent to a Parser which it receives input from.

  • Validation is intended to eventually expose information about the
    module and operators as necessary. For example methods can be added to
    Validator to query what types are and the results of operators in
    functions. It's envisioned that you'd use a Parser to parse a module
    and then Validator would be used before you execute
    application-specific code per-section/item.

At this time operator/function validation is not changed. The operator
validator is only very lightly touched, but otherwise it's expected that
this will be a future refactoring. I would like to effectively add a
method-per-opcode to FuncValidator so engines can, for example,
validate a call instruction, get back the type of the call, and then
use that type to iterate over the types on the stack and return values.
None of this is supported yet, but I'm hoping to make, for example,
cranelift-wasm lean much more heavily on the wasmparser Validator.


In terms of next steps there's some other bits worth mentioning. My original intention was to start discussing this change through the https://github.com/bytecodealliance/rfcs process, but I basically got excited about implementing this so I went ahead and did this. This change has a large impact on likely all consumers of wasmparser and I would very much like to get feedback from stakeholders before actually merging this. I do not expect the APIs literally as-is in the initial state of this PR to get merged, I'm hoping that feedback can help evolve them over time!

Other items I'd like to perhaps consider investigating and/or implementing would be:

  • Fuzzing the incremental parser. There's code bits to handle "I've got the whole module here" vs "I need to wait for more data", and while I tried to architect these in a way that reduces the likelihood of bugs this would be a fantastic thing to fuzz. Basically incrementally parsing a module should always give the exact same semantic results as having the whole module in memory at the beginning, and there should likely be a fuzzer which asserts this.

  • Optimizing the parser and validator. Some local large modules take nearly 400ms to parse and validate right now, which seems a bit large even for a 40MB wasm file. I think there's likely a lot of low-hanging fruit to be had to optimize validation, and this is something I think is worth looking into to ensure we're able to address performance bottlenecks with this API design.

  • Experience integrating this into Wasmtime. My intention is to make cranelift-wasm much simpler and lean much more heavily on Validator in wasmparser, but I think it's worth proving this out. The long-term goal is to make the implementation of module-linking much simpler in cranelift-wasm by ensuring that everyone's using the same few functions to handle type aliases and count modules and such.

And I'm sure at least one other thing will come up during review!

@alexcrichton
Copy link
Member Author

One last thing I think is worth pointing out is that my goal here is to start the ball rolling on this new API design. This singular PR/commit is not what I intend the final state to look like. For example FuncValidator is only a shell of what I'd like it to be, but changing it requires more invasive changes to operator validation which already exist, and I'd prefer to at least try to minimize the amount of bits touched in wasmparser as part of this PR.

Basically this is to say that where it makes sense I've left out possible future improvements, but I intend to tackle them as subsequent PRs once this looks good to everyone.

@fitzgen
Copy link
Member

fitzgen commented Jul 10, 2020

I haven't looked at the code yet, but just wanted to note that I think this is a good direction, specifically from the angle of simplifying cranelift-wasm. For example, the ref.is_null instruction no longer has a type immediate, so to provide that information to the FuncEnvironment if the env needs it for translation, cranelift-wasm would have to duplicate type checking. (We were almost bitten by this for reference types in wasmtime, but since we can check whether the cranelift type is i64 or r64, we avoided it.)

Regarding the eventual validator trait, here are some random thoughts about what we want from it:

  • The ability to add an arbitrary associated data to the stack alongside the valtypes (e.g. cranelift-wasm would have cranelift_codegen::ir::Values that it is building up in translation, walrus would have its AST nodes, a pure validator would use ())

  • Some sort of hooks for control flow joins, to insert phis and all that. Maybe this is the same as hooking end instructions. Need to revisit that part of cranelift-wasm to remember.

  • We don't want to require that every single implementor of the validator trait should have to reimplement the type validation themselves. Even if they are overriding an opcode method to record extra data. So I view this less as a validator trait (since validation is always handled for you), and more as a callback trait.

Going to take a look at the API docs and then the code itself next.

@fitzgen
Copy link
Member

fitzgen commented Jul 10, 2020

* Fuzzing the incremental parser.

Here's a sketch of an idea for fuzzing the incremental parser. Basically do differential fuzzing between giving the parser data in increments and giving it the data all at once. This should exercise the incremental code paths and check the correctness of the results. Without any regards to the actual APIs, here is some pseudocode:

fuzz_target!(|pair: (usize, Vec<u8>)| {
    let (index, data) = pair;

    // Parse incrementally, by splitting data at index.
    let mut incr_parser = Parser::new();
    incr_parser.parse(&data[..index]);
    incr_parser.parse(&data[index..]);
    let incr_result = parser.result();

    // Parse all of data in one go.
    let complete_result = Parser::new().parse(&data).result();

    assert_eq!(incr_result, complete_result);
});

@fitzgen
Copy link
Member

fitzgen commented Jul 10, 2020

Alternatively, rather than split once at index, feed many chunks of a given size:

fuzz_target!(|pair: (usize, Vec<u8>)| {
    let (chunk_size, data) = pair;

    // Parse incrementally, by feeding chunks of size chunk_size.
    let mut incr_parser = Parser::new();
    for chunk in data.chunks(chunk_size) {
        incr_parser.parse(chunk);
    }
    let incr_result = parser.result();

    // Parse all of data in one go.
    let complete_result = Parser::new().parse(&data).result();

    assert_eq!(incr_result, complete_result);
});

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I looked at the API docs and everthing makes sense / is expected. The examples for using Parser were helpful! 👍 I skimmed the code, didn't read it in excruciating detail, and it looked good. As you mentioned, it was familiar, since there was a bit of code motion here as well.

Random note: this is an existing issue, but OperatorsIteratorWithOffsets is not exported from the crate, despite being returned by a public method. We should export it.

#[derive(Debug, Clone)]
pub struct Parser {
state: State,
offset: u64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a method that exposes this value (or maybe have the offset in Chunk::Parsed or Payload?). We need it for processing DWARF.

I know it can be reconstructed from looking at Chunk::Parsed::consumed, but (a) that seems quite tedious, and (b) might not have the granularity required for the DWARF stuff (offset per instruction).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind: just saw OperatorsReader::read_with_offset etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah FWIW I also want to make sure that every single byte in a wasm file can be accounted for using this API (one way or another at least). The wasmparser-dump crate (which previously lived at tests/dump.rs) ensures this too

Comment on lines 403 to 404
// TODO: thread thread through `offset: u64` to `BinaryReader`, remove
// the cast here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually something I forgot to mention too. Lots of stuff in wasmparser is gearded towards usize in terms of offsets, but I think that we'll probably want to switch to u64 now that they're not always relative to an in-memory stream but may instead be relative to a stream of bytes elsewhere.

crates/wasmparser/src/validator.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@Robbepop Robbepop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewed a tiny fraction of the whole PR.


[dependencies]
anyhow = "1"
wasmparser = { path = "../wasmparser" }
Copy link
Collaborator

@Robbepop Robbepop Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure exactly what the dump crate is good for (it prints Wasm modules?) but this is a cyclic dependency back to wasmparser that has the dump crate as dev-dependency. I usually try to avoid those if possible. Not sure if it is bad in this case though - I guess not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah FWIW I do this relatively commonly, and it's intended. This works out (intentionally) in Cargo and is basically a way to define core functionality which is tested with other pieces of functionality defined elsewhere that build on the core.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thanks for elaborating!

Comment on lines +208 to +216
let mut ret = Validator::new();
ret.wasm_reference_types(true)
.wasm_multi_value(true)
.wasm_simd(true)
.wasm_module_linking(true)
.wasm_bulk_memory(true)
.wasm_threads(true)
.wasm_tail_call(true);
return ret;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you chain it because it has &mut self receivers? If so: If this is a common way to instantiate Validator instances we might prefer a builder pattern here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personally style is to use &mut self -> &mut self for builder style rather than self -> self. This is a downside that you have to bind a local variable every now and then, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both works, yeah. Whatever you prefer. The advantage with self -> self is that you can more easily use type-level to protect against wrong usage which is why I generally prefer it but not really needed here.

@@ -119,70 +193,39 @@ fn collect_benchmark_inputs() -> Vec<BenchmarkInput> {
fn it_works_benchmark(c: &mut Criterion) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2 remaining benchmarks are nicely written, however, since this PR adds support for incremental parsing we should actually also have benchmarks that take this into account. So far the 2 existing benchmarks only benchmark the parsing where the full code is known ahead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah the benchmarking isn't super rigorous right now and I think this could benefit from adding some more benchmarks. For now though I'm not sure it's worth adding a benchmark for incremental parsing since if the slowdown happens between incremental chunks you typically have a lot going on between then anyway (e.g. waiting for data from the network)

Copy link
Contributor

@yurydelendik yurydelendik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I agree with the API changes.

crates/wasmparser/src/parser.rs Outdated Show resolved Hide resolved
crates/wasmparser/src/validator.rs Outdated Show resolved Hide resolved
@alexcrichton
Copy link
Member Author

@fitzgen

Here's a sketch of an idea for fuzzing the incremental parser

Thanks for the inspiration! I started with the examples you gave and ended up realizing we could also do data: Vec<Vec<u8>> which seems like a good easy way to set up fuzzing.

Fuzzing was able to nigh-instantaneously find "bugs" when I was writing the fuzz program, I'm always impressed at how fast it can find preexisting bugs...

This commit is a major refactoring (bordering on rewrite) of
wasmparser's top-level wasm module parsing and validation. Lots of code
from the previous validator has been reused but it's moved around so
much that it looks more like a rewrite. At a high level this commit
removes the old `Parser`, `ModuleReader`, and `ValidatingParser` types.
These are replaced with a new `Parser`, `Validator`, and
`FuncValidator`. There are a number of points motivating this rewrite:

* Parsing a module currently requires the entire module to be resident
  in-memory. There is no way to incrementally parse a module (for
  example as it arrives from the network).

* Validating a module is an all-or-nothing operation. This, like parsing
  above, means it's not friendly for incrementally acquired wasm
  binaries.

* Validation does not expose its results, nor does it provide a way
  for doing so. This means that you can validate a wasm blob in its
  entirety but you can't retroactively ask what the type of function 3
  was. More concretely, if you're implementing a code translator you
  have to track a lot of state the validator was already keeping for you.

* Validation did not easily provide the ability to parse, validate, and
  possibly compile wasm functions in parallel. The single monolithic
  `Validator` would be difficult to interleave application-specific
  details into, such as parallelism.

These issues are all currently deep architectural issues in how code is
organized today, so the approach taken in this commit is to rewrite
these as opposed to adding them on as a feature. Much of this work was
motivated after recent refactorings for the module linking proposal. The
amount of bookeeping needed to keep track of type aliases and such was a
big enough job for validation that I didn't want to have to redo it all
again in wasmtime later on!

The new `Parser` and `Validator` types are designed to be used both in
high-level and low-level contexts. Handling a WebAssembly module
efficiently can often involve a lot of moving pieces at runtime which
are very application-specific, and it seems like overkill or impossible
at worst to try to encapsulate all these patterns in wasmparser. Instead
the intention here is that the lowest level bits are able to be reused
regardless of how you're parsing wasm, and the higher level bits are as
straightforward to reimplement and use as possible. This ideally means
that if you can't use some high-level conveniences in wasmparser it
should be obvious how you can rewrite them locally to work in your own
application.

Detailed design of the new APIs added here is best learned by reading
the rustdoc documentation, the examples, or tests. At a high-level
though the way these new types operate are:

* `Parser` is fed chunks of data, and it will return one chunk of parsed
  data which is a view into the input buffer. If it can't parse a chunk
  then it will tell the application it needs to wait for more data to be
  available.

* Most sections are parsed as-a-whole, meaning that they need to be
  entirely resident in memory before being parsed. For example this
  rewrite does not support incrementally parsing the type section. This
  is done for ease with the expectation that most sections are
  reasonably quite small and have no reason to be incrementally
  processed beyond the section-at-a-time level.

* `Parser`, however, will allow incremental downloads of the code and
  module code sections. This means that it supports parsing a singular
  function or a singular module at a time. This allows functions to be
  validated/processed immediately as they're received, without having to
  wait for the next function to be available.

* The `Validator` type receives as input the payloads returned by
  `Parser`. The `Validator` type is intended to be persistently living
  adjacent to a `Parser` which it receives input from.

* Validation is intended to eventually expose information about the
  module and operators as necessary. For example methods can be added to
  `Validator` to query what types are and the results of operators in
  functions. It's envisioned that you'd use a `Parser` to parse a module
  and then `Validator` would be used before you execute
  application-specific code per-section/item.

At this time operator/function validation is not changed. The operator
validator is only very lightly touched, but otherwise it's expected that
this will be a future refactoring. I would like to effectively add a
method-per-opcode to `FuncValidator` so engines can, for example,
validate a `call` instruction, get back the type of the call, and then
use that type to iterate over the types on the stack and return values.
None of this is supported yet, but I'm hoping to make, for example,
cranelift-wasm lean much more heavily on the wasmparser `Validator`.
Copy link
Collaborator

@Robbepop Robbepop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing this PR in chunks. I am very happy about all the new comments and generally the code looks clean and fresh. I already learned a lot by reading it. :)

crates/wasmparser/src/parser.rs Outdated Show resolved Hide resolved
crates/wasmparser/src/parser.rs Outdated Show resolved Hide resolved
Comment on lines 290 to 292
/// Parse errors are returned as an `Err`. Errors can happen when the
/// structure of the module is expected, or if sections are too large for
/// example. Note that errors are not returned for malformed *contents* of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean that the "structure of the module is expected"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that's supposed to be s/expected/unexpected/

@alexcrichton
Copy link
Member Author

Ok thanks all for reviewing! I'm gonna go ahead and merge this and it sounds like iteration can continue in-tree.

@alexcrichton alexcrichton merged commit db2ef19 into bytecodealliance:main Jul 13, 2020
@alexcrichton alexcrichton deleted the incremental-parser branch July 13, 2020 14:12
alexcrichton added a commit to alexcrichton/wasm-tools that referenced this pull request Jul 13, 2020
This fixes issues as pointed out [here]. All of the items in each
submodule are intended to be reexported at the root anyway, so it's a
bit easier to reexport via globs rather than each name individually.

[here]: bytecodealliance#40 (review)
alexcrichton added a commit to alexcrichton/wasm-tools that referenced this pull request Jul 13, 2020
This fixes issues as pointed out [here]. All of the items in each
submodule are intended to be reexported at the root anyway, so it's a
bit easier to reexport via globs rather than each name individually.

[here]: bytecodealliance#40 (review)
alexcrichton added a commit that referenced this pull request Jul 13, 2020
This fixes issues as pointed out [here]. All of the items in each
submodule are intended to be reexported at the root anyway, so it's a
bit easier to reexport via globs rather than each name individually.

[here]: #40 (review)
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Jul 13, 2020
This commit is intended to update wasmparser to 0.59.0. This primarily
includes bytecodealliance/wasm-tools#40 which is a large update to how
parsing and validation works. The impact on Wasmtime is pretty small at
this time, but over time I'd like to refactor the internals here to lean
more heavily on that upstream wasmparser refactoring.

For now, though, the intention is to get on the train of wasmparser's
latest `main` branch to ensure we get bug fixes and such.

As part of this update a few other crates and such were updated. This is
primarily to handle the new encoding of `ref.is_null` where the type is
not part of the instruction encoding any more.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Jul 13, 2020
This commit is intended to update wasmparser to 0.59.0. This primarily
includes bytecodealliance/wasm-tools#40 which is a large update to how
parsing and validation works. The impact on Wasmtime is pretty small at
this time, but over time I'd like to refactor the internals here to lean
more heavily on that upstream wasmparser refactoring.

For now, though, the intention is to get on the train of wasmparser's
latest `main` branch to ensure we get bug fixes and such.

As part of this update a few other crates and such were updated. This is
primarily to handle the new encoding of `ref.is_null` where the type is
not part of the instruction encoding any more.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Jul 13, 2020
This commit is intended to update wasmparser to 0.59.0. This primarily
includes bytecodealliance/wasm-tools#40 which is a large update to how
parsing and validation works. The impact on Wasmtime is pretty small at
this time, but over time I'd like to refactor the internals here to lean
more heavily on that upstream wasmparser refactoring.

For now, though, the intention is to get on the train of wasmparser's
latest `main` branch to ensure we get bug fixes and such.

As part of this update a few other crates and such were updated. This is
primarily to handle the new encoding of `ref.is_null` where the type is
not part of the instruction encoding any more.
alexcrichton added a commit to bytecodealliance/wasmtime that referenced this pull request Jul 13, 2020
This commit is intended to update wasmparser to 0.59.0. This primarily
includes bytecodealliance/wasm-tools#40 which is a large update to how
parsing and validation works. The impact on Wasmtime is pretty small at
this time, but over time I'd like to refactor the internals here to lean
more heavily on that upstream wasmparser refactoring.

For now, though, the intention is to get on the train of wasmparser's
latest `main` branch to ensure we get bug fixes and such.

As part of this update a few other crates and such were updated. This is
primarily to handle the new encoding of `ref.is_null` where the type is
not part of the instruction encoding any more.
dhil added a commit to dhil/wasm-tools that referenced this pull request Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants