Rewrite wasmparser's module parser and validator #40

alexcrichton · 2020-07-09T19:58:44Z

This commit is a major refactoring (bordering on rewrite) of
wasmparser's top-level wasm module parsing and validation. Lots of code
from the previous validator has been reused but it's moved around so
much that it looks more like a rewrite. At a high level this commit
removes the old Parser, ModuleReader, and ValidatingParser types.
These are replaced with a new Parser, Validator, and
FuncValidator. There are a number of points motivating this rewrite:

Parsing a module currently requires the entire module to be resident
in-memory. There is no way to incrementally parse a module (for
example as it arrives from the network).
Validating a module is an all-or-nothing operation. This, like parsing
above, means it's not friendly for incrementally acquired wasm
binaries.
Validation does not expose its results, nor does it provide a way
for doing so. This means that you can validate a wasm blob in its
entirety but you can't retroactively ask what the type of function 3
was. More concretely, if you're implementing a code translator you
have to track a lot of state the validator was already keeping for you.
Validation did not easily provide the ability to parse, validate, and
possibly compile wasm functions in parallel. The single monolithic
Validator would be difficult to interleave application-specific
details into, such as parallelism.

These issues are all currently deep architectural issues in how code is
organized today, so the approach taken in this commit is to rewrite
these as opposed to adding them on as a feature. Much of this work was
motivated after recent refactorings for the module linking proposal. The
amount of bookeeping needed to keep track of type aliases and such was a
big enough job for validation that I didn't want to have to redo it all
again in wasmtime later on!

The new Parser and Validator types are designed to be used both in
high-level and low-level contexts. Handling a WebAssembly module
efficiently can often involve a lot of moving pieces at runtime which
are very application-specific, and it seems like overkill or impossible
at worst to try to encapsulate all these patterns in wasmparser. Instead
the intention here is that the lowest level bits are able to be reused
regardless of how you're parsing wasm, and the higher level bits are as
straightforward to reimplement and use as possible. This ideally means
that if you can't use some high-level conveniences in wasmparser it
should be obvious how you can rewrite them locally to work in your own
application.

Detailed design of the new APIs added here is best learned by reading
the rustdoc documentation, the examples, or tests. At a high-level
though the way these new types operate are:

Parser is fed chunks of data, and it will return one chunk of parsed
data which is a view into the input buffer. If it can't parse a chunk
then it will tell the application it needs to wait for more data to be
available.
Most sections are parsed as-a-whole, meaning that they need to be
entirely resident in memory before being parsed. For example this
rewrite does not support incrementally parsing the type section. This
is done for ease with the expectation that most sections are
reasonably quite small and have no reason to be incrementally
processed beyond the section-at-a-time level.
Parser, however, will allow incremental downloads of the code and
module code sections. This means that it supports parsing a singular
function or a singular module at a time. This allows functions to be
validated/processed immediately as they're received, without having to
wait for the next function to be available.
The Validator type receives as input the payloads returned by
Parser. The Validator type is intended to be persistently living
adjacent to a Parser which it receives input from.
Validation is intended to eventually expose information about the
module and operators as necessary. For example methods can be added to
Validator to query what types are and the results of operators in
functions. It's envisioned that you'd use a Parser to parse a module
and then Validator would be used before you execute
application-specific code per-section/item.

At this time operator/function validation is not changed. The operator
validator is only very lightly touched, but otherwise it's expected that
this will be a future refactoring. I would like to effectively add a
method-per-opcode to FuncValidator so engines can, for example,
validate a call instruction, get back the type of the call, and then
use that type to iterate over the types on the stack and return values.
None of this is supported yet, but I'm hoping to make, for example,
cranelift-wasm lean much more heavily on the wasmparser Validator.

In terms of next steps there's some other bits worth mentioning. My original intention was to start discussing this change through the https://github.com/bytecodealliance/rfcs process, but I basically got excited about implementing this so I went ahead and did this. This change has a large impact on likely all consumers of wasmparser and I would very much like to get feedback from stakeholders before actually merging this. I do not expect the APIs literally as-is in the initial state of this PR to get merged, I'm hoping that feedback can help evolve them over time!

Other items I'd like to perhaps consider investigating and/or implementing would be:

Fuzzing the incremental parser. There's code bits to handle "I've got the whole module here" vs "I need to wait for more data", and while I tried to architect these in a way that reduces the likelihood of bugs this would be a fantastic thing to fuzz. Basically incrementally parsing a module should always give the exact same semantic results as having the whole module in memory at the beginning, and there should likely be a fuzzer which asserts this.
Optimizing the parser and validator. Some local large modules take nearly 400ms to parse and validate right now, which seems a bit large even for a 40MB wasm file. I think there's likely a lot of low-hanging fruit to be had to optimize validation, and this is something I think is worth looking into to ensure we're able to address performance bottlenecks with this API design.
Experience integrating this into Wasmtime. My intention is to make cranelift-wasm much simpler and lean much more heavily on Validator in wasmparser, but I think it's worth proving this out. The long-term goal is to make the implementation of module-linking much simpler in cranelift-wasm by ensuring that everyone's using the same few functions to handle type aliases and count modules and such.

And I'm sure at least one other thing will come up during review!

alexcrichton · 2020-07-09T20:00:32Z

One last thing I think is worth pointing out is that my goal here is to start the ball rolling on this new API design. This singular PR/commit is not what I intend the final state to look like. For example FuncValidator is only a shell of what I'd like it to be, but changing it requires more invasive changes to operator validation which already exist, and I'd prefer to at least try to minimize the amount of bits touched in wasmparser as part of this PR.

Basically this is to say that where it makes sense I've left out possible future improvements, but I intend to tackle them as subsequent PRs once this looks good to everyone.

fitzgen · 2020-07-10T16:52:57Z

I haven't looked at the code yet, but just wanted to note that I think this is a good direction, specifically from the angle of simplifying cranelift-wasm. For example, the ref.is_null instruction no longer has a type immediate, so to provide that information to the FuncEnvironment if the env needs it for translation, cranelift-wasm would have to duplicate type checking. (We were almost bitten by this for reference types in wasmtime, but since we can check whether the cranelift type is i64 or r64, we avoided it.)

Regarding the eventual validator trait, here are some random thoughts about what we want from it:

The ability to add an arbitrary associated data to the stack alongside the valtypes (e.g. cranelift-wasm would have cranelift_codegen::ir::Values that it is building up in translation, walrus would have its AST nodes, a pure validator would use ())
Some sort of hooks for control flow joins, to insert phis and all that. Maybe this is the same as hooking end instructions. Need to revisit that part of cranelift-wasm to remember.
We don't want to require that every single implementor of the validator trait should have to reimplement the type validation themselves. Even if they are overriding an opcode method to record extra data. So I view this less as a validator trait (since validation is always handled for you), and more as a callback trait.

Going to take a look at the API docs and then the code itself next.

fitzgen · 2020-07-10T17:00:18Z

* Fuzzing the incremental parser.

Here's a sketch of an idea for fuzzing the incremental parser. Basically do differential fuzzing between giving the parser data in increments and giving it the data all at once. This should exercise the incremental code paths and check the correctness of the results. Without any regards to the actual APIs, here is some pseudocode:

fuzz_target!(|pair: (usize, Vec<u8>)| {
    let (index, data) = pair;

    // Parse incrementally, by splitting data at index.
    let mut incr_parser = Parser::new();
    incr_parser.parse(&data[..index]);
    incr_parser.parse(&data[index..]);
    let incr_result = parser.result();

    // Parse all of data in one go.
    let complete_result = Parser::new().parse(&data).result();

    assert_eq!(incr_result, complete_result);
});

fitzgen · 2020-07-10T17:05:06Z

Alternatively, rather than split once at index, feed many chunks of a given size:

fuzz_target!(|pair: (usize, Vec<u8>)| {
    let (chunk_size, data) = pair;

    // Parse incrementally, by feeding chunks of size chunk_size.
    let mut incr_parser = Parser::new();
    for chunk in data.chunks(chunk_size) {
        incr_parser.parse(chunk);
    }
    let incr_result = parser.result();

    // Parse all of data in one go.
    let complete_result = Parser::new().parse(&data).result();

    assert_eq!(incr_result, complete_result);
});

fitzgen

Okay I looked at the API docs and everthing makes sense / is expected. The examples for using Parser were helpful! 👍 I skimmed the code, didn't read it in excruciating detail, and it looked good. As you mentioned, it was familiar, since there was a bit of code motion here as well.

Random note: this is an existing issue, but OperatorsIteratorWithOffsets is not exported from the crate, despite being returned by a public method. We should export it.

fitzgen · 2020-07-10T17:23:07Z

crates/wasmparser/src/parser.rs

+#[derive(Debug, Clone)]
+pub struct Parser {
+    state: State,
+    offset: u64,


It would be nice to have a method that exposes this value (or maybe have the offset in Chunk::Parsed or Payload?). We need it for processing DWARF.

I know it can be reconstructed from looking at Chunk::Parsed::consumed, but (a) that seems quite tedious, and (b) might not have the granularity required for the DWARF stuff (offset per instruction).

Nevermind: just saw OperatorsReader::read_with_offset etc.

Ah yeah FWIW I also want to make sure that every single byte in a wasm file can be accounted for using this API (one way or another at least). The wasmparser-dump crate (which previously lived at tests/dump.rs) ensures this too

fitzgen · 2020-07-10T17:34:13Z

crates/wasmparser/src/parser.rs

+        // TODO: thread thread through `offset: u64` to `BinaryReader`, remove
+        // the cast here.


This is actually something I forgot to mention too. Lots of stuff in wasmparser is gearded towards usize in terms of offsets, but I think that we'll probably want to switch to u64 now that they're not always relative to an in-memory stream but may instead be relative to a stream of bytes elsewhere.

crates/wasmparser/src/validator.rs

Robbepop

Just reviewed a tiny fraction of the whole PR.

Robbepop · 2020-07-10T13:19:44Z

crates/dump/Cargo.toml

+
+[dependencies]
+anyhow = "1"
+wasmparser = { path = "../wasmparser" }


Not sure exactly what the dump crate is good for (it prints Wasm modules?) but this is a cyclic dependency back to wasmparser that has the dump crate as dev-dependency. I usually try to avoid those if possible. Not sure if it is bad in this case though - I guess not?

Ah FWIW I do this relatively commonly, and it's intended. This works out (intentionally) in Cargo and is basically a way to define core functionality which is tested with other pieces of functionality defined elsewhere that build on the core.

Okay thanks for elaborating!

Robbepop · 2020-07-10T18:03:10Z

crates/wasmparser/benches/benchmark.rs

+        let mut ret = Validator::new();
+        ret.wasm_reference_types(true)
+            .wasm_multi_value(true)
+            .wasm_simd(true)
+            .wasm_module_linking(true)
+            .wasm_bulk_memory(true)
+            .wasm_threads(true)
+            .wasm_tail_call(true);
+        return ret;


Couldn't you chain it because it has &mut self receivers? If so: If this is a common way to instantiate Validator instances we might prefer a builder pattern here.

My personally style is to use &mut self -> &mut self for builder style rather than self -> self. This is a downside that you have to bind a local variable every now and then, though.

Both works, yeah. Whatever you prefer. The advantage with self -> self is that you can more easily use type-level to protect against wrong usage which is why I generally prefer it but not really needed here.

Robbepop · 2020-07-10T18:04:05Z

crates/wasmparser/benches/benchmark.rs

@@ -119,70 +193,39 @@ fn collect_benchmark_inputs() -> Vec<BenchmarkInput> {
 fn it_works_benchmark(c: &mut Criterion) {


The 2 remaining benchmarks are nicely written, however, since this PR adds support for incremental parsing we should actually also have benchmarks that take this into account. So far the 2 existing benchmarks only benchmark the parsing where the full code is known ahead.

Ah yeah the benchmarking isn't super rigorous right now and I think this could benefit from adding some more benchmarks. For now though I'm not sure it's worth adding a benchmark for incremental parsing since if the slowdown happens between incremental chunks you typically have a lot going on between then anyway (e.g. waiting for data from the network)

yurydelendik

Nice! I agree with the API changes.

crates/wasmparser/src/parser.rs

crates/wasmparser/src/validator.rs

alexcrichton · 2020-07-10T19:58:47Z

@fitzgen

Here's a sketch of an idea for fuzzing the incremental parser

Thanks for the inspiration! I started with the examples you gave and ended up realizing we could also do data: Vec<Vec<u8>> which seems like a good easy way to set up fuzzing.

Fuzzing was able to nigh-instantaneously find "bugs" when I was writing the fuzz program, I'm always impressed at how fast it can find preexisting bugs...

This commit is a major refactoring (bordering on rewrite) of wasmparser's top-level wasm module parsing and validation. Lots of code from the previous validator has been reused but it's moved around so much that it looks more like a rewrite. At a high level this commit removes the old `Parser`, `ModuleReader`, and `ValidatingParser` types. These are replaced with a new `Parser`, `Validator`, and `FuncValidator`. There are a number of points motivating this rewrite: * Parsing a module currently requires the entire module to be resident in-memory. There is no way to incrementally parse a module (for example as it arrives from the network). * Validating a module is an all-or-nothing operation. This, like parsing above, means it's not friendly for incrementally acquired wasm binaries. * Validation does not expose its results, nor does it provide a way for doing so. This means that you can validate a wasm blob in its entirety but you can't retroactively ask what the type of function 3 was. More concretely, if you're implementing a code translator you have to track a lot of state the validator was already keeping for you. * Validation did not easily provide the ability to parse, validate, and possibly compile wasm functions in parallel. The single monolithic `Validator` would be difficult to interleave application-specific details into, such as parallelism. These issues are all currently deep architectural issues in how code is organized today, so the approach taken in this commit is to rewrite these as opposed to adding them on as a feature. Much of this work was motivated after recent refactorings for the module linking proposal. The amount of bookeeping needed to keep track of type aliases and such was a big enough job for validation that I didn't want to have to redo it all again in wasmtime later on! The new `Parser` and `Validator` types are designed to be used both in high-level and low-level contexts. Handling a WebAssembly module efficiently can often involve a lot of moving pieces at runtime which are very application-specific, and it seems like overkill or impossible at worst to try to encapsulate all these patterns in wasmparser. Instead the intention here is that the lowest level bits are able to be reused regardless of how you're parsing wasm, and the higher level bits are as straightforward to reimplement and use as possible. This ideally means that if you can't use some high-level conveniences in wasmparser it should be obvious how you can rewrite them locally to work in your own application. Detailed design of the new APIs added here is best learned by reading the rustdoc documentation, the examples, or tests. At a high-level though the way these new types operate are: * `Parser` is fed chunks of data, and it will return one chunk of parsed data which is a view into the input buffer. If it can't parse a chunk then it will tell the application it needs to wait for more data to be available. * Most sections are parsed as-a-whole, meaning that they need to be entirely resident in memory before being parsed. For example this rewrite does not support incrementally parsing the type section. This is done for ease with the expectation that most sections are reasonably quite small and have no reason to be incrementally processed beyond the section-at-a-time level. * `Parser`, however, will allow incremental downloads of the code and module code sections. This means that it supports parsing a singular function or a singular module at a time. This allows functions to be validated/processed immediately as they're received, without having to wait for the next function to be available. * The `Validator` type receives as input the payloads returned by `Parser`. The `Validator` type is intended to be persistently living adjacent to a `Parser` which it receives input from. * Validation is intended to eventually expose information about the module and operators as necessary. For example methods can be added to `Validator` to query what types are and the results of operators in functions. It's envisioned that you'd use a `Parser` to parse a module and then `Validator` would be used before you execute application-specific code per-section/item. At this time operator/function validation is not changed. The operator validator is only very lightly touched, but otherwise it's expected that this will be a future refactoring. I would like to effectively add a method-per-opcode to `FuncValidator` so engines can, for example, validate a `call` instruction, get back the type of the call, and then use that type to iterate over the types on the stack and return values. None of this is supported yet, but I'm hoping to make, for example, cranelift-wasm lean much more heavily on the wasmparser `Validator`.

Robbepop

Still reviewing this PR in chunks. I am very happy about all the new comments and generally the code looks clean and fresh. I already learned a lot by reading it. :)

crates/wasmparser/src/parser.rs

Robbepop · 2020-07-11T21:14:27Z

crates/wasmparser/src/parser.rs

+    /// Parse errors are returned as an `Err`. Errors can happen when the
+    /// structure of the module is expected, or if sections are too large for
+    /// example. Note that errors are not returned for malformed *contents* of


What does it mean that the "structure of the module is expected"?

Oops, that's supposed to be s/expected/unexpected/

alexcrichton · 2020-07-13T14:12:45Z

Ok thanks all for reviewing! I'm gonna go ahead and merge this and it sounds like iteration can continue in-tree.

This fixes issues as pointed out [here]. All of the items in each submodule are intended to be reexported at the root anyway, so it's a bit easier to reexport via globs rather than each name individually. [here]: bytecodealliance#40 (review)

This fixes issues as pointed out [here]. All of the items in each submodule are intended to be reexported at the root anyway, so it's a bit easier to reexport via globs rather than each name individually. [here]: #40 (review)

This commit is intended to update wasmparser to 0.59.0. This primarily includes bytecodealliance/wasm-tools#40 which is a large update to how parsing and validation works. The impact on Wasmtime is pretty small at this time, but over time I'd like to refactor the internals here to lean more heavily on that upstream wasmparser refactoring. For now, though, the intention is to get on the train of wasmparser's latest `main` branch to ensure we get bug fixes and such. As part of this update a few other crates and such were updated. This is primarily to handle the new encoding of `ref.is_null` where the type is not part of the instruction encoding any more.

Merge with upstream

alexcrichton mentioned this pull request Jul 9, 2020

Parse from R: Read instead of byte slice #7

Closed

alexcrichton requested review from fitzgen and yurydelendik July 9, 2020 20:40

fitzgen approved these changes Jul 10, 2020

View reviewed changes

Robbepop reviewed Jul 10, 2020

View reviewed changes

yurydelendik approved these changes Jul 10, 2020

View reviewed changes

crates/wasmparser/src/parser.rs Outdated Show resolved Hide resolved

crates/wasmparser/src/validator.rs Outdated Show resolved Hide resolved

alexcrichton mentioned this pull request Jul 10, 2020

Typecheck declare and actual inline module types #37

Merged

alexcrichton force-pushed the incremental-parser branch from acce7b0 to 631ba82 Compare July 10, 2020 19:56

alexcrichton added 7 commits July 10, 2020 14:05

Fix fuzzers

674f943

Fix wasmparser benches

75222fe

Add an incremental parsing fuzzer

46f728a

Rename Switch to Push

cd40026

Bring back module type validation

786f756

Typos

a983f48

alexcrichton force-pushed the incremental-parser branch from fe27cdb to a983f48 Compare July 10, 2020 21:06

Robbepop reviewed Jul 11, 2020

View reviewed changes

crates/wasmparser/src/parser.rs Outdated Show resolved Hide resolved

crates/wasmparser/src/parser.rs Outdated Show resolved Hide resolved

Robbepop reviewed Jul 11, 2020

View reviewed changes

alexcrichton added 2 commits July 13, 2020 07:11

Fix some typos

073f1f2

More typos

e830637

alexcrichton merged commit db2ef19 into bytecodealliance:main Jul 13, 2020

alexcrichton deleted the incremental-parser branch July 13, 2020 14:12

alexcrichton mentioned this pull request Jul 13, 2020

Use glob exports to avoid forgetting to export something #43

Merged

alexcrichton mentioned this pull request Jul 13, 2020

Update wasmparser to 0.59.0 bytecodealliance/wasmtime#2013

Merged

alexcrichton mentioned this pull request Aug 4, 2020

Implement the module linking proposal in Wasmtime bytecodealliance/wasmtime#2094

Closed

11 tasks

dhil added a commit to dhil/wasm-tools that referenced this pull request Dec 14, 2023

Merge pull request bytecodealliance#40 from dhil/wasmfx-merge

7102709

Merge with upstream

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite wasmparser's module parser and validator #40

Rewrite wasmparser's module parser and validator #40

alexcrichton commented Jul 9, 2020 •

edited

Loading

alexcrichton commented Jul 9, 2020

fitzgen commented Jul 10, 2020

fitzgen commented Jul 10, 2020

fitzgen commented Jul 10, 2020

fitzgen left a comment

fitzgen Jul 10, 2020

fitzgen Jul 10, 2020

alexcrichton Jul 10, 2020

fitzgen Jul 10, 2020

alexcrichton Jul 10, 2020

Robbepop left a comment

Robbepop Jul 10, 2020 •

edited

Loading

alexcrichton Jul 10, 2020

Robbepop Jul 10, 2020

Robbepop Jul 10, 2020

alexcrichton Jul 10, 2020

Robbepop Jul 10, 2020

Robbepop Jul 10, 2020

alexcrichton Jul 10, 2020

yurydelendik left a comment

alexcrichton commented Jul 10, 2020

Robbepop left a comment

Robbepop Jul 11, 2020

alexcrichton Jul 13, 2020

alexcrichton commented Jul 13, 2020

		// TODO: thread thread through `offset: u64` to `BinaryReader`, remove
		// the cast here.

		@@ -119,70 +193,39 @@ fn collect_benchmark_inputs() -> Vec<BenchmarkInput> {
		fn it_works_benchmark(c: &mut Criterion) {

Rewrite wasmparser's module parser and validator #40

Rewrite wasmparser's module parser and validator #40

Conversation

alexcrichton commented Jul 9, 2020 • edited Loading

alexcrichton commented Jul 9, 2020

fitzgen commented Jul 10, 2020

fitzgen commented Jul 10, 2020

fitzgen commented Jul 10, 2020

fitzgen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Robbepop left a comment

Choose a reason for hiding this comment

Robbepop Jul 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurydelendik left a comment

Choose a reason for hiding this comment

alexcrichton commented Jul 10, 2020

Robbepop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Jul 13, 2020

alexcrichton commented Jul 9, 2020 •

edited

Loading

Robbepop Jul 10, 2020 •

edited

Loading