-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite wasmparser's module parser and validator #40
Rewrite wasmparser's module parser and validator #40
Conversation
One last thing I think is worth pointing out is that my goal here is to start the ball rolling on this new API design. This singular PR/commit is not what I intend the final state to look like. For example Basically this is to say that where it makes sense I've left out possible future improvements, but I intend to tackle them as subsequent PRs once this looks good to everyone. |
I haven't looked at the code yet, but just wanted to note that I think this is a good direction, specifically from the angle of simplifying Regarding the eventual validator trait, here are some random thoughts about what we want from it:
Going to take a look at the API docs and then the code itself next. |
Here's a sketch of an idea for fuzzing the incremental parser. Basically do differential fuzzing between giving the parser data in increments and giving it the data all at once. This should exercise the incremental code paths and check the correctness of the results. Without any regards to the actual APIs, here is some pseudocode: fuzz_target!(|pair: (usize, Vec<u8>)| {
let (index, data) = pair;
// Parse incrementally, by splitting data at index.
let mut incr_parser = Parser::new();
incr_parser.parse(&data[..index]);
incr_parser.parse(&data[index..]);
let incr_result = parser.result();
// Parse all of data in one go.
let complete_result = Parser::new().parse(&data).result();
assert_eq!(incr_result, complete_result);
}); |
Alternatively, rather than split once at fuzz_target!(|pair: (usize, Vec<u8>)| {
let (chunk_size, data) = pair;
// Parse incrementally, by feeding chunks of size chunk_size.
let mut incr_parser = Parser::new();
for chunk in data.chunks(chunk_size) {
incr_parser.parse(chunk);
}
let incr_result = parser.result();
// Parse all of data in one go.
let complete_result = Parser::new().parse(&data).result();
assert_eq!(incr_result, complete_result);
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I looked at the API docs and everthing makes sense / is expected. The examples for using Parser
were helpful! 👍 I skimmed the code, didn't read it in excruciating detail, and it looked good. As you mentioned, it was familiar, since there was a bit of code motion here as well.
Random note: this is an existing issue, but OperatorsIteratorWithOffsets
is not exported from the crate, despite being returned by a public method. We should export it.
#[derive(Debug, Clone)] | ||
pub struct Parser { | ||
state: State, | ||
offset: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a method that exposes this value (or maybe have the offset in Chunk::Parsed
or Payload
?). We need it for processing DWARF.
I know it can be reconstructed from looking at Chunk::Parsed::consumed
, but (a) that seems quite tedious, and (b) might not have the granularity required for the DWARF stuff (offset per instruction).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind: just saw OperatorsReader::read_with_offset
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah FWIW I also want to make sure that every single byte in a wasm file can be accounted for using this API (one way or another at least). The wasmparser-dump
crate (which previously lived at tests/dump.rs
) ensures this too
crates/wasmparser/src/parser.rs
Outdated
// TODO: thread thread through `offset: u64` to `BinaryReader`, remove | ||
// the cast here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually something I forgot to mention too. Lots of stuff in wasmparser is gearded towards usize
in terms of offsets, but I think that we'll probably want to switch to u64
now that they're not always relative to an in-memory stream but may instead be relative to a stream of bytes elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just reviewed a tiny fraction of the whole PR.
|
||
[dependencies] | ||
anyhow = "1" | ||
wasmparser = { path = "../wasmparser" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure exactly what the dump
crate is good for (it prints Wasm modules?) but this is a cyclic dependency back to wasmparser
that has the dump
crate as dev-dependency
. I usually try to avoid those if possible. Not sure if it is bad in this case though - I guess not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah FWIW I do this relatively commonly, and it's intended. This works out (intentionally) in Cargo and is basically a way to define core functionality which is tested with other pieces of functionality defined elsewhere that build on the core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay thanks for elaborating!
let mut ret = Validator::new(); | ||
ret.wasm_reference_types(true) | ||
.wasm_multi_value(true) | ||
.wasm_simd(true) | ||
.wasm_module_linking(true) | ||
.wasm_bulk_memory(true) | ||
.wasm_threads(true) | ||
.wasm_tail_call(true); | ||
return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you chain it because it has &mut self
receivers? If so: If this is a common way to instantiate Validator
instances we might prefer a builder pattern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My personally style is to use &mut self -> &mut self
for builder style rather than self -> self
. This is a downside that you have to bind a local variable every now and then, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both works, yeah. Whatever you prefer. The advantage with self -> self
is that you can more easily use type-level to protect against wrong usage which is why I generally prefer it but not really needed here.
@@ -119,70 +193,39 @@ fn collect_benchmark_inputs() -> Vec<BenchmarkInput> { | |||
fn it_works_benchmark(c: &mut Criterion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2 remaining benchmarks are nicely written, however, since this PR adds support for incremental parsing we should actually also have benchmarks that take this into account. So far the 2 existing benchmarks only benchmark the parsing where the full code is known ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah the benchmarking isn't super rigorous right now and I think this could benefit from adding some more benchmarks. For now though I'm not sure it's worth adding a benchmark for incremental parsing since if the slowdown happens between incremental chunks you typically have a lot going on between then anyway (e.g. waiting for data from the network)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I agree with the API changes.
acce7b0
to
631ba82
Compare
Thanks for the inspiration! I started with the examples you gave and ended up realizing we could also do Fuzzing was able to nigh-instantaneously find "bugs" when I was writing the fuzz program, I'm always impressed at how fast it can find preexisting bugs... |
This commit is a major refactoring (bordering on rewrite) of wasmparser's top-level wasm module parsing and validation. Lots of code from the previous validator has been reused but it's moved around so much that it looks more like a rewrite. At a high level this commit removes the old `Parser`, `ModuleReader`, and `ValidatingParser` types. These are replaced with a new `Parser`, `Validator`, and `FuncValidator`. There are a number of points motivating this rewrite: * Parsing a module currently requires the entire module to be resident in-memory. There is no way to incrementally parse a module (for example as it arrives from the network). * Validating a module is an all-or-nothing operation. This, like parsing above, means it's not friendly for incrementally acquired wasm binaries. * Validation does not expose its results, nor does it provide a way for doing so. This means that you can validate a wasm blob in its entirety but you can't retroactively ask what the type of function 3 was. More concretely, if you're implementing a code translator you have to track a lot of state the validator was already keeping for you. * Validation did not easily provide the ability to parse, validate, and possibly compile wasm functions in parallel. The single monolithic `Validator` would be difficult to interleave application-specific details into, such as parallelism. These issues are all currently deep architectural issues in how code is organized today, so the approach taken in this commit is to rewrite these as opposed to adding them on as a feature. Much of this work was motivated after recent refactorings for the module linking proposal. The amount of bookeeping needed to keep track of type aliases and such was a big enough job for validation that I didn't want to have to redo it all again in wasmtime later on! The new `Parser` and `Validator` types are designed to be used both in high-level and low-level contexts. Handling a WebAssembly module efficiently can often involve a lot of moving pieces at runtime which are very application-specific, and it seems like overkill or impossible at worst to try to encapsulate all these patterns in wasmparser. Instead the intention here is that the lowest level bits are able to be reused regardless of how you're parsing wasm, and the higher level bits are as straightforward to reimplement and use as possible. This ideally means that if you can't use some high-level conveniences in wasmparser it should be obvious how you can rewrite them locally to work in your own application. Detailed design of the new APIs added here is best learned by reading the rustdoc documentation, the examples, or tests. At a high-level though the way these new types operate are: * `Parser` is fed chunks of data, and it will return one chunk of parsed data which is a view into the input buffer. If it can't parse a chunk then it will tell the application it needs to wait for more data to be available. * Most sections are parsed as-a-whole, meaning that they need to be entirely resident in memory before being parsed. For example this rewrite does not support incrementally parsing the type section. This is done for ease with the expectation that most sections are reasonably quite small and have no reason to be incrementally processed beyond the section-at-a-time level. * `Parser`, however, will allow incremental downloads of the code and module code sections. This means that it supports parsing a singular function or a singular module at a time. This allows functions to be validated/processed immediately as they're received, without having to wait for the next function to be available. * The `Validator` type receives as input the payloads returned by `Parser`. The `Validator` type is intended to be persistently living adjacent to a `Parser` which it receives input from. * Validation is intended to eventually expose information about the module and operators as necessary. For example methods can be added to `Validator` to query what types are and the results of operators in functions. It's envisioned that you'd use a `Parser` to parse a module and then `Validator` would be used before you execute application-specific code per-section/item. At this time operator/function validation is not changed. The operator validator is only very lightly touched, but otherwise it's expected that this will be a future refactoring. I would like to effectively add a method-per-opcode to `FuncValidator` so engines can, for example, validate a `call` instruction, get back the type of the call, and then use that type to iterate over the types on the stack and return values. None of this is supported yet, but I'm hoping to make, for example, cranelift-wasm lean much more heavily on the wasmparser `Validator`.
fe27cdb
to
a983f48
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing this PR in chunks. I am very happy about all the new comments and generally the code looks clean and fresh. I already learned a lot by reading it. :)
crates/wasmparser/src/parser.rs
Outdated
/// Parse errors are returned as an `Err`. Errors can happen when the | ||
/// structure of the module is expected, or if sections are too large for | ||
/// example. Note that errors are not returned for malformed *contents* of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean that the "structure of the module is expected"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that's supposed to be s/expected/unexpected/
Ok thanks all for reviewing! I'm gonna go ahead and merge this and it sounds like iteration can continue in-tree. |
This fixes issues as pointed out [here]. All of the items in each submodule are intended to be reexported at the root anyway, so it's a bit easier to reexport via globs rather than each name individually. [here]: bytecodealliance#40 (review)
This fixes issues as pointed out [here]. All of the items in each submodule are intended to be reexported at the root anyway, so it's a bit easier to reexport via globs rather than each name individually. [here]: bytecodealliance#40 (review)
This fixes issues as pointed out [here]. All of the items in each submodule are intended to be reexported at the root anyway, so it's a bit easier to reexport via globs rather than each name individually. [here]: #40 (review)
This commit is intended to update wasmparser to 0.59.0. This primarily includes bytecodealliance/wasm-tools#40 which is a large update to how parsing and validation works. The impact on Wasmtime is pretty small at this time, but over time I'd like to refactor the internals here to lean more heavily on that upstream wasmparser refactoring. For now, though, the intention is to get on the train of wasmparser's latest `main` branch to ensure we get bug fixes and such. As part of this update a few other crates and such were updated. This is primarily to handle the new encoding of `ref.is_null` where the type is not part of the instruction encoding any more.
This commit is intended to update wasmparser to 0.59.0. This primarily includes bytecodealliance/wasm-tools#40 which is a large update to how parsing and validation works. The impact on Wasmtime is pretty small at this time, but over time I'd like to refactor the internals here to lean more heavily on that upstream wasmparser refactoring. For now, though, the intention is to get on the train of wasmparser's latest `main` branch to ensure we get bug fixes and such. As part of this update a few other crates and such were updated. This is primarily to handle the new encoding of `ref.is_null` where the type is not part of the instruction encoding any more.
This commit is intended to update wasmparser to 0.59.0. This primarily includes bytecodealliance/wasm-tools#40 which is a large update to how parsing and validation works. The impact on Wasmtime is pretty small at this time, but over time I'd like to refactor the internals here to lean more heavily on that upstream wasmparser refactoring. For now, though, the intention is to get on the train of wasmparser's latest `main` branch to ensure we get bug fixes and such. As part of this update a few other crates and such were updated. This is primarily to handle the new encoding of `ref.is_null` where the type is not part of the instruction encoding any more.
This commit is intended to update wasmparser to 0.59.0. This primarily includes bytecodealliance/wasm-tools#40 which is a large update to how parsing and validation works. The impact on Wasmtime is pretty small at this time, but over time I'd like to refactor the internals here to lean more heavily on that upstream wasmparser refactoring. For now, though, the intention is to get on the train of wasmparser's latest `main` branch to ensure we get bug fixes and such. As part of this update a few other crates and such were updated. This is primarily to handle the new encoding of `ref.is_null` where the type is not part of the instruction encoding any more.
Merge with upstream
This commit is a major refactoring (bordering on rewrite) of
wasmparser's top-level wasm module parsing and validation. Lots of code
from the previous validator has been reused but it's moved around so
much that it looks more like a rewrite. At a high level this commit
removes the old
Parser
,ModuleReader
, andValidatingParser
types.These are replaced with a new
Parser
,Validator
, andFuncValidator
. There are a number of points motivating this rewrite:Parsing a module currently requires the entire module to be resident
in-memory. There is no way to incrementally parse a module (for
example as it arrives from the network).
Validating a module is an all-or-nothing operation. This, like parsing
above, means it's not friendly for incrementally acquired wasm
binaries.
Validation does not expose its results, nor does it provide a way
for doing so. This means that you can validate a wasm blob in its
entirety but you can't retroactively ask what the type of function 3
was. More concretely, if you're implementing a code translator you
have to track a lot of state the validator was already keeping for you.
Validation did not easily provide the ability to parse, validate, and
possibly compile wasm functions in parallel. The single monolithic
Validator
would be difficult to interleave application-specificdetails into, such as parallelism.
These issues are all currently deep architectural issues in how code is
organized today, so the approach taken in this commit is to rewrite
these as opposed to adding them on as a feature. Much of this work was
motivated after recent refactorings for the module linking proposal. The
amount of bookeeping needed to keep track of type aliases and such was a
big enough job for validation that I didn't want to have to redo it all
again in wasmtime later on!
The new
Parser
andValidator
types are designed to be used both inhigh-level and low-level contexts. Handling a WebAssembly module
efficiently can often involve a lot of moving pieces at runtime which
are very application-specific, and it seems like overkill or impossible
at worst to try to encapsulate all these patterns in wasmparser. Instead
the intention here is that the lowest level bits are able to be reused
regardless of how you're parsing wasm, and the higher level bits are as
straightforward to reimplement and use as possible. This ideally means
that if you can't use some high-level conveniences in wasmparser it
should be obvious how you can rewrite them locally to work in your own
application.
Detailed design of the new APIs added here is best learned by reading
the rustdoc documentation, the examples, or tests. At a high-level
though the way these new types operate are:
Parser
is fed chunks of data, and it will return one chunk of parseddata which is a view into the input buffer. If it can't parse a chunk
then it will tell the application it needs to wait for more data to be
available.
Most sections are parsed as-a-whole, meaning that they need to be
entirely resident in memory before being parsed. For example this
rewrite does not support incrementally parsing the type section. This
is done for ease with the expectation that most sections are
reasonably quite small and have no reason to be incrementally
processed beyond the section-at-a-time level.
Parser
, however, will allow incremental downloads of the code andmodule code sections. This means that it supports parsing a singular
function or a singular module at a time. This allows functions to be
validated/processed immediately as they're received, without having to
wait for the next function to be available.
The
Validator
type receives as input the payloads returned byParser
. TheValidator
type is intended to be persistently livingadjacent to a
Parser
which it receives input from.Validation is intended to eventually expose information about the
module and operators as necessary. For example methods can be added to
Validator
to query what types are and the results of operators infunctions. It's envisioned that you'd use a
Parser
to parse a moduleand then
Validator
would be used before you executeapplication-specific code per-section/item.
At this time operator/function validation is not changed. The operator
validator is only very lightly touched, but otherwise it's expected that
this will be a future refactoring. I would like to effectively add a
method-per-opcode to
FuncValidator
so engines can, for example,validate a
call
instruction, get back the type of the call, and thenuse that type to iterate over the types on the stack and return values.
None of this is supported yet, but I'm hoping to make, for example,
cranelift-wasm lean much more heavily on the wasmparser
Validator
.In terms of next steps there's some other bits worth mentioning. My original intention was to start discussing this change through the https://github.com/bytecodealliance/rfcs process, but I basically got excited about implementing this so I went ahead and did this. This change has a large impact on likely all consumers of
wasmparser
and I would very much like to get feedback from stakeholders before actually merging this. I do not expect the APIs literally as-is in the initial state of this PR to get merged, I'm hoping that feedback can help evolve them over time!Other items I'd like to perhaps consider investigating and/or implementing would be:
Fuzzing the incremental parser. There's code bits to handle "I've got the whole module here" vs "I need to wait for more data", and while I tried to architect these in a way that reduces the likelihood of bugs this would be a fantastic thing to fuzz. Basically incrementally parsing a module should always give the exact same semantic results as having the whole module in memory at the beginning, and there should likely be a fuzzer which asserts this.
Optimizing the parser and validator. Some local large modules take nearly 400ms to parse and validate right now, which seems a bit large even for a 40MB wasm file. I think there's likely a lot of low-hanging fruit to be had to optimize validation, and this is something I think is worth looking into to ensure we're able to address performance bottlenecks with this API design.
Experience integrating this into Wasmtime. My intention is to make cranelift-wasm much simpler and lean much more heavily on
Validator
in wasmparser, but I think it's worth proving this out. The long-term goal is to make the implementation of module-linking much simpler in cranelift-wasm by ensuring that everyone's using the same few functions to handle type aliases and count modules and such.And I'm sure at least one other thing will come up during review!