Wasi Preview 2 Async Race Condition #6811

rylev · 2023-08-07T11:02:17Z

Moving streams to async has introduced a race condition. The AsyncWriteStream often gets dropped before the receiver inside the spawned task has time to receive a message.

This makes sense as there's nothing to tie the spawned task's completion to the Store that contains the AsyncWriteStream. It's very easy to drop the Store before any handlers are allowed to run to completition.

Test Case

Take the following "hello world" case:

use wasmtime::{
    component::{Component, Linker},
    Config, Engine, Store,
};
use wasmtime_wasi::preview2::{Table, WasiCtx, WasiCtxBuilder, WasiView};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let mut config = Config::new();
    config.wasm_component_model(true).async_support(true);
    let engine = Engine::new(&config)?;
    let component = Component::from_file(&engine, "example.wasm")?;
    let mut table = Table::new();
    let ctx = WasiCtxBuilder::new().inherit_stdio().build(&mut table)?;
    let mut store = Store::new(&engine, Data { table, ctx });
    let mut linker = Linker::new(&engine);
    wasmtime_wasi::preview2::command::add_to_linker(&mut linker)?;
    let instance = linker.instantiate_async(&mut store, &component).await?;
    let run = instance.get_typed_func::<(), ()>(&mut store, "hello")?;
    run.call_async(&mut store, ()).await?;
    Ok(())
}

struct Data {
    table: Table,
    ctx: WasiCtx,
}

impl WasiView for Data {
    fn table(&self) -> &wasmtime_wasi::preview2::Table {
        &self.table
    }

    fn table_mut(&mut self) -> &mut wasmtime_wasi::preview2::Table {
        &mut self.table
    }

    fn ctx(&self) -> &wasmtime_wasi::preview2::WasiCtx {
        &self.ctx
    }

    fn ctx_mut(&mut self) -> &mut wasmtime_wasi::preview2::WasiCtx {
        &mut self.ctx
    }
}

The example.wasm file is just a simple wasm component that has one export hello which just uses Rust's println! to print "Hello, World!" to stdout.

This technically works, but only by coincidence. If we see when the "Hello, World" gets flushed to stdout, it happens after the spawned task that is handling IO is aborted in the Drop handler.

Steps to Reproduce

You can reproduce those by running the above example with a local checkout of wasmtime. Simply add a logging statement to the drop handler of AsyncWriteStream and you'll see that the drop (and thus the cancellation of the spawned task) happens before the output to stdout.

I've found this to be reliably reproducible in a user visible way if the wasm binary prints more than 1 line of text to stdout. The first line usually gets successfully flushed, but the second does not. Adding something to slow down the execution of the runtime can eliminates the bug (e.g., adding tokio::time::sleep(std::time::Duration::from_millis(1)).await; after the function call, gets rid of the race condition).

I first caught this bug first when upgrading Spin to the latest version of wasmtime. The exported function would complete and the containing Store would be dropped before the underlying AsyncWriteStream's spawned task had a chance to receive the bytes to be written to the inner AsyncWrite type. Only every ~10th invocation would see the spawned task actually process the bytes before the AsyncWriteStream's drop handler canceled the task.

Expected Results

I would expect that there would be some mechanism for ensuring that ongoing async writes complete before the Store is dropped.

Versions and Environment

This is the case on the latest version of wasmtime as of this writing.

The text was updated successfully, but these errors were encountered:

pchickey · 2023-08-07T17:53:25Z

Thanks for the report! @alexcrichton and @elliottt and I had talked about this hypothetical but I hadn't realized how trivial it is to trip over. We'll work on a mechanism to hold onto the JoinHandles created inside a Store beyond the lifetime of the Store.

rylev · 2023-08-08T08:59:24Z

Think about this more, I think the right thing to do is to just remove the call to abort inside AsyncWriteStream's drop impl. I don't see why an immediate abort would be necessary, and if there is such a use case, it seems like it should be opt-in.

If we remove the call to abort, then the spawned task will still cleanly exit. Either the call to recv will return None because the channel is empty and the sender has been dropped OR a normal expected exit occurs here or here.

alexcrichton · 2023-08-08T15:42:52Z

The main consequence of letting things run in the background though was that each wasm guest could keep work arbitrarily sitting in the background for a long time, meaning that work could accumulate over time by accident before it's all flushed out and finished. The thinking was the we could add the ability for a host to explicitly opt-in to daemon-ing the background work to keep it going but additionally allow hosts to learn about the background work via some object too.

In light of this we may want to revisit the considered default though since I think it's true that almost all the time having background work won't harm anyone or anything. So while I think it'd still be good to package up the background work for an embedder to get access to, @pchickey what do you think about switch to not-abort-by-default?

This method flushes output and terminates background tasks. Background tasks still terminate as part of Drop! The problem with the current implementation is that there is no way to wait for output buffered in host background tasks to flush before aborting those tasks as part of dropping the Store/Table. This means that e.g. for a trivial component that prints "hello world\n" to stdout and returns, if the Store&Table drop immediately after execution of the component completes, there is a race and the output may not happen at all. I don't really love this design, but we got backed into a corner because all of the alternatives we could think up are worse: * We can't just get rid of the abort-on-drop completely ("daemonize" the tasks) because that means that streams that are connected to e.g. a stalled client connection will consume resources forever, which is not acceptable in some embeddings. * We can't ensure flushing on drop of a table/store because it requires an await, and rust does not have an async drop * We can't add an explicit destructor to a table/store which will terminate tasks, and if this destructor is not called tasks will "daemonize", because that means cancellation of the future executing a component before the explicit destructor is called will end up daemonizing the task. * We could configure all AsyncWriteStreams (and any other stream impls that spawn a task) at creation, or at insertion to the table, with whether they should daemonize on drop or not. This would mean plumbing a bunch of config into places it currently is not. Therefore, the only behavior we could come up with was to keep the abort-on-drop behavior for background tasks, and add methods to ensure that background tasks are joined (finished) gracefully. This means that both sync and async users of WasiCtx will need to call the appropriate method to wait on background tasks. This is easy enough for users to miss, but we decided that the alternatives are worse. Closes #6811

This method flushes output and terminates background tasks. Background tasks still terminate as part of Drop! The problem with the current implementation is that there is no way to wait for output buffered in host background tasks to flush before aborting those tasks as part of dropping the Store/Table. This means that e.g. for a trivial component that prints "hello world\n" to stdout and returns, if the Store&Table drop immediately after execution of the component completes, there is a race and the output may not happen at all. I don't really love this design, but we got backed into a corner because all of the alternatives we could think up are worse: * We can't just get rid of the abort-on-drop completely ("daemonize" the tasks) because that means that streams that are connected to e.g. a stalled client connection will consume resources forever, which is not acceptable in some embeddings. * We can't ensure flushing on drop of a table/store because it requires an await, and rust does not have an async drop * We can't add an explicit destructor to a table/store which will terminate tasks, and if this destructor is not called tasks will "daemonize", because that means cancellation of the future executing a component before the explicit destructor is called will end up daemonizing the task. * We could configure all AsyncWriteStreams (and any other stream impls that spawn a task) at creation, or at insertion to the table, with whether they should daemonize on drop or not. This would mean plumbing a bunch of config into places it currently is not. Therefore, the only behavior we could come up with was to keep the abort-on-drop behavior for background tasks, and add methods to ensure that background tasks are joined (finished) gracefully. This means that both sync and async users of WasiCtx will need to call the appropriate method to wait on background tasks. This is easy enough for users to miss, but we decided that the alternatives are worse. Closes bytecodealliance#6811

rylev added the bug Incorrect behavior in the current implementation that needs fixing label Aug 7, 2023

pchickey self-assigned this Aug 7, 2023

pchickey mentioned this issue Aug 8, 2023

add a join_background_tasks method to Host{In,Out}putStream and WasiCtx. #6823

Closed

pchickey mentioned this issue Aug 30, 2023

WASI preview 2 output-streams: new backpressure and flushing design #6877

Merged

rylev mentioned this issue Sep 8, 2023

Update to wasmtime 12.0 fermyon/spin-componentize#17

Closed

elliottt closed this as completed in #6877 Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wasi Preview 2 Async Race Condition #6811

Wasi Preview 2 Async Race Condition #6811

rylev commented Aug 7, 2023

pchickey commented Aug 7, 2023

rylev commented Aug 8, 2023 •

edited

Loading

alexcrichton commented Aug 8, 2023

Wasi Preview 2 Async Race Condition #6811

Wasi Preview 2 Async Race Condition #6811

Comments

rylev commented Aug 7, 2023

Test Case

Steps to Reproduce

Expected Results

Versions and Environment

pchickey commented Aug 7, 2023

rylev commented Aug 8, 2023 • edited Loading

alexcrichton commented Aug 8, 2023

rylev commented Aug 8, 2023 •

edited

Loading