-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasking for Emscripten/Wasm target #32532
Conversation
Add it to our priority list and I'll get to it in the fall sometime after 1.3 is branched. |
Current status: Things basically work with one task at a time, but when I try to interleave tasks it gets confused. E.g.:
so it's getting the right task, but printing the wrong |
One thing the Emscripten C implementation is missing (for coroutines) is saving and restoring the C stack, could be part of the same issue here (Bysyncify itself isn't aware of the C stack, just the wasm native call stack). |
Yes, I think I'm restoring that here (after saving it on the corresponding deschedule routine): could easily be doing it wrong though. I'll look into it some more. |
This seems to be working well now. |
For people following along the status here is that this works well, though the bysyncify pass still seems to have a significant amount of overhead for our use case (as well as causing Chrome to complain about it's stack being exceeded). I'm fairly confident that that can be addressed by being a bit more careful which functions are transformed. |
How much overhead are you seeing? Aside from the issue of picking which functions to transform (which I saw you opened an issue on, good idea!), I'd be surprised if this increases code size more than say 50% or so, and especially I'm surprised it overflows the native call stack in browsers. Maybe you can send me a testcase to look at? |
Alright, I've pushed a kf/bysyncifybench branch to https://github.com/Keno/julia-wasm. The hello.js and hello.wasm files in that directory are the build outputs with bysyncify enabled. The -no-bysyncify variants of those files are as the name implies. Local benchmarks: Bysyncify expands the .wasm file by 3.4x. Time benchmarks:
Without Bysyncify:
With Bysyncify:
The julia runtime itself was compiled with O3 (which is what the time benchmarks are measuring), but I re-used dependencies from an O0 build, so some of the size expansion may be easily explained by that. |
Thanks @Keno! Optimizations do seem to explain most of the code size issues here:
Based on my own testing, the size and speed overheads tend to be correlated, so I'm hopeful that building with optimizations will give a much smaller slowdown - hopefully just around 25% or so. (I wanted to try to measure speed myself, but wasn't sure how? When I open a webserver to |
Ugh. That .wasm file did come right out of emscripten, but I think I forgot to put the
Sorry, the frontend is a hacked up version of a frontend we had many years ago and is super janky (I just threw it together in 20 minutes). The place to go to is |
Ok, so with optimizations, I do see the small file size for the .wasm file, but unfortunately the timings are still in the ~350s range. In more encouraging news though it doesn't blow up the browsers anymore, so that's progress. |
Interestingly Chrome seems to be quite a bit better (now that the stack doesn't blow up anymore), clocking in at 81s on the optimized .wasm. |
Thanks @Keno, now I see the repl and I can enter commands, but what do I enter to run the benchmark? (I tried to enter the two lines I saw before, but the first already errors, (81s vs 26s (over 3x slower) is surprisingly bad, so I'd like to do some profiling to investigate this.) |
Sorry about that, forgot to push a change. Try with Keno/julia-wasm@52944b4 (on the same branch). |
Ok, that gets me past the julia> GC.enable(false)
true
julia> show(stdout, MIME"text/html", methods(gcd))
MethodError(f=typeof(Base.show)(), args=(Core.CoreSTDOUT(), Base.Multimedia.MIME{Symbol("text/html")}, Base.MethodList(ms=Array{Method, (8,)}[
Base.GMP.gcd(...),
Base.gcd(...),
Base.gcd(...),
Base.gcd(...),
Base.gcd(...),
Base.gcd(...),
Base.gcd(...),
Dates.gcd(...)], mt=Core.MethodTable(name=:gcd, defs=Core.TypeMapEntry(sig=Tuple{typeof(Base.gcd), Base.GMP.BigInt, Base.GMP.BigInt}, simplesig=nothing, guardsigs=svec(), min_world=0x0000312b, max_world=0xffffffff, func=Base.GMP.gcd(...), isleafsig=true, issimplesig=true, va=false, next=↩︎
Core.TypeMapEntry(sig=Tuple{typeof(Base.gcd), T, T} where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8}, simplesig=nothing, guardsigs=svec(), min_world=0x000022b9, max_world=0xffffffff, func=Base.gcd(...), isleafsig=false, issimplesig=false, va=false, next=↩︎
Core.TypeMapEntry(sig=Tuple{typeof(Base.gcd), Integer}, simplesig=nothing, guardsigs=svec(), min_world=0x000022bc, max_world=0xffffffff, func=Base.gcd(...), isleafsig=false, issimplesig=false, va=false, next=↩︎
Core.TypeMapEntry(sig=Tuple{typeof(Base.gcd), T, T} where T<:Integer, simplesig=nothing, guardsigs=svec(), min_world=0x000022b8, max_world=0xffffffff, func=Base.gcd(...), isleafsig=false, issimplesig=false, va=false, next=↩︎
Core.TypeMapEntry(sig=Tuple{typeof(Base.gcd), Integer, Integer}, simplesig=nothing, guardsigs=svec(), min_world=0x000022be, max_world=0xffffffff, func=Base.gcd(...), isleafsig=false, issimplesig=false, va=false, next=↩︎
Core.TypeMapEntry(sig=Tuple{typeof(Base.gcd), Integer, Vararg{Integer, N} where N}, simplesig=nothing, guardsigs=svec(), min_world=0x000022c0, max_world=0xffffffff, func=Base.gcd(...), isleafsig=false, issimplesig=false, va=true, next=↩︎
Core.TypeMapEntry(sig=Tuple{typeof(Base.gcd), AbstractArray{#s66, N} where N where #s66<:Integer}, simplesig=nothing, guardsigs=svec(), min_world=0x000022c3, max_world=0xffffffff, func=Base.gcd(...), isleafsig=false, issimplesig=false, va=false, next=↩︎
Core.TypeMapEntry(sig=Tuple{typeof(Base.gcd), P, P} where P<:Dates.Period, simplesig=nothing, guardsigs=svec(), min_world=0x00003ac7, max_world=0xffffffff, func=Dates.gcd(...), isleafsig=false, issimplesig=false, va=false, next=↩︎
nothing)))))))), cache=nothing, max_args=3, kwsorter=#<null>, module=Base, backedges=#<null>, =-222198320, =-222207920, offs=0x01, =0x00))), world=0x000042bb)
#<null>
error during run:
nothing
julia> |
Sorry, missing parens on the MIME call, should be: |
BTW, I'm hoping things should work with that, but if you run into any more trouble, I'm happy to get on a call (or stop by the julialang slack - https://slackinvite.julialang.org/) for more real-time communication. I know the setup is a bit janky at the moment. |
It works now, thanks! (and no problem, I understand things are messy atm) Ok, I measure a Chrome slowdown of 39% with bysyncify. That's more than the 25% I was hoping for, but still pretty reasonable I think. I'm hopeful we can improve that further without adding new options as I do more general optimization work (but I'm not sure by how much). |
What version of Chrome are you measuring with? I still fairly consistently see 300% overhead (more with firefox). I also realized that we may be using different bysyncify import lists. The one I'm using is EDIT: Also, I'm transforming functions that have indirect calls of course. |
I've just ran What's your entire emcc command for linking? It should contain that bysyncify imports list plus |
Alright, I've pushed updated files to the branch. |
Oh, hmm. The first time I loaded the new no-bysyncify page it took 37s, the next time it only took 22s. Perhaps there is some caching in Chrome that's been throwing off my measurements? |
I don't see the same behavior with bysyncify enabled however. There it seems to consistently take 54s whether or not it's the first time I'm loading the page. |
Also the thing I'm measuring is just the last invocation, since that's the thing that actually runs the wasm code for any appreciable amount of time. |
@kripken do you see the performance disparity with the new files I pushed? |
Caching can make things tricky to measure, yeah. May need to run multiple times after the first time to get something more stable. It may be good to figure out the code size issue first, as that is fully deterministic. I saw a 25% increase when running bysyncify on the optimized wasm. Is the 50% difference you see on optimized builds (-O3, say) for both bysyncify and not? One big difference in what we are doing is that I just ran the pass at the end, after optimizing the final wasm. If you send me the link command + input files for that, I can try to reproduce the entire build. |
@kripken Sorry for the delay. I've updated the julia-wasm repo with up to date instructions for building the whole thing. I've also put together a docker container that has everything from top to bottom here: https://drive.google.com/open?id=1RDz18kLMU9jgrt6G4u0aSPZuvw-n7zwo (see the rebuild_js.sh script in /root/julia-wasm to modify the final link like). Hope that works for you to reproduce. |
Ok, after offline discussion I got it working locally, and can confirm the ~50% code size increase. That's in the expected range more or less, based on https://kripken.github.io/blog/assets/bysize.svg (from https://kripken.github.io/blog/wasm/2019/07/16/asyncify.html). Speed-wise, the noise makes it hard, but the One possible noise issue: I see chrome runs a CPU or two at max speed after loading the page. I believe it's compiling on TurboFan on a background thread. That means that if you test this immediately, you'll be measuring the baseline, and if you wait enough, you'll be testing fully optimized code. So it's important to measure the same one on both. Disabling "WebAssembly baseline compiler" temporarily in chrome://flags might help. |
With the baseline compiler disabled, I get about 22s for the no-asyncify version and 41s for the asyncify version (after waiting for it to finish compiling). That seems quite a bit worse that what you're measuring (in particular, your no-asyncify version seems a lot slower). |
Interesting, maybe it depends on the browser? On chrome 75 I see pretty consistent numbers like I posted before (testing a few more runs now, I see a 15% difference, and both around around 50s). Are we perhaps measuring differently? Here's my full method: set that chrome flag to disable baseline ("WebAssembly baseline compiler"); load the page (takes longer on asyncify build, I need to file a v8 bug on that); start measuring time; enter But yeah, could just be a browser difference - a register allocator improvement between versions could easily explain the large difference between our results. In any case, definitely we want to improve this, and code size too. I have a few asyncify optimizations I want to finish, and after those we may add a list-based approach for fine-tuning. But I hope that meanwhile this isn't a blocking problem for landing this, even if it is a 2x slowdown? |
Trying to run profilers on this, if the firefox profiler is to be believed then most of the time spent is in |
Hello, Coroutine has been implemented in LLVM since 2016 Isn't it simpler to handle it at llvm ? |
No. LLVM does not support co-routines in any way on wasm. Furthermore, julia does not use LLVM's coroutine support even on platforms where it is supported, so it certainly wouldn't be simpler ;). |
That's my methodology too, except that I input the command first, and then start timing and hit enter at the same time, stopping it when it prints. It's still odd, since I'm also using Chrome 75 (on Mac OS), and especially the no-asyncify build is quite a bit faster.
Yes, that seems about right. I've also tried emscripten-core/emscripten#9094 and I do get some decent improvement (42s down to 34s), but the no-asyncify build is of course still at 22s. That's one Chrome though. The difference is MUCH worse on firefox. |
(in case anybody wants to reproduce the blacklist I tried was
) |
src/task.c
Outdated
@@ -319,9 +326,11 @@ static void ctx_switch(jl_ptls_t ptls, jl_task_t **pt) | |||
jl_swap_fiber(lastt_ctx, &t->ctx); | |||
} | |||
else { | |||
#ifndef JL_HAVE_BYSYNCIFY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
#ifdef COPY_STACKS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
With a compiled system image, the performance of the benchmark is now acceptable (about a second or so) even with ASYNCIFY enabled, so I think we're ok to merge it and turn it on by default. That said, ASYNCIFY does still introduce a noticeable delay, so we should of course keep working on performance. |
This is an implementation of Julia's coroutine/tasking system on top of the binaryen Asyncify transform [1] recently implemented by Alon Zakai. The wasm target is unusual in several ways: 1. It has an implicitly managed call stack that we may not modify directly (i.e. is only modified through calls/returns). 2. The event loop is inverted, in that the browser runs the main event loop and we get callbacks for events, rather than Julia being the main event loop. Asyncify takes care of the first problem by providing a mechanism to explicitly unwind and rewind the implicitly managed stack (essentially copying it to an explicitly managed stack - see the linked implementation for more information). For the second, I am currently using the ptls root_task to represent the browser event loop, i.e. yielding to that task will return control back to the browser and this is the task in which functions called by javascript will run unless they explicitly construct a task to run. As a result, julia code executed in the main task may not perform any blocking operations (though this is currently not enforced). I think this is a sensible setup since the main task will want to run some minor julia code (e.g. to introspect some data structures), but the bulk of the code will run in their own tasks (e.g. the REPL backend task). [1] https://github.com/WebAssembly/binaryen/blob/master/src/passes/Asyncify.cpp
Rebased, addressed @vchuravy's review comments above, so this is good to go from my perspective. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple small comments for my own understanding, nothing blocking.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would have been nice to be given a chance to review this before merging, but since it's only adding inactive code (and fixing a few conditional guards), it doesn't appear to hurt anything.
// maybe check the kernel for new messages too | ||
if (jl_atomic_load(&jl_uv_n_waiters) == 0) | ||
jl_process_events(jl_global_event_loop()); | ||
#else | ||
// Yield back to browser event loop | ||
return ptls->root_task; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems unfortunate, since it appears to reflect that root_task
means something different for jsvm and everyone else. it's also adding code to a part of the code that should be unreachable from wasm currently (it's only active in a threads-enabled build).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, root_task is the one that we are in when we start up the runtime and happens to be the one that runs the browser event loop. I don't think this is too different from the meaning of root_task for other platforms.
One drive-by comment: I wonder in what way coroutines would be unsupported in LLVM for WebAssembly because I've been using them successfully for about a year now with WebAssembly. It's my understanding that the entire coroutine lowering is target independent. |
May be see #32712 (comment) |
This is an implementation of Julia's coroutine/tasking system on top of
the binaryen Bysyncify transform [1] recently implemented by @kripken.
The wasm target is unusual in several ways:
(i.e. is only modified through calls/returns).
we get callbacks for events, rather than Julia being the main event loop.
Bysyncify takes care of the first problem by providing a mechanism to explicitly
unwind and rewind the implicitly managed stack (essentially copying it to an
explicitly managed stack - see the linked implementation for more information).
For the second, I am currently using the ptls root_task to represent the browser
event loop, i.e. yielding to that task will return control back to the browser
and this is the task in which functions called by javascript will run unless they
explicitly construct a task to run. As a result, julia code executed in the main
task may not perform any blocking operations (though this is currently not enforced).
I think this is a sensible setup since the main task will want to run some minor julia
code (e.g. to introspect some data structures), but the bulk of the code will run in
their own tasks (e.g. the REPL backend task).
[1] https://github.com/WebAssembly/binaryen/blob/master/src/passes/Bysyncify.cpp