-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guard pages + stack caching #10460
Guard pages + stack caching #10460
Conversation
/// On POSIX, this can be used to specify the default flags passed to `mmap`. By default it uses | ||
/// `MAP_PRIVATE` and, if not using `MapFd`, `MAP_ANON`. This will override both of those. This | ||
/// is platform-specific (the exact values used) and unused on Windows. | ||
MapUnsupportedFlags(c_int), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MapUnknownFlags
? Unsupported
make it sound a little like mmap doesn't support them, rather than Rust just not having them hardcoded. (Minor nitpick.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that they are entirely non-standard. But, every platform we support supports some of them!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MapNonStandardFlags
then?
So I'm ambivalent about this PR. On one side, it brings us in line with how pthreads works, avoids malloc overhead, and gives us a first pass at guard page stack safety. On the other hand, it hurts the microbenchmarks on Linux using glibc. This is about 2x slower on all of the simple "spawn a task as fast as we can then stop" benchmarks using glibc. Whatever its allocator is doing, it's extremely friendly to these sorts of benchmarks (it's basically just repeatedly allocating and freeing a 2M vector). But, we are faster than jemalloc and, I assume, other platforms' mallocs. Plus, we're caching stacks. Are we willing to take the hit on these benchmarks? Is there a better testcase than spawnalot for overall perf impact? |
|
||
/// A task's stack. The name "StackSegment" is a vestige of segmented stacks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we s/Segment//
rather than add this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
Interesting! I'm curious if we're orders of magnitude slower than just calling mmap in a loop from C? If so, then there's probably a hidden problem somewhere. Other than that, this all looks pretty good to me. I'm not a big fan of the failure on out-of-memory, but it seems easy enough to push upwards by returning I'm not sure if we want to set the cache limit to 50 stacks, because this cache is a per-scheduler thing, and by default you have a scheduler per-core, so the cache could become fairly large fairly quickly maybe? It may also just want a limit of perhaps 10 instead of 50. Overall, this looks good, but I would like to investigate more into the performance numbers before landing. If we're slower, then how come? Is it a fundamental problem using mmap? How much faster is malloc? How can malloc be faster? etc, etc. |
@@ -10,14 +10,15 @@ | |||
|
|||
//! Runtime environment settings | |||
|
|||
use from_str::FromStr; | |||
use option::{Some, None}; | |||
use prelude::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you leave this as not a glob import?
@alexcrichton we're faster than jemalloc. we were an ounce slower than pthread, which is almost just the mmap/mprotect loop (it adds a clone in there), though I think the simpler caching will bring that down. apparently glibc does very heavy caching. Investigating more after the build finishes. |
Does this affect the maximum number of tasks we can spawn? I was under the impression there were limits to the number of mmapped regions we could create. |
On my system (x86-64 debian unstable):
|
I'm running On Thu, Nov 14, 2013 at 12:26 AM, Huon Wilson [email protected]:
|
Unfortunately that is a hard limit if we want guard pages... we could make |
A 2MiB stack isn't enough for |
@cmr: I can definitely spawn way more POSIX threads than that with 4GiB of memory. They set a 1 page guard at the end of the allocation. |
Yeh, nevermind, that limit doesn't apply to individual mprotects within a On Thu, Nov 14, 2013 at 12:45 AM, Daniel Micay [email protected]:
|
I cannot explain this behavior at all.
where silly-test-spawn.rs is fn main() {
for _ in range(1, 100_000) {
do spawn { }
}
} Why would 4M stacks be 5x slower then 8M stacks? |
(With the mmap code, it maps in a single stack and just keeps using that, due to the caching) |
Well, the base unit is tcmalloc is 2 pages and so is the default Linux stack size. Perhaps there's a good reason for using it. :) |
Erm... actually, seems that rust builds just kicked off in the background. Probably should have killed that first! New numbers:
|
@cmr: what if you raise |
Well, it makes them all, but segfaults in |
I'm still not understanding how there's a path forward with this. Even if we do use jemalloc, it look like the system malloc is always faster than jemalloc (even without mprotect)? |
@alexcrichton: It's faster than If we want stack safety, we have to pay the cost of calling |
@thestinger is there any way to quantize fragmentation? One problem with this is that it's not easy to decide when we need to allocate a huge block to put stacks on. By the time we know we need to be using large allocations, it's too late. We can depend on overcommit and just allocate huge (say, 1G) blocks and slice them up per stack. At this point we're rewriting malloc. Or, maybe, get the map limit at runtime and when we get to limit/2, start allocating 2 stacks at once, limit/4, 4 at once, and so forth. Or something. Some careful juggling is needed if we want small stacks. But I'm not sure those are really possible without reworking tons of things, and I'm not even certain this sort of scheme would be workable for tiny tasks. What does Go do here? |
The segmented stack prelude is a 2-3% performance hit overall and makes it impossible to use freestanding Rust without building |
@cmr: Try using detached threads in C to match what Rust is doing, or synchronizing the Rust tasks to prevent them from exiting immediately if it's not already doing that. |
I have rebased this. I don't know enough about either side of this issue to effectively debug the benchmark. I don't particularly mind it, though. Moving forward, we could add some sort of heuristic that, after allocating X segments, we can decide that "there's probably going to be lots more spawns soon!" and allocate a chunk and dish out stacks from that with mprotects between. We should also probably add an interface to the Runtime to say "make a big stack buffer, you're gunna need it", to get around the (probably flawwed heuristic). IMO this is mergable. |
I have put this on the next meeting agenda. |
We discussed this at the meeting today, and this seems like we would want to merge this, but then I remembered that this does not protect against overflow in native tasks (guaranteed on all platforms, it may already have that guarantee on windows). I'm unsure of whether we can protect against this with native tasks, and if we don't protect against it in native tasks, we probably shouldn't try to protect against it in green tasks (especially because native threads will probably become the default). |
Why wouldn't this protect native tasks? On Tue, Jan 7, 2014 at 2:19 PM, Alex Crichton [email protected]:
|
@alexcrichton: Native threads already have guard pages as set by |
This is only allocating stacks for green tasks (in libgreen), stack allocating for native tasks happens through pthreads, and in pthreads I don't think that we're guaranteed to have a guard page at the end of the thread's stack. |
Oh interesting! It appears that there is |
It's part of POSIX since the early revisions so it should exist almost everywhere. Windows has full stack safety by default much like gcc with |
Excellent! In that case, @cmr, could you rebase this patch for green threads? I'll open an issue for native threads. |
This also fixes up the documentation a bit, it was subtly incorrect.
Also implement caching of stacks.
// FIXME #7767: Putting main into a ~ so it's a thin pointer and can | ||
// be passed to the spawn function. Another unfortunate | ||
// allocation | ||
let start = ~start; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was a rebase that went wrong by a little
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed it was. I was wondering why there was a ~~proc slipping on somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r=me with this touch-up
ping, I can take over this if you'd like. |
that'd be nice, I really just don't have the time right now to hunt down bugs on platforms I don't have ready access to. |
Closing in favor of #11756 |
No description provided.