Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guard pages + stack caching #10460

Closed
wants to merge 4 commits into from
Closed

Guard pages + stack caching #10460

wants to merge 4 commits into from

Conversation

emberian
Copy link
Member

No description provided.

/// On POSIX, this can be used to specify the default flags passed to `mmap`. By default it uses
/// `MAP_PRIVATE` and, if not using `MapFd`, `MAP_ANON`. This will override both of those. This
/// is platform-specific (the exact values used) and unused on Windows.
MapUnsupportedFlags(c_int),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MapUnknownFlags? Unsupported make it sound a little like mmap doesn't support them, rather than Rust just not having them hardcoded. (Minor nitpick.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that they are entirely non-standard. But, every platform we support supports some of them!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MapNonStandardFlags then?

@emberian
Copy link
Member Author

So I'm ambivalent about this PR. On one side, it brings us in line with how pthreads works, avoids malloc overhead, and gives us a first pass at guard page stack safety. On the other hand, it hurts the microbenchmarks on Linux using glibc. This is about 2x slower on all of the simple "spawn a task as fast as we can then stop" benchmarks using glibc. Whatever its allocator is doing, it's extremely friendly to these sorts of benchmarks (it's basically just repeatedly allocating and freeing a 2M vector). But, we are faster than jemalloc and, I assume, other platforms' mallocs. Plus, we're caching stacks.

Are we willing to take the hit on these benchmarks? Is there a better testcase than spawnalot for overall perf impact?

cc @alexcrichton @pcwalton @brson


/// A task's stack. The name "StackSegment" is a vestige of segmented stacks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we s/Segment// rather than add this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@alexcrichton
Copy link
Member

Interesting! I'm curious if we're orders of magnitude slower than just calling mmap in a loop from C? If so, then there's probably a hidden problem somewhere.

Other than that, this all looks pretty good to me. I'm not a big fan of the failure on out-of-memory, but it seems easy enough to push upwards by returning Option<Stack>.

I'm not sure if we want to set the cache limit to 50 stacks, because this cache is a per-scheduler thing, and by default you have a scheduler per-core, so the cache could become fairly large fairly quickly maybe? It may also just want a limit of perhaps 10 instead of 50.

Overall, this looks good, but I would like to investigate more into the performance numbers before landing. If we're slower, then how come? Is it a fundamental problem using mmap? How much faster is malloc? How can malloc be faster? etc, etc.

@@ -10,14 +10,15 @@

//! Runtime environment settings

use from_str::FromStr;
use option::{Some, None};
use prelude::*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave this as not a glob import?

@emberian
Copy link
Member Author

@alexcrichton we're faster than jemalloc. we were an ounce slower than pthread, which is almost just the mmap/mprotect loop (it adds a clone in there), though I think the simpler caching will bring that down. apparently glibc does very heavy caching. Investigating more after the build finishes.

@brson
Copy link
Contributor

brson commented Nov 14, 2013

Does this affect the maximum number of tasks we can spawn? I was under the impression there were limits to the number of mmapped regions we could create.

@huonw
Copy link
Member

huonw commented Nov 14, 2013

On my system (x86-64 debian unstable):

$ sudo sysctl vm.max_map_count
vm.max_map_count = 65530

docs

This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling malloc, directly by mmap and mprotect, and also when loading shared libraries.

@emberian
Copy link
Member Author

I'm running task-perf-one-million 40 to check and soforth. It took 16s
with the old code, I killed jemalloc after 7 minutes, and the new code it
indeed fails. The highest I could get it was ./task-perf-one-million 31,
which is 23282 tasks. With 31 instead of 40, old took 0.531s, old
w/jemalloc takes 5.378, new takes 0.594s

On Thu, Nov 14, 2013 at 12:26 AM, Huon Wilson [email protected]:

On my system:

$ sudo sysctl vm.max_map_count
vm.max_map_count = 65530

docs https://www.kernel.org/doc/Documentation/sysctl/vm.txt

This file contains the maximum number of memory map areas a process may
have. Memory map areas are used as a side-effect of calling malloc,
directly by mmap and mprotect, and also when loading shared libraries.


Reply to this email directly or view it on GitHubhttps://github.com//pull/10460#issuecomment-28460538
.

@emberian
Copy link
Member Author

Unfortunately that is a hard limit if we want guard pages... we could make
tasks request guard pages manually, but it's hard to know if you're ever
going to be calling FFI at any point.

@thestinger
Copy link
Contributor

A 2MiB stack isn't enough for rustc so it's not a sane default. The Linux default is 8MiB so we should be comparing with a number like that - jemalloc stops caching at >4MiB and should fall right through to mmap. Large allocations cause silly amounts of fragmentation with glibc so the numbers will get much worse if you let it churn for a while doing something like vector reallocations as it goes.

@thestinger
Copy link
Contributor

@cmr: I can definitely spawn way more POSIX threads than that with 4GiB of memory. They set a 1 page guard at the end of the allocation.

@emberian
Copy link
Member Author

Yeh, nevermind, that limit doesn't apply to individual mprotects within a
range. It's possible to map a very large segment and split the stacks
within it.

On Thu, Nov 14, 2013 at 12:45 AM, Daniel Micay [email protected]:

@cmr https://github.com/cmr: I can definitely spawn way more POSIX
threads than that with 4GiB of memory.


Reply to this email directly or view it on GitHubhttps://github.com//pull/10460#issuecomment-28461132
.

@emberian
Copy link
Member Author

I cannot explain this behavior at all.

RUST_MIN_STACK=8388608 ./silly-test-spawn-mmap  1.82s user 1.64s system 173% cpu 1.989 total    
RUST_MIN_STACK=8388608 ./silly-test-spawn  2.21s user 10.52s system 207% cpu 6.143 total
RUST_MIN_STACK=8388608 jemalloc.sh ./silly-test-spawn  2.85s user 25.95s system 111% cpu 25.774 total
RUST_MIN_STACK=4194304 jemalloc.sh ./silly-test-spawn  2.88s user 26.56s system 111% cpu 26.367 total                                                                                                               
RUST_MIN_STACK=4194304 ./silly-test-spawn  2.19s user 10.86s system 233% cpu 5.583 total
RUST_MIN_STACK=4194304 ./silly-test-spawn-mmap  2.01s user 10.30s system 116% cpu 10.598 total

where silly-test-spawn.rs is

fn main() {
    for _ in range(1, 100_000) {
        do spawn { }
    }
}

Why would 4M stacks be 5x slower then 8M stacks?

@emberian
Copy link
Member Author

(With the mmap code, it maps in a single stack and just keeps using that, due to the caching)

@thestinger
Copy link
Contributor

Well, the base unit is tcmalloc is 2 pages and so is the default Linux stack size. Perhaps there's a good reason for using it. :)

@emberian
Copy link
Member Author

Erm... actually, seems that rust builds just kicked off in the background. Probably should have killed that first! New numbers:

[1:31:17]~/hacking/rust/src/test/bench> for s in $stacks; do for e in $exes; do echo "stacksize=$s exe=$e"; RUST_MIN_STACK=$e /usr/bin/time -f '%E' $e; done done
stacksize=4194304 exe=./silly-test-spawn-mmap
0:01.76
stacksize=4194304 exe=./silly-test-spawn
0:00.92
stacksize=8388608 exe=./silly-test-spawn-mmap
0:01.75
stacksize=8388608 exe=./silly-test-spawn
0:00.89
stacksize=1048576 exe=./silly-test-spawn-mmap
0:01.81
stacksize=1048576 exe=./silly-test-spawn
0:00.89
stacksize=8092 exe=./silly-test-spawn-mmap
0:01.72
stacksize=8092 exe=./silly-test-spawn
0:00.94
[1:31:35]~/hacking/rust/src/test/bench> for s in $stacks; do for e in $exes; do echo "stacksize=$s exe=$e"; RUST_MIN_STACK=$e /usr/bin/time -f '%E' jemalloc.sh $e; done done
stacksize=4194304 exe=./silly-test-spawn-mmap
0:01.82
stacksize=4194304 exe=./silly-test-spawn
0:01.13
stacksize=8388608 exe=./silly-test-spawn-mmap
0:01.80
stacksize=8388608 exe=./silly-test-spawn
0:01.12
stacksize=1048576 exe=./silly-test-spawn-mmap
0:01.78
stacksize=1048576 exe=./silly-test-spawn
0:01.13
stacksize=8092 exe=./silly-test-spawn-mmap
0:01.80
stacksize=8092 exe=./silly-test-spawn
0:01.12

@thestinger
Copy link
Contributor

@cmr: what if you raise vm.max_map_count and set vm.overcommit_memory to 1 (overcommit) instead of 0 (heuristic overcommit)?

@emberian
Copy link
Member Author

Well, it makes them all, but segfaults in __deallocate_stack(), called by
pthread_join.

@alexcrichton
Copy link
Member

I'm still not understanding how there's a path forward with this. Even if we do use jemalloc, it look like the system malloc is always faster than jemalloc (even without mprotect)?

@thestinger
Copy link
Contributor

@alexcrichton: It's faster than jemalloc for this size of allocation because it's not worrying about fragmentation and will end up causing pathological performance and memory usage in the long run.

If we want stack safety, we have to pay the cost of calling mmap and mprotect just like pthread_create does. If we want even less stack safety than C threads, then sure, we don't need this. Task spawning performance doesn't matter very much because tasks don't replace thread pools for CPU-bound code.

@emberian
Copy link
Member Author

@thestinger is there any way to quantize fragmentation?

One problem with this is that it's not easy to decide when we need to allocate a huge block to put stacks on. By the time we know we need to be using large allocations, it's too late. We can depend on overcommit and just allocate huge (say, 1G) blocks and slice them up per stack. At this point we're rewriting malloc. Or, maybe, get the map limit at runtime and when we get to limit/2, start allocating 2 stacks at once, limit/4, 4 at once, and so forth. Or something.

Some careful juggling is needed if we want small stacks. But I'm not sure those are really possible without reworking tons of things, and I'm not even certain this sort of scheme would be workable for tiny tasks. What does Go do here?

@thestinger
Copy link
Contributor

The segmented stack prelude is a 2-3% performance hit overall and makes it impossible to use freestanding Rust without building clang near the same LLVM revision as Rust and compiling the rustc bytecode with it. I think that's a far more important cost to worry about than a micro-benchmark spawning no-op tasks. Windows provides safety from stack overflow by default without the same overhead, and gcc provides it via -fstack-check. It's very sad for Rust to be less safe than C in regards to stack safety while also spending more resources on it.

@thestinger
Copy link
Contributor

@cmr: Try using detached threads in C to match what Rust is doing, or synchronizing the Rust tasks to prevent them from exiting immediately if it's not already doing that.

@emberian
Copy link
Member Author

I have rebased this. I don't know enough about either side of this issue to effectively debug the benchmark. I don't particularly mind it, though.

Moving forward, we could add some sort of heuristic that, after allocating X segments, we can decide that "there's probably going to be lots more spawns soon!" and allocate a chunk and dish out stacks from that with mprotects between. We should also probably add an interface to the Runtime to say "make a big stack buffer, you're gunna need it", to get around the (probably flawwed heuristic).

IMO this is mergable.

@alexcrichton
Copy link
Member

I have put this on the next meeting agenda.

@alexcrichton
Copy link
Member

We discussed this at the meeting today, and this seems like we would want to merge this, but then I remembered that this does not protect against overflow in native tasks (guaranteed on all platforms, it may already have that guarantee on windows).

I'm unsure of whether we can protect against this with native tasks, and if we don't protect against it in native tasks, we probably shouldn't try to protect against it in green tasks (especially because native threads will probably become the default).

@emberian
Copy link
Member Author

emberian commented Jan 7, 2014

Why wouldn't this protect native tasks?

On Tue, Jan 7, 2014 at 2:19 PM, Alex Crichton [email protected]:

We discussed this at the meeting today, and this seems like we would want
to merge this, but then I remembered that this does not protect against
overflow in native tasks (guaranteed on all platforms, it may already have
that guarantee on windows).

I'm unsure of whether we can protect against this with native tasks, and
if we don't protect against it in native tasks, we probably shouldn't try
to protect against it in green tasks (especially because native threads
will probably become the default).


Reply to this email directly or view it on GitHubhttps://github.com//pull/10460#issuecomment-31769329
.

@thestinger
Copy link
Contributor

@alexcrichton: Native threads already have guard pages as set by pthread_attr_setguardsize (default of 1 on Linux with glibc).

@alexcrichton
Copy link
Member

This is only allocating stacks for green tasks (in libgreen), stack allocating for native tasks happens through pthreads, and in pthreads I don't think that we're guaranteed to have a guard page at the end of the thread's stack.

@alexcrichton
Copy link
Member

Oh interesting! It appears that there is pthread_attr_setguardsize on OSX (although it doesn't have a manpage), and I guess it will probably be on freebsd as well. Are we sure that some equivalent exists on windows/android

@thestinger
Copy link
Contributor

It's part of POSIX since the early revisions so it should exist almost everywhere. Windows has full stack safety by default much like gcc with -fcheck-stack on Linux.

@alexcrichton
Copy link
Member

Excellent! In that case, @cmr, could you rebase this patch for green threads?

I'll open an issue for native threads.

This also fixes up the documentation a bit, it was subtly incorrect.
// FIXME #7767: Putting main into a ~ so it's a thin pointer and can
// be passed to the spawn function. Another unfortunate
// allocation
let start = ~start;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was a rebase that went wrong by a little

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed it was. I was wondering why there was a ~~proc slipping on somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me with this touch-up

@alexcrichton
Copy link
Member

ping, I can take over this if you'd like.

@emberian
Copy link
Member Author

that'd be nice, I really just don't have the time right now to hunt down bugs on platforms I don't have ready access to.

@alexcrichton alexcrichton mentioned this pull request Jan 23, 2014
@alexcrichton
Copy link
Member

Closing in favor of #11756

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants