Guard pages + stack caching #10460

emberian · 2013-11-13T13:03:29Z

No description provided.

huonw · 2013-11-13T13:08:43Z

src/libstd/os.rs

+    /// On POSIX, this can be used to specify the default flags passed to `mmap`. By default it uses
+    /// `MAP_PRIVATE` and, if not using `MapFd`, `MAP_ANON`. This will override both of those. This
+    /// is platform-specific (the exact values used) and unused on Windows.
+    MapUnsupportedFlags(c_int),


MapUnknownFlags? Unsupported make it sound a little like mmap doesn't support them, rather than Rust just not having them hardcoded. (Minor nitpick.)

The idea is that they are entirely non-standard. But, every platform we support supports some of them!

MapNonStandardFlags then?

emberian · 2013-11-13T13:10:47Z

So I'm ambivalent about this PR. On one side, it brings us in line with how pthreads works, avoids malloc overhead, and gives us a first pass at guard page stack safety. On the other hand, it hurts the microbenchmarks on Linux using glibc. This is about 2x slower on all of the simple "spawn a task as fast as we can then stop" benchmarks using glibc. Whatever its allocator is doing, it's extremely friendly to these sorts of benchmarks (it's basically just repeatedly allocating and freeing a 2M vector). But, we are faster than jemalloc and, I assume, other platforms' mallocs. Plus, we're caching stacks.

Are we willing to take the hit on these benchmarks? Is there a better testcase than spawnalot for overall perf impact?

cc @alexcrichton @pcwalton @brson

huonw · 2013-11-13T13:11:14Z

src/libstd/rt/stack.rs


+/// A task's stack. The name "StackSegment" is a vestige of segmented stacks.


Can we s/Segment// rather than add this comment?

alexcrichton · 2013-11-13T17:29:26Z

Interesting! I'm curious if we're orders of magnitude slower than just calling mmap in a loop from C? If so, then there's probably a hidden problem somewhere.

Other than that, this all looks pretty good to me. I'm not a big fan of the failure on out-of-memory, but it seems easy enough to push upwards by returning Option<Stack>.

I'm not sure if we want to set the cache limit to 50 stacks, because this cache is a per-scheduler thing, and by default you have a scheduler per-core, so the cache could become fairly large fairly quickly maybe? It may also just want a limit of perhaps 10 instead of 50.

Overall, this looks good, but I would like to investigate more into the performance numbers before landing. If we're slower, then how come? Is it a fundamental problem using mmap? How much faster is malloc? How can malloc be faster? etc, etc.

alexcrichton · 2013-11-13T17:29:51Z

src/libstd/rt/env.rs

@@ -10,14 +10,15 @@

 //! Runtime environment settings

-use from_str::FromStr;
-use option::{Some, None};
+use prelude::*;


Could you leave this as not a glob import?

emberian · 2013-11-14T04:51:35Z

@alexcrichton we're faster than jemalloc. we were an ounce slower than pthread, which is almost just the mmap/mprotect loop (it adds a clone in there), though I think the simpler caching will bring that down. apparently glibc does very heavy caching. Investigating more after the build finishes.

brson · 2013-11-14T05:08:47Z

Does this affect the maximum number of tasks we can spawn? I was under the impression there were limits to the number of mmapped regions we could create.

huonw · 2013-11-14T05:26:02Z

On my system (x86-64 debian unstable):

$ sudo sysctl vm.max_map_count
vm.max_map_count = 65530

docs

This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling malloc, directly by mmap and mprotect, and also when loading shared libraries.

emberian · 2013-11-14T05:35:22Z

I'm running task-perf-one-million 40 to check and soforth. It took 16s
with the old code, I killed jemalloc after 7 minutes, and the new code it
indeed fails. The highest I could get it was ./task-perf-one-million 31,
which is 23282 tasks. With 31 instead of 40, old took 0.531s, old
w/jemalloc takes 5.378, new takes 0.594s

On Thu, Nov 14, 2013 at 12:26 AM, Huon Wilson [email protected]:

On my system:

$ sudo sysctl vm.max_map_count
vm.max_map_count = 65530

docs https://www.kernel.org/doc/Documentation/sysctl/vm.txt

This file contains the maximum number of memory map areas a process may
have. Memory map areas are used as a side-effect of calling malloc,
directly by mmap and mprotect, and also when loading shared libraries.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/10460#issuecomment-28460538
.

emberian · 2013-11-14T05:37:41Z

Unfortunately that is a hard limit if we want guard pages... we could make
tasks request guard pages manually, but it's hard to know if you're ever
going to be calling FFI at any point.

thestinger · 2013-11-14T05:43:44Z

A 2MiB stack isn't enough for rustc so it's not a sane default. The Linux default is 8MiB so we should be comparing with a number like that - jemalloc stops caching at >4MiB and should fall right through to mmap. Large allocations cause silly amounts of fragmentation with glibc so the numbers will get much worse if you let it churn for a while doing something like vector reallocations as it goes.

thestinger · 2013-11-14T05:45:12Z

@cmr: I can definitely spawn way more POSIX threads than that with 4GiB of memory. They set a 1 page guard at the end of the allocation.

emberian · 2013-11-14T05:52:29Z

Yeh, nevermind, that limit doesn't apply to individual mprotects within a
range. It's possible to map a very large segment and split the stacks
within it.

On Thu, Nov 14, 2013 at 12:45 AM, Daniel Micay [email protected]:

@cmr https://github.com/cmr: I can definitely spawn way more POSIX
threads than that with 4GiB of memory.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/10460#issuecomment-28461132
.

emberian · 2013-11-14T06:11:33Z

I cannot explain this behavior at all.

RUST_MIN_STACK=8388608 ./silly-test-spawn-mmap  1.82s user 1.64s system 173% cpu 1.989 total    
RUST_MIN_STACK=8388608 ./silly-test-spawn  2.21s user 10.52s system 207% cpu 6.143 total
RUST_MIN_STACK=8388608 jemalloc.sh ./silly-test-spawn  2.85s user 25.95s system 111% cpu 25.774 total
RUST_MIN_STACK=4194304 jemalloc.sh ./silly-test-spawn  2.88s user 26.56s system 111% cpu 26.367 total                                                                                                               
RUST_MIN_STACK=4194304 ./silly-test-spawn  2.19s user 10.86s system 233% cpu 5.583 total
RUST_MIN_STACK=4194304 ./silly-test-spawn-mmap  2.01s user 10.30s system 116% cpu 10.598 total

where silly-test-spawn.rs is

fn main() {
    for _ in range(1, 100_000) {
        do spawn { }
    }
}

Why would 4M stacks be 5x slower then 8M stacks?

emberian · 2013-11-14T06:12:00Z

(With the mmap code, it maps in a single stack and just keeps using that, due to the caching)

thestinger · 2013-11-14T06:14:14Z

Well, the base unit is tcmalloc is 2 pages and so is the default Linux stack size. Perhaps there's a good reason for using it. :)

emberian · 2013-11-14T06:32:35Z

Erm... actually, seems that rust builds just kicked off in the background. Probably should have killed that first! New numbers:

[1:31:17]~/hacking/rust/src/test/bench> for s in $stacks; do for e in $exes; do echo "stacksize=$s exe=$e"; RUST_MIN_STACK=$e /usr/bin/time -f '%E' $e; done done
stacksize=4194304 exe=./silly-test-spawn-mmap
0:01.76
stacksize=4194304 exe=./silly-test-spawn
0:00.92
stacksize=8388608 exe=./silly-test-spawn-mmap
0:01.75
stacksize=8388608 exe=./silly-test-spawn
0:00.89
stacksize=1048576 exe=./silly-test-spawn-mmap
0:01.81
stacksize=1048576 exe=./silly-test-spawn
0:00.89
stacksize=8092 exe=./silly-test-spawn-mmap
0:01.72
stacksize=8092 exe=./silly-test-spawn
0:00.94
[1:31:35]~/hacking/rust/src/test/bench> for s in $stacks; do for e in $exes; do echo "stacksize=$s exe=$e"; RUST_MIN_STACK=$e /usr/bin/time -f '%E' jemalloc.sh $e; done done
stacksize=4194304 exe=./silly-test-spawn-mmap
0:01.82
stacksize=4194304 exe=./silly-test-spawn
0:01.13
stacksize=8388608 exe=./silly-test-spawn-mmap
0:01.80
stacksize=8388608 exe=./silly-test-spawn
0:01.12
stacksize=1048576 exe=./silly-test-spawn-mmap
0:01.78
stacksize=1048576 exe=./silly-test-spawn
0:01.13
stacksize=8092 exe=./silly-test-spawn-mmap
0:01.80
stacksize=8092 exe=./silly-test-spawn
0:01.12

thestinger · 2013-11-23T12:54:02Z

@cmr: what if you raise vm.max_map_count and set vm.overcommit_memory to 1 (overcommit) instead of 0 (heuristic overcommit)?

emberian · 2013-11-23T13:22:31Z

Well, it makes them all, but segfaults in __deallocate_stack(), called by
pthread_join.

alexcrichton · 2013-11-23T17:29:48Z

I'm still not understanding how there's a path forward with this. Even if we do use jemalloc, it look like the system malloc is always faster than jemalloc (even without mprotect)?

thestinger · 2013-11-23T17:43:41Z

@alexcrichton: It's faster than jemalloc for this size of allocation because it's not worrying about fragmentation and will end up causing pathological performance and memory usage in the long run.

If we want stack safety, we have to pay the cost of calling mmap and mprotect just like pthread_create does. If we want even less stack safety than C threads, then sure, we don't need this. Task spawning performance doesn't matter very much because tasks don't replace thread pools for CPU-bound code.

emberian · 2013-11-23T21:02:20Z

@thestinger is there any way to quantize fragmentation?

One problem with this is that it's not easy to decide when we need to allocate a huge block to put stacks on. By the time we know we need to be using large allocations, it's too late. We can depend on overcommit and just allocate huge (say, 1G) blocks and slice them up per stack. At this point we're rewriting malloc. Or, maybe, get the map limit at runtime and when we get to limit/2, start allocating 2 stacks at once, limit/4, 4 at once, and so forth. Or something.

Some careful juggling is needed if we want small stacks. But I'm not sure those are really possible without reworking tons of things, and I'm not even certain this sort of scheme would be workable for tiny tasks. What does Go do here?

thestinger · 2013-12-15T03:47:18Z

The segmented stack prelude is a 2-3% performance hit overall and makes it impossible to use freestanding Rust without building clang near the same LLVM revision as Rust and compiling the rustc bytecode with it. I think that's a far more important cost to worry about than a micro-benchmark spawning no-op tasks. Windows provides safety from stack overflow by default without the same overhead, and gcc provides it via -fstack-check. It's very sad for Rust to be less safe than C in regards to stack safety while also spending more resources on it.

thestinger · 2013-12-15T03:48:42Z

@cmr: Try using detached threads in C to match what Rust is doing, or synchronizing the Rust tasks to prevent them from exiting immediately if it's not already doing that.

emberian · 2013-12-18T23:46:50Z

I have rebased this. I don't know enough about either side of this issue to effectively debug the benchmark. I don't particularly mind it, though.

Moving forward, we could add some sort of heuristic that, after allocating X segments, we can decide that "there's probably going to be lots more spawns soon!" and allocate a chunk and dish out stacks from that with mprotects between. We should also probably add an interface to the Runtime to say "make a big stack buffer, you're gunna need it", to get around the (probably flawwed heuristic).

IMO this is mergable.

alexcrichton · 2013-12-24T04:40:51Z

I have put this on the next meeting agenda.

alexcrichton · 2014-01-07T19:19:27Z

We discussed this at the meeting today, and this seems like we would want to merge this, but then I remembered that this does not protect against overflow in native tasks (guaranteed on all platforms, it may already have that guarantee on windows).

I'm unsure of whether we can protect against this with native tasks, and if we don't protect against it in native tasks, we probably shouldn't try to protect against it in green tasks (especially because native threads will probably become the default).

emberian · 2014-01-07T19:21:32Z

Why wouldn't this protect native tasks?

On Tue, Jan 7, 2014 at 2:19 PM, Alex Crichton [email protected]:

We discussed this at the meeting today, and this seems like we would want
to merge this, but then I remembered that this does not protect against
overflow in native tasks (guaranteed on all platforms, it may already have
that guarantee on windows).

I'm unsure of whether we can protect against this with native tasks, and
if we don't protect against it in native tasks, we probably shouldn't try
to protect against it in green tasks (especially because native threads
will probably become the default).

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/10460#issuecomment-31769329
.

thestinger · 2014-01-07T19:28:34Z

@alexcrichton: Native threads already have guard pages as set by pthread_attr_setguardsize (default of 1 on Linux with glibc).

alexcrichton · 2014-01-07T19:29:11Z

This is only allocating stacks for green tasks (in libgreen), stack allocating for native tasks happens through pthreads, and in pthreads I don't think that we're guaranteed to have a guard page at the end of the thread's stack.

alexcrichton · 2014-01-07T19:30:24Z

Oh interesting! It appears that there is pthread_attr_setguardsize on OSX (although it doesn't have a manpage), and I guess it will probably be on freebsd as well. Are we sure that some equivalent exists on windows/android

thestinger · 2014-01-07T19:33:26Z

It's part of POSIX since the early revisions so it should exist almost everywhere. Windows has full stack safety by default much like gcc with -fcheck-stack on Linux.

alexcrichton · 2014-01-07T19:37:39Z

Excellent! In that case, @cmr, could you rebase this patch for green threads?

I'll open an issue for native threads.

This also fixes up the documentation a bit, it was subtly incorrect.

Also implement caching of stacks.

alexcrichton · 2014-01-17T18:03:58Z

src/libgreen/context.rs

+        // FIXME #7767: Putting main into a ~ so it's a thin pointer and can
+        // be passed to the spawn function.  Another unfortunate
+        // allocation
+        let start = ~start;


I think this was a rebase that went wrong by a little

indeed it was. I was wondering why there was a ~~proc slipping on somewhere.

r=me with this touch-up

alexcrichton · 2014-01-23T00:09:47Z

ping, I can take over this if you'd like.

emberian · 2014-01-23T06:00:33Z

that'd be nice, I really just don't have the time right now to hunt down bugs on platforms I don't have ready access to.

alexcrichton · 2014-01-23T18:20:58Z

Closing in favor of #11756

huonw reviewed Nov 13, 2013
View reviewed changes

alexcrichton reviewed Nov 13, 2013
View reviewed changes

alexcrichton mentioned this pull request Jan 7, 2014

Native tasks do not have a guard page. #11377

Closed

Add support for arbitrary flags to MemoryMap.

3115968

This also fixes up the documentation a bit, it was subtly incorrect.

emberian added 3 commits January 17, 2014 11:47

Use mmap to map in task stacks and guard page

fe88137

Also implement caching of stacks.

Update task-perf-one-million

54b3685

Add benchmarks

62faac5

alexcrichton reviewed Jan 17, 2014
View reviewed changes

alexcrichton mentioned this pull request Jan 23, 2014

Guard pages #11756

Closed

alexcrichton closed this Jan 23, 2014


		/// A task's stack. The name "StackSegment" is a vestige of segmented stacks.

Guard pages + stack caching #10460

Guard pages + stack caching #10460

Conversation

emberian commented Nov 13, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emberian commented Nov 13, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Nov 13, 2013

Choose a reason for hiding this comment

emberian commented Nov 14, 2013

brson commented Nov 14, 2013

huonw commented Nov 14, 2013

emberian commented Nov 14, 2013

emberian commented Nov 14, 2013

thestinger commented Nov 14, 2013

thestinger commented Nov 14, 2013

emberian commented Nov 14, 2013

emberian commented Nov 14, 2013

emberian commented Nov 14, 2013

thestinger commented Nov 14, 2013

emberian commented Nov 14, 2013

thestinger commented Nov 23, 2013

emberian commented Nov 23, 2013

alexcrichton commented Nov 23, 2013

thestinger commented Nov 23, 2013

emberian commented Nov 23, 2013

thestinger commented Dec 15, 2013

thestinger commented Dec 15, 2013

emberian commented Dec 18, 2013

alexcrichton commented Dec 24, 2013

alexcrichton commented Jan 7, 2014

emberian commented Jan 7, 2014

thestinger commented Jan 7, 2014

alexcrichton commented Jan 7, 2014

alexcrichton commented Jan 7, 2014

thestinger commented Jan 7, 2014

alexcrichton commented Jan 7, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Jan 23, 2014

emberian commented Jan 23, 2014

alexcrichton commented Jan 23, 2014