-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use jemalloc's default prefix #18678
Conversation
By default, jemalloc only uses a prefix on OS X, iOS and Windows where attempting to replace the platform allocator is broken. The lack of a prefix allows jemalloc to override the weak symbols from the C standard library to replace the system allocator. This results in a performance improvement for memory allocation in C along with reduced fragmentation. For example, the time spent on LLVM passes in the Rust compiler on Linux is cut by 10% and peak memory usage is reduced by 15%. Even on platforms like FreeBSD where jemalloc is the system allocator, this eliminates the inevitable external fragmentation and performance hit caused by using two general purpose allocators. Closes #18676
I'm a bit wary of this. Replacing malloc is pretty sneaky and I can imagine it causing unintended side effects (I feel like this has come up before - the first time we integrated jemalloc we did not use a custom prefix and had malloc/free mismatches). Like to get some other feedback. |
It's explicitly supported by the C standard library implementations by going out of the way to use weak symbols for the platform allocator. It's the normal way to use jemalloc and tcmalloc and it isn't a hack or sneaky at all.
It defaults to a |
Rust is causing undefined behaviour by opting out of replacing the system allocator but not disabling jemalloc's support for dss. It's not well-defined to call |
I'm also somewhat wary of this, but it has been quite some time since the bugs we had last. (cc #9925 and #9933). A lot has changed about the runtime/compiler since then, particularly in the form of how we link all our libraries together. We're also not ourselves calling I'd be willing to give this a shot as it's pretty easy to undo, I'd just want to keep a close eye on the queue for the next few days. Changes like this often have quite surprising fallout. The wins seems significant enough that it may be worth just trying out. @brson, does that sound ok to you? |
It seems that all things considered, given that Rust itself will under the hood still use jemalloc for allocation, the key change to this would be foreign code linked into a rust binary? Not suggesting this is bad, but wanted to check I was reading this correctly. |
@richo: Yes, this means code using the C allocator API (like LLVM) will make use of jemalloc. Mixed allocator usage is pretty much the definition of external fragmentation as unused capacity in one allocator is not usable by the other. It also means that virtual memory is fragmented into smaller spans managed by either allocator which hurts large allocations / reallocations. Finally, it means that it's not possible to have jemalloc safely use dss (which has the potential to speed up reallocations). |
@thestinger If it is a problem for jemalloc to be calling sbrk then we could change it to use mmap if we had to. Do you have a source for your assertion that the current behavior is not ok, and do you know why it is working anyway on all platforms? Are you aware of precedent in other language stacks for overriding the C allocator by default, e.g. does Java override malloc, python? If we were to do this then it would be a Linux-specific behavior for a major subsystem. |
It's not okay to call http://lifecs.likai.org/2010/02/sbrk-is-not-thread-safe.html (or see the man pages)
I'm not aware of a language shipping an alternative allocator implementation like Rust. I'm also not aware of a language implementation providing features like
No, it's not Linux-specific. It will be doing it on all platforms that aren't explicitly blacklisted by jemalloc like FreeBSD and the Android variant of Linux. It also works on Solaris, AIX and most other *nix operating systems. The inability to replace the Windows allocator is considered an important missing feature upstream and will be fixed at some point - hopefully soon. On OS X, jemalloc replaces the system allocator by registering itself as the default zone allocator. Rust isn't passing This is the normal way to make use of jemalloc because mixing general purpose allocators is bad for performance and memory usage. On FreeBSD (and perhaps NetBSD), the system allocator is jemalloc and Rust is causing two instances of it to be used in the same process. It means external fragmentation (2 fully independent allocators), lower performance (loss of data locality, separate caches) and fragmented virtual memory. |
Java, V8, and SpiderMonkey certainly do ship their own allocators and don't replace the C allocator, although at least V8 could in theory. I'm not going to declare myself opposed to this, but I would like to understand why Rust is different from those three languages in this regard. Replacing malloc with jemalloc for the whole process seems rather intrusive for e.g. embedding scenarios. Of course it should be possible to do so for the reasons you describe, but I suspect that, for example, a Java user would not expect this behavior by default. |
I don't see any problems with this in an embedding scenario. The platform C libraries are designed with support for cleanly replacing the system allocator. Rust already provides the choice being using the C standard library API or the non-standard jemalloc API for allocation via the If Rust ever intends to use the system jemalloc as required by Debian / Fedora packaging standards (#13945), then it needs to use the default prefix because distribution packages don't add a non-default prefix to the APIs. Rust mixing two general purpose allocators in the same process increases memory usage through external fragmentation, fragmented spans of virtual memory and poor data locality. If the system allocator is jemalloc, then what Rust is doing is strictly worse than calling into that. I think it's far less intrusive to have jemalloc use the system provided methods for replacing the platform allocator (weak symbols, OS X zone allocator API) than it is to have it live alongside it and fight with it for resources.
It uses an allocator with the same kind of design and performance characteristics as the platform allocator. Using jemalloc is justified because it provides a consistent performance profile across platforms and allows Rust to leverage advanced API features like sized deallocation, alignment support, in-place reallocation and more. However, that value proposition isn't as strong if jemalloc is going to be hamstrung with a non-default configuration flag that's significantly hurting performance and memory usage.
Garbage collectors have drastically different performance and memory usage characteristics. Both Chromium and Firefox replace the system allocators with TCMalloc / jemalloc despite shipping those garbage collector implementations.
I doubt that many Java users have expectations about whether the system allocator is replaced. It wouldn't really matter if they did, because it's a low-level implementation detail with no API impact. The impact is a performance improvement, reduction in memory usage and far more accurate allocator statistics since in practice it will cover everything other than stacks (I'll get to that later...) and memory mapped files (might be interesting to manage those via the same chunk allocator). |
I'm not disagreeing that we should support this mode, but I'm questioning why it should be the default.
Neither are designed for embedding.
I'm not sure that's true.
What if your C app has two libraries you want to use, one written in Rust and one written in some other language that uses tcmalloc? |
Building for Fedora etc. using the system allocator can be done just by disabling jemalloc in the build. That point at least doesn't seem relevant here. Likewise I don't understand the point about Rust on systems that already use jemalloc. For such systems wouldn't we want to disable our build of jemalloc and use the system allocator? You claimed that the current configuration is 'significantly hurting performance and memory usage'. I can imagine the memory usage aspect since there are two active allocators, but how does it hurt performance? Aren't we using jemalloc because it has better performance than the system allocator? Have you done measurements on the increased memory usage we are causing by having two active allocators? ISTM that Rust programs by default should not be calling malloc much (if at all), so the impact may not be great. I'm still not convinced that it is Rust's place to be messing with C's allocator. If we did this then are others still able to override the allocator in the expected way or do they have to use Rust's allocator? |
The goal would to be build it against the distribution's jemalloc package via the same non-standard API with features like sized deallocation, not the normal system allocator. It's certainly relevant here because the only choice Rust is offering right now is between lower performance in Rust with the system allocator (whether or not it is jemalloc) and inefficient mixed allocator usage.
We would still want to use the non-standard API for the significant performance advantages. I would expect the benefits of sized deallocation to cause a widening of the performance gap between the APIs once jemalloc gains support for arena caches.
Mixing the allocators spreads out data more sparsely, and it's not keeping the thread caches as hot as they would be if the C memory allocations were also hitting it. Since both allocators are grabbing virtual memory via
Rust programs often call
It's still possible to override the system allocator by linking a library like TCMalloc before liballoc. I don't think it makes much sense to mix allocators in the same process though. However, I don't think it would make sense for someone to do this. Mixing allocators in the same process is not a good idea and it would make far more sense to build Rust with the system allocator API if the intent is to use another allocator like TCMalloc. |
I think it makes sense to offer two choices:
I don't think it makes sense to offer the middle ground that Rust is currently using. It still has the disadvantage of an extra library dependency. It doesn't need to have the disadvantage of mixed allocator usage. I don't see any benefits to avoiding the normal configuration where it replaces the system allocator on platforms with support for doing so. It's a significant performance and memory usage loss relative to a default build. |
It could build against the platform allocator by default, but I don't see why this crippled build of jemalloc should be supported.
I don't think significantly hurting performance / memory usage in processes making heavy use of the C allocator is friendly to embedding. Rust supports using the platform allocator but if it's built against the non-standard jemalloc API instead then it makes sense to leverage the ability to replace the system allocator that's provided by the platform.
It will work correctly. Only one of the allocators will replace the system allocator, depending on which was loaded first - that's how weak symbols work. Rust doesn't permit mixing the |
Was the main concern about embedding the possibility that jemalloc might replace the system allocator during the runtime of a C program? I am not seeing other potential disadvantages being mentioned, and 20% less memory usage is huge, especially given that we get it by doing nothing, instead of sabotaging jemalloc. |
It's not possible to |
@eddyb: It's ~10-15% less peak |
The Rust compiler and LLVM make a lot of small allocations so metadata overhead is a big deal. They also do a lot of repeated vector reallocations, so fragmented virtual memory doesn't help. |
I will bring this up during the meeting. My current position is that I am not in favor of doing this by default for libraries because I am concerned about surprising embedders. Sure, jemalloc may well be the best allocator ever, but it's still a somewhat antisocial thing to do. If I'm invited over to your house for dinner, and when I get there I suddenly smash your old CRT TV and replace it with a brand new flat screen, the fact that I upgraded your TV is not going to stop me from getting kicked out of your house and arrested. Executables, however, are a separate story, because it makes sense that Rust "owns" the runtime environment of a Rust executable. So I would be fine with doing this for executables. (Note that this would address the |
I don't see how it could just be done for executable crates but not library ones. |
Well, we could just get rid of the prefix for jemalloc and only link it in for executable crates (at the top level). That way, library crates would use whatever the system allocator is (which, for Rust executable projects, would be jemalloc; for non-Rust executable projects that would be jemalloc if the embedder explicitly opts in and otherwise would be their native allocator).
Wait, does this mean that it's impossible to
The issue is that that's for the embedder to decide; we should not be making that decision for them if we're a guest in their process. |
That won't work because we want to be calling into jemalloc's non-standard API from those libraries.
It's one of several reasons that
Rust doesn't make the decision for them. It has a configure option for using the platform allocator API instead of jemalloc's non-standard API. I don't see any way to make the choice at runtime between the more performant jemalloc-specific API (sized deallocation, etc.) and the platform API. |
Can't we make weak symbol shims over jemalloc's nonstandard API that call into the system allocator and are overridden by jemalloc if it's linked in?
Well, I'd prefer to fix that instead of sealing off the possibility entirely.
Having to rebuild Rust is an awfully big sledgehammer to hit this problem with. |
@thestinger The problem I see is that the system is already "broken" (fragile may be better word) even for C and C++ but people are used to it. The concern here is that Rust libraries should fit into usual brokeness boundaries. With your change the breakage may be worse. But I realize that I made wrong conclusion from your last previous post. If the Rust dynamic library uses only global malloc and free then it can free anything that it receives from the the binary. When the alignment restrictions of malloc are more strict then requirements of the freex then the binary can free anything it receives from the library. The rallocx may fall into the same reasoning category. |
This pull request doesn't break anything.
I don't really think we're talking about the same things. |
@bill-myers when you are in another language and you |
@Thiez Of course the performance of the Rust library will be worse with only the malloc/free. The key part is the assumption that it can be done so that library uses the malloc/free interface to allocator while the binary uses mallocx/freex interface to the same allocator and nothing breaks. The other assumtion is that when you decide to use dynamic library then you do not care about performance so much. |
It's already documented, and is not something that would be a good idea to permit in any scenario. It's a really bad idea to mix allocators across library boundaries this way on Windows. There is often more than one "platform" allocator in the same process there and it varies across the library boundaries. You need to free stuff with the API provided by the library / language itself. In general, it's not something that should be permitted because it would tie the hands of Rust in the future. It would always need to use an allocator stack fully interchangeable with "the" (which?) platform allocator if that was a guarantee. It would hamstring future improvements by forcing the whole allocator to be based around a legacy API. |
Also keep in mind that some platform-specific C libraries might have broken code that relies on unspecified behavior of the system malloc, such as the way it aligns allocations, the way it rounds sizes up and the portion of address space it uses (esp. where it is located relative to the 2GB/4GB limit). If the system allocator is replaced with jemalloc, then jemalloc has to provide those same guarantees, which might possibly result in wasting memory if the system allocator rounds sizes up a lot. |
@bill-myers in that case the broken library should be fixed. If it's broken for us it will also be broken programs that also use an alternative allocator, and probably on systems that use jemalloc as the system allocator (e.g. freebsd). I see no reason why Rust should suffer to support what is essentially hypothetical wrong third-party code. |
Well, for instance the binary-only Adobe Flash on Linux famously used memcpy() on overlapping memory areas and relied on glibc copying in the forward direction. Since the allocator is usually never replaced by normal C programs, I would definitely expect that there is code that does things like allocate 24 bytes and use 32 and that works reliably because the system malloc rounds up to powers of two. Thinking more about it, I think prefixing jemalloc, never replacing the system allocator and thus always having two allocators running is the only thing that is both guaranteed to be bug-free and prevents libraries from depending on free on Boxes working. |
All of the relevant information is in the original pull request. It doesn't make any changes to the user-facing semantics that are exposed today. It is an under the hood performance improvement, and 95% of the discussion here is off-topic for this pull request. The pull request has nothing to do with the ability to choose the allocator without compiling Rust and has no impact on the ability to
That's exactly what was intended:
I have to keep addressing misinformation and repeated misrepresentation of my statements, so there is no way that anyone is going to be able to follow the conversation here.
Rust's allocator API is entirely separate from the C allocator API. It is explicitly documented as not being interchangeable with it for reasons that I have gotten into above. Dynamically loading a library with If foreign code is using a function like
Rust already provides the option of the
dss (data storage segment) refers to the heap data section, which is managed by jemalloc has a runtime option setting the preference of mmap vs. sbrk. The default is The zone allocator API is an API provided by OS X / iOS for the explicit purpose of replacing the system allocator. Rust is currently making use of it, since it doesn't pass
|
That's not true. This is the normal way to use the TCMalloc and jemalloc libraries, and C programs already need to cope with endless churn in the design of the glibc allocator and portability to various other allocators.
I don't understand how you've managed to draw that conclusion. The FreeBSD, glibc, Windows and OS X allocators have been drastically changed over the years and it has never resulted in what you're claiming. |
Rust is already doing stuff like enabling full ASLR by default and having the linker remove unused sections which is far more relevant to the issue of breaking code that depends on undefined behaviour. Rust makes the assumption that |
It is also trivially easy to define a My impression from past discussion was that C code being able to |
@thestinger After rereading the commit and some key parts of the discussion my understading is that the concern is that dynamic Rust libraries become "poisonous" because of the malloc/free. My question is: would it be possible to export unprefixed jemalloc symbols only for Rust binaries and use prefixed symbols in all Rust code? That way the Rust programs will benefit from single memory allocator with low risk of breaking any external code and on the other side dlopening Rust library will not change behaviour at the cost of using additional allocator alongside the system allocator (but when you decide to use dynamic library than performance presumably does not matter so much). |
The current situation is that liballoc can be compiled with or without
I don't know how linking would work in this case. If the libraries use the prefixed symbols then those symbols need to provided by something for linking to complete. |
If there are any widely-used large programs (Firefox or Chrome perhaps?) replacing the default allocator on Linux, that would give some confidence that at least the basic system libraries are not broken with non-default allocators. In addition to the bugs issue, there is also the issue of things starting to depend on malloc() and the Rust allocator being interchangeable. If we really want to allow replacing the system allocator I think it should be optional, and ideally a choice that can be overridden at runtime with an environment variable. Maybe what could be done is keep jemalloc prefixed or dlopen it with RTLD_LOCAL, and add code to Rust executables (not libraries) to use either glibc malloc hooks or GNU ifuncs to replace the system allocator, depending on both an environment variable and a default setting specified at compile time. |
Firefox, Chrome and MariaDB among others. It's the normal way to use TCMalloc / jemalloc and putting either of them in |
I'm not experienced in allocation stuff, but I just do not understand why everyone states that this pull request affects dlopen while the pull request author states it does not. Again, I don't know in detail how allocators and dlopen-related functionality works, but it seems very clear to me from @thestinger's explanations that just renaming symbols per this PR does not affect dlopening, and dlopen is broken now anyway due to jemalloc implementation. |
@bill-myers I have added jemalloc to couple large in-house developed programs running on linux without any problems. You can easily google results of benchmarks that many people did by preloading jemalloc or tcmalloc in different SQL databases, interpreters and other programs. |
I don't care enough about Rust's performance to deal with this. That's all folks. |
@thestinger Excuse me that I ask another question but the pull request says "The lack of a |
It replaces the C allocator if and only if it's linked into the executable at initialization (ignoring |
@thestinger You mean dynamically linked at program initialization? What about RTLD_GLOBAL? Wouldn't it replace C allocator for other dlopened libraries? |
It won't clobber the existing symbols without |
I think this is the critical piece of information that was missing from the PR, and what most people were confused/wary about. This was mentioned in this thread, but not prominently enough. Very few people will have the energy to carefully read a long-ish thread, then hunt down the rest of info in the IRC logs, previous PRs, etc. Least of all Rust core team, who, I imagine, are pretty busy these days.
I hope things look less bleak in the morning :-) |
Oh, and regarding LLVM allocator: can't we link rustc_llvm with |
@vadimcn Citing pcwalton from the reddit thread:
This sounds less bleak I think. |
minor: Fix a few typos
By default, jemalloc only uses a prefix on OS X, iOS and Windows where
attempting to replace the platform allocator is broken. The lack of a
prefix allows jemalloc to override the weak symbols from the C standard
library to replace the system allocator.
This results in a performance improvement for memory allocation in C
along with reduced fragmentation. For example, the time spent on LLVM
passes in the Rust compiler on Linux is cut by 10% and peak memory usage
is reduced by 15%. Even on platforms like FreeBSD where jemalloc is the
system allocator, this eliminates the inevitable external fragmentation
and performance hit caused by using two general purpose allocators.
Closes #18676