-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: kernel heap requests on behalf of syscalls #6972
Comments
IMHO, option 3 offers the most flexibility in resource allocation policy while also being relatively simple to use in the trivial cases. |
I... still kinda hate this. Not the changes themselves, just the notion of where they're going. Trying to track allocation at a fine-grained level like this is a "last mile" kind of hardening, and we're trying to do that over an API that is already full of holes. I mean, take the k_delayed_work examples: those are functions that run in kernel space! Why do we care about prevening a process from being able to do an OOM DoS using an API that by definition can already do that and much, much more? IMHO, if we really want to harden Zephyr against userspace misbehavior[1], we MUST start by carefully defining (or even redesigning) a "hardened Zephyr API" and not trying to plug all the holes whack-a-mole style. There are too many, and we'll end up with a bloated mess. Our existing API is filled with things that just aren't meaningfully useful to processes working across MMU boundaries. Other examples: You'd never choose a semaphore for that, you'd use something like a futex for controlling blocking. Mailboxes and pipes just reduce to something like "file descriptors" in this world. [1] One big note here: Linux userspace can, by default, trivially OOM the system. It's possible to harden that, of course, but non-trivial and non-standard. And even then it doesn't work by tracking every kernel allocation made on behalf of the process but by limiting the resources granted to it (and its children) by the VM, which is IMHO a simpler regime to work in. Do we really want to be pushing a more fine-grained and more complicated framework into our little RTOS that Linux never felt it needed? ==== OK, rant part over. All that being said: if I had to pick one choice from the list it would be #3, for robustness. Note that #1 and #2 require more than just tracking byte counts: you need a big list of all the allocations so that if the thread aborts (strictly: all threads attached to that memory space I guess) you can clean it up. Otherwise you still have a DoS condition exploitable by starting threads that allocate and bail. Doing this in a separate heap gets you that tracking for free. My guess is that if you add this requirement to the budget for #1 and #2, you'd find that #3 is actually simpler. If I could suggest a #4: I'd say it would be cleaner still to forget the byte and heap block tracking, make sure that:
I don't see any reason that couldn't be just as hard from a security point of view, and it doesn't require any complicated tracking beyond counting. It does require some rework to existing data structures to meet the "clamp" requirement though. |
@andyross we want to implement workqueues that run in user mode, #6289. User mode won't have access to workqueues that run in supervisor mode. I had started working on this until I realized that supporting delayed work would present some issues. There will be another RFC for this. |
@andyross we surely don't. I'm soliciting ideas here for how to make this simple. My original thinking was in the vein of rlimits but I'm currently considering your suggestion #4. |
OK. Thinking out loud on the implementation details for this. If I understand you correctly:
This I'm having a little trouble with in terms of "what counts as a reference" but if we made it synonymous with "having active permissions on that object" then this is straightforward too, and would be a simple counter member in the struct k_thread. |
That all sounds good. FWIW: I'd argue you don't even need the per-object byte counts. All that is really required for security is that the kernel object's allocation be bounded to some predictable level and not drivable to arbitrary sizes by userspace requests. We can trust kernel code. In almost all circumstances I'm sure this is already true. And I was making the same assumption about permission vs. reference. If you can't touch it it doesn't matter if it exists or not. I guess that requires that kobj handles never be passed around between processes that don't have permissions to them though, which would require some auditing. |
I'm not 100% sure about this. There are some use-cases where the memory allocation might be somewhat large. Or at least, the way that objects make allocations are not necessarily on the same order of magnitude per object type. I did sketch out use-cases in the original RFC but I'm going to take another pass at this and really drill down to the specifics for each API that might involve allocations on behalf of syscalls. It will help knowing exactly what we are dealing with, and in further contemplation we may want to say for some of them "no this is not the right way to go about this" and omit those. |
I've spent quite some time on this. After trying out a lot of stuff I have pretty much convinced myself that Option 3 is still the way to go but let me attempt to justify it. I did seriously consider tracking allocations on a per-object instead of a per-thread basis but I found things ended up much simpler this way even though it's not how Linux does it. First, the details of what I am specifically trying to do:
The previous 3 items show how I am using the resource pools to solve three problems:
I found with Option 3 that all three of these were trivial to implement; at the end of the day we just call z_thread_malloc(), and later on k_free() when done. The fact that these allocations store internally what pool they came from makes things even simpler. However, there's one more thing that need to be done and that is reference counting. If an object loses all its references, in almost all cases it should drop to an unused/uninitialized state. I want to make sure of two things:
The model of using the existing permission bitfield as also a reference count works GREAT. I am not sure if I need to rename it, but basically if a thread has permission on an object, this is also a reference. A reference is automatically cleared on an object when a thread exits (we had this already). I have added logic such that whenever a thread's reference count on an object drops to zero, any object-specific cleanup function (if it exists) is called, and then if the object was dynamically allocated it will be itself freed. There are some caveats:
Example lifecycle: User thread A is granted a generously sized resource pool P which is shared with threads B and C which make up some logical application on the microcontroller. There are two other logical applications running on this MCU, each with their own resource pools. User thread A needs a pipe. It calls:
and is returned an uninitialized pipe object, with A being automatically granted permission on it. User thread A initializes the pipe. A needs a fairly large buffer and calls
Ret would equal -ENOMEM if an 8K buffer couldn't be reserved but this succeeds and the system call returns 0. A then uses the pipe for a while. Then one of two things happen: A either exits, or A intentionally drops its reference to the pipe with k_object_release(p). The reference count on the pipe drops to 0, assuming no other threads had a reference to it. This triggers a cleanup action which first calls k_pipe_cleanup(), which frees the pipe's buffer. Then since the pipe itself was dynamically allocated, the memory for the pipe object itself is also returned to the resource pool. This all happens automatically. I have a rough WIP PR that demonstrates most of this: #7049 |
We've found some situations where we are going to need to allow the kernel side of system calls to reserve memory on some private kernel heap, for short or long-lived purposes, because user mode cannot be trusted to provide this memory itself. However, if we allow this, we must have some means to prevent user mode threads from DoS the kernel heap by requesting all of its memory.
Let's go through some examples where this is needed:
So given all this, we still don't want threads to spam syscalls and eat up the entire kernel heap. There needs to be some limits on how much kernel heap data a user thread can indirectly consume on behalf of these calls. I can think of 3 methods for doing this:
I feel Option 1 doesn't account for divergent sizes of object requests, it would only work well if all requests were generally on the same order of magnitude in size, and I won't discuss Option 1 further.
Option 2 is simple and easy and doesn't require the creation of additional pools. However it means that child threads would again be able to reserve as much stuff as the parent; it doesn't impose any kind of cap on a group of threads. We don't (yet) have a process model in Zephyr, with multiple threads running in one process. Something like option 2 would be ideal later if when introduce processes, even if it's only an abstraction layer for non-MMU systems.
Option 3 is a bit more cumbersome since additional memory pools will have to be defined. But it does offer the ability to have threads that form some logical "application" on the device to all draw from a single pool. One of the more popular use-cases of Zephyr is to have multiple, more or less independent applications all running on the same MCU, APIs like Memory Domains and so forth are intended for this. So if each application has its own pool for syscall requests, then they can't DoS each other. The downside is that since these are separate pools, they all have to be set up, a proper size determined, threads assigned to them, etc.
I'm looking for input on approach:
The text was updated successfully, but these errors were encountered: