RFC: kernel heap requests on behalf of syscalls #6972

andrewboie · 2018-04-06T18:39:17Z

We've found some situations where we are going to need to allow the kernel side of system calls to reserve memory on some private kernel heap, for short or long-lived purposes, because user mode cannot be trusted to provide this memory itself. However, if we allow this, we must have some means to prevent user mode threads from DoS the kernel heap by requesting all of its memory.

Let's go through some examples where this is needed:

Dynamic allocation of kernel objects, an RFC patch which exists in dynamic kernel objects #6876. We want to allow for kernel objects to be allocated dynamically, not just statically as we have now. It would be very desirable to allow user mode threads to also request kernel objects.
We have some system calls which would like to do some deferred processing. A good example is the k_delayed_work APIs. In order to properly do deferred processing, you need to instantiate a struct _timeout which gets added to system timeout lists. Once the timeout expires, the _timeout can be deleted. The _timeout cannot be in user-accessible memory, so any deferred workqueue requests will need a transient heap allocation to handle the timeout.
Using the workqueue example again, there's a larger struct k_work which is queued when work is requested and de-queued by the worker thread. A user mode compatible workqueue would need temporary allocation of k_work objects until the worker thread consumes them.
In a more general case, we could introduce the notion of unbounded k_queues by allocating the linked list behind the scenes, right now the only IPC data exchange mechanism we allow user mode to use are pipes which are fixed size.
k_pipes are themselves currently problematic. We really can't trust user mode to provide the buffer where the pipe data gets queued. If user mode wants to initialize a pipe object we need the pipe's buffer to be allocated on a heap instead.

So given all this, we still don't want threads to spam syscalls and eat up the entire kernel heap. There needs to be some limits on how much kernel heap data a user thread can indirectly consume on behalf of these calls. I can think of 3 methods for doing this:

Limit number of heap requests. Simpler per-thread counter, any given thread cannot make more than N requests before the request fails, where N defaults to some Kconfig and can be adjusted by supervisor mode at runtime. Freeing memory will credit the thread that made the request (even if freed by some other thread or an ISR)
Limit number of bytes of heap requests. Just like zephyr-master_fork #1, but account for actual number of bytes requested. Any given thread can only reserve N bytes.
Don't do numerical accounting, instead there can be a supervisor mode to assign a k_mem_pool to a thread. Any requests made by that thread will draw from that memory pool. Defaults to NULL, so no requests allowed. Supervisor mode API to assign a pool to a thread. Multiple threads can all be using the same pool, child threads will inherit the pool assigned to the parent. If DoS is not a concern, just make this be the kernel heap.

I feel Option 1 doesn't account for divergent sizes of object requests, it would only work well if all requests were generally on the same order of magnitude in size, and I won't discuss Option 1 further.

Option 2 is simple and easy and doesn't require the creation of additional pools. However it means that child threads would again be able to reserve as much stuff as the parent; it doesn't impose any kind of cap on a group of threads. We don't (yet) have a process model in Zephyr, with multiple threads running in one process. Something like option 2 would be ideal later if when introduce processes, even if it's only an abstraction layer for non-MMU systems.

Option 3 is a bit more cumbersome since additional memory pools will have to be defined. But it does offer the ability to have threads that form some logical "application" on the device to all draw from a single pool. One of the more popular use-cases of Zephyr is to have multiple, more or less independent applications all running on the same MCU, APIs like Memory Domains and so forth are intended for this. So if each application has its own pool for syscall requests, then they can't DoS each other. The downside is that since these are separate pools, they all have to be set up, a proper size determined, threads assigned to them, etc.

I'm looking for input on approach:

Pursue Option 2 by introducing a Zephyr "process" abstraction layer and have the heap limits shared by all threads that belong to that process. This is what I am trending towards. The same process-level APIs will then tie into future work to introduce virtual memory.
Pursue Option 2 but disregard potential DoS scenarios where a ton of user threads are created and eat up the heap in concert
Pursue Option 3 as it would be fastest to implement even though not particularly easy to use.

stephensmalley · 2018-04-09T13:36:49Z

IMHO, option 3 offers the most flexibility in resource allocation policy while also being relatively simple to use in the trivial cases.

andyross · 2018-04-09T16:50:32Z

I... still kinda hate this. Not the changes themselves, just the notion of where they're going. Trying to track allocation at a fine-grained level like this is a "last mile" kind of hardening, and we're trying to do that over an API that is already full of holes. I mean, take the k_delayed_work examples: those are functions that run in kernel space! Why do we care about prevening a process from being able to do an OOM DoS using an API that by definition can already do that and much, much more?

IMHO, if we really want to harden Zephyr against userspace misbehavior[1], we MUST start by carefully defining (or even redesigning) a "hardened Zephyr API" and not trying to plug all the holes whack-a-mole style. There are too many, and we'll end up with a bloated mess.

Our existing API is filled with things that just aren't meaningfully useful to processes working across MMU boundaries. Other examples: You'd never choose a semaphore for that, you'd use something like a futex for controlling blocking. Mailboxes and pipes just reduce to something like "file descriptors" in this world.

[1] One big note here: Linux userspace can, by default, trivially OOM the system. It's possible to harden that, of course, but non-trivial and non-standard. And even then it doesn't work by tracking every kernel allocation made on behalf of the process but by limiting the resources granted to it (and its children) by the VM, which is IMHO a simpler regime to work in. Do we really want to be pushing a more fine-grained and more complicated framework into our little RTOS that Linux never felt it needed?

====

OK, rant part over.

All that being said: if I had to pick one choice from the list it would be #3, for robustness. Note that #1 and #2 require more than just tracking byte counts: you need a big list of all the allocations so that if the thread aborts (strictly: all threads attached to that memory space I guess) you can clean it up. Otherwise you still have a DoS condition exploitable by starting threads that allocate and bail. Doing this in a separate heap gets you that tracking for free. My guess is that if you add this requirement to the budget for #1 and #2, you'd find that #3 is actually simpler.

If I could suggest a #4: I'd say it would be cleaner still to forget the byte and heap block tracking, make sure that:

All heap blocks are "owned" by a kernel object
Reference count kernel objects by how many threads are accessible
Make sure that each object has a sane clamp on memory usage (e.g. the size of a unix file descriptor buffer, etc...)
Just limit the number of references to kernel objects a thread is allowed to have, in the same way unix limits the number of open descriptors in a process.

I don't see any reason that couldn't be just as hard from a security point of view, and it doesn't require any complicated tracking beyond counting. It does require some rework to existing data structures to meet the "clamp" requirement though.

andrewboie · 2018-04-09T16:58:21Z

take the k_delayed_work examples: those are functions that run in kernel space! Why do we care about prevening a process from being able to do an OOM DoS using an API that by definition can already do that and much, much more?

@andyross we want to implement workqueues that run in user mode, #6289. User mode won't have access to workqueues that run in supervisor mode. I had started working on this until I realized that supporting delayed work would present some issues. There will be another RFC for this.

andrewboie · 2018-04-09T17:02:16Z

Do we really want to be pushing a more fine-grained and more complicated framework into our little RTOS that Linux never felt it needed?

@andyross we surely don't. I'm soliciting ideas here for how to make this simple. My original thinking was in the vein of rlimits but I'm currently considering your suggestion #4.

andrewboie · 2018-04-09T17:19:19Z

If I could suggest a #4: I'd say it would be cleaner still to forget the byte and heap block tracking, make sure that:

All heap blocks are "owned" by a kernel object
Reference count kernel objects by how many threads are accessible
Make sure that each object has a sane clamp on memory usage (e.g. the size of a unix file descriptor buffer, etc...)
Just limit the number of references to kernel objects a thread is allowed to have, in the same way unix limits the number of open descriptors in a process.
I don't see any reason that couldn't be just as hard from a security point of view, and it doesn't require any complicated tracking beyond counting. It does require some rework to existing data structures to meet the "clamp" requirement though.

OK. Thinking out loud on the implementation details for this. If I understand you correctly:

Any heap object requested from the kernel heap will track what object it belongs to with some header information pointing back to the source object. This will automatically be handled by the allocation. There will be a special API to free such memory, so that the owning object gets credit for the release of the memory it requested:

+---------------+
| k_mem_block   |
+---------------+
| obj pointer   |
+---------------+
| requested mem |
|               |
+---------------+

The clamp on memory usage could be another field in struct _k_object, or actually we could just use the existing u32_t "data" member since that is currently unused for all but thread and thread stack objects; tracking how much heap memory an object currently has in play. If a request is made on behalf of some object, a check is made to ensure that object hasn't exceeded its limits.

Just limit the number of references to kernel objects a thread is allowed to have, in the same way unix limits the number of open descriptors in a process.

This I'm having a little trouble with in terms of "what counts as a reference" but if we made it synonymous with "having active permissions on that object" then this is straightforward too, and would be a simple counter member in the struct k_thread.

andyross · 2018-04-09T17:39:16Z

That all sounds good. FWIW: I'd argue you don't even need the per-object byte counts. All that is really required for security is that the kernel object's allocation be bounded to some predictable level and not drivable to arbitrary sizes by userspace requests. We can trust kernel code. In almost all circumstances I'm sure this is already true.

And I was making the same assumption about permission vs. reference. If you can't touch it it doesn't matter if it exists or not. I guess that requires that kobj handles never be passed around between processes that don't have permissions to them though, which would require some auditing.

andrewboie · 2018-04-09T17:43:46Z

FWIW: I'd argue you don't even need the per-object byte counts. All that is really required for security is that the kernel object's allocation be bounded to some predictable level and not drivable to arbitrary sizes by userspace requests.

I'm not 100% sure about this. There are some use-cases where the memory allocation might be somewhat large. Or at least, the way that objects make allocations are not necessarily on the same order of magnitude per object type.

I did sketch out use-cases in the original RFC but I'm going to take another pass at this and really drill down to the specifics for each API that might involve allocations on behalf of syscalls. It will help knowing exactly what we are dealing with, and in further contemplation we may want to say for some of them "no this is not the right way to go about this" and omit those.

andrewboie · 2018-04-25T20:59:25Z

I've spent quite some time on this. After trying out a lot of stuff I have pretty much convinced myself that Option 3 is still the way to go but let me attempt to justify it. I did seriously consider tracking allocations on a per-object instead of a per-thread basis but I found things ended up much simpler this way even though it's not how Linux does it.

First, the details of what I am specifically trying to do:

I've written a patch to add a k_mem_pool_malloc() API. This works exactly like k_malloc() except a pointer to a memory pool is additionally provided. k_malloc() now just calls this with a pointer to the system heap. k_free() is used to release memory allocated with either k_malloc() or k_mem_pool_malloc() since the allocation itself tracks what pool it came from.
I implemented Option 3, which is that threads can be assigned a resource k_mem_pool. The patch is dead simple, maybe 20 lines of code. By default a thread has no resource pool. A supervisor-only API k_thread_resource_pool_assign() exists to set a thread's resource pool. Children of the thread inherit from the parent. The internal API z_thread_malloc() exists to request memory from the current thread's resource pool, returning NULL if there is no pool or not enough RAM.
The earlier patch I wrote to be able to request entire kernel objects dynamically dynamic kernel objects #6876 now draws from the caller's resource pool instead of the system heap. k_object_alloc() is now exposed as a system call. k_object_free() is NOT exposed as a system call and never will be.
We have open bugs on the k_pipe and k_msgq objects because their init functions take buffers as a parameter which are then used for internal storage by the object. We can't trust user mode to provide such buffers. These init functions have been removed as syscalls and new _alloc_init() functions added which allocate the buffers from the caller's resource pool.
I am currently working on a variant of the k_queue API. When the user passes in a pointer to be added to the queue, these alternate APIs will allocate from the caller's resource pool a struct which will contain the linked list information and the pointer to the data itself. The dequeue function frees this temporary struct before returning the original pointer to the caller. The key difference is that the user is no longer required to leave room in their data for the queue's internal linked list information, and the provided data pointer is never dereferenced at all.

The previous 3 items show how I am using the resource pools to solve three problems:

How to allocate some memory to create brand-new kernel objects
How to allocate memory for nontrivially large buffers used by certain kernel objects like pipes, these can be pretty large depending on use-case
How to allocate variously sized temporary blocks of memory used behind the scenes in system calls so that user data is not mixed with kernel bookkeeping information, and the user doesn't need to directly provide memory to perform such actions.

I found with Option 3 that all three of these were trivial to implement; at the end of the day we just call z_thread_malloc(), and later on k_free() when done. The fact that these allocations store internally what pool they came from makes things even simpler.

However, there's one more thing that need to be done and that is reference counting. If an object loses all its references, in almost all cases it should drop to an unused/uninitialized state. I want to make sure of two things:

That any buffers allocated for the object, for example the buffer region for a k_pipe(), are automatically freed. The interface is that objects that might need special cleanup like this should expose a cleanup() function which gets called on it when all refs are lost.
If the object was itself dynamically allocated with k_object_alloc(), the entire object should then be freed.

The model of using the existing permission bitfield as also a reference count works GREAT. I am not sure if I need to rename it, but basically if a thread has permission on an object, this is also a reference. A reference is automatically cleared on an object when a thread exits (we had this already). I have added logic such that whenever a thread's reference count on an object drops to zero, any object-specific cleanup function (if it exists) is called, and then if the object was dynamically allocated it will be itself freed.

There are some caveats:

A thread should never lose a reference on an object while it is inside an API call for that object. This means that I had to remove k_object_access_revoke() as a system call. However, I added a new one, k_object_release(void *obj) which drops a reference to that object for the calling thread. I just can't have user threads revoking access to objects for other threads since they could be in an API call.
Supervisor threads are going to need to acquire references to objects when they use them, even though the permission model is ignored, to prevent situations where an object unexpectedly disappears while being used.

Example lifecycle:

User thread A is granted a generously sized resource pool P which is shared with threads B and C which make up some logical application on the microcontroller. There are two other logical applications running on this MCU, each with their own resource pools.

User thread A needs a pipe. It calls:

struct k_pipe *p = k_object_alloc(K_OBJ_PIPE);

and is returned an uninitialized pipe object, with A being automatically granted permission on it.

User thread A initializes the pipe. A needs a fairly large buffer and calls

int ret = k_pipe_alloc_init(p, 8192);

Ret would equal -ENOMEM if an 8K buffer couldn't be reserved but this succeeds and the system call returns 0.

A then uses the pipe for a while. Then one of two things happen: A either exits, or A intentionally drops its reference to the pipe with k_object_release(p). The reference count on the pipe drops to 0, assuming no other threads had a reference to it. This triggers a cleanup action which first calls k_pipe_cleanup(), which frees the pipe's buffer. Then since the pipe itself was dynamically allocated, the memory for the pipe object itself is also returned to the resource pool. This all happens automatically.

I have a rough WIP PR that demonstrates most of this: #7049

andrewboie added Feature A planned feature with a milestone area: Memory Protection labels Apr 6, 2018

nashif added the RFC Request For Comments: want input from the community label Apr 9, 2018

andrewboie mentioned this issue Apr 25, 2018

thread resource pools and kernel object memory management #7049

Merged

andrewboie closed this as completed Jun 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: kernel heap requests on behalf of syscalls #6972

RFC: kernel heap requests on behalf of syscalls #6972

andrewboie commented Apr 6, 2018

stephensmalley commented Apr 9, 2018

andyross commented Apr 9, 2018

andrewboie commented Apr 9, 2018 •

edited

Loading

andrewboie commented Apr 9, 2018

andrewboie commented Apr 9, 2018 •

edited by lpereira

Loading

andyross commented Apr 9, 2018

andrewboie commented Apr 9, 2018

andrewboie commented Apr 25, 2018 •

edited

Loading

RFC: kernel heap requests on behalf of syscalls #6972

RFC: kernel heap requests on behalf of syscalls #6972

Comments

andrewboie commented Apr 6, 2018

stephensmalley commented Apr 9, 2018

andyross commented Apr 9, 2018

andrewboie commented Apr 9, 2018 • edited Loading

andrewboie commented Apr 9, 2018

andrewboie commented Apr 9, 2018 • edited by lpereira Loading

andyross commented Apr 9, 2018

andrewboie commented Apr 9, 2018

andrewboie commented Apr 25, 2018 • edited Loading

andrewboie commented Apr 9, 2018 •

edited

Loading

andrewboie commented Apr 9, 2018 •

edited by lpereira

Loading

andrewboie commented Apr 25, 2018 •

edited

Loading