-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The performance difference between system write API and liburing API when writing small data. #912
Comments
What kernel version is that? Until recent io_uring was offloading buffered writes to a worker thread, which is very slow, that's where |
Another aspect here might be the repeated waiting for single operations with io_uring. Does this pattern persist if you batch multiple requests per submission? |
I used the kernel of 6.0.9-060009-generic and 5.15.0-53-generic, and got the same result. I'll try the new kernal to test again. |
I can't write multiple requests in batch mode, because I want to use io_uring API to hook the system API which to be used to hijack the IO process of some DB engine such as wiredtiger which is used by mongodb. The writedtiger engine has a lot of small data read/write operations and got more slower performance in that way, so I wrote one simple demo to test and got that result. |
I am facing a similar performance difference between the normal write system calls and the ones executed through io_uring. Seeing this answer, I updated my system to run the latest kernel 6.4.11 on Ubuntu 23.04, but the write operation is still handled in a worker thread since ftrace hits the io_uring_queue_async_work trace point. Does the mentioned fast path require any options such as O_DIRECT or IORING_SETUP_IOPOLL? |
First of all, batching requests is good for performance, but it's not about 10x difference as in the original question, but rather 10s%.
Let's start with basic questions:
In short, no, optimisations I mentioned are for buffered (non O_DIRECT) writes. O_DIRECT is using a different path, which is already swift enough. |
Yes, I did run the same experiment as the one mentioned above.
When enabling various io_uring tracepoints, this is the general pattern, showing that the write is added to the async work list.
My current file system is ext4. I have not tried performing file writes using io_uring on top of other file systems.
The writes are performed sequentially and write 1 byte each. I'd be happy to provide additional information should there still be unclarities. Thank you. |
I've been talking to Stefan, in short, apparently that optimisation doesn't support ext4. TL;DR; it will depend on ext4 being converted to iomap, which is apparently in plan / in progress, but I don't have any prediction for that. |
I think it's very important for iouring to read/write small data very quickly same as the system IO API, especially for DB engine such as Berkeley DB, wiredtiger, etc. With the high performance and async-IO of iouring, I can hook the IO API of read/write for DB engine, and use coroutine way to support high concurrency service with high performance. |
Thank you for the info. Do you happen to know which file systems currently support the write fast path? I would be very curious as to whether io_uring can outperform the normal write systemcall if the writes are not handled using worker threads. |
@hema2601 , AFAIK iomap originated from XFS, so if io_uring fastpath depends on iomap, then XFS should give you better results. |
xfs and btrfs
Speaking about buffered writes, the only upper hand io_uring has is syscall batching and maybe registered files, and unless batching is good enough it won't outperform. The task/process/thread will be doing the same amount of work, e.g. memcpy from page cache, and without io-wq there is no asynchronicity for it yet. |
I can confirm that I have similar issues. The reason I stopped the work I did up until #862 was that send syscall outperforms io_uring send for smaller/medium writes esp. if CPU cache is smaller. My problem is that io_uring performs really good for the receiving part but sending smaller messages is heavier since with send syscall you do in fast path:
With io_uring you have to call prep_send and io_uring_submit for the above fast path, and that for me is slower than send syscall. And if you take the "proactive" approach it peforms WAY worse, since you exhaust your CPU-cache immediately. I've tried all kinds of variations such as batching 16 send calls before io_uring_submit but it does not outperform send. Optimizing small sends with io_uring so that the above CPU-cache-optimal way of sending performs better than send syscall should be possible, esp. since I am using all of the modern features like direct descriptors and everything is cache aware and on huge pages, etc. |
Oh yeah, I read my whole thread and remember: The main issue for smaller sends can be summed up like so:
That, or somehow optimize prep_send + io_uring_submit to outperform send syscall (it doesn't for small) |
I'm surprised you see a difference with send(2) vs IORING_OP_SEND for that case, have you done any profiling that could shed some light on this? Assuming your application is threaded, are you using registered rings to avoid the fget/fput on the ring fd itself? |
How is the progress on this issue? I really hope to see som significant progress about it. |
@zhengshuxin you can re-open this issue if still broken ? |
Apparently, Jens was cleaning up gh tasks and might have overdid it a bit. Is it still ext4 in your case? |
Yes, the case is in ext4 on Ubuntu with kernel6.8. And I've added read comparing for sys read and liburing read, found that they have the similarly performance. The newly added read codes in https://github.com/acl-dev/demo/blob/master/c/file/main.c . Below is the performance results of read and write for sys and liburing. ./file -n 100000
IO_URING_VERSION_MAJOR: 2, IO_URING_VERSION_MINOR: 8
uring_write: open file.txt ok, fd=3
uring_write: write char=0
uring_write: write char=1
uring_write: write char=2
uring_write: write char=3
uring_write: write char=4
uring_write: write char=5
uring_write: write char=6
uring_write: write char=7
uring_write: write char=8
uring_write: write char=9
close file.txt ok, fd=3
uring write, total write=100000, cost=1541.28 ms, speed=64881.18
-------------------------------------------------------
sys_write: open file.txt ok, fd=3
sys_write: write char=0
sys_write: write char=1
sys_write: write char=2
sys_write: write char=3
sys_write: write char=4
sys_write: write char=5
sys_write: write char=6
sys_write: write char=7
sys_write: write char=8
sys_write: write char=9
close file.txt ok, fd=3
sys write, total write=100000, cost=80.58 ms, speed=1240925.73
========================================================
uring_read: read open file.txt ok, fd=3
uring_read: char[0]=0
uring_read: char[1]=1
uring_read: char[2]=2
uring_read: char[3]=3
uring_read: char[4]=4
uring_read: char[5]=5
uring_read: char[6]=6
uring_read: char[7]=7
uring_read: char[8]=8
uring_read: char[9]=9
close fd=3
uring read, total read=100000, cost=84.52 ms, speed=1183179.91
-------------------------------------------------------
sys_read: open file.txt ok, fd=3
sys_read: char[0]=0
sys_read: char[1]=1
sys_read: char[2]=2
sys_read: char[3]=3
sys_read: char[4]=4
sys_read: char[5]=5
sys_read: char[6]=6
sys_read: char[7]=7
sys_read: char[8]=8
sys_read: char[9]=9
sys read, total read=100000, cost=67.22 ms, speed=1487586.09 |
I have no permission to re-open the issue. |
Hi, I wrote one sample to test the performance of writing small data with liburing and the system write API, found that the system write API is much more faster than the API in liburing. I don't know the reason. The testing demo is in https://github.com/acl-dev/demo/blob/master/c/file/main.c. And some writing codes are show below:
At last, I got the below performance result:
That's to say the performance of writing small data with system API is 67x faster than that in liburing. To find the reason, I used perf tool to check the write process of liburing, and got the below:
Why is it so high CPU when calling finish_task_switch.isra.0 and __lock_text_start?
Is there anybody can tell me how to improve the performance of writing small data?
Thanks.
---zsx
The text was updated successfully, but these errors were encountered: