The performance difference between system write API and liburing API when writing small data. #912

zhengshuxin · 2023-07-28T13:51:17Z

Hi, I wrote one sample to test the performance of writing small data with liburing and the system write API, found that the system write API is much more faster than the API in liburing. I don't know the reason. The testing demo is in https://github.com/acl-dev/demo/blob/master/c/file/main.c. And some writing codes are show below:

static int __open_flags = O_WRONLY | O_APPEND | O_CREAT;

void uring_write(const char *path, size_t count)
{
        int    fd = open(path, __open_flags, 0600), ret;
        size_t i;
        char   buf[1] = { '0' };
        assert(fd >= 0);

        struct io_uring_params params;
        struct io_uring uring;

        memset(&params, 0, sizeof(params));
        ret = io_uring_queue_init_params(100, &uring, &params);
        assert(ret >= 0);

        struct io_uring_sqe *sqe;
        struct io_uring_cqe *cqe;

        for (i = 0; i < count; i++) {
                sqe = io_uring_get_sqe(&uring);
                io_uring_prep_write(sqe, fd, buf, 1, i);
                io_uring_submit(&uring);
                ret = io_uring_wait_cqes(&uring, &cqe, 1, NULL, NULL);
                if (ret < 0) {
                        printf("io_uring_wait_cqe error=%s\r\n", strerror(-ret));
                        exit(1);
                }
                io_uring_cqe_seen(&uring, cqe);
        }
       ...
}

static void sys_write(const char *path, size_t count)
{
        int    fd = open(path, __open_flags, 0600), ret;
        char   buf[1] = { '0' };
        size_t i;
        assert(fd >= 0);

        for (i = 0; i < count; i++) {
                if (write(fd, buf, 1) <= 0) {
                        printf("write to %s error %s\r\n", path, strerror(errno));
                        exit(1);
                }
        }
        ...
}

At last, I got the below performance result:

uring write, total write=1000000, cost=107121.84 ms, speed=9335.16
sys write, total write=1000000, cost=1595.58 ms, speed=626731.35

That's to say the performance of writing small data with system API is 67x faster than that in liburing. To find the reason, I used perf tool to check the write process of liburing, and got the below:

Samples: 8K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 593388678 lost: 0/0 drop: 0/0                                                                           
Overhead  Shared Object        Symbol                                                                                                                                          
  47.07%  [kernel]             [k] finish_task_switch.isra.0
  34.24%  [kernel]             [k] __lock_text_start
   1.70%  [kernel]             [k] __softirqentry_text_start
   1.45%  liburing-ffi.so.2.4  [.] io_uring_submit
   1.39%  liburing-ffi.so.2.4  [.] _io_uring_get_cqe
   1.17%  [kernel]             [k] vbg_req_perform
   0.91%  [kernel]             [k] syscall_enter_from_user_mode
   0.82%  [kernel]             [k] __do_sys_io_uring_enter
   0.74%  [kernel]             [k] __fget_files
   0.71%  file                 [.] main
   0.59%  [kernel]             [k] io_cqring_wait
   0.50%  [kernel]             [k] __io_file_supports_nowait
   0.38%  [kernel]             [k] io_submit_sqes
   0.37%  [kernel]             [k] do_syscall_64
   0.37%  [kernel]             [k] __get_user_8
   0.36%  liburing-ffi.so.2.4  [.] __io_uring_get_cqe
   0.36%  [kernel]             [k] fput_many
   0.35%  [kernel]             [k] io_wqe_enqueue
   0.34%  [kernel]             [k] exit_to_user_mode_loop
   0.33%  [kernel]             [k] io_prep_rw
   0.30%  [kernel]             [k] io_queue_async_work
   0.28%  [kernel]             [k] io_wq_enqueue
   0.28%  [kernel]             [k] __io_queue_sqe
   0.26%  [kernel]             [k] __fget_light
   0.21%  [kernel]             [k] io_prep_async_work
   0.21%  [kernel]             [k] io_init_req
   0.18%  [kernel]             [k] __raw_callee_save___pv_queued_spin_unlock
   0.17%  [kernel]             [k] __schedule
   0.17%  [kernel]             [k] io_wqe_activate_free_worker
   0.16%  [kernel]             [k] fput
   0.15%  [kernel]             [k] io_req_prep
   0.15%  [kernel]             [k] io_run_task_work_sig
   0.14%  [kernel]             [k] hrtimer_start_range_ns
   0.14%  [kernel]             [k] syscall_exit_to_user_mode
   0.14%  [kernel]             [k] try_to_wake_up
   0.13%  [kernel]             [k] schedule
   0.13%  [kernel]             [k] io_arm_poll_handler
   0.13%  [kernel]             [k] mem_cgroup_handle_over_high
   0.12%  liburing-ffi.so.2.4  [.] io_uring_wait_cqes

Why is it so high CPU when calling finish_task_switch.isra.0 and __lock_text_start?

Is there anybody can tell me how to improve the performance of writing small data?

Thanks.
---zsx

The text was updated successfully, but these errors were encountered:

isilence · 2023-07-28T14:46:29Z

What kernel version is that? Until recent io_uring was offloading buffered writes to a worker thread, which is very slow, that's where io_wq_enqueue comes from in the profile. There is a fast path for that nowadays, so my first suggestion is try out a newer kernel.

BenBE · 2023-07-28T20:01:30Z

Another aspect here might be the repeated waiting for single operations with io_uring. Does this pattern persist if you batch multiple requests per submission?

zhengshuxin · 2023-07-29T01:56:23Z

What kernel version is that? Until recent io_uring was offloading buffered writes to a worker thread, which is very slow, that's where io_wq_enqueue comes from in the profile. There is a fast path for that nowadays, so my first suggestion is try out a newer kernel.

I used the kernel of 6.0.9-060009-generic and 5.15.0-53-generic, and got the same result. I'll try the new kernal to test again.
Thanks.

zhengshuxin · 2023-07-29T02:07:18Z

Another aspect here might be the repeated waiting for single operations with io_uring. Does this pattern persist if you batch multiple requests per submission?

I can't write multiple requests in batch mode, because I want to use io_uring API to hook the system API which to be used to hijack the IO process of some DB engine such as wiredtiger which is used by mongodb. The writedtiger engine has a lot of small data read/write operations and got more slower performance in that way, so I wrote one simple demo to test and got that result.

hema2601 · 2023-08-21T01:36:49Z

What kernel version is that? Until recent io_uring was offloading buffered writes to a worker thread, which is very slow, that's where io_wq_enqueue comes from in the profile. There is a fast path for that nowadays, so my first suggestion is try out a newer kernel.

I am facing a similar performance difference between the normal write system calls and the ones executed through io_uring. Seeing this answer, I updated my system to run the latest kernel 6.4.11 on Ubuntu 23.04, but the write operation is still handled in a worker thread since ftrace hits the io_uring_queue_async_work trace point.

Does the mentioned fast path require any options such as O_DIRECT or IORING_SETUP_IOPOLL?

isilence · 2023-08-23T16:37:46Z

First of all, batching requests is good for performance, but it's not about 10x difference as in the original question, but rather 10s%.

I am facing a similar performance difference between the normal write system calls and the ones executed through io_uring. Seeing this answer, I updated my system to run the latest kernel 6.4.11 on Ubuntu 23.04, but the write operation is still handled in a worker thread since ftrace hits the io_uring_queue_async_work trace point.

Let's start with basic questions:

Did you run the same reproducer as above? If not, do you have a repro?
What filesystems you use?
What is the write pattern? What is the write size? Is it random or sequential?

Does the mentioned fast path require any options such as O_DIRECT or IORING_SETUP_IOPOLL?

In short, no, optimisations I mentioned are for buffered (non O_DIRECT) writes. O_DIRECT is using a different path, which is already swift enough.

hema2601 · 2023-08-29T06:29:27Z

Did you run the same reproducer as above? If not, do you have a repro?

Yes, I did run the same experiment as the one mentioned above.
The output on my machine after having upgraded to Ubuntu 23.04 with kernel 6.4.11 is as such:

uring write, total write=1000000, cost=5445.03 ms, speed=183653.78
sys write, total write=1000000, cost=769.35 ms, speed=1299806.98

When enabling various io_uring tracepoints, this is the general pattern, showing that the write is added to the async work list.

file-3977811 [001] ..... 447127.636303: sys_io_uring_enter -> 0x0
file-3977811 [001] ..... 447127.636303: sys_io_uring_enter(fd: 4, to_submit: 1, min_complete: 0, flags: 0, argp: 0, argsz: 8)
file-3977811 [001] ..... 447127.636303: io_uring_submit_req: ring 00000000ed62390b, req 0000000044765cb2, user_data 0x0, opcode WRITE, flags 0x0, sq_thread 0
file-3977811 [001] ..... 447127.636303: io_uring_file_get: ring 00000000ed62390b, req 0000000044765cb2, user_data 0x0, fd 3
file-3977811 [001] ..... 447127.636304: io_uring_queue_async_work: ring 00000000ed62390b, request 0000000044765cb2, user_data 0x0, opcode WRITE, flags 0xe0040000, hashed queue, work 00000000fa55aaba
file-3977811 [001] ..... 447127.636304: sys_io_uring_enter -> 0x1
file-3977811 [001] ..... 447127.636304: sys_io_uring_enter(fd: 4, to_submit: 0, min_complete: 1, flags: 1, argp: 0, argsz: 8)
file-3977811 [001] ..... 447127.636304: io_uring_cqring_wait: ring 00000000ed62390b, min_events 1
file-3977811 [001] ..... 447127.636308: io_uring_task_work_run: tctx 00000000fafc97eb, count 1, loops 1

What filesystems you use?

My current file system is ext4. I have not tried performing file writes using io_uring on top of other file systems.

What is the write pattern? What is the write size? Is it random or sequential?

The writes are performed sequentially and write 1 byte each.

I'd be happy to provide additional information should there still be unclarities. Thank you.

isilence · 2023-08-30T14:11:57Z

I've been talking to Stefan, in short, apparently that optimisation doesn't support ext4.

TL;DR; it will depend on ext4 being converted to iomap, which is apparently in plan / in progress, but I don't have any prediction for that.

zhengshuxin · 2023-08-31T01:46:41Z

I think it's very important for iouring to read/write small data very quickly same as the system IO API, especially for DB engine such as Berkeley DB, wiredtiger, etc. With the high performance and async-IO of iouring, I can hook the IO API of read/write for DB engine, and use coroutine way to support high concurrency service with high performance.

hema2601 · 2023-08-31T02:44:10Z

Thank you for the info. Do you happen to know which file systems currently support the write fast path? I would be very curious as to whether io_uring can outperform the normal write systemcall if the writes are not handled using worker threads.

redbaron · 2023-08-31T10:17:06Z

@hema2601 , AFAIK iomap originated from XFS, so if io_uring fastpath depends on iomap, then XFS should give you better results.

isilence · 2023-09-05T14:03:00Z

Do you happen to know which file systems currently support the write fast path?

xfs and btrfs

I would be very curious as to whether io_uring can outperform the normal write systemcall if the writes are not handled using worker threads.

Speaking about buffered writes, the only upper hand io_uring has is syscall batching and maybe registered files, and unless batching is good enough it won't outperform. The task/process/thread will be doing the same amount of work, e.g. memcpy from page cache, and without io-wq there is no asynchronicity for it yet.

uNetworkingAB · 2023-12-16T09:13:25Z

I can confirm that I have similar issues. The reason I stopped the work I did up until #862 was that send syscall outperforms io_uring send for smaller/medium writes esp. if CPU cache is smaller.

My problem is that io_uring performs really good for the receiving part but sending smaller messages is heavier since with send syscall you do in fast path:

Generate some response in a CPU-cached vector ("block of memory")
Call send syscall -> if ok then never mind, it succeeded (fast path almost always taken for shorter messages)
In worse case, move the response to some less speedy block of memory, most likely not in cache

With io_uring you have to call prep_send and io_uring_submit for the above fast path, and that for me is slower than send syscall. And if you take the "proactive" approach it peforms WAY worse, since you exhaust your CPU-cache immediately. I've tried all kinds of variations such as batching 16 send calls before io_uring_submit but it does not outperform send.

Optimizing small sends with io_uring so that the above CPU-cache-optimal way of sending performs better than send syscall should be possible, esp. since I am using all of the modern features like direct descriptors and everything is cache aware and on huge pages, etc.

uNetworkingAB · 2023-12-16T09:26:37Z

Oh yeah, I read my whole thread and remember: The main issue for smaller sends can be summed up like so:

Send syscall that can take direct descriptors would be killer, I think

That, or somehow optimize prep_send + io_uring_submit to outperform send syscall (it doesn't for small)

axboe · 2023-12-18T18:08:50Z

I'm surprised you see a difference with send(2) vs IORING_OP_SEND for that case, have you done any profiling that could shed some light on this? Assuming your application is threaded, are you using registered rings to avoid the fget/fput on the ring fd itself?

zhengshuxin · 2024-08-13T09:03:00Z

How is the progress on this issue? I really hope to see som significant progress about it.
Thanks
--zsx

boxerab · 2024-10-17T14:27:38Z

@zhengshuxin you can re-open this issue if still broken ?

isilence · 2024-10-17T15:00:23Z

Apparently, Jens was cleaning up gh tasks and might have overdid it a bit. Is it still ext4 in your case?

zhengshuxin · 2024-10-18T08:52:55Z

Apparently, Jens was cleaning up gh tasks and might have overdid it a bit. Is it still ext4 in your case?

Yes, the case is in ext4 on Ubuntu with kernel6.8. And I've added read comparing for sys read and liburing read, found that they have the similarly performance. The newly added read codes in https://github.com/acl-dev/demo/blob/master/c/file/main.c . Below is the performance results of read and write for sys and liburing.

./file -n 100000
IO_URING_VERSION_MAJOR: 2, IO_URING_VERSION_MINOR: 8

uring_write: open file.txt ok, fd=3
uring_write: write char=0
uring_write: write char=1
uring_write: write char=2
uring_write: write char=3
uring_write: write char=4
uring_write: write char=5
uring_write: write char=6
uring_write: write char=7
uring_write: write char=8
uring_write: write char=9
close file.txt ok, fd=3
uring write, total write=100000, cost=1541.28 ms, speed=64881.18
-------------------------------------------------------
sys_write: open file.txt ok, fd=3
sys_write: write char=0
sys_write: write char=1
sys_write: write char=2
sys_write: write char=3
sys_write: write char=4
sys_write: write char=5
sys_write: write char=6
sys_write: write char=7
sys_write: write char=8
sys_write: write char=9
close file.txt ok, fd=3
sys write, total write=100000, cost=80.58 ms, speed=1240925.73
========================================================
uring_read: read open file.txt ok, fd=3
uring_read: char[0]=0
uring_read: char[1]=1
uring_read: char[2]=2
uring_read: char[3]=3
uring_read: char[4]=4
uring_read: char[5]=5
uring_read: char[6]=6
uring_read: char[7]=7
uring_read: char[8]=8
uring_read: char[9]=9
close fd=3
uring read, total read=100000, cost=84.52 ms, speed=1183179.91
-------------------------------------------------------
sys_read: open file.txt ok, fd=3
sys_read: char[0]=0
sys_read: char[1]=1
sys_read: char[2]=2
sys_read: char[3]=3
sys_read: char[4]=4
sys_read: char[5]=5
sys_read: char[6]=6
sys_read: char[7]=7
sys_read: char[8]=8
sys_read: char[9]=9
sys read, total read=100000, cost=67.22 ms, speed=1487586.09

zhengshuxin · 2024-10-18T09:11:25Z

@zhengshuxin you can re-open this issue if still broken ?

I have no permission to re-open the issue.

axboe closed this as completed Oct 2, 2024

zhengshuxin mentioned this issue Oct 17, 2024

Benchmark results for storage to SSD with/without liburing #1123

Open

zhengshuxin mentioned this issue Jan 15, 2025

(very) slow writes compared to syscall writes #1261

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The performance difference between system write API and liburing API when writing small data. #912

The performance difference between system write API and liburing API when writing small data. #912

zhengshuxin commented Jul 28, 2023

isilence commented Jul 28, 2023

BenBE commented Jul 28, 2023

zhengshuxin commented Jul 29, 2023

zhengshuxin commented Jul 29, 2023

hema2601 commented Aug 21, 2023

isilence commented Aug 23, 2023

hema2601 commented Aug 29, 2023

isilence commented Aug 30, 2023

zhengshuxin commented Aug 31, 2023

hema2601 commented Aug 31, 2023

redbaron commented Aug 31, 2023

isilence commented Sep 5, 2023

uNetworkingAB commented Dec 16, 2023 •

edited

Loading

uNetworkingAB commented Dec 16, 2023

axboe commented Dec 18, 2023

zhengshuxin commented Aug 13, 2024

boxerab commented Oct 17, 2024

isilence commented Oct 17, 2024

zhengshuxin commented Oct 18, 2024 •

edited

Loading

zhengshuxin commented Oct 18, 2024

The performance difference between system write API and liburing API when writing small data. #912

The performance difference between system write API and liburing API when writing small data. #912

Comments

zhengshuxin commented Jul 28, 2023

isilence commented Jul 28, 2023

BenBE commented Jul 28, 2023

zhengshuxin commented Jul 29, 2023

zhengshuxin commented Jul 29, 2023

hema2601 commented Aug 21, 2023

isilence commented Aug 23, 2023

hema2601 commented Aug 29, 2023

isilence commented Aug 30, 2023

zhengshuxin commented Aug 31, 2023

hema2601 commented Aug 31, 2023

redbaron commented Aug 31, 2023

isilence commented Sep 5, 2023

uNetworkingAB commented Dec 16, 2023 • edited Loading

uNetworkingAB commented Dec 16, 2023

axboe commented Dec 18, 2023

zhengshuxin commented Aug 13, 2024

boxerab commented Oct 17, 2024

isilence commented Oct 17, 2024

zhengshuxin commented Oct 18, 2024 • edited Loading

zhengshuxin commented Oct 18, 2024

uNetworkingAB commented Dec 16, 2023 •

edited

Loading

zhengshuxin commented Oct 18, 2024 •

edited

Loading