Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The performance difference between system write API and liburing API when writing small data. #912

Closed
zhengshuxin opened this issue Jul 28, 2023 · 20 comments

Comments

@zhengshuxin
Copy link

Hi, I wrote one sample to test the performance of writing small data with liburing and the system write API, found that the system write API is much more faster than the API in liburing. I don't know the reason. The testing demo is in https://github.com/acl-dev/demo/blob/master/c/file/main.c. And some writing codes are show below:

static int __open_flags = O_WRONLY | O_APPEND | O_CREAT;

void uring_write(const char *path, size_t count)
{
        int    fd = open(path, __open_flags, 0600), ret;
        size_t i;
        char   buf[1] = { '0' };
        assert(fd >= 0);

        struct io_uring_params params;
        struct io_uring uring;

        memset(&params, 0, sizeof(params));
        ret = io_uring_queue_init_params(100, &uring, &params);
        assert(ret >= 0);

        struct io_uring_sqe *sqe;
        struct io_uring_cqe *cqe;

        for (i = 0; i < count; i++) {
                sqe = io_uring_get_sqe(&uring);
                io_uring_prep_write(sqe, fd, buf, 1, i);
                io_uring_submit(&uring);
                ret = io_uring_wait_cqes(&uring, &cqe, 1, NULL, NULL);
                if (ret < 0) {
                        printf("io_uring_wait_cqe error=%s\r\n", strerror(-ret));
                        exit(1);
                }
                io_uring_cqe_seen(&uring, cqe);
        }
       ...
}

static void sys_write(const char *path, size_t count)
{
        int    fd = open(path, __open_flags, 0600), ret;
        char   buf[1] = { '0' };
        size_t i;
        assert(fd >= 0);

        for (i = 0; i < count; i++) {
                if (write(fd, buf, 1) <= 0) {
                        printf("write to %s error %s\r\n", path, strerror(errno));
                        exit(1);
                }
        }
        ...
}

At last, I got the below performance result:

uring write, total write=1000000, cost=107121.84 ms, speed=9335.16
sys write, total write=1000000, cost=1595.58 ms, speed=626731.35

That's to say the performance of writing small data with system API is 67x faster than that in liburing. To find the reason, I used perf tool to check the write process of liburing, and got the below:

Samples: 8K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 593388678 lost: 0/0 drop: 0/0                                                                           
Overhead  Shared Object        Symbol                                                                                                                                          
  47.07%  [kernel]             [k] finish_task_switch.isra.0
  34.24%  [kernel]             [k] __lock_text_start
   1.70%  [kernel]             [k] __softirqentry_text_start
   1.45%  liburing-ffi.so.2.4  [.] io_uring_submit
   1.39%  liburing-ffi.so.2.4  [.] _io_uring_get_cqe
   1.17%  [kernel]             [k] vbg_req_perform
   0.91%  [kernel]             [k] syscall_enter_from_user_mode
   0.82%  [kernel]             [k] __do_sys_io_uring_enter
   0.74%  [kernel]             [k] __fget_files
   0.71%  file                 [.] main
   0.59%  [kernel]             [k] io_cqring_wait
   0.50%  [kernel]             [k] __io_file_supports_nowait
   0.38%  [kernel]             [k] io_submit_sqes
   0.37%  [kernel]             [k] do_syscall_64
   0.37%  [kernel]             [k] __get_user_8
   0.36%  liburing-ffi.so.2.4  [.] __io_uring_get_cqe
   0.36%  [kernel]             [k] fput_many
   0.35%  [kernel]             [k] io_wqe_enqueue
   0.34%  [kernel]             [k] exit_to_user_mode_loop
   0.33%  [kernel]             [k] io_prep_rw
   0.30%  [kernel]             [k] io_queue_async_work
   0.28%  [kernel]             [k] io_wq_enqueue
   0.28%  [kernel]             [k] __io_queue_sqe
   0.26%  [kernel]             [k] __fget_light
   0.21%  [kernel]             [k] io_prep_async_work
   0.21%  [kernel]             [k] io_init_req
   0.18%  [kernel]             [k] __raw_callee_save___pv_queued_spin_unlock
   0.17%  [kernel]             [k] __schedule
   0.17%  [kernel]             [k] io_wqe_activate_free_worker
   0.16%  [kernel]             [k] fput
   0.15%  [kernel]             [k] io_req_prep
   0.15%  [kernel]             [k] io_run_task_work_sig
   0.14%  [kernel]             [k] hrtimer_start_range_ns
   0.14%  [kernel]             [k] syscall_exit_to_user_mode
   0.14%  [kernel]             [k] try_to_wake_up
   0.13%  [kernel]             [k] schedule
   0.13%  [kernel]             [k] io_arm_poll_handler
   0.13%  [kernel]             [k] mem_cgroup_handle_over_high
   0.12%  liburing-ffi.so.2.4  [.] io_uring_wait_cqes

Why is it so high CPU when calling finish_task_switch.isra.0 and __lock_text_start?

Is there anybody can tell me how to improve the performance of writing small data?

Thanks.
---zsx

@isilence
Copy link
Collaborator

What kernel version is that? Until recent io_uring was offloading buffered writes to a worker thread, which is very slow, that's where io_wq_enqueue comes from in the profile. There is a fast path for that nowadays, so my first suggestion is try out a newer kernel.

@BenBE
Copy link

BenBE commented Jul 28, 2023

Another aspect here might be the repeated waiting for single operations with io_uring. Does this pattern persist if you batch multiple requests per submission?

@zhengshuxin
Copy link
Author

What kernel version is that? Until recent io_uring was offloading buffered writes to a worker thread, which is very slow, that's where io_wq_enqueue comes from in the profile. There is a fast path for that nowadays, so my first suggestion is try out a newer kernel.

I used the kernel of 6.0.9-060009-generic and 5.15.0-53-generic, and got the same result. I'll try the new kernal to test again.
Thanks.

@zhengshuxin
Copy link
Author

Another aspect here might be the repeated waiting for single operations with io_uring. Does this pattern persist if you batch multiple requests per submission?

I can't write multiple requests in batch mode, because I want to use io_uring API to hook the system API which to be used to hijack the IO process of some DB engine such as wiredtiger which is used by mongodb. The writedtiger engine has a lot of small data read/write operations and got more slower performance in that way, so I wrote one simple demo to test and got that result.

@hema2601
Copy link

What kernel version is that? Until recent io_uring was offloading buffered writes to a worker thread, which is very slow, that's where io_wq_enqueue comes from in the profile. There is a fast path for that nowadays, so my first suggestion is try out a newer kernel.

I am facing a similar performance difference between the normal write system calls and the ones executed through io_uring. Seeing this answer, I updated my system to run the latest kernel 6.4.11 on Ubuntu 23.04, but the write operation is still handled in a worker thread since ftrace hits the io_uring_queue_async_work trace point.

Does the mentioned fast path require any options such as O_DIRECT or IORING_SETUP_IOPOLL?

@isilence
Copy link
Collaborator

First of all, batching requests is good for performance, but it's not about 10x difference as in the original question, but rather 10s%.

I am facing a similar performance difference between the normal write system calls and the ones executed through io_uring. Seeing this answer, I updated my system to run the latest kernel 6.4.11 on Ubuntu 23.04, but the write operation is still handled in a worker thread since ftrace hits the io_uring_queue_async_work trace point.

Let's start with basic questions:

  1. Did you run the same reproducer as above? If not, do you have a repro?
  2. What filesystems you use?
  3. What is the write pattern? What is the write size? Is it random or sequential?

Does the mentioned fast path require any options such as O_DIRECT or IORING_SETUP_IOPOLL?

In short, no, optimisations I mentioned are for buffered (non O_DIRECT) writes. O_DIRECT is using a different path, which is already swift enough.

@hema2601
Copy link

  1. Did you run the same reproducer as above? If not, do you have a repro?

Yes, I did run the same experiment as the one mentioned above.
The output on my machine after having upgraded to Ubuntu 23.04 with kernel 6.4.11 is as such:

uring write, total write=1000000, cost=5445.03 ms, speed=183653.78
sys write, total write=1000000, cost=769.35 ms, speed=1299806.98

When enabling various io_uring tracepoints, this is the general pattern, showing that the write is added to the async work list.

file-3977811 [001] ..... 447127.636303: sys_io_uring_enter -> 0x0
file-3977811 [001] ..... 447127.636303: sys_io_uring_enter(fd: 4, to_submit: 1, min_complete: 0, flags: 0, argp: 0, argsz: 8)
file-3977811 [001] ..... 447127.636303: io_uring_submit_req: ring 00000000ed62390b, req 0000000044765cb2, user_data 0x0, opcode WRITE, flags 0x0, sq_thread 0
file-3977811 [001] ..... 447127.636303: io_uring_file_get: ring 00000000ed62390b, req 0000000044765cb2, user_data 0x0, fd 3
file-3977811 [001] ..... 447127.636304: io_uring_queue_async_work: ring 00000000ed62390b, request 0000000044765cb2, user_data 0x0, opcode WRITE, flags 0xe0040000, hashed queue, work 00000000fa55aaba
file-3977811 [001] ..... 447127.636304: sys_io_uring_enter -> 0x1
file-3977811 [001] ..... 447127.636304: sys_io_uring_enter(fd: 4, to_submit: 0, min_complete: 1, flags: 1, argp: 0, argsz: 8)
file-3977811 [001] ..... 447127.636304: io_uring_cqring_wait: ring 00000000ed62390b, min_events 1
file-3977811 [001] ..... 447127.636308: io_uring_task_work_run: tctx 00000000fafc97eb, count 1, loops 1
  1. What filesystems you use?

My current file system is ext4. I have not tried performing file writes using io_uring on top of other file systems.

  1. What is the write pattern? What is the write size? Is it random or sequential?

The writes are performed sequentially and write 1 byte each.

I'd be happy to provide additional information should there still be unclarities. Thank you.

@isilence
Copy link
Collaborator

I've been talking to Stefan, in short, apparently that optimisation doesn't support ext4.

TL;DR; it will depend on ext4 being converted to iomap, which is apparently in plan / in progress, but I don't have any prediction for that.

@zhengshuxin
Copy link
Author

I think it's very important for iouring to read/write small data very quickly same as the system IO API, especially for DB engine such as Berkeley DB, wiredtiger, etc. With the high performance and async-IO of iouring, I can hook the IO API of read/write for DB engine, and use coroutine way to support high concurrency service with high performance.

@hema2601
Copy link

Thank you for the info. Do you happen to know which file systems currently support the write fast path? I would be very curious as to whether io_uring can outperform the normal write systemcall if the writes are not handled using worker threads.

@redbaron
Copy link

@hema2601 , AFAIK iomap originated from XFS, so if io_uring fastpath depends on iomap, then XFS should give you better results.

@isilence
Copy link
Collaborator

isilence commented Sep 5, 2023

Do you happen to know which file systems currently support the write fast path?

xfs and btrfs

I would be very curious as to whether io_uring can outperform the normal write systemcall if the writes are not handled using worker threads.

Speaking about buffered writes, the only upper hand io_uring has is syscall batching and maybe registered files, and unless batching is good enough it won't outperform. The task/process/thread will be doing the same amount of work, e.g. memcpy from page cache, and without io-wq there is no asynchronicity for it yet.

@uNetworkingAB
Copy link

uNetworkingAB commented Dec 16, 2023

I can confirm that I have similar issues. The reason I stopped the work I did up until #862 was that send syscall outperforms io_uring send for smaller/medium writes esp. if CPU cache is smaller.

My problem is that io_uring performs really good for the receiving part but sending smaller messages is heavier since with send syscall you do in fast path:

  1. Generate some response in a CPU-cached vector ("block of memory")
  2. Call send syscall -> if ok then never mind, it succeeded (fast path almost always taken for shorter messages)
  3. In worse case, move the response to some less speedy block of memory, most likely not in cache

With io_uring you have to call prep_send and io_uring_submit for the above fast path, and that for me is slower than send syscall. And if you take the "proactive" approach it peforms WAY worse, since you exhaust your CPU-cache immediately. I've tried all kinds of variations such as batching 16 send calls before io_uring_submit but it does not outperform send.

Optimizing small sends with io_uring so that the above CPU-cache-optimal way of sending performs better than send syscall should be possible, esp. since I am using all of the modern features like direct descriptors and everything is cache aware and on huge pages, etc.

@uNetworkingAB
Copy link

Oh yeah, I read my whole thread and remember: The main issue for smaller sends can be summed up like so:

Send syscall that can take direct descriptors would be killer, I think

That, or somehow optimize prep_send + io_uring_submit to outperform send syscall (it doesn't for small)

@axboe
Copy link
Owner

axboe commented Dec 18, 2023

I'm surprised you see a difference with send(2) vs IORING_OP_SEND for that case, have you done any profiling that could shed some light on this? Assuming your application is threaded, are you using registered rings to avoid the fget/fput on the ring fd itself?

@zhengshuxin
Copy link
Author

How is the progress on this issue? I really hope to see som significant progress about it.
Thanks
--zsx

@boxerab
Copy link

boxerab commented Oct 17, 2024

@zhengshuxin you can re-open this issue if still broken ?

@isilence
Copy link
Collaborator

Apparently, Jens was cleaning up gh tasks and might have overdid it a bit. Is it still ext4 in your case?

@zhengshuxin
Copy link
Author

zhengshuxin commented Oct 18, 2024

Apparently, Jens was cleaning up gh tasks and might have overdid it a bit. Is it still ext4 in your case?

Yes, the case is in ext4 on Ubuntu with kernel6.8. And I've added read comparing for sys read and liburing read, found that they have the similarly performance. The newly added read codes in https://github.com/acl-dev/demo/blob/master/c/file/main.c . Below is the performance results of read and write for sys and liburing.

./file -n 100000
IO_URING_VERSION_MAJOR: 2, IO_URING_VERSION_MINOR: 8

uring_write: open file.txt ok, fd=3
uring_write: write char=0
uring_write: write char=1
uring_write: write char=2
uring_write: write char=3
uring_write: write char=4
uring_write: write char=5
uring_write: write char=6
uring_write: write char=7
uring_write: write char=8
uring_write: write char=9
close file.txt ok, fd=3
uring write, total write=100000, cost=1541.28 ms, speed=64881.18
-------------------------------------------------------
sys_write: open file.txt ok, fd=3
sys_write: write char=0
sys_write: write char=1
sys_write: write char=2
sys_write: write char=3
sys_write: write char=4
sys_write: write char=5
sys_write: write char=6
sys_write: write char=7
sys_write: write char=8
sys_write: write char=9
close file.txt ok, fd=3
sys write, total write=100000, cost=80.58 ms, speed=1240925.73
========================================================
uring_read: read open file.txt ok, fd=3
uring_read: char[0]=0
uring_read: char[1]=1
uring_read: char[2]=2
uring_read: char[3]=3
uring_read: char[4]=4
uring_read: char[5]=5
uring_read: char[6]=6
uring_read: char[7]=7
uring_read: char[8]=8
uring_read: char[9]=9
close fd=3
uring read, total read=100000, cost=84.52 ms, speed=1183179.91
-------------------------------------------------------
sys_read: open file.txt ok, fd=3
sys_read: char[0]=0
sys_read: char[1]=1
sys_read: char[2]=2
sys_read: char[3]=3
sys_read: char[4]=4
sys_read: char[5]=5
sys_read: char[6]=6
sys_read: char[7]=7
sys_read: char[8]=8
sys_read: char[9]=9
sys read, total read=100000, cost=67.22 ms, speed=1487586.09

@zhengshuxin
Copy link
Author

@zhengshuxin you can re-open this issue if still broken ?

I have no permission to re-open the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants