ZFS doesn't respect Linux kernel CPU isolation mechanisms #8908

sjuxax · 2019-06-14T23:17:48Z

System information

Type	Version/Name
Distribution Name	ArchLinux
Distribution Version	Rolling
Linux Kernel	4.19.48
Architecture	x86_64
ZFS Version	0.8.0
SPL Version	0.8.0

Describe the problem you're observing

module/spl/spl-taskq.c contains this code:

  tqt->tqt_thread = spl_kthread_create(taskq_thread, tqt,
      "%s", tq->tq_name);
  if (tqt->tqt_thread == NULL) {
    kmem_free(tqt, sizeof (taskq_thread_t));
    return (NULL);
  }

  if (spl_taskq_thread_bind) {
    last_used_cpu = (last_used_cpu + 1) % num_online_cpus();
    kthread_bind(tqt->tqt_thread, last_used_cpu);
  }

Thus, kthreads spawn either with the default cpumask or, if spl_taskq_thread_bind=1 is set on module import, are bound to CPUs without regard for their availability to the scheduler. This can be a substantial source of latency, which is not acceptable on many systems that use the isolcpus boot parameter to isolate designated "real-time" cores.

While spl_taskq_thread_bind=1 prevents latency from thread migration on/off RT CPUs, it can make things substantially worse by locking the threads to arbitrary cores in a way that can't be changed with taskset, leaving the RT CPUs permanently saddled with the kthread for its full lifecycle.

Ideally, the modular CPU selection would be replaced with something that uses the kernel's housekeeping API in include/linux/sched/isolation.h to get the cpumask of non-isolated CPUs and use kthread_create_on_cpu in spl_kthread_create and/or ~~kthread_bind_mask~~ to schedule and bind threads across non-RT cores only. Note, however, this is an incomplete solution because the kernel's interface to get an isolcpus cpumask has changed several times across the versions supported by ZFS.

Various hacks can be done to try to prevent unbound kthreads from using isolated cores, and threads not bound with spl_taskq_thread_bind can be moved, but these solutions are iffy and incomplete at best. It would be great if ZFS respected isolcpus from the start.

Describe how to reproduce the problem

Boot with isolcpus, capture a trace of the RT CPUs with perf sched record or other tracing mechanisms, observe ZFS-spawned kthreads coming on and off isolated cores. This is the primary remaining source of latency on my local system.

Include any warning/errors/backtraces from the system logs

The text was updated successfully, but these errors were encountered:

sjuxax · 2019-06-15T04:39:02Z

Minimal quick and dirty patch that appears to work for me here: sjuxax@7c2a896 .

Looks like kthread_bind_mask isn't exported, so using cpumask_next_wrap instead. This iterates through the CPUs, but skips CPUs that have HK_FLAG_DOMAIN specified in the housekeeping API. I'm sure there are other places where the affinity needs to be set, but at a glance, this appears to quiet things down a bit. 🤷‍♂️

gamanakis · 2019-12-16T04:14:15Z

@sjuxax your observations are correct. The other place you would have to do this is in __thread_create() in module/spl/spl-thread.c. You can see a very primitive example here:
gamanakis@b9bad20

behlendorf · 2019-12-16T21:22:04Z

@sjuxax would you mind opening a PR with the proposed fix for taskqs and dedicated threads. Then we can get you some better feedback and shouldn't lose track of this again.

gamanakis · 2019-12-18T02:57:47Z

@behlendorf Would it be worth it to have additionally a cpulist as an spl module parameter that would bind those threads to defined cpus?

testdasi · 2020-07-28T23:26:50Z

Has there been any progress on fixing this defect please?

IvanVolosyuk · 2020-11-17T22:49:47Z

CPU hotpluging code changes the relevant code:
#11212
This PR will have to implement this changes in hotplug-aware way.

stale · 2021-11-18T10:40:07Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

Jauchi · 2022-03-13T06:58:25Z

Has this been fixed?
On my system (zfs-2.1.2-1), using isolcpus and spl_taskq_thread_bind with either 0 or 1 has no effect (ZFS still using a cpu that is excluded by isolcpus).

ipaqmaster · 2022-03-13T07:52:54Z

Interesting,

I saw your reply via email and tried it myself to confirm. I am on Archlinux here using:

Kernel 5.16.13
zfs-2.1.2-1
zfs-kmod-2.1.2-1
AMD Ryzen 9 3900X
32G DDR4 @3600MHz ( 2x F4-3600C16-16GTZNC )
2TB Corsair MP600 nvme to read from as a test

My boot arguments were:

zfs=myPool/myRoot rw iommu=pt iommu=1 quiet acpi_enforce_resources=lax hugepagesz=1G hugepages=12 isolcpus=3-11,15-23 rcu_nocbs=3-11,15-23 nohz_full=3-11,15-23 systemd.unified_cgroup_hierarchy=0 rcu_nocb_poll irqaffinity=0,1,2,12,13,14

I opened htop on one screen and could already see that only cores 0,1,2 + 12,13,14 were given work by my host.

At this point I used pv /data/somelargefile.dat > /dev/null in another terminal and ZFS read it out at ~1.9gb/s.

I could see the z_rd_init_0 (and incremented) threads giving thread's 0,1,2,12,13,14 the workload of their life. But the other cores were left 100% idle. This wasn't the case before.

I tried another pv from data in an encrypted dataset and while the read speed was expectedly slower, it still only executed on the 6 cpu threads which were not isolated. I don't know why your situation is behaving differently.

Jauchi · 2022-03-13T12:58:13Z

Hello, thanks for the quick and detailed reply!

I forgot to mention that I am running NixOS unstable.

I tried to adapt my system as far as possible to your kernel parameters, now I have the following cmdline (hashes and PCI IDs removed for readability):
BOOT_IMAGE=(hd0,gpt2)//kernels/[...]-linux-5.15.27-bzImage init=/nix/store/[...]/init vfio-pci.ids=[...] amd_iommu=on iommu=pt iommu=1 acpi_enforce_resources=lax isolcpus=7-15 rcu_nocbs=7-15 nohz_full=7-15 rcu_nocb_poll irqaffinity=0,1,2,3,4,5,6 spl_taskq_thread_bind=1 nohibernate zfs_force=1 systemd.unified_cgroup_hierarchy=0 loglevel=4

The line spl_taskq_thread_bind=1 does not have any effect it seems, I tried booting once with and without it and it made no difference.
As soon as my system is booted, there is some (less than 2%) kernel activity on core #14, along with userspace activity on 0-6. All other cores are silent.

However, I took a look in htop and zfs is not the only kernel process using that core, so likely something entirely wrong on my part.

So, this is cleary some sort of user error on my part. If you have any suggestions or Ideas, I would of course be very thankful nonetheless.
If I find a solution, I will try to post it here too. Thanks!

Jauchi · 2022-03-13T19:05:12Z

Okay, so I think I figured it out, although the reasons why it is the way it is are beyond my understanding.

To make a long story short, if I leave a CPU core between 8 and 15 for the kernel, it uses that CPU core, otherwise it will just assign a random one at boot time and be stuck with it.
So, if I isolated all CPUs except 0 and 1, CPUs 2-7 would not have any kernel processes, 8-15 would have a random CPU that has the kernel processes.
Basically, what I did is followed the example made by @ipaqmaster - just leave some CPUs in the upper range as well, now I can use 2-7 and 10-15 exclusively and I have no more issues, everything is nice and isolated.

Thank you!

Dummyc0m · 2024-06-18T05:03:24Z

I concur with @Jauchi that with isolcpus=0-7, z_trim_int still managed to get scheduled onto cpu 0. I'll dig deeper if I have time.

IvanVolosyuk · 2024-06-18T12:10:10Z

Just FYI: the isolcpus doesn't work, but it seems some of the other command line args do. I have nohz_full=1-7,16-23 rcu_nocbs=0-7,16-23 irqaffinity=8-15,24-31 rcu_nocb_poll which makes the CPU threads 1-7, 16-23 ZFS-free it seems.
I also use spl.spl_taskq_thread_bind=0 spl.spl_taskq_thread_priority=0 which may matter.

dannynodies · 2024-09-25T12:59:18Z

I just tested (on fully updated Ubuntu 24.04) with the config shared by @IvanVolosyuk (with the exception of nohz_full=1-7,17-23 - note the ticks on the sibling hyperthread) and I found ZFS pinned all its threads to core 0, which was bad news. Tried removing nohz_full and it was the same, no difference with spl_taskq_thread_bind set to 0 or 1.

Had to give the box back so ran out of time to test any more, if I can get a box with a similar CPU i'll have another go.

My objective was to only allow ZFS to use cores 8-15 and 24-31 (one of those AMD X3D CPUs with chunky L3 cache on only those cores)

IvanVolosyuk · 2024-09-27T02:08:35Z

Yeah, I also have 7950x3d CPU. I use this config for a while without issues. My first CCD is running qemu/kvm with realtime priority (except for CPU/thread=0, which is left alone as Linux still schedules something on it). I don't use isolcpus, but I do pin vcpu and io threads in qemu; and interrupt lines in Linux kernel away from the CCD. I can observe ZFS only using second CCD in 'top' when doing heavy zstd compression in ZFS.

If you plan on buying that CPU - I would advice against it if you plan to do kvm+vfio: https://www.reddit.com/r/VFIO/comments/194ndu7/anyone_experiencing_host_random_reboots_using/

testdasi · 2024-09-27T08:14:13Z

How do we get this reopened and/or raise a separate bug?
This bug is closed so presumably nobody is looking at it anymore.

IvanVolosyuk · 2024-09-27T13:52:34Z

@behlendorf can we reopen this as the issue is still exist. I understand that there are some technical / license issues to make it happen.

behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 16, 2019

stale bot added the Status: Stale No recent activity for issue label Nov 18, 2021

stale bot closed this as completed Feb 18, 2022

amotin reopened this Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS doesn't respect Linux kernel CPU isolation mechanisms #8908

ZFS doesn't respect Linux kernel CPU isolation mechanisms #8908

sjuxax commented Jun 14, 2019 •

edited

Loading

sjuxax commented Jun 15, 2019

gamanakis commented Dec 16, 2019 •

edited

Loading

behlendorf commented Dec 16, 2019

gamanakis commented Dec 18, 2019

testdasi commented Jul 28, 2020

IvanVolosyuk commented Nov 17, 2020

stale bot commented Nov 18, 2021

Jauchi commented Mar 13, 2022

ipaqmaster commented Mar 13, 2022 •

edited

Loading

Jauchi commented Mar 13, 2022 •

edited

Loading

Jauchi commented Mar 13, 2022

Dummyc0m commented Jun 18, 2024

IvanVolosyuk commented Jun 18, 2024 •

edited

Loading

dannynodies commented Sep 25, 2024

IvanVolosyuk commented Sep 27, 2024

testdasi commented Sep 27, 2024

IvanVolosyuk commented Sep 27, 2024

ZFS doesn't respect Linux kernel CPU isolation mechanisms #8908

ZFS doesn't respect Linux kernel CPU isolation mechanisms #8908

Comments

sjuxax commented Jun 14, 2019 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

sjuxax commented Jun 15, 2019

gamanakis commented Dec 16, 2019 • edited Loading

behlendorf commented Dec 16, 2019

gamanakis commented Dec 18, 2019

testdasi commented Jul 28, 2020

IvanVolosyuk commented Nov 17, 2020

stale bot commented Nov 18, 2021

Jauchi commented Mar 13, 2022

ipaqmaster commented Mar 13, 2022 • edited Loading

Jauchi commented Mar 13, 2022 • edited Loading

Jauchi commented Mar 13, 2022

Dummyc0m commented Jun 18, 2024

IvanVolosyuk commented Jun 18, 2024 • edited Loading

dannynodies commented Sep 25, 2024

IvanVolosyuk commented Sep 27, 2024

testdasi commented Sep 27, 2024

IvanVolosyuk commented Sep 27, 2024

sjuxax commented Jun 14, 2019 •

edited

Loading

gamanakis commented Dec 16, 2019 •

edited

Loading

ipaqmaster commented Mar 13, 2022 •

edited

Loading

Jauchi commented Mar 13, 2022 •

edited

Loading

IvanVolosyuk commented Jun 18, 2024 •

edited

Loading