Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib/mpi: Use "static inline" instead of "inline" #2

Merged
merged 1 commit into from
Apr 7, 2017

Conversation

bthebaudeau
Copy link
Contributor

The commit 2e7dc31 had replaced
"extern inline" with "inline" in order to fix the build with GCC 5,
because of new inline semantics. However, this change breaks the build
with GCC 4.9, causing many errors such as:

lib/mpi/generic_mpih-mul1.o: In function mpihelp_add_1': generic_mpih-mul1.c:(.text+0x0): multiple definition of mpihelp_add_1'
lib/mpi/generic_mpih-lshift.o:generic_mpih-lshift.c:(.text+0x0): first defined here

Fix this like in the mainline commit
9c6bd0c, by using "static inline".

Signed-off-by: Benoît Thébaudeau [email protected]

The commit 2e7dc31 had replaced
"extern inline" with "inline" in order to fix the build with GCC 5,
because of new inline semantics. However, this change breaks the build
with GCC 4.9, causing many errors such as:

  lib/mpi/generic_mpih-mul1.o: In function `mpihelp_add_1':
  generic_mpih-mul1.c:(.text+0x0): multiple definition of `mpihelp_add_1'
  lib/mpi/generic_mpih-lshift.o:generic_mpih-lshift.c:(.text+0x0): first defined here

Fix this like in the mainline commit
9c6bd0c, by using "static inline".

Signed-off-by: Benoît Thébaudeau <[email protected]>
@gouwa gouwa merged commit 2d81a3a into khadas:ubuntu Apr 7, 2017
@bthebaudeau bthebaudeau deleted the patch-1 branch April 7, 2017 11:56
numbqq added a commit that referenced this pull request Jun 1, 2018
watchdog: kvims: update watchdog driver
numbqq pushed a commit that referenced this pull request Oct 10, 2018
This patch fixes the following warning:
[    2.226264] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
[    2.226341] in_atomic(): 0, irqs_disabled(): 0, pid: 1, name: swapper/0
[    2.226366] 4 locks held by swapper/0/1:
[    2.226385]  #0:  (&dev->mutex){......}, at: [<ffffff800851c664>] __device_attach+0x3c/0x134
[    2.226515]  #1:  (cpu_hotplug.lock){......}, at: [<ffffff800809feb0>] get_online_cpus+0x38/0x9c
[    2.226594]  #2:  (subsys mutex#7){......}, at: [<ffffff800851b0e4>] subsys_interface_register+0x54/0xfc
[    2.226684]  #3:  (rcu_read_lock){......}, at: [<ffffff800845c764>] rockchip_adjust_power_scale+0x88/0x450
[    2.226771] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.4.132 #600
[    2.226790] Hardware name: Rockchip PX30 evb ddr3 board (DT)
[    2.226809] Call trace:
[    2.226840] [<ffffff800808a0d4>] dump_backtrace+0x0/0x1ec
[    2.226889] [<ffffff800808a2d4>] show_stack+0x14/0x1c
[    2.226918] [<ffffff80083c41b4>] dump_stack+0x94/0xbc
[    2.226946] [<ffffff80080cb708>] ___might_sleep+0x108/0x118
[    2.226972] [<ffffff80080cb788>] __might_sleep+0x70/0x80
[    2.227001] [<ffffff8008b6d1d0>] mutex_lock_nested+0x54/0x38c
[    2.227031] [<ffffff8008858714>] __of_clk_get_from_provider+0x44/0xec
[    2.227061] [<ffffff8008852814>] __of_clk_get_by_name+0xd4/0x14c
[    2.227112] [<ffffff80088528a4>] of_clk_get_by_name+0x18/0x28
[    2.227140] [<ffffff800845c980>] rockchip_adjust_power_scale+0x2a4/0x450
[    2.227171] [<ffffff800877f630>] cpufreq_init+0x154/0x394
[    2.227199] [<ffffff8008776ac4>] cpufreq_online+0x1b0/0x68c
[    2.227225] [<ffffff8008777040>] cpufreq_add_dev+0x3c/0x94
[    2.227253] [<ffffff800851b168>] subsys_interface_register+0xd8/0xfc
[    2.227281] [<ffffff80087772d8>] cpufreq_register_driver+0x10c/0x1a8
[    2.227332] [<ffffff800877f93c>] dt_cpufreq_probe+0xcc/0xe8
[    2.227360] [<ffffff800851e7b8>] platform_drv_probe+0x54/0xa8
[    2.227385] [<ffffff800851c924>] driver_probe_device+0x194/0x278
[    2.227411] [<ffffff800851cb40>] __device_attach_driver+0x60/0x9c
[    2.227439] [<ffffff800851ae14>] bus_for_each_drv+0x9c/0xbc
[    2.227464] [<ffffff800851c6f0>] __device_attach+0xc8/0x134
[    2.227488] [<ffffff800851ccb8>] device_initial_probe+0x10/0x18
[    2.227514] [<ffffff800851bcec>] bus_probe_device+0x2c/0x90
[    2.227541] [<ffffff8008519e90>] device_add+0x44c/0x510
[    2.227570] [<ffffff800851e4d4>] platform_device_add+0xa0/0x1e4
[    2.227597] [<ffffff800851ef28>] platform_device_register_full+0xa4/0xe4
[    2.227627] [<ffffff80090f8680>] rockchip_cpufreq_driver_init+0xd4/0x31c
[    2.227654] [<ffffff8008083468>] do_one_initcall+0x84/0x1a4
[    2.227682] [<ffffff80090c0e98>] kernel_init_freeable+0x260/0x264
[    2.227708] [<ffffff8008b6ab68>] kernel_init+0x10/0xf8
[    2.227734] [<ffffff80080832a0>] ret_from_fork+0x10/0x30

Fixes: 197a2d3 ("soc: rockchip: opp_select: add missing rcu lock")
Change-Id: Ib52b6f7bc77bb207a1c8c6880f7a5e916fa3d2ee
Signed-off-by: Finley Xiao <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
While testing the deadline scheduler + cgroup setup I hit this
warning.

[  132.612935] ------------[ cut here ]------------
[  132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
[  132.612952] Modules linked in: (a ton of modules...)
[  132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
[  132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[  132.612982]  0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
[  132.612984]  0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
[  132.612985]  00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
[  132.612986] Call Trace:
[  132.612987]  <IRQ>  [<ffffffff813d229e>] dump_stack+0x63/0x85
[  132.612994]  [<ffffffff810a652b>] __warn+0xcb/0xf0
[  132.612997]  [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170
[  132.612999]  [<ffffffff810a665d>] warn_slowpath_null+0x1d/0x20
[  132.613000]  [<ffffffff810aba5b>] __local_bh_enable_ip+0x6b/0x80
[  132.613008]  [<ffffffff817d6c8a>] _raw_write_unlock_bh+0x1a/0x20
[  132.613010]  [<ffffffff817d6c9e>] _raw_spin_unlock_bh+0xe/0x10
[  132.613015]  [<ffffffff811388ac>] put_css_set+0x5c/0x60
[  132.613016]  [<ffffffff8113dc7f>] cgroup_free+0x7f/0xa0
[  132.613017]  [<ffffffff810a3912>] __put_task_struct+0x42/0x140
[  132.613018]  [<ffffffff810e776a>] dl_task_timer+0xca/0x250
[  132.613027]  [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170
[  132.613030]  [<ffffffff8111371e>] __hrtimer_run_queues+0xee/0x270
[  132.613031]  [<ffffffff81113ec8>] hrtimer_interrupt+0xa8/0x190
[  132.613034]  [<ffffffff81051a58>] local_apic_timer_interrupt+0x38/0x60
[  132.613035]  [<ffffffff817d9b0d>] smp_apic_timer_interrupt+0x3d/0x50
[  132.613037]  [<ffffffff817d7c5c>] apic_timer_interrupt+0x8c/0xa0
[  132.613038]  <EOI>  [<ffffffff81063466>] ? native_safe_halt+0x6/0x10
[  132.613043]  [<ffffffff81037a4e>] default_idle+0x1e/0xd0
[  132.613044]  [<ffffffff810381cf>] arch_cpu_idle+0xf/0x20
[  132.613046]  [<ffffffff810e8fda>] default_idle_call+0x2a/0x40
[  132.613047]  [<ffffffff810e92d7>] cpu_startup_entry+0x2e7/0x340
[  132.613048]  [<ffffffff81050235>] start_secondary+0x155/0x190
[  132.613049] ---[ end trace f91934d162ce9977 ]---

The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
this problem - and other problems of sharing a spinlock with an
interrupt.

Change-Id: I5b5d5c79c3f380ac35f58596fc2cebaf6348eb67
Cc: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: [email protected]
Cc: [email protected] # 4.5+
Cc: [email protected]
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: "Luis Claudio R. Goncalves" <[email protected]>
Signed-off-by: Daniel Bristot de Oliveira <[email protected]>
Acked-by: Zefan Li <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
[ Upstream commit fba9eb7 ]

Add a header with macros usable in assembler files to emit alternative
code sequences. It works analog to the alternatives for inline assmeblies
in C files, with the same restrictions and capabilities.
The syntax is

     ALTERNATIVE "<default instructions sequence>", \
		 "<alternative instructions sequence>", \
		 "<features-bit>"
and

     ALTERNATIVE_2 "<default instructions sequence>", \
		   "<alternative instructions sqeuence #1>", \
		   "<feature-bit #1>",
		   "<alternative instructions sqeuence #2>", \
		   "<feature-bit #2>"

Reviewed-by: Vasily Gorbik <[email protected]>
Signed-off-by: Martin Schwidefsky <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
[ Upstream commit 2c0aa08 ]

Scenario:
1. Port down and do fail over
2. Ap do rds_bind syscall

PID: 47039  TASK: ffff89887e2fe640  CPU: 47  COMMAND: "kworker/u:6"
 #0 [ffff898e35f159f0] machine_kexec at ffffffff8103abf9
 #1 [ffff898e35f15a60] crash_kexec at ffffffff810b96e3
 #2 [ffff898e35f15b30] oops_end at ffffffff8150f518
 #3 [ffff898e35f15b60] no_context at ffffffff8104854c
 #4 [ffff898e35f15ba0] __bad_area_nosemaphore at ffffffff81048675
 #5 [ffff898e35f15bf0] bad_area_nosemaphore at ffffffff810487d3
 #6 [ffff898e35f15c00] do_page_fault at ffffffff815120b8
 #7 [ffff898e35f15d10] page_fault at ffffffff8150ea95
    [exception RIP: unknown or invalid address]
    RIP: 0000000000000000  RSP: ffff898e35f15dc8  RFLAGS: 00010282
    RAX: 00000000fffffffe  RBX: ffff889b77f6fc00  RCX:ffffffff81c99d88
    RDX: 0000000000000000  RSI: ffff896019ee08e8  RDI:ffff889b77f6fc00
    RBP: ffff898e35f15df0   R8: ffff896019ee08c8  R9:0000000000000000
    R10: 0000000000000400  R11: 0000000000000000  R12:ffff896019ee08c0
    R13: ffff889b77f6fe68  R14: ffffffff81c99d80  R15: ffffffffa022a1e0
    ORIG_RAX: ffffffffffffffff  CS: 0010 SS: 0018
 #8 [ffff898e35f15dc8] cma_ndev_work_handler at ffffffffa022a228 [rdma_cm]
 #9 [ffff898e35f15df8] process_one_work at ffffffff8108a7c6
 #10 [ffff898e35f15e58] worker_thread at ffffffff8108bda0
 #11 [ffff898e35f15ee8] kthread at ffffffff81090fe6

PID: 45659  TASK: ffff880d313d2500  CPU: 31  COMMAND: "oracle_45659_ap"
 #0 [ffff881024ccfc98] __schedule at ffffffff8150bac4
 #1 [ffff881024ccfd40] schedule at ffffffff8150c2cf
 #2 [ffff881024ccfd50] __mutex_lock_slowpath at ffffffff8150cee7
 #3 [ffff881024ccfdc0] mutex_lock at ffffffff8150cdeb
 #4 [ffff881024ccfde0] rdma_destroy_id at ffffffffa022a027 [rdma_cm]
 #5 [ffff881024ccfe10] rds_ib_laddr_check at ffffffffa0357857 [rds_rdma]
 #6 [ffff881024ccfe50] rds_trans_get_preferred at ffffffffa0324c2a [rds]
 #7 [ffff881024ccfe80] rds_bind at ffffffffa031d690 [rds]
 #8 [ffff881024ccfeb0] sys_bind at ffffffff8142a670

PID: 45659                          PID: 47039
rds_ib_laddr_check
  /* create id_priv with a null event_handler */
  rdma_create_id
  rdma_bind_addr
    cma_acquire_dev
      /* add id_priv to cma_dev->id_list */
      cma_attach_to_dev
                                    cma_ndev_work_handler
                                      /* event_hanlder is null */
                                      id_priv->id.event_handler

Signed-off-by: Guanglei Li <[email protected]>
Signed-off-by: Honglei Wang <[email protected]>
Reviewed-by: Junxiao Bi <[email protected]>
Reviewed-by: Yanjun Zhu <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Acked-by: Santosh Shilimkar <[email protected]>
Acked-by: Doug Ledford <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
[ Upstream commit b6dd4d8 ]

The pr_debug() in gic-v3 gic_send_sgi() can trigger a circular locking
warning:

 GICv3: CPU10: ICC_SGI1R_EL1 5000400
 ======================================================
 WARNING: possible circular locking dependency detected
 4.15.0+ #1 Tainted: G        W
 ------------------------------------------------------
 dynamic_debug01/1873 is trying to acquire lock:
  ((console_sem).lock){-...}, at: [<0000000099c891ec>] down_trylock+0x20/0x4c

 but task is already holding lock:
  (&rq->lock){-.-.}, at: [<00000000842e1587>] __task_rq_lock+0x54/0xdc

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #2 (&rq->lock){-.-.}:
        __lock_acquire+0x3b4/0x6e0
        lock_acquire+0xf4/0x2a8
        _raw_spin_lock+0x4c/0x60
        task_fork_fair+0x3c/0x148
        sched_fork+0x10c/0x214
        copy_process.isra.32.part.33+0x4e8/0x14f0
        _do_fork+0xe8/0x78c
        kernel_thread+0x48/0x54
        rest_init+0x34/0x2a4
        start_kernel+0x45c/0x488

 -> #1 (&p->pi_lock){-.-.}:
        __lock_acquire+0x3b4/0x6e0
        lock_acquire+0xf4/0x2a8
        _raw_spin_lock_irqsave+0x58/0x70
        try_to_wake_up+0x48/0x600
        wake_up_process+0x28/0x34
        __up.isra.0+0x60/0x6c
        up+0x60/0x68
        __up_console_sem+0x4c/0x7c
        console_unlock+0x328/0x634
        vprintk_emit+0x25c/0x390
        dev_vprintk_emit+0xc4/0x1fc
        dev_printk_emit+0x88/0xa8
        __dev_printk+0x58/0x9c
        _dev_info+0x84/0xa8
        usb_new_device+0x100/0x474
        hub_port_connect+0x280/0x92c
        hub_event+0x740/0xa84
        process_one_work+0x240/0x70c
        worker_thread+0x60/0x400
        kthread+0x110/0x13c
        ret_from_fork+0x10/0x18

 -> #0 ((console_sem).lock){-...}:
        validate_chain.isra.34+0x6e4/0xa20
        __lock_acquire+0x3b4/0x6e0
        lock_acquire+0xf4/0x2a8
        _raw_spin_lock_irqsave+0x58/0x70
        down_trylock+0x20/0x4c
        __down_trylock_console_sem+0x3c/0x9c
        console_trylock+0x20/0xb0
        vprintk_emit+0x254/0x390
        vprintk_default+0x58/0x90
        vprintk_func+0xbc/0x164
        printk+0x80/0xa0
        __dynamic_pr_debug+0x84/0xac
        gic_raise_softirq+0x184/0x18c
        smp_cross_call+0xac/0x218
        smp_send_reschedule+0x3c/0x48
        resched_curr+0x60/0x9c
        check_preempt_curr+0x70/0xdc
        wake_up_new_task+0x310/0x470
        _do_fork+0x188/0x78c
        SyS_clone+0x44/0x50
        __sys_trace_return+0x0/0x4

 other info that might help us debug this:

 Chain exists of:
   (console_sem).lock --> &p->pi_lock --> &rq->lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&rq->lock);
                                lock(&p->pi_lock);
                                lock(&rq->lock);
   lock((console_sem).lock);

  *** DEADLOCK ***

 2 locks held by dynamic_debug01/1873:
  #0:  (&p->pi_lock){-.-.}, at: [<000000001366df53>] wake_up_new_task+0x40/0x470
  #1:  (&rq->lock){-.-.}, at: [<00000000842e1587>] __task_rq_lock+0x54/0xdc

 stack backtrace:
 CPU: 10 PID: 1873 Comm: dynamic_debug01 Tainted: G        W        4.15.0+ #1
 Hardware name: GIGABYTE R120-T34-00/MT30-GS2-00, BIOS T48 10/02/2017
 Call trace:
  dump_backtrace+0x0/0x188
  show_stack+0x24/0x2c
  dump_stack+0xa4/0xe0
  print_circular_bug.isra.31+0x29c/0x2b8
  check_prev_add.constprop.39+0x6c8/0x6dc
  validate_chain.isra.34+0x6e4/0xa20
  __lock_acquire+0x3b4/0x6e0
  lock_acquire+0xf4/0x2a8
  _raw_spin_lock_irqsave+0x58/0x70
  down_trylock+0x20/0x4c
  __down_trylock_console_sem+0x3c/0x9c
  console_trylock+0x20/0xb0
  vprintk_emit+0x254/0x390
  vprintk_default+0x58/0x90
  vprintk_func+0xbc/0x164
  printk+0x80/0xa0
  __dynamic_pr_debug+0x84/0xac
  gic_raise_softirq+0x184/0x18c
  smp_cross_call+0xac/0x218
  smp_send_reschedule+0x3c/0x48
  resched_curr+0x60/0x9c
  check_preempt_curr+0x70/0xdc
  wake_up_new_task+0x310/0x470
  _do_fork+0x188/0x78c
  SyS_clone+0x44/0x50
  __sys_trace_return+0x0/0x4
 GICv3: CPU0: ICC_SGI1R_EL1 12000

This could be fixed with printk_deferred() but that might lessen its
usefulness for debugging. So change it to pr_devel to keep it out of
production kernels. Developers working on gic-v3 can enable it as
needed in their kernels.

Signed-off-by: Mark Salter <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
[ Upstream commit 2bbea6e ]

when mounting an ISO filesystem sometimes (very rarely)
the system hangs because of a race condition between two tasks.

PID: 6766   TASK: ffff88007b2a6dd0  CPU: 0   COMMAND: "mount"
 #0 [ffff880078447ae0] __schedule at ffffffff8168d605
 #1 [ffff880078447b48] schedule_preempt_disabled at ffffffff8168ed49
 #2 [ffff880078447b58] __mutex_lock_slowpath at ffffffff8168c995
 #3 [ffff880078447bb8] mutex_lock at ffffffff8168bdef
 #4 [ffff880078447bd0] sr_block_ioctl at ffffffffa00b6818 [sr_mod]
 #5 [ffff880078447c10] blkdev_ioctl at ffffffff812fea50
 #6 [ffff880078447c70] ioctl_by_bdev at ffffffff8123a8b3
 #7 [ffff880078447c90] isofs_fill_super at ffffffffa04fb1e1 [isofs]
 #8 [ffff880078447da8] mount_bdev at ffffffff81202570
 #9 [ffff880078447e18] isofs_mount at ffffffffa04f9828 [isofs]
#10 [ffff880078447e28] mount_fs at ffffffff81202d09
#11 [ffff880078447e70] vfs_kern_mount at ffffffff8121ea8f
#12 [ffff880078447ea8] do_mount at ffffffff81220fee
#13 [ffff880078447f28] sys_mount at ffffffff812218d6
#14 [ffff880078447f80] system_call_fastpath at ffffffff81698c49
    RIP: 00007fd9ea914e9a  RSP: 00007ffd5d9bf648  RFLAGS: 00010246
    RAX: 00000000000000a5  RBX: ffffffff81698c49  RCX: 0000000000000010
    RDX: 00007fd9ec2bc210  RSI: 00007fd9ec2bc290  RDI: 00007fd9ec2bcf30
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000010
    R10: 00000000c0ed0001  R11: 0000000000000206  R12: 00007fd9ec2bc040
    R13: 00007fd9eb6b2380  R14: 00007fd9ec2bc210  R15: 00007fd9ec2bcf30
    ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b

This task was trying to mount the cdrom.  It allocated and configured a
super_block struct and owned the write-lock for the super_block->s_umount
rwsem. While exclusively owning the s_umount lock, it called
sr_block_ioctl and waited to acquire the global sr_mutex lock.

PID: 6785   TASK: ffff880078720fb0  CPU: 0   COMMAND: "systemd-udevd"
 #0 [ffff880078417898] __schedule at ffffffff8168d605
 #1 [ffff880078417900] schedule at ffffffff8168dc59
 #2 [ffff880078417910] rwsem_down_read_failed at ffffffff8168f605
 #3 [ffff880078417980] call_rwsem_down_read_failed at ffffffff81328838
 #4 [ffff8800784179d0] down_read at ffffffff8168cde0
 #5 [ffff8800784179e8] get_super at ffffffff81201cc7
 #6 [ffff880078417a10] __invalidate_device at ffffffff8123a8de
 #7 [ffff880078417a40] flush_disk at ffffffff8123a94b
 #8 [ffff880078417a88] check_disk_change at ffffffff8123ab50
 #9 [ffff880078417ab0] cdrom_open at ffffffffa00a29e1 [cdrom]
#10 [ffff880078417b68] sr_block_open at ffffffffa00b6f9b [sr_mod]
#11 [ffff880078417b98] __blkdev_get at ffffffff8123ba86
#12 [ffff880078417bf0] blkdev_get at ffffffff8123bd65
#13 [ffff880078417c78] blkdev_open at ffffffff8123bf9b
#14 [ffff880078417c90] do_dentry_open at ffffffff811fc7f7
#15 [ffff880078417cd8] vfs_open at ffffffff811fc9cf
#16 [ffff880078417d00] do_last at ffffffff8120d53d
#17 [ffff880078417db0] path_openat at ffffffff8120e6b2
#18 [ffff880078417e48] do_filp_open at ffffffff8121082b
#19 [ffff880078417f18] do_sys_open at ffffffff811fdd33
#20 [ffff880078417f70] sys_open at ffffffff811fde4e
#21 [ffff880078417f80] system_call_fastpath at ffffffff81698c49
    RIP: 00007f29438b0c20  RSP: 00007ffc76624b78  RFLAGS: 00010246
    RAX: 0000000000000002  RBX: ffffffff81698c49  RCX: 0000000000000000
    RDX: 00007f2944a5fa70  RSI: 00000000000a0800  RDI: 00007f2944a5fa70
    RBP: 00007f2944a5f540   R8: 0000000000000000   R9: 0000000000000020
    R10: 00007f2943614c40  R11: 0000000000000246  R12: ffffffff811fde4e
    R13: ffff880078417f78  R14: 000000000000000c  R15: 00007f2944a4b010
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

This task tried to open the cdrom device, the sr_block_open function
acquired the global sr_mutex lock. The call to check_disk_change()
then saw an event flag indicating a possible media change and tried
to flush any cached data for the device.
As part of the flush, it tried to acquire the super_block->s_umount
lock associated with the cdrom device.
This was the same super_block as created and locked by the previous task.

The first task acquires the s_umount lock and then the sr_mutex_lock;
the second task acquires the sr_mutex_lock and then the s_umount lock.

This patch fixes the issue by moving check_disk_change() out of
cdrom_open() and let the caller take care of it.

Signed-off-by: Maurizio Lombardi <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
[ Upstream commit a3ca831 ]

When booting up with "threadirqs" in command line, all irq handlers of the DMA
controller pl330 will be threaded forcedly. These threads will race for the same
list, pl330->req_done.

Before the callback, the spinlock was released. And after it, the spinlock was
taken. This opened an race window where another threaded irq handler could steal
the spinlock and be permitted to delete entries of the list, pl330->req_done.

If the later deleted an entry that was still referred to by the former, there would
be a kernel panic when the former was scheduled and tried to get the next sibling
of the deleted entry.

The scenario could be depicted as below:

  Thread: T1  pl330->req_done  Thread: T2
      |             |              |
      |          -A-B-C-D-         |
    Locked          |              |
      |             |           Waiting
    Del A           |              |
      |          -B-C-D-           |
    Unlocked        |              |
      |             |           Locked
    Waiting         |              |
      |             |            Del B
      |             |              |
      |           -C-D-         Unlocked
    Waiting         |              |
      |
    Locked
      |
   get C via B
      \
       - Kernel panic

The kernel panic looked like as below:

Unable to handle kernel paging request at virtual address dead000000000108
pgd = ffffff8008c9e000
[dead000000000108] *pgd=000000027fffe003, *pud=000000027fffe003, *pmd=0000000000000000
Internal error: Oops: 96000044 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 85 Comm: irq/59-66330000 Not tainted 4.8.24-WR9.0.0.12_standard #2
Hardware name: Broadcom NS2 SVK (DT)
task: ffffffc1f5cc3c00 task.stack: ffffffc1f5ce0000
PC is at pl330_irq_handler+0x27c/0x390
LR is at pl330_irq_handler+0x2a8/0x390
pc : [<ffffff80084cb694>] lr : [<ffffff80084cb6c0>] pstate: 800001c5
sp : ffffffc1f5ce3d00
x29: ffffffc1f5ce3d00 x28: 0000000000000140
x27: ffffffc1f5c530b0 x26: dead000000000100
x25: dead000000000200 x24: 0000000000418958
x23: 0000000000000001 x22: ffffffc1f5ccd668
x21: ffffffc1f5ccd590 x20: ffffffc1f5ccd418
x19: dead000000000060 x18: 0000000000000001
x17: 0000000000000007 x16: 0000000000000001
x15: ffffffffffffffff x14: ffffffffffffffff
x13: ffffffffffffffff x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000000840
x9 : ffffffc1f5ce0000 x8 : ffffffc1f5cc3338
x7 : ffffff8008ce2020 x6 : 0000000000000000
x5 : 0000000000000000 x4 : 0000000000000001
x3 : dead000000000200 x2 : dead000000000100
x1 : 0000000000000140 x0 : ffffffc1f5ccd590

Process irq/59-66330000 (pid: 85, stack limit = 0xffffffc1f5ce0020)
Stack: (0xffffffc1f5ce3d00 to 0xffffffc1f5ce4000)
3d00: ffffffc1f5ce3d80 ffffff80080f09d0 ffffffc1f5ca0c00 ffffffc1f6f7c600
3d20: ffffffc1f5ce0000 ffffffc1f6f7c600 ffffffc1f5ca0c00 ffffff80080f0998
3d40: ffffffc1f5ce0000 ffffff80080f0000 0000000000000000 0000000000000000
3d60: ffffff8008ce202c ffffff8008ce2020 ffffffc1f5ccd668 ffffffc1f5c530b0
3d80: ffffffc1f5ce3db0 ffffff80080f0d70 ffffffc1f5ca0c40 0000000000000001
3da0: ffffffc1f5ce0000 ffffff80080f0cfc ffffffc1f5ce3e20 ffffff80080bf4f8
3dc0: ffffffc1f5ca0c80 ffffff8008bf3798 ffffff8008955528 ffffffc1f5ca0c00
3de0: ffffff80080f0c30 0000000000000000 0000000000000000 0000000000000000
3e00: 0000000000000000 0000000000000000 0000000000000000 ffffff80080f0b68
3e20: 0000000000000000 ffffff8008083690 ffffff80080bf420 ffffffc1f5ca0c80
3e40: 0000000000000000 0000000000000000 0000000000000000 ffffff80080cb648
3e60: ffffff8008b1c780 0000000000000000 0000000000000000 ffffffc1f5ca0c00
3e80: ffffffc100000000 ffffff8000000000 ffffffc1f5ce3e90 ffffffc1f5ce3e90
3ea0: 0000000000000000 ffffff8000000000 ffffffc1f5ce3eb0 ffffffc1f5ce3eb0
3ec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3ee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3f00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3f20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3fc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000
3fe0: 0000000000000000 0000000000000000 0000000275ce3ff0 0000000275ce3ff8
Call trace:
Exception stack(0xffffffc1f5ce3b30 to 0xffffffc1f5ce3c60)
3b20:                                   dead000000000060 0000008000000000
3b40: ffffffc1f5ce3d00 ffffff80084cb694 0000000000000008 0000000000000e88
3b60: ffffffc1f5ce3bb0 ffffff80080dac68 ffffffc1f5ce3b90 ffffff8008826fe4
3b80: 00000000000001c0 00000000000001c0 ffffffc1f5ce3bb0 ffffff800848dfcc
3ba0: 0000000000020000 ffffff8008b15ae4 ffffffc1f5ce3c00 ffffff800808f000
3bc0: 0000000000000010 ffffff80088377f0 ffffffc1f5ccd590 0000000000000140
3be0: dead000000000100 dead000000000200 0000000000000001 0000000000000000
3c00: 0000000000000000 ffffff8008ce2020 ffffffc1f5cc3338 ffffffc1f5ce0000
3c20: 0000000000000840 0000000000000001 0000000000000000 ffffffffffffffff
3c40: ffffffffffffffff ffffffffffffffff 0000000000000001 0000000000000007
[<ffffff80084cb694>] pl330_irq_handler+0x27c/0x390
[<ffffff80080f09d0>] irq_forced_thread_fn+0x38/0x88
[<ffffff80080f0d70>] irq_thread+0x140/0x200
[<ffffff80080bf4f8>] kthread+0xd8/0xf0
[<ffffff8008083690>] ret_from_fork+0x10/0x40
Code: f2a00838 f9405763 aa1c03e1 aa1503e0 (f9000443)
---[ end trace f50005726d31199c ]---
Kernel panic - not syncing: Fatal exception in interrupt
SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 0-1
Kernel Offset: disabled
Memory Limit: none
---[ end Kernel panic - not syncing: Fatal exception in interrupt

To fix this, re-start with the list-head after dropping the lock then
re-takeing it.

Reviewed-by: Frank Mori Hess <[email protected]>
Tested-by: Frank Mori Hess <[email protected]>
Signed-off-by: Qi Hou <[email protected]>
Signed-off-by: Vinod Koul <[email protected]>

Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
commit df30781 upstream.

For problem determination we need to see whether and why we were successful
or not. This allows deduction of scsi_eh escalation.

Example trace record formatted with zfcpdbf from s390-tools:

Timestamp      : ...
Area           : SCSI
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1
Tag            : schrh_r        SCSI host reset handler result
Request ID     : 0x0000000000000000                     none (invalid)
SCSI ID        : 0xffffffff                             none (invalid)
SCSI LUN       : 0xffffffff                             none (invalid)
SCSI LUN high  : 0xffffffff                             none (invalid)
SCSI result    : 0x00002002     field re-used for midlayer value: SUCCESS
                                or in other cases: 0x2009 == FAST_IO_FAIL
SCSI retries   : 0xff                                   none (invalid)
SCSI allowed   : 0xff                                   none (invalid)
SCSI scribble  : 0xffffffffffffffff                     none (invalid)
SCSI opcode    : ffffffff ffffffff ffffffff ffffffff    none (invalid)
FCP rsp inf cod: 0xff                                   none (invalid)
FCP rsp IU     : 00000000 00000000 00000000 00000000    none (invalid)
                 00000000 00000000

v2.6.35 commit a1dbfdd ("[SCSI] zfcp: Pass return code from
fc_block_scsi_eh to scsi eh") introduced the first return with something
other than the previously hardcoded single SUCCESS return path.

Signed-off-by: Steffen Maier <[email protected]>
Fixes: a1dbfdd ("[SCSI] zfcp: Pass return code from fc_block_scsi_eh to scsi eh")
Cc: <[email protected]> #2.6.38+
Reviewed-by: Jens Remus <[email protected]>
Reviewed-by: Benjamin Block <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
commit 81979ae upstream.

We already have a SCSI trace for the end of abort and scsi_eh TMF. Due to
zfcp_erp_wait() and fc_block_scsi_eh() time can pass between the start of
our eh callback and an actual send/recv of an abort / TMF request.  In order
to see the temporal sequence including any abort / TMF send retries, add a
trace before the above two blocking functions.  This supports problem
determination with scsi_eh and parallel zfcp ERP.

No need to explicitly trace the beginning of our eh callback, since we
typically can send an abort / TMF and see its HBA response (in the worst
case, it's a pseudo response on dismiss all of adapter recovery, e.g. due to
an FSF request timeout [fsrth_1] of the abort / TMF). If we cannot send, we
now get a trace record for the first "abrt_wt" or "[lt]r_wait" which denotes
almost the beginning of the callback.

No need to explicitly trace the wakeup after the above two blocking
functions because the next retry loop causes another trace in any case and
that is sufficient.

Example trace records formatted with zfcpdbf from s390-tools:

Timestamp      : ...
Area           : SCSI
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1
Tag            : abrt_wt        abort, before zfcp_erp_wait()
Request ID     : 0x0000000000000000                     none (invalid)
SCSI ID        : 0x<scsi_id>
SCSI LUN       : 0x<scsi_lun>
SCSI LUN high  : 0x<scsi_lun_high>
SCSI result    : 0x<scsi_result_of_cmd_to_be_aborted>
SCSI retries   : 0x<retries_of_cmd_to_be_aborted>
SCSI allowed   : 0x<allowed_retries_of_cmd_to_be_aborted>
SCSI scribble  : 0x<req_id_of_cmd_to_be_aborted>
SCSI opcode    : <CDB_of_cmd_to_be_aborted>
FCP rsp inf cod: 0x..                                   none (invalid)
FCP rsp IU     : ...                                    none (invalid)

Timestamp      : ...
Area           : SCSI
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1
Tag            : lr_wait        LUN reset, before zfcp_erp_wait()
Request ID     : 0x0000000000000000                     none (invalid)
SCSI ID        : 0x<scsi_id>
SCSI LUN       : 0x<scsi_lun>
SCSI LUN high  : 0x<scsi_lun_high>
SCSI result    : 0x...                                  unrelated
SCSI retries   : 0x..                                   unrelated
SCSI allowed   : 0x..                                   unrelated
SCSI scribble  : 0x...                                  unrelated
SCSI opcode    : ...                                    unrelated
FCP rsp inf cod: 0x..                                   none (invalid)
FCP rsp IU     : ...                                    none (invalid)

Signed-off-by: Steffen Maier <[email protected]>
Fixes: 63caf36 ("[SCSI] zfcp: Improve reliability of SCSI eh handlers in zfcp")
Fixes: af4de36 ("[SCSI] zfcp: Block scsi_eh thread for rport state BLOCKED")
Cc: <[email protected]> #2.6.38+
Reviewed-by: Benjamin Block <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
…ailed

commit 512857a upstream.

If a SCSI device is deleted during scsi_eh host reset, we cannot get a
reference to the SCSI device anymore since scsi_device_get returns !=0 by
design. Assuming the recovery of adapter and port(s) was successful,
zfcp_erp_strategy_followup_success() attempts to trigger a LUN reset for the
half-gone SCSI device. Unfortunately, it causes the following confusing
trace record which states that zfcp will do a LUN recovery as "ERP need" is
ZFCP_ERP_ACTION_REOPEN_LUN == 1 and equals "ERP want".

Old example trace record formatted with zfcpdbf from s390-tools:

Tag:           : ersfs_3 ERP, trigger, unit reopen, port reopen succeeded
LUN            : 0x<FCP_LUN>
WWPN           : 0x<WWPN>
D_ID           : 0x<N_Port-ID>
Adapter status : 0x5400050b
Port status    : 0x54000001
LUN status     : 0x40000000     ZFCP_STATUS_COMMON_RUNNING
                                but not ZFCP_STATUS_COMMON_UNBLOCKED as it
                                was closed on close part of adapter reopen
ERP want       : 0x01
ERP need       : 0x01           misleading

However, zfcp_erp_setup_act() returns NULL as it cannot get the reference.
Hence, zfcp_erp_action_enqueue() takes an early goto out and _NO_ recovery
actually happens.

We always do want the recovery trigger trace record even if no erp_action
could be enqueued as in this case. For other cases where we did not enqueue
an erp_action, 'need' has always been zero to indicate this. In order to
indicate above goto out, introduce an eyecatcher "flag" to mark the "ERP
need" as 'not needed' but still keep the information which erp_action type,
that zfcp_erp_required_act() had decided upon, is needed.  0xc_ is chosen to
be visibly different from 0x0_ in "ERP want".

New example trace record formatted with zfcpdbf from s390-tools:

Tag:           : ersfs_3 ERP, trigger, unit reopen, port reopen succeeded
LUN            : 0x<FCP_LUN>
WWPN           : 0x<WWPN>
D_ID           : 0x<N_Port-ID>
Adapter status : 0x5400050b
Port status    : 0x54000001
LUN status     : 0x40000000
ERP want       : 0x01
ERP need       : 0xc1           would need LUN ERP, but no action set up
                   ^

Before v2.6.38 commit ae0904f ("[SCSI] zfcp: Redesign of the debug
tracing for recovery actions.") we could detect this case because the
"erp_action" field in the trace was NULL. The rework removed erp_action as
argument and field from the trace.

This patch here is for tracing. A fix to allow LUN recovery in the case at
hand is a topic for a separate patch.

See also commit fdbd1c5 ("[SCSI] zfcp: Allow running unit/LUN shutdown
without acquiring reference") for a similar case and background info.

Signed-off-by: Steffen Maier <[email protected]>
Fixes: ae0904f ("[SCSI] zfcp: Redesign of the debug tracing for recovery actions.")
Cc: <[email protected]> #2.6.38+
Reviewed-by: Benjamin Block <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
… return

commit 96d9270 upstream.

get_device() and its internally used kobject_get() only return NULL if they
get passed NULL as argument. zfcp_get_port_by_wwpn() loops over
adapter->port_list so the iteration variable port is always non-NULL.
Struct device is embedded in struct zfcp_port so &port->dev is always
non-NULL. This is the argument to get_device().  However, if we get an
fc_rport in terminate_rport_io() for which we cannot find a match within
zfcp_get_port_by_wwpn(), the latter can return NULL.  v2.6.30 commit
7093293 ("[SCSI] zfcp: Fix oops when port disappears") introduced an
early return without adding a trace record for this case.  Even if we don't
need recovery in this case, for debugging we should still see that our
callback was invoked originally by scsi_transport_fc.

Example trace record formatted with zfcpdbf from s390-tools:

Timestamp      : ...
Area           : REC
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1
Tag            : sctrpin        SCSI terminate rport I/O, no zfcp port
LUN            : 0xffffffffffffffff                     none (invalid)
WWPN           : 0x<wwpn>               WWPN
D_ID           : 0x<n_port_id>          N_Port-ID
Adapter status : 0x...
Port status    : 0xffffffff             unknown (-1)
LUN status     : 0x00000000                             none (invalid)
Ready count    : 0x...
Running count  : 0x...
ERP want       : 0x03                   ZFCP_ERP_ACTION_REOPEN_PORT_FORCED
ERP need       : 0xc0                   ZFCP_ERP_ACTION_NONE

Signed-off-by: Steffen Maier <[email protected]>
Fixes: 7093293 ("[SCSI] zfcp: Fix oops when port disappears")
Cc: <[email protected]> #2.6.38+
Reviewed-by: Benjamin Block <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
…RP_FAILED

commit d70aab5 upstream.

For problem determination we always want to see when we were invoked on the
terminate_rport_io callback whether we perform something or not.

Temporal event sequence of interest with a long fast_io_fail_tmo of 27 sec:

loose remote port

t   workqueue
[s] zfcp_q_<dev>       IRQ                 zfcperp<dev>

=== ================== =================== ============================

  0                    recv RSCN
                       q p.test_link_work
    block rport
     start fast_io_fail_tmo
    send ADISC ELS
  4                    recv ADISC fail
                       block zfcp_port
                                           port forced reopen
                                           send open port
 12                    recv open port fail
                                           q p.gid_pn_work
                                           zfcp_erp_wakeup
                                           (zfcp_erp_wait would return)
    GID_PN fail

Before this point, we got a SCSI trace with tag "sctrpi1" on fast_io_fail,
e.g. with the typical 5 sec setting.

    port.status |= ERP_FAILED

If fast_io_fail_tmo triggers after this point, we missed a SCSI trace.

    workqueue
    fc_dl_<host>
    ==================
 27 fc_timeout_fail_rport_io
    fc_terminate_rport_io
    zfcp_scsi_terminate_rport_io
    zfcp_erp_port_forced_reopen
    _zfcp_erp_port_forced_reopen
     if (port.status & ERP_FAILED)
      return;

Therefore, write a trace before above early return.

Example trace record formatted with zfcpdbf from s390-tools:

Timestamp      : ...
Area           : REC
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1                      ZFCP_DBF_REC_TRIG
Tag            : sctrpi1                SCSI terminate rport I/O
LUN            : 0xffffffffffffffff                     none (invalid)
WWPN           : 0x<wwpn>
D_ID           : 0x<n_port_id>
Adapter status : 0x...
Port status    : 0x...
LUN status     : 0x00000000                             none (invalid)
Ready count    : 0x...
Running count  : 0x...
ERP want       : 0x03                   ZFCP_ERP_ACTION_REOPEN_PORT_FORCED
ERP need       : 0xe0                   ZFCP_ERP_ACTION_FAILED

Signed-off-by: Steffen Maier <[email protected]>
Cc: <[email protected]> #2.6.38+
Reviewed-by: Benjamin Block <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
commit 8c3d20a upstream.

That other commit introduced an inconsistency because it would trace on
ERP_FAILED for all callers of port forced reopen triggers (not just
terminate_rport_io), but it would not trace on ERP_FAILED for all callers of
other ERP triggers such as adapter, port regular, LUN.

Therefore, generalize that other commit. zfcp_erp_action_enqueue() already
had two early outs which re-used the one zfcp_dbf_rec_trig() call.  All ERP
trigger functions finally run through zfcp_erp_action_enqueue().  So move
the special handling for ZFCP_STATUS_COMMON_ERP_FAILED into
zfcp_erp_action_enqueue() and add another early out with new trace marker
for pseudo ERP need in this case. This removes all early returns from all
ERP trigger functions so we always end up at zfcp_dbf_rec_trig().

Example trace record formatted with zfcpdbf from s390-tools:

Timestamp      : ...
Area           : REC
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1                      ZFCP_DBF_REC_TRIG
Tag            : .......
LUN            : 0x...
WWPN           : 0x...
D_ID           : 0x...
Adapter status : 0x...
Port status    : 0x...
LUN status     : 0x...
Ready count    : 0x...
Running count  : 0x...
ERP want       : 0x0.                   ZFCP_ERP_ACTION_REOPEN_...
ERP need       : 0xe0                   ZFCP_ERP_ACTION_FAILED

Signed-off-by: Steffen Maier <[email protected]>
Cc: <[email protected]> #2.6.38+
Reviewed-by: Benjamin Block <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
commit 6a76550 upstream.

Example trace record formatted with zfcpdbf from s390-tools:

Timestamp      : ...
Area           : REC
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1                      ZFCP_DBF_REC_TRIG
Tag            : .......
LUN            : 0x...
WWPN           : 0x...
D_ID           : 0x...
Adapter status : 0x...
Port status    : 0x...
LUN status     : 0x...
Ready count    : 0x...
Running count  : 0x...
ERP want       : 0x0.                   ZFCP_ERP_ACTION_REOPEN_...
ERP need       : 0xc0                   ZFCP_ERP_ACTION_NONE

Signed-off-by: Steffen Maier <[email protected]>
Cc: <[email protected]> #2.6.38+
Reviewed-by: Benjamin Block <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
commit b63e132 upstream.

The current MIPS implementation of arch_trigger_cpumask_backtrace() is
broken because it attempts to use synchronous IPIs despite the fact that
it may be run with interrupts disabled.

This means that when arch_trigger_cpumask_backtrace() is invoked, for
example by the RCU CPU stall watchdog, we may:

  - Deadlock due to use of synchronous IPIs with interrupts disabled,
    causing the CPU that's attempting to generate the backtrace output
    to hang itself.

  - Not succeed in generating the desired output from remote CPUs.

  - Produce warnings about this from smp_call_function_many(), for
    example:

    [42760.526910] INFO: rcu_sched detected stalls on CPUs/tasks:
    [42760.535755]  0-...!: (1 GPs behind) idle=ade/140000000000000/0 softirq=526944/526945 fqs=0
    [42760.547874]  1-...!: (0 ticks this GP) idle=e4a/140000000000000/0 softirq=547885/547885 fqs=0
    [42760.559869]  (detected by 2, t=2162 jiffies, g=266689, c=266688, q=33)
    [42760.568927] ------------[ cut here ]------------
    [42760.576146] WARNING: CPU: 2 PID: 1216 at kernel/smp.c:416 smp_call_function_many+0x88/0x20c
    [42760.587839] Modules linked in:
    [42760.593152] CPU: 2 PID: 1216 Comm: sh Not tainted 4.15.4-00373-gee058bb4d0c2 #2
    [42760.603767] Stack : 8e09bd20 8e09bd20 8e09bd20 fffffff0 00000007 00000006 00000000 8e09bca8
    [42760.616937]         95b2b379 95b2b379 807a0080 00000007 81944518 0000018a 00000032 00000000
    [42760.630095]         00000000 00000030 80000000 00000000 806eca74 00000009 8017e2b8 000001a0
    [42760.643169]         00000000 00000002 00000000 8e09baa4 00000008 808b8008 86d69080 8e09bca0
    [42760.656282]         8e09ad50 805e20aa 00000000 00000000 00000000 8017e2b8 00000009 801070ca
    [42760.669424]         ...
    [42760.673919] Call Trace:
    [42760.678672] [<27fde568>] show_stack+0x70/0xf0
    [42760.685417] [<84751641>] dump_stack+0xaa/0xd0
    [42760.692188] [<699d671c>] __warn+0x80/0x92
    [42760.698549] [<68915d41>] warn_slowpath_null+0x28/0x36
    [42760.705912] [<f7c76c1c>] smp_call_function_many+0x88/0x20c
    [42760.713696] [<6bbdfc2a>] arch_trigger_cpumask_backtrace+0x30/0x4a
    [42760.722216] [<f845bd33>] rcu_dump_cpu_stacks+0x6a/0x98
    [42760.729580] [<796e7629>] rcu_check_callbacks+0x672/0x6ac
    [42760.737476] [<059b3b43>] update_process_times+0x18/0x34
    [42760.744981] [<6eb94941>] tick_sched_handle.isra.5+0x26/0x38
    [42760.752793] [<478d3d70>] tick_sched_timer+0x1c/0x50
    [42760.759882] [<e56ea39f>] __hrtimer_run_queues+0xc6/0x226
    [42760.767418] [<e88bbcae>] hrtimer_interrupt+0x88/0x19a
    [42760.775031] [<6765a19e>] gic_compare_interrupt+0x2e/0x3a
    [42760.782761] [<0558bf5f>] handle_percpu_devid_irq+0x78/0x168
    [42760.790795] [<90c11ba2>] generic_handle_irq+0x1e/0x2c
    [42760.798117] [<1b6d462c>] gic_handle_local_int+0x38/0x86
    [42760.805545] [<b2ada1c7>] gic_irq_dispatch+0xa/0x14
    [42760.812534] [<90c11ba2>] generic_handle_irq+0x1e/0x2c
    [42760.820086] [<c7521934>] do_IRQ+0x16/0x20
    [42760.826274] [<9aef3ce6>] plat_irq_dispatch+0x62/0x94
    [42760.833458] [<6a94b53c>] except_vec_vi_end+0x70/0x78
    [42760.840655] [<22284043>] smp_call_function_many+0x1ba/0x20c
    [42760.848501] [<54022b58>] smp_call_function+0x1e/0x2c
    [42760.855693] [<ab9fc705>] flush_tlb_mm+0x2a/0x98
    [42760.862730] [<0844cdd0>] tlb_flush_mmu+0x1c/0x44
    [42760.869628] [<cb259b74>] arch_tlb_finish_mmu+0x26/0x3e
    [42760.877021] [<1aeaaf74>] tlb_finish_mmu+0x18/0x66
    [42760.883907] [<b3fce717>] exit_mmap+0x76/0xea
    [42760.890428] [<c4c8a2f6>] mmput+0x80/0x11a
    [42760.896632] [<a41a08f4>] do_exit+0x1f4/0x80c
    [42760.903158] [<ee01cef6>] do_group_exit+0x20/0x7e
    [42760.909990] [<13fa8d54>] __wake_up_parent+0x0/0x1e
    [42760.917045] [<46cf89d0>] smp_call_function_many+0x1a2/0x20c
    [42760.924893] [<8c21a93b>] syscall_common+0x14/0x1c
    [42760.931765] ---[ end trace 02aa09da9dc52a60 ]---
    [42760.938342] ------------[ cut here ]------------
    [42760.945311] WARNING: CPU: 2 PID: 1216 at kernel/smp.c:291 smp_call_function_single+0xee/0xf8
    ...

This patch switches MIPS' arch_trigger_cpumask_backtrace() to use async
IPIs & smp_call_function_single_async() in order to resolve this
problem. We ensure use of the pre-allocated call_single_data_t
structures is serialized by maintaining a cpumask indicating that
they're busy, and refusing to attempt to send an IPI when a CPU's bit is
set in this mask. This should only happen if a CPU hasn't responded to a
previous backtrace IPI - ie. if it's hung - and we print a warning to
the console in this case.

I've marked this for stable branches as far back as v4.9, to which it
applies cleanly. Strictly speaking the faulty MIPS implementation can be
traced further back to commit 856839b ("MIPS: Add
arch_trigger_all_cpu_backtrace() function") in v3.19, but kernel
versions v3.19 through v4.8 will require further work to backport due to
the rework performed in commit 9a01c3e ("nmi_backtrace: add more
trigger_*_cpu_backtrace() methods").

Signed-off-by: Paul Burton <[email protected]>
Patchwork: https://patchwork.linux-mips.org/patch/19597/
Cc: James Hogan <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: [email protected]
Cc: [email protected] # v4.9+
Fixes: 856839b ("MIPS: Add arch_trigger_all_cpu_backtrace() function")
Fixes: 9a01c3e ("nmi_backtrace: add more trigger_*_cpu_backtrace() methods")
[ Huacai: backported to 4.4: Restruction since generic NMI solution is unavailable ]
Signed-off-by: Huacai Chen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
commit c604cb7 upstream.

My recent fix for dns_resolver_preparse() printing very long strings was
incomplete, as shown by syzbot which still managed to hit the
WARN_ONCE() in set_precision() by adding a crafted "dns_resolver" key:

    precision 50001 too large
    WARNING: CPU: 7 PID: 864 at lib/vsprintf.c:2164 vsnprintf+0x48a/0x5a0

The bug this time isn't just a printing bug, but also a logical error
when multiple options ("#"-separated strings) are given in the key
payload.  Specifically, when separating an option string into name and
value, if there is no value then the name is incorrectly considered to
end at the end of the key payload, rather than the end of the current
option.  This bypasses validation of the option length, and also means
that specifying multiple options is broken -- which presumably has gone
unnoticed as there is currently only one valid option anyway.

A similar problem also applied to option values, as the kstrtoul() when
parsing the "dnserror" option will read past the end of the current
option and into the next option.

Fix these bugs by correctly computing the length of the option name and
by copying the option value, null-terminated, into a temporary buffer.

Reproducer for the WARN_ONCE() that syzbot hit:

    perl -e 'print "#A#", "\0" x 50000' | keyctl padd dns_resolver desc @s

Reproducer for "dnserror" option being parsed incorrectly (expected
behavior is to fail when seeing the unknown option "foo", actual
behavior was to read the dnserror value as "1#foo" and fail there):

    perl -e 'print "#dnserror=1#foo\0"' | keyctl padd dns_resolver desc @s

Reported-by: syzbot <[email protected]>
Fixes: 4a2d789 ("DNS: If the DNS server returns an error, allow that to be cached [ver #2]")
Signed-off-by: Eric Biggers <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
commit 89da619 upstream.

Kernel panic when with high memory pressure, calltrace looks like,

PID: 21439 TASK: ffff881be3afedd0 CPU: 16 COMMAND: "java"
 #0 [ffff881ec7ed7630] machine_kexec at ffffffff81059beb
 #1 [ffff881ec7ed7690] __crash_kexec at ffffffff81105942
 #2 [ffff881ec7ed7760] crash_kexec at ffffffff81105a30
 #3 [ffff881ec7ed7778] oops_end at ffffffff816902c8
 #4 [ffff881ec7ed77a0] no_context at ffffffff8167ff46
 #5 [ffff881ec7ed77f0] __bad_area_nosemaphore at ffffffff8167ffdc
 #6 [ffff881ec7ed7838] __node_set at ffffffff81680300
 #7 [ffff881ec7ed7860] __do_page_fault at ffffffff8169320f
 #8 [ffff881ec7ed78c0] do_page_fault at ffffffff816932b5
 #9 [ffff881ec7ed78f0] page_fault at ffffffff8168f4c8
    [exception RIP: _raw_spin_lock_irqsave+47]
    RIP: ffffffff8168edef RSP: ffff881ec7ed79a8 RFLAGS: 00010046
    RAX: 0000000000000246 RBX: ffffea0019740d00 RCX: ffff881ec7ed7fd8
    RDX: 0000000000020000 RSI: 0000000000000016 RDI: 0000000000000008
    RBP: ffff881ec7ed79a8 R8: 0000000000000246 R9: 000000000001a098
    R10: ffff88107ffda000 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000008 R14: ffff881ec7ed7a80 R15: ffff881be3afedd0
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018

It happens in the pagefault and results in double pagefault
during compacting pages when memory allocation fails.

Analysed the vmcore, the page leads to second pagefault is corrupted
with _mapcount=-256, but private=0.

It's caused by the race between migration and ballooning, and lock
missing in virtballoon_migratepage() of virtio_balloon driver.
This patch fix the bug.

Fixes: e225042 ("virtio_balloon: introduce migration primitives to balloon pages")
Cc: [email protected]
Signed-off-by: Jiang Biao <[email protected]>
Signed-off-by: Huang Chong <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
commit 4e1a720 upstream.

slub debug reported:

[  440.648642] =============================================================================
[  440.648649] BUG kmalloc-1024 (Tainted: G    BU     O   ): Poison overwritten
[  440.648651] -----------------------------------------------------------------------------

[  440.648655] INFO: 0xe70f4bec-0xe70f4bec. First byte 0x6a instead of 0x6b
[  440.648665] INFO: Allocated in sk_prot_alloc+0x6b/0xc6 age=33155 cpu=1 pid=1047
[  440.648671] 	___slab_alloc.constprop.24+0x1fc/0x292
[  440.648675] 	__slab_alloc.isra.18.constprop.23+0x1c/0x25
[  440.648677] 	__kmalloc+0xb6/0x17f
[  440.648680] 	sk_prot_alloc+0x6b/0xc6
[  440.648683] 	sk_alloc+0x1e/0xa1
[  440.648700] 	sco_sock_alloc.constprop.6+0x26/0xaf [bluetooth]
[  440.648716] 	sco_connect_cfm+0x166/0x281 [bluetooth]
[  440.648731] 	hci_conn_request_evt.isra.53+0x258/0x281 [bluetooth]
[  440.648746] 	hci_event_packet+0x28b/0x2326 [bluetooth]
[  440.648759] 	hci_rx_work+0x161/0x291 [bluetooth]
[  440.648764] 	process_one_work+0x163/0x2b2
[  440.648767] 	worker_thread+0x1a9/0x25c
[  440.648770] 	kthread+0xf8/0xfd
[  440.648774] 	ret_from_fork+0x2e/0x38
[  440.648779] INFO: Freed in __sk_destruct+0xd3/0xdf age=3815 cpu=1 pid=1047
[  440.648782] 	__slab_free+0x4b/0x27a
[  440.648784] 	kfree+0x12e/0x155
[  440.648787] 	__sk_destruct+0xd3/0xdf
[  440.648790] 	sk_destruct+0x27/0x29
[  440.648793] 	__sk_free+0x75/0x91
[  440.648795] 	sk_free+0x1c/0x1e
[  440.648810] 	sco_sock_kill+0x5a/0x5f [bluetooth]
[  440.648825] 	sco_conn_del+0x8e/0xba [bluetooth]
[  440.648840] 	sco_disconn_cfm+0x3a/0x41 [bluetooth]
[  440.648855] 	hci_event_packet+0x45e/0x2326 [bluetooth]
[  440.648868] 	hci_rx_work+0x161/0x291 [bluetooth]
[  440.648872] 	process_one_work+0x163/0x2b2
[  440.648875] 	worker_thread+0x1a9/0x25c
[  440.648877] 	kthread+0xf8/0xfd
[  440.648880] 	ret_from_fork+0x2e/0x38
[  440.648884] INFO: Slab 0xf4718580 objects=27 used=27 fp=0x  (null) flags=0x40008100
[  440.648886] INFO: Object 0xe70f4b88 @offset=19336 fp=0xe70f54f8

When KASAN was enabled, it reported:

[  210.096613] ==================================================================
[  210.096634] BUG: KASAN: use-after-free in ex_handler_refcount+0x5b/0x127
[  210.096641] Write of size 4 at addr ffff880107e17160 by task kworker/u9:1/2040

[  210.096651] CPU: 1 PID: 2040 Comm: kworker/u9:1 Tainted: G     U     O    4.14.47-20180606+ #2
[  210.096654] Hardware name: , BIOS 2017.01-00087-g43e04de 08/30/2017
[  210.096693] Workqueue: hci0 hci_rx_work [bluetooth]
[  210.096698] Call Trace:
[  210.096711]  dump_stack+0x46/0x59
[  210.096722]  print_address_description+0x6b/0x23b
[  210.096729]  ? ex_handler_refcount+0x5b/0x127
[  210.096736]  kasan_report+0x220/0x246
[  210.096744]  ex_handler_refcount+0x5b/0x127
[  210.096751]  ? ex_handler_clear_fs+0x85/0x85
[  210.096757]  fixup_exception+0x8c/0x96
[  210.096766]  do_trap+0x66/0x2c1
[  210.096773]  do_error_trap+0x152/0x180
[  210.096781]  ? fixup_bug+0x78/0x78
[  210.096817]  ? hci_debugfs_create_conn+0x244/0x26a [bluetooth]
[  210.096824]  ? __schedule+0x113b/0x1453
[  210.096830]  ? sysctl_net_exit+0xe/0xe
[  210.096837]  ? __wake_up_common+0x343/0x343
[  210.096843]  ? insert_work+0x107/0x163
[  210.096850]  invalid_op+0x1b/0x40
[  210.096888] RIP: 0010:hci_debugfs_create_conn+0x244/0x26a [bluetooth]
[  210.096892] RSP: 0018:ffff880094a0f970 EFLAGS: 00010296
[  210.096898] RAX: 0000000000000000 RBX: ffff880107e170e8 RCX: ffff880107e17160
[  210.096902] RDX: 000000000000002f RSI: ffff88013b80ed40 RDI: ffffffffa058b940
[  210.096906] RBP: ffff88011b2b0578 R08: 00000000852f0ec9 R09: ffffffff81cfcf9b
[  210.096909] R10: 00000000d21bdad7 R11: 0000000000000001 R12: ffff8800967b0488
[  210.096913] R13: ffff880107e17168 R14: 0000000000000068 R15: ffff8800949c0008
[  210.096920]  ? __sk_destruct+0x2c6/0x2d4
[  210.096959]  hci_event_packet+0xff5/0x7de2 [bluetooth]
[  210.096969]  ? __local_bh_enable_ip+0x43/0x5b
[  210.097004]  ? l2cap_sock_recv_cb+0x158/0x166 [bluetooth]
[  210.097039]  ? hci_le_meta_evt+0x2bb3/0x2bb3 [bluetooth]
[  210.097075]  ? l2cap_ertm_init+0x94e/0x94e [bluetooth]
[  210.097093]  ? xhci_urb_enqueue+0xbd8/0xcf5 [xhci_hcd]
[  210.097102]  ? __accumulate_pelt_segments+0x24/0x33
[  210.097109]  ? __accumulate_pelt_segments+0x24/0x33
[  210.097115]  ? __update_load_avg_se.isra.2+0x217/0x3a4
[  210.097122]  ? set_next_entity+0x7c3/0x12cd
[  210.097128]  ? pick_next_entity+0x25e/0x26c
[  210.097135]  ? pick_next_task_fair+0x2ca/0xc1a
[  210.097141]  ? switch_mm_irqs_off+0x346/0xb4f
[  210.097147]  ? __switch_to+0x769/0xbc4
[  210.097153]  ? compat_start_thread+0x66/0x66
[  210.097188]  ? hci_conn_check_link_mode+0x1cd/0x1cd [bluetooth]
[  210.097195]  ? finish_task_switch+0x392/0x431
[  210.097228]  ? hci_rx_work+0x154/0x487 [bluetooth]
[  210.097260]  hci_rx_work+0x154/0x487 [bluetooth]
[  210.097269]  process_one_work+0x579/0x9e9
[  210.097277]  worker_thread+0x68f/0x804
[  210.097285]  kthread+0x31c/0x32b
[  210.097292]  ? rescuer_thread+0x70c/0x70c
[  210.097299]  ? kthread_create_on_node+0xa3/0xa3
[  210.097306]  ret_from_fork+0x35/0x40

[  210.097314] Allocated by task 2040:
[  210.097323]  kasan_kmalloc.part.1+0x51/0xc7
[  210.097328]  __kmalloc+0x17f/0x1b6
[  210.097335]  sk_prot_alloc+0xf2/0x1a3
[  210.097340]  sk_alloc+0x22/0x297
[  210.097375]  sco_sock_alloc.constprop.7+0x23/0x202 [bluetooth]
[  210.097410]  sco_connect_cfm+0x2d0/0x566 [bluetooth]
[  210.097443]  hci_conn_request_evt.isra.53+0x6d3/0x762 [bluetooth]
[  210.097476]  hci_event_packet+0x85e/0x7de2 [bluetooth]
[  210.097507]  hci_rx_work+0x154/0x487 [bluetooth]
[  210.097512]  process_one_work+0x579/0x9e9
[  210.097517]  worker_thread+0x68f/0x804
[  210.097523]  kthread+0x31c/0x32b
[  210.097529]  ret_from_fork+0x35/0x40

[  210.097533] Freed by task 2040:
[  210.097539]  kasan_slab_free+0xb3/0x15e
[  210.097544]  kfree+0x103/0x1a9
[  210.097549]  __sk_destruct+0x2c6/0x2d4
[  210.097584]  sco_conn_del.isra.1+0xba/0x10e [bluetooth]
[  210.097617]  hci_event_packet+0xff5/0x7de2 [bluetooth]
[  210.097648]  hci_rx_work+0x154/0x487 [bluetooth]
[  210.097653]  process_one_work+0x579/0x9e9
[  210.097658]  worker_thread+0x68f/0x804
[  210.097663]  kthread+0x31c/0x32b
[  210.097670]  ret_from_fork+0x35/0x40

[  210.097676] The buggy address belongs to the object at ffff880107e170e8
 which belongs to the cache kmalloc-1024 of size 1024
[  210.097681] The buggy address is located 120 bytes inside of
 1024-byte region [ffff880107e170e8, ffff880107e174e8)
[  210.097683] The buggy address belongs to the page:
[  210.097689] page:ffffea00041f8400 count:1 mapcount:0 mapping:          (null) index:0xffff880107e15b68 compound_mapcount: 0
[  210.110194] flags: 0x8000000000008100(slab|head)
[  210.115441] raw: 8000000000008100 0000000000000000 ffff880107e15b68 0000000100170016
[  210.115448] raw: ffffea0004a47620 ffffea0004b48e20 ffff88013b80ed40 0000000000000000
[  210.115451] page dumped because: kasan: bad access detected

[  210.115454] Memory state around the buggy address:
[  210.115460]  ffff880107e17000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  210.115465]  ffff880107e17080: fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb
[  210.115469] >ffff880107e17100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  210.115472]                                                        ^
[  210.115477]  ffff880107e17180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  210.115481]  ffff880107e17200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  210.115483] ==================================================================

And finally when BT_DBG() and ftrace was enabled it showed:

       <...>-14979 [001] ....   186.104191: sco_sock_kill <-sco_sock_close
       <...>-14979 [001] ....   186.104191: sco_sock_kill <-sco_sock_release
       <...>-14979 [001] ....   186.104192: sco_sock_kill: sk ef0497a0 state 9
       <...>-14979 [001] ....   186.104193: bt_sock_unlink <-sco_sock_kill
kworker/u9:2-792   [001] ....   186.104246: sco_sock_kill <-sco_conn_del
kworker/u9:2-792   [001] ....   186.104248: sco_sock_kill: sk ef0497a0 state 9
kworker/u9:2-792   [001] ....   186.104249: bt_sock_unlink <-sco_sock_kill
kworker/u9:2-792   [001] ....   186.104250: sco_sock_destruct <-__sk_destruct
kworker/u9:2-792   [001] ....   186.104250: sco_sock_destruct: sk ef0497a0
kworker/u9:2-792   [001] ....   186.104860: hci_conn_del <-hci_event_packet
kworker/u9:2-792   [001] ....   186.104864: hci_conn_del: hci0 hcon ef0484c0 handle 266

Only in the failed case, sco_sock_kill() gets called with the same sock
pointer two times. Add a check for SOCK_DEAD to avoid continue killing
a socket which has already been killed.

Signed-off-by: Sudip Mukherjee <[email protected]>
Signed-off-by: Marcel Holtmann <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Oct 10, 2018
commit ccd5b32 upstream.

The following commit:

  39a0526 ("x86/mm: Factor out LDT init from context init")

renamed init_new_context() to init_new_context_ldt() and added a new
init_new_context() which calls init_new_context_ldt().  However, the
error code of init_new_context_ldt() was ignored.  Consequently, if a
memory allocation in alloc_ldt_struct() failed during a fork(), the
->context.ldt of the new task remained the same as that of the old task
(due to the memcpy() in dup_mm()).  ldt_struct's are not intended to be
shared, so a use-after-free occurred after one task exited.

Fix the bug by making init_new_context() pass through the error code of
init_new_context_ldt().

This bug was found by syzkaller, which encountered the following splat:

    BUG: KASAN: use-after-free in free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116
    Read of size 4 at addr ffff88006d2cb7c8 by task kworker/u9:0/3710

    CPU: 1 PID: 3710 Comm: kworker/u9:0 Not tainted 4.13.0-rc4-next-20170811 #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
     __dump_stack lib/dump_stack.c:16 [inline]
     dump_stack+0x194/0x257 lib/dump_stack.c:52
     print_address_description+0x73/0x250 mm/kasan/report.c:252
     kasan_report_error mm/kasan/report.c:351 [inline]
     kasan_report+0x24e/0x340 mm/kasan/report.c:409
     __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
     free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116
     free_ldt_struct arch/x86/kernel/ldt.c:173 [inline]
     destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171
     destroy_context arch/x86/include/asm/mmu_context.h:157 [inline]
     __mmdrop+0xe9/0x530 kernel/fork.c:889
     mmdrop include/linux/sched/mm.h:42 [inline]
     exec_mmap fs/exec.c:1061 [inline]
     flush_old_exec+0x173c/0x1ff0 fs/exec.c:1291
     load_elf_binary+0x81f/0x4ba0 fs/binfmt_elf.c:855
     search_binary_handler+0x142/0x6b0 fs/exec.c:1652
     exec_binprm fs/exec.c:1694 [inline]
     do_execveat_common.isra.33+0x1746/0x22e0 fs/exec.c:1816
     do_execve+0x31/0x40 fs/exec.c:1860
     call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100
     ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431

    Allocated by task 3700:
     save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
     save_stack+0x43/0xd0 mm/kasan/kasan.c:447
     set_track mm/kasan/kasan.c:459 [inline]
     kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
     kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3627
     kmalloc include/linux/slab.h:493 [inline]
     alloc_ldt_struct+0x52/0x140 arch/x86/kernel/ldt.c:67
     write_ldt+0x7b7/0xab0 arch/x86/kernel/ldt.c:277
     sys_modify_ldt+0x1ef/0x240 arch/x86/kernel/ldt.c:307
     entry_SYSCALL_64_fastpath+0x1f/0xbe

    Freed by task 3700:
     save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
     save_stack+0x43/0xd0 mm/kasan/kasan.c:447
     set_track mm/kasan/kasan.c:459 [inline]
     kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
     __cache_free mm/slab.c:3503 [inline]
     kfree+0xca/0x250 mm/slab.c:3820
     free_ldt_struct.part.2+0xdd/0x150 arch/x86/kernel/ldt.c:121
     free_ldt_struct arch/x86/kernel/ldt.c:173 [inline]
     destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171
     destroy_context arch/x86/include/asm/mmu_context.h:157 [inline]
     __mmdrop+0xe9/0x530 kernel/fork.c:889
     mmdrop include/linux/sched/mm.h:42 [inline]
     __mmput kernel/fork.c:916 [inline]
     mmput+0x541/0x6e0 kernel/fork.c:927
     copy_process.part.36+0x22e1/0x4af0 kernel/fork.c:1931
     copy_process kernel/fork.c:1546 [inline]
     _do_fork+0x1ef/0xfb0 kernel/fork.c:2025
     SYSC_clone kernel/fork.c:2135 [inline]
     SyS_clone+0x37/0x50 kernel/fork.c:2129
     do_syscall_64+0x26c/0x8c0 arch/x86/entry/common.c:287
     return_from_SYSCALL_64+0x0/0x7a

Here is a C reproducer:

    #include <asm/ldt.h>
    #include <pthread.h>
    #include <signal.h>
    #include <stdlib.h>
    #include <sys/syscall.h>
    #include <sys/wait.h>
    #include <unistd.h>

    static void *fork_thread(void *_arg)
    {
        fork();
    }

    int main(void)
    {
        struct user_desc desc = { .entry_number = 8191 };

        syscall(__NR_modify_ldt, 1, &desc, sizeof(desc));

        for (;;) {
            if (fork() == 0) {
                pthread_t t;

                srand(getpid());
                pthread_create(&t, NULL, fork_thread, NULL);
                usleep(rand() % 10000);
                syscall(__NR_exit_group, 0);
            }
            wait(NULL);
        }
    }

Note: the reproducer takes advantage of the fact that alloc_ldt_struct()
may use vmalloc() to allocate a large ->entries array, and after
commit:

  5d17a73 ("vmalloc: back off when the current task is killed")

it is possible for userspace to fail a task's vmalloc() by
sending a fatal signal, e.g. via exit_group().  It would be more
difficult to reproduce this bug on kernels without that commit.

This bug only affected kernels with CONFIG_MODIFY_LDT_SYSCALL=y.

Signed-off-by: Eric Biggers <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: <[email protected]> [v4.6+]
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Fixes: 39a0526 ("x86/mm: Factor out LDT init from context init")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Ben Hutchings <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
gouwa pushed a commit that referenced this pull request Dec 24, 2018
This patch fixes the following warning:
[    2.226264] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
[    2.226341] in_atomic(): 0, irqs_disabled(): 0, pid: 1, name: swapper/0
[    2.226366] 4 locks held by swapper/0/1:
[    2.226385]  #0:  (&dev->mutex){......}, at: [<ffffff800851c664>] __device_attach+0x3c/0x134
[    2.226515]  #1:  (cpu_hotplug.lock){......}, at: [<ffffff800809feb0>] get_online_cpus+0x38/0x9c
[    2.226594]  #2:  (subsys mutex#7){......}, at: [<ffffff800851b0e4>] subsys_interface_register+0x54/0xfc
[    2.226684]  #3:  (rcu_read_lock){......}, at: [<ffffff800845c764>] rockchip_adjust_power_scale+0x88/0x450
[    2.226771] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.4.132 #600
[    2.226790] Hardware name: Rockchip PX30 evb ddr3 board (DT)
[    2.226809] Call trace:
[    2.226840] [<ffffff800808a0d4>] dump_backtrace+0x0/0x1ec
[    2.226889] [<ffffff800808a2d4>] show_stack+0x14/0x1c
[    2.226918] [<ffffff80083c41b4>] dump_stack+0x94/0xbc
[    2.226946] [<ffffff80080cb708>] ___might_sleep+0x108/0x118
[    2.226972] [<ffffff80080cb788>] __might_sleep+0x70/0x80
[    2.227001] [<ffffff8008b6d1d0>] mutex_lock_nested+0x54/0x38c
[    2.227031] [<ffffff8008858714>] __of_clk_get_from_provider+0x44/0xec
[    2.227061] [<ffffff8008852814>] __of_clk_get_by_name+0xd4/0x14c
[    2.227112] [<ffffff80088528a4>] of_clk_get_by_name+0x18/0x28
[    2.227140] [<ffffff800845c980>] rockchip_adjust_power_scale+0x2a4/0x450
[    2.227171] [<ffffff800877f630>] cpufreq_init+0x154/0x394
[    2.227199] [<ffffff8008776ac4>] cpufreq_online+0x1b0/0x68c
[    2.227225] [<ffffff8008777040>] cpufreq_add_dev+0x3c/0x94
[    2.227253] [<ffffff800851b168>] subsys_interface_register+0xd8/0xfc
[    2.227281] [<ffffff80087772d8>] cpufreq_register_driver+0x10c/0x1a8
[    2.227332] [<ffffff800877f93c>] dt_cpufreq_probe+0xcc/0xe8
[    2.227360] [<ffffff800851e7b8>] platform_drv_probe+0x54/0xa8
[    2.227385] [<ffffff800851c924>] driver_probe_device+0x194/0x278
[    2.227411] [<ffffff800851cb40>] __device_attach_driver+0x60/0x9c
[    2.227439] [<ffffff800851ae14>] bus_for_each_drv+0x9c/0xbc
[    2.227464] [<ffffff800851c6f0>] __device_attach+0xc8/0x134
[    2.227488] [<ffffff800851ccb8>] device_initial_probe+0x10/0x18
[    2.227514] [<ffffff800851bcec>] bus_probe_device+0x2c/0x90
[    2.227541] [<ffffff8008519e90>] device_add+0x44c/0x510
[    2.227570] [<ffffff800851e4d4>] platform_device_add+0xa0/0x1e4
[    2.227597] [<ffffff800851ef28>] platform_device_register_full+0xa4/0xe4
[    2.227627] [<ffffff80090f8680>] rockchip_cpufreq_driver_init+0xd4/0x31c
[    2.227654] [<ffffff8008083468>] do_one_initcall+0x84/0x1a4
[    2.227682] [<ffffff80090c0e98>] kernel_init_freeable+0x260/0x264
[    2.227708] [<ffffff8008b6ab68>] kernel_init+0x10/0xf8
[    2.227734] [<ffffff80080832a0>] ret_from_fork+0x10/0x30

Fixes: 197a2d3 ("soc: rockchip: opp_select: add missing rcu lock")
Change-Id: Ib52b6f7bc77bb207a1c8c6880f7a5e916fa3d2ee
Signed-off-by: Finley Xiao <[email protected]>
gouwa pushed a commit that referenced this pull request May 28, 2019
[ Upstream commit a3ca831 ]

When booting up with "threadirqs" in command line, all irq handlers of the DMA
controller pl330 will be threaded forcedly. These threads will race for the same
list, pl330->req_done.

Before the callback, the spinlock was released. And after it, the spinlock was
taken. This opened an race window where another threaded irq handler could steal
the spinlock and be permitted to delete entries of the list, pl330->req_done.

If the later deleted an entry that was still referred to by the former, there would
be a kernel panic when the former was scheduled and tried to get the next sibling
of the deleted entry.

The scenario could be depicted as below:

  Thread: T1  pl330->req_done  Thread: T2
      |             |              |
      |          -A-B-C-D-         |
    Locked          |              |
      |             |           Waiting
    Del A           |              |
      |          -B-C-D-           |
    Unlocked        |              |
      |             |           Locked
    Waiting         |              |
      |             |            Del B
      |             |              |
      |           -C-D-         Unlocked
    Waiting         |              |
      |
    Locked
      |
   get C via B
      \
       - Kernel panic

The kernel panic looked like as below:

Unable to handle kernel paging request at virtual address dead000000000108
pgd = ffffff8008c9e000
[dead000000000108] *pgd=000000027fffe003, *pud=000000027fffe003, *pmd=0000000000000000
Internal error: Oops: 96000044 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 85 Comm: irq/59-66330000 Not tainted 4.8.24-WR9.0.0.12_standard #2
Hardware name: Broadcom NS2 SVK (DT)
task: ffffffc1f5cc3c00 task.stack: ffffffc1f5ce0000
PC is at pl330_irq_handler+0x27c/0x390
LR is at pl330_irq_handler+0x2a8/0x390
pc : [<ffffff80084cb694>] lr : [<ffffff80084cb6c0>] pstate: 800001c5
sp : ffffffc1f5ce3d00
x29: ffffffc1f5ce3d00 x28: 0000000000000140
x27: ffffffc1f5c530b0 x26: dead000000000100
x25: dead000000000200 x24: 0000000000418958
x23: 0000000000000001 x22: ffffffc1f5ccd668
x21: ffffffc1f5ccd590 x20: ffffffc1f5ccd418
x19: dead000000000060 x18: 0000000000000001
x17: 0000000000000007 x16: 0000000000000001
x15: ffffffffffffffff x14: ffffffffffffffff
x13: ffffffffffffffff x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000000840
x9 : ffffffc1f5ce0000 x8 : ffffffc1f5cc3338
x7 : ffffff8008ce2020 x6 : 0000000000000000
x5 : 0000000000000000 x4 : 0000000000000001
x3 : dead000000000200 x2 : dead000000000100
x1 : 0000000000000140 x0 : ffffffc1f5ccd590

Process irq/59-66330000 (pid: 85, stack limit = 0xffffffc1f5ce0020)
Stack: (0xffffffc1f5ce3d00 to 0xffffffc1f5ce4000)
3d00: ffffffc1f5ce3d80 ffffff80080f09d0 ffffffc1f5ca0c00 ffffffc1f6f7c600
3d20: ffffffc1f5ce0000 ffffffc1f6f7c600 ffffffc1f5ca0c00 ffffff80080f0998
3d40: ffffffc1f5ce0000 ffffff80080f0000 0000000000000000 0000000000000000
3d60: ffffff8008ce202c ffffff8008ce2020 ffffffc1f5ccd668 ffffffc1f5c530b0
3d80: ffffffc1f5ce3db0 ffffff80080f0d70 ffffffc1f5ca0c40 0000000000000001
3da0: ffffffc1f5ce0000 ffffff80080f0cfc ffffffc1f5ce3e20 ffffff80080bf4f8
3dc0: ffffffc1f5ca0c80 ffffff8008bf3798 ffffff8008955528 ffffffc1f5ca0c00
3de0: ffffff80080f0c30 0000000000000000 0000000000000000 0000000000000000
3e00: 0000000000000000 0000000000000000 0000000000000000 ffffff80080f0b68
3e20: 0000000000000000 ffffff8008083690 ffffff80080bf420 ffffffc1f5ca0c80
3e40: 0000000000000000 0000000000000000 0000000000000000 ffffff80080cb648
3e60: ffffff8008b1c780 0000000000000000 0000000000000000 ffffffc1f5ca0c00
3e80: ffffffc100000000 ffffff8000000000 ffffffc1f5ce3e90 ffffffc1f5ce3e90
3ea0: 0000000000000000 ffffff8000000000 ffffffc1f5ce3eb0 ffffffc1f5ce3eb0
3ec0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3ee0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3f00: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3f20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3f40: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3fa0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
3fc0: 0000000000000000 0000000000000005 0000000000000000 0000000000000000
3fe0: 0000000000000000 0000000000000000 0000000275ce3ff0 0000000275ce3ff8
Call trace:
Exception stack(0xffffffc1f5ce3b30 to 0xffffffc1f5ce3c60)
3b20:                                   dead000000000060 0000008000000000
3b40: ffffffc1f5ce3d00 ffffff80084cb694 0000000000000008 0000000000000e88
3b60: ffffffc1f5ce3bb0 ffffff80080dac68 ffffffc1f5ce3b90 ffffff8008826fe4
3b80: 00000000000001c0 00000000000001c0 ffffffc1f5ce3bb0 ffffff800848dfcc
3ba0: 0000000000020000 ffffff8008b15ae4 ffffffc1f5ce3c00 ffffff800808f000
3bc0: 0000000000000010 ffffff80088377f0 ffffffc1f5ccd590 0000000000000140
3be0: dead000000000100 dead000000000200 0000000000000001 0000000000000000
3c00: 0000000000000000 ffffff8008ce2020 ffffffc1f5cc3338 ffffffc1f5ce0000
3c20: 0000000000000840 0000000000000001 0000000000000000 ffffffffffffffff
3c40: ffffffffffffffff ffffffffffffffff 0000000000000001 0000000000000007
[<ffffff80084cb694>] pl330_irq_handler+0x27c/0x390
[<ffffff80080f09d0>] irq_forced_thread_fn+0x38/0x88
[<ffffff80080f0d70>] irq_thread+0x140/0x200
[<ffffff80080bf4f8>] kthread+0xd8/0xf0
[<ffffff8008083690>] ret_from_fork+0x10/0x40
Code: f2a00838 f9405763 aa1c03e1 aa1503e0 (f9000443)
---[ end trace f50005726d31199c ]---
Kernel panic - not syncing: Fatal exception in interrupt
SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 0-1
Kernel Offset: disabled
Memory Limit: none
---[ end Kernel panic - not syncing: Fatal exception in interrupt

To fix this, re-start with the list-head after dropping the lock then
re-takeing it.

Reviewed-by: Frank Mori Hess <[email protected]>
Tested-by: Frank Mori Hess <[email protected]>
Signed-off-by: Qi Hou <[email protected]>
Signed-off-by: Vinod Koul <[email protected]>

Signed-off-by: Sasha Levin <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Jul 27, 2019
[ Upstream commit a9fd095 ]

Leaving dev_init_lock mutex locked in probe causes BUG and a WARNING when
kernel is compiled with CONFIG_PROVE_LOCKING. Convert mutex to completion
which silences those warnings and improves code readability.

Fix below errors when connecting the USB WiFi dongle:

brcmfmac: brcmf_fw_alloc_request: using brcm/brcmfmac43143 for chip BCM43143/2
BUG: workqueue leaked lock or atomic: kworker/0:2/0x00000000/434
     last function: hub_event
1 lock held by kworker/0:2/434:
 #0: 18d5dcdf (&devinfo->dev_init_lock){+.+.}, at: brcmf_usb_probe+0x78/0x550 [brcmfmac]
CPU: 0 PID: 434 Comm: kworker/0:2 Not tainted 4.19.23-00084-g454a789-dirty #123
Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
Workqueue: usb_hub_wq hub_event
[<8011237c>] (unwind_backtrace) from [<8010d74c>] (show_stack+0x10/0x14)
[<8010d74c>] (show_stack) from [<809c4324>] (dump_stack+0xa8/0xd4)
[<809c4324>] (dump_stack) from [<8014195c>] (process_one_work+0x710/0x808)
[<8014195c>] (process_one_work) from [<80141a80>] (worker_thread+0x2c/0x564)
[<80141a80>] (worker_thread) from [<80147bcc>] (kthread+0x13c/0x16c)
[<80147bcc>] (kthread) from [<801010b4>] (ret_from_fork+0x14/0x20)
Exception stack(0xed1d9fb0 to 0xed1d9ff8)
9fa0:                                     00000000 00000000 00000000 00000000
9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
9fe0: 00000000 00000000 00000000 00000000 00000013 00000000

======================================================
WARNING: possible circular locking dependency detected
4.19.23-00084-g454a789-dirty #123 Not tainted
------------------------------------------------------
kworker/0:2/434 is trying to acquire lock:
e29cf799 ((wq_completion)"events"){+.+.}, at: process_one_work+0x174/0x808

but task is already holding lock:
18d5dcdf (&devinfo->dev_init_lock){+.+.}, at: brcmf_usb_probe+0x78/0x550 [brcmfmac]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&devinfo->dev_init_lock){+.+.}:
       mutex_lock_nested+0x1c/0x24
       brcmf_usb_probe+0x78/0x550 [brcmfmac]
       usb_probe_interface+0xc0/0x1bc
       really_probe+0x228/0x2c0
       __driver_attach+0xe4/0xe8
       bus_for_each_dev+0x68/0xb4
       bus_add_driver+0x19c/0x214
       driver_register+0x78/0x110
       usb_register_driver+0x84/0x148
       process_one_work+0x228/0x808
       worker_thread+0x2c/0x564
       kthread+0x13c/0x16c
       ret_from_fork+0x14/0x20
         (null)

-> #1 (brcmf_driver_work){+.+.}:
       worker_thread+0x2c/0x564
       kthread+0x13c/0x16c
       ret_from_fork+0x14/0x20
         (null)

-> #0 ((wq_completion)"events"){+.+.}:
       process_one_work+0x1b8/0x808
       worker_thread+0x2c/0x564
       kthread+0x13c/0x16c
       ret_from_fork+0x14/0x20
         (null)

other info that might help us debug this:

Chain exists of:
  (wq_completion)"events" --> brcmf_driver_work --> &devinfo->dev_init_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&devinfo->dev_init_lock);
                               lock(brcmf_driver_work);
                               lock(&devinfo->dev_init_lock);
  lock((wq_completion)"events");

 *** DEADLOCK ***

1 lock held by kworker/0:2/434:
 #0: 18d5dcdf (&devinfo->dev_init_lock){+.+.}, at: brcmf_usb_probe+0x78/0x550 [brcmfmac]

stack backtrace:
CPU: 0 PID: 434 Comm: kworker/0:2 Not tainted 4.19.23-00084-g454a789-dirty #123
Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
Workqueue: events request_firmware_work_func
[<8011237c>] (unwind_backtrace) from [<8010d74c>] (show_stack+0x10/0x14)
[<8010d74c>] (show_stack) from [<809c4324>] (dump_stack+0xa8/0xd4)
[<809c4324>] (dump_stack) from [<80172838>] (print_circular_bug+0x210/0x330)
[<80172838>] (print_circular_bug) from [<80175940>] (__lock_acquire+0x160c/0x1a30)
[<80175940>] (__lock_acquire) from [<8017671c>] (lock_acquire+0xe0/0x268)
[<8017671c>] (lock_acquire) from [<80141404>] (process_one_work+0x1b8/0x808)
[<80141404>] (process_one_work) from [<80141a80>] (worker_thread+0x2c/0x564)
[<80141a80>] (worker_thread) from [<80147bcc>] (kthread+0x13c/0x16c)
[<80147bcc>] (kthread) from [<801010b4>] (ret_from_fork+0x14/0x20)
Exception stack(0xed1d9fb0 to 0xed1d9ff8)
9fa0:                                     00000000 00000000 00000000 00000000
9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
9fe0: 00000000 00000000 00000000 00000000 00000013 00000000

Signed-off-by: Piotr Figiel <[email protected]>
Signed-off-by: Kalle Valo <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
numbqq pushed a commit that referenced this pull request Jul 27, 2019
…text

commit 0c9e8b3 upstream.

stub_probe() and stub_disconnect() call functions which could call
sleeping function in invalid context whil holding busid_lock.

Fix the problem by refining the lock holds to short critical sections
to change the busid_priv fields. This fix restructures the code to
limit the lock holds in stub_probe() and stub_disconnect().

stub_probe():

[15217.927028] BUG: sleeping function called from invalid context at mm/slab.h:418
[15217.927038] in_atomic(): 1, irqs_disabled(): 0, pid: 29087, name: usbip
[15217.927044] 5 locks held by usbip/29087:
[15217.927047]  #0: 0000000091647f28 (sb_writers#6){....}, at: vfs_write+0x191/0x1c0
[15217.927062]  #1: 000000008f9ba75b (&of->mutex){....}, at: kernfs_fop_write+0xf7/0x1b0
[15217.927072]  #2: 00000000872e5b4b (&dev->mutex){....}, at: __device_driver_lock+0x3b/0x50
[15217.927082]  #3: 00000000e74ececc (&dev->mutex){....}, at: __device_driver_lock+0x46/0x50
[15217.927090]  #4: 00000000b20abbe0 (&(&busid_table[i].busid_lock)->rlock){....}, at: get_busid_priv+0x48/0x60 [usbip_host]
[15217.927103] CPU: 3 PID: 29087 Comm: usbip Tainted: G        W         5.1.0-rc6+ #40
[15217.927106] Hardware name: Dell Inc. OptiPlex 790/0HY9JP, BIOS A18 09/24/2013
[15217.927109] Call Trace:
[15217.927118]  dump_stack+0x63/0x85
[15217.927127]  ___might_sleep+0xff/0x120
[15217.927133]  __might_sleep+0x4a/0x80
[15217.927143]  kmem_cache_alloc_trace+0x1aa/0x210
[15217.927156]  stub_probe+0xe8/0x440 [usbip_host]
[15217.927171]  usb_probe_device+0x34/0x70

stub_disconnect():

[15279.182478] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:908
[15279.182487] in_atomic(): 1, irqs_disabled(): 0, pid: 29114, name: usbip
[15279.182492] 5 locks held by usbip/29114:
[15279.182494]  #0: 0000000091647f28 (sb_writers#6){....}, at: vfs_write+0x191/0x1c0
[15279.182506]  #1: 00000000702cf0f3 (&of->mutex){....}, at: kernfs_fop_write+0xf7/0x1b0
[15279.182514]  #2: 00000000872e5b4b (&dev->mutex){....}, at: __device_driver_lock+0x3b/0x50
[15279.182522]  #3: 00000000e74ececc (&dev->mutex){....}, at: __device_driver_lock+0x46/0x50
[15279.182529]  #4: 00000000b20abbe0 (&(&busid_table[i].busid_lock)->rlock){....}, at: get_busid_priv+0x48/0x60 [usbip_host]
[15279.182541] CPU: 0 PID: 29114 Comm: usbip Tainted: G        W         5.1.0-rc6+ #40
[15279.182543] Hardware name: Dell Inc. OptiPlex 790/0HY9JP, BIOS A18 09/24/2013
[15279.182546] Call Trace:
[15279.182554]  dump_stack+0x63/0x85
[15279.182561]  ___might_sleep+0xff/0x120
[15279.182566]  __might_sleep+0x4a/0x80
[15279.182574]  __mutex_lock+0x55/0x950
[15279.182582]  ? get_busid_priv+0x48/0x60 [usbip_host]
[15279.182587]  ? reacquire_held_locks+0xec/0x1a0
[15279.182591]  ? get_busid_priv+0x48/0x60 [usbip_host]
[15279.182597]  ? find_held_lock+0x94/0xa0
[15279.182609]  mutex_lock_nested+0x1b/0x20
[15279.182614]  ? mutex_lock_nested+0x1b/0x20
[15279.182618]  kernfs_remove_by_name_ns+0x2a/0x90
[15279.182625]  sysfs_remove_file_ns+0x15/0x20
[15279.182629]  device_remove_file+0x19/0x20
[15279.182634]  stub_disconnect+0x6d/0x180 [usbip_host]
[15279.182643]  usb_unbind_device+0x27/0x60

Signed-off-by: Shuah Khan <[email protected]>
Cc: stable <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Jul 27, 2019
… sdevs)

commit ef4021f upstream.

When the user tries to remove a zfcp port via sysfs, we only rejected it if
there are zfcp unit children under the port. With purely automatically
scanned LUNs there are no zfcp units but only SCSI devices. In such cases,
the port_remove erroneously continued. We close the port and this
implicitly closes all LUNs under the port. The SCSI devices survive with
their private zfcp_scsi_dev still holding a reference to the "removed"
zfcp_port (still allocated but invisible in sysfs) [zfcp_get_port_by_wwpn
in zfcp_scsi_slave_alloc]. This is not a problem as long as the fc_rport
stays blocked. Once (auto) port scan brings back the removed port, we
unblock its fc_rport again by design.  However, there is no mechanism that
would recover (open) the LUNs under the port (no "ersfs_3" without
zfcp_unit [zfcp_erp_strategy_followup_success]).  Any pending or new I/O to
such LUN leads to repeated:

  Done: NEEDS_RETRY Result: hostbyte=DID_IMM_RETRY driverbyte=DRIVER_OK

See also v4.10 commit 6f2ce1c ("scsi: zfcp: fix rport unblock race
with LUN recovery"). Even a manual LUN recovery
(echo 0 > /sys/bus/scsi/devices/H:C:T:L/zfcp_failed)
does not help, as the LUN links to the old "removed" port which remains
to lack ZFCP_STATUS_COMMON_RUNNING [zfcp_erp_required_act].
The only workaround is to first ensure that the fc_rport is blocked
(e.g. port_remove again in case it was re-discovered by (auto) port scan),
then delete the SCSI devices, and finally re-discover by (auto) port scan.
The port scan includes an fc_rport unblock, which in turn triggers
a new scan on the scsi target to freshly get new pure auto scan LUNs.

Fix this by rejecting port_remove also if there are SCSI devices
(even without any zfcp_unit) under this port. Re-use mechanics from v3.7
commit d99b601 ("[SCSI] zfcp: restore refcount check on port_remove").
However, we have to give up zfcp_sysfs_port_units_mutex earlier in unit_add
to prevent a deadlock with scsi_host scan taking shost->scan_mutex first
and then zfcp_sysfs_port_units_mutex now in our zfcp_scsi_slave_alloc().

Signed-off-by: Steffen Maier <[email protected]>
Fixes: b62a8d9 ("[SCSI] zfcp: Use SCSI device data zfcp scsi dev instead of zfcp unit")
Fixes: f8210e3 ("[SCSI] zfcp: Allow midlayer to scan for LUNs when running in NPIV mode")
Cc: <[email protected]> #2.6.37+
Reviewed-by: Benjamin Block <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
numbqq pushed a commit that referenced this pull request Jul 27, 2019
[ Upstream commit 347ab94 ]

This patch fixes deadlock warning if removing PWM device
when CONFIG_PROVE_LOCKING is enabled.

This issue can be reproceduced by the following steps on
the R-Car H3 Salvator-X board if the backlight is disabled:

 # cd /sys/class/pwm/pwmchip0
 # echo 0 > export
 # ls
 device  export  npwm  power  pwm0  subsystem  uevent  unexport
 # cd device/driver
 # ls
 bind  e6e31000.pwm  uevent  unbind
 # echo e6e31000.pwm > unbind

[   87.659974] ======================================================
[   87.666149] WARNING: possible circular locking dependency detected
[   87.672327] 5.0.0 #7 Not tainted
[   87.675549] ------------------------------------------------------
[   87.681723] bash/2986 is trying to acquire lock:
[   87.686337] 000000005ea0e178 (kn->count#58){++++}, at: kernfs_remove_by_name_ns+0x50/0xa0
[   87.694528]
[   87.694528] but task is already holding lock:
[   87.700353] 000000006313b17c (pwm_lock){+.+.}, at: pwmchip_remove+0x28/0x13c
[   87.707405]
[   87.707405] which lock already depends on the new lock.
[   87.707405]
[   87.715574]
[   87.715574] the existing dependency chain (in reverse order) is:
[   87.723048]
[   87.723048] -> #1 (pwm_lock){+.+.}:
[   87.728017]        __mutex_lock+0x70/0x7e4
[   87.732108]        mutex_lock_nested+0x1c/0x24
[   87.736547]        pwm_request_from_chip.part.6+0x34/0x74
[   87.741940]        pwm_request_from_chip+0x20/0x40
[   87.746725]        export_store+0x6c/0x1f4
[   87.750820]        dev_attr_store+0x18/0x28
[   87.754998]        sysfs_kf_write+0x54/0x64
[   87.759175]        kernfs_fop_write+0xe4/0x1e8
[   87.763615]        __vfs_write+0x40/0x184
[   87.767619]        vfs_write+0xa8/0x19c
[   87.771448]        ksys_write+0x58/0xbc
[   87.775278]        __arm64_sys_write+0x18/0x20
[   87.779721]        el0_svc_common+0xd0/0x124
[   87.783986]        el0_svc_compat_handler+0x1c/0x24
[   87.788858]        el0_svc_compat+0x8/0x18
[   87.792947]
[   87.792947] -> #0 (kn->count#58){++++}:
[   87.798260]        lock_acquire+0xc4/0x22c
[   87.802353]        __kernfs_remove+0x258/0x2c4
[   87.806790]        kernfs_remove_by_name_ns+0x50/0xa0
[   87.811836]        remove_files.isra.1+0x38/0x78
[   87.816447]        sysfs_remove_group+0x48/0x98
[   87.820971]        sysfs_remove_groups+0x34/0x4c
[   87.825583]        device_remove_attrs+0x6c/0x7c
[   87.830197]        device_del+0x11c/0x33c
[   87.834201]        device_unregister+0x14/0x2c
[   87.838638]        pwmchip_sysfs_unexport+0x40/0x4c
[   87.843509]        pwmchip_remove+0xf4/0x13c
[   87.847773]        rcar_pwm_remove+0x28/0x34
[   87.852039]        platform_drv_remove+0x24/0x64
[   87.856651]        device_release_driver_internal+0x18c/0x21c
[   87.862391]        device_release_driver+0x14/0x1c
[   87.867175]        unbind_store+0xe0/0x124
[   87.871265]        drv_attr_store+0x20/0x30
[   87.875442]        sysfs_kf_write+0x54/0x64
[   87.879618]        kernfs_fop_write+0xe4/0x1e8
[   87.884055]        __vfs_write+0x40/0x184
[   87.888057]        vfs_write+0xa8/0x19c
[   87.891887]        ksys_write+0x58/0xbc
[   87.895716]        __arm64_sys_write+0x18/0x20
[   87.900154]        el0_svc_common+0xd0/0x124
[   87.904417]        el0_svc_compat_handler+0x1c/0x24
[   87.909289]        el0_svc_compat+0x8/0x18
[   87.913378]
[   87.913378] other info that might help us debug this:
[   87.913378]
[   87.921374]  Possible unsafe locking scenario:
[   87.921374]
[   87.927286]        CPU0                    CPU1
[   87.931808]        ----                    ----
[   87.936331]   lock(pwm_lock);
[   87.939293]                                lock(kn->count#58);
[   87.945120]                                lock(pwm_lock);
[   87.950599]   lock(kn->count#58);
[   87.953908]
[   87.953908]  *** DEADLOCK ***
[   87.953908]
[   87.959821] 4 locks held by bash/2986:
[   87.963563]  #0: 00000000ace7bc30 (sb_writers#6){.+.+}, at: vfs_write+0x188/0x19c
[   87.971044]  #1: 00000000287991b2 (&of->mutex){+.+.}, at: kernfs_fop_write+0xb4/0x1e8
[   87.978872]  #2: 00000000f739d016 (&dev->mutex){....}, at: device_release_driver_internal+0x40/0x21c
[   87.988001]  #3: 000000006313b17c (pwm_lock){+.+.}, at: pwmchip_remove+0x28/0x13c
[   87.995481]
[   87.995481] stack backtrace:
[   87.999836] CPU: 0 PID: 2986 Comm: bash Not tainted 5.0.0 #7
[   88.005489] Hardware name: Renesas Salvator-X board based on r8a7795 ES1.x (DT)
[   88.012791] Call trace:
[   88.015235]  dump_backtrace+0x0/0x190
[   88.018891]  show_stack+0x14/0x1c
[   88.022204]  dump_stack+0xb0/0xec
[   88.025514]  print_circular_bug.isra.32+0x1d0/0x2e0
[   88.030385]  __lock_acquire+0x1318/0x1864
[   88.034388]  lock_acquire+0xc4/0x22c
[   88.037958]  __kernfs_remove+0x258/0x2c4
[   88.041874]  kernfs_remove_by_name_ns+0x50/0xa0
[   88.046398]  remove_files.isra.1+0x38/0x78
[   88.050487]  sysfs_remove_group+0x48/0x98
[   88.054490]  sysfs_remove_groups+0x34/0x4c
[   88.058580]  device_remove_attrs+0x6c/0x7c
[   88.062671]  device_del+0x11c/0x33c
[   88.066154]  device_unregister+0x14/0x2c
[   88.070070]  pwmchip_sysfs_unexport+0x40/0x4c
[   88.074421]  pwmchip_remove+0xf4/0x13c
[   88.078163]  rcar_pwm_remove+0x28/0x34
[   88.081906]  platform_drv_remove+0x24/0x64
[   88.085996]  device_release_driver_internal+0x18c/0x21c
[   88.091215]  device_release_driver+0x14/0x1c
[   88.095478]  unbind_store+0xe0/0x124
[   88.099048]  drv_attr_store+0x20/0x30
[   88.102704]  sysfs_kf_write+0x54/0x64
[   88.106359]  kernfs_fop_write+0xe4/0x1e8
[   88.110275]  __vfs_write+0x40/0x184
[   88.113757]  vfs_write+0xa8/0x19c
[   88.117065]  ksys_write+0x58/0xbc
[   88.120374]  __arm64_sys_write+0x18/0x20
[   88.124291]  el0_svc_common+0xd0/0x124
[   88.128034]  el0_svc_compat_handler+0x1c/0x24
[   88.132384]  el0_svc_compat+0x8/0x18

The sysfs unexport in pwmchip_remove() is completely asymmetric
to what we do in pwmchip_add_with_polarity() and commit 0733424
("pwm: Unexport children before chip removal") is a strong indication
that this was wrong to begin with. We should just move
pwmchip_sysfs_unexport() where it belongs, which is right after
pwmchip_sysfs_unexport_children(). In that case, we do not need
separate functions anymore either.

We also really want to remove sysfs irrespective of whether or not
the chip will be removed as a result of pwmchip_remove(). We can only
assume that the driver will be gone after that, so we shouldn't leave
any dangling sysfs files around.

This warning disappears if we move pwmchip_sysfs_unexport() to
the top of pwmchip_remove(), pwmchip_sysfs_unexport_children().
That way it is also outside of the pwm_lock section, which indeed
doesn't seem to be needed.

Moving the pwmchip_sysfs_export() call outside of that section also
seems fine and it'd be perfectly symmetric with pwmchip_remove() again.

So, this patch fixes them.

Signed-off-by: Phong Hoang <[email protected]>
[shimoda: revise the commit log and code]
Fixes: 76abbdd ("pwm: Add sysfs interface")
Fixes: 0733424 ("pwm: Unexport children before chip removal")
Signed-off-by: Yoshihiro Shimoda <[email protected]>
Tested-by: Hoan Nguyen An <[email protected]>
Reviewed-by: Geert Uytterhoeven <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Reviewed-by: Uwe Kleine-König <[email protected]>
Signed-off-by: Thierry Reding <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
numbqq pushed a commit that referenced this pull request Jul 27, 2019
[ Upstream commit 5518424 ]

ifmsh->csa is an RCU-protected pointer. The writer context
in ieee80211_mesh_finish_csa() is already mutually
exclusive with wdev->sdata.mtx, but the RCU checker did
not know this. Use rcu_dereference_protected() to avoid a
warning.

fixes the following warning:

[   12.519089] =============================
[   12.520042] WARNING: suspicious RCU usage
[   12.520652] 5.1.0-rc7-wt+ #16 Tainted: G        W
[   12.521409] -----------------------------
[   12.521972] net/mac80211/mesh.c:1223 suspicious rcu_dereference_check() usage!
[   12.522928] other info that might help us debug this:
[   12.523984] rcu_scheduler_active = 2, debug_locks = 1
[   12.524855] 5 locks held by kworker/u8:2/152:
[   12.525438]  #0: 00000000057be08c ((wq_completion)phy0){+.+.}, at: process_one_work+0x1a2/0x620
[   12.526607]  #1: 0000000059c6b07a ((work_completion)(&sdata->csa_finalize_work)){+.+.}, at: process_one_work+0x1a2/0x620
[   12.528001]  #2: 00000000f184ba7d (&wdev->mtx){+.+.}, at: ieee80211_csa_finalize_work+0x2f/0x90
[   12.529116]  #3: 00000000831a1f54 (&local->mtx){+.+.}, at: ieee80211_csa_finalize_work+0x47/0x90
[   12.530233]  #4: 00000000fd06f988 (&local->chanctx_mtx){+.+.}, at: ieee80211_csa_finalize_work+0x51/0x90

Signed-off-by: Thomas Pedersen <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
More lockdep gifts, a 5-way lockup race:

	perf_event_create_kernel_counter()
	  perf_event_alloc()
	    perf_try_init_event()
	      x86_pmu_event_init()
		__x86_pmu_event_init()
		  x86_reserve_hardware()
 #0		    mutex_lock(&pmc_reserve_mutex);
		    reserve_ds_buffer()
 #1		      get_online_cpus()

	perf_event_release_kernel()
	  _free_event()
	    hw_perf_event_destroy()
	      x86_release_hardware()
 #0		mutex_lock(&pmc_reserve_mutex)
		release_ds_buffer()
 #1		  get_online_cpus()

 #1	do_cpu_up()
	  perf_event_init_cpu()
 #2	    mutex_lock(&pmus_lock)
 #3	    mutex_lock(&ctx->mutex)

	sys_perf_event_open()
	  mutex_lock_double()
 #3	    mutex_lock(ctx->mutex)
 #4	    mutex_lock_nested(ctx->mutex, 1);

	perf_try_init_event()
 #4	  mutex_lock_nested(ctx->mutex, 1)
	  x86_pmu_event_init()
	    intel_pmu_hw_config()
	      x86_add_exclusive()
 #0		mutex_lock(&pmc_reserve_mutex)

Fix it by using ordering constructs instead of locking.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
Yonghong Song says:

====================
A kernel page fault which happens in lpm map trie_get_next_key is reported
by syzbot and Eric. The issue was introduced by commit b471f2f
("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map").
Patch #1 fixed the issue in the kernel and patch #2 adds a multithreaded
test case in tools/testing/selftests/bpf/test_lpm_map.
====================

Signed-off-by: Alexei Starovoitov <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
Heiner reported a lockdep splat [1]

This is caused by attempting GFP_KERNEL allocation while RCU lock is
held and BH blocked.

We believe that addrconf_verify_rtnl() could run for a long period,
so instead of using GFP_ATOMIC here as Ido suggested, we should break
the critical section and restart it after the allocation.

[1]
[86220.125562] =============================
[86220.125586] WARNING: suspicious RCU usage
[86220.125612] 4.15.0-rc7-next-20180110+ #7 Not tainted
[86220.125641] -----------------------------
[86220.125666] kernel/sched/core.c:6026 Illegal context switch in RCU-bh read-side critical section!
[86220.125711]
               other info that might help us debug this:

[86220.125755]
               rcu_scheduler_active = 2, debug_locks = 1
[86220.125792] 4 locks held by kworker/0:2/1003:
[86220.125817]  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
[86220.125895]  #1:  ((addr_chk_work).work){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
[86220.125959]  #2:  (rtnl_mutex){+.+.}, at: [<00000000b06d9510>] rtnl_lock+0x12/0x20
[86220.126017]  #3:  (rcu_read_lock_bh){....}, at: [<00000000aef52299>] addrconf_verify_rtnl+0x1e/0x510 [ipv6]
[86220.126111]
               stack backtrace:
[86220.126142] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7
[86220.126185] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015
[86220.126250] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6]
[86220.126288] Call Trace:
[86220.126312]  dump_stack+0x70/0x9e
[86220.126337]  lockdep_rcu_suspicious+0xce/0xf0
[86220.126365]  ___might_sleep+0x1d3/0x240
[86220.126390]  __might_sleep+0x45/0x80
[86220.126416]  kmem_cache_alloc_trace+0x53/0x250
[86220.126458]  ? ipv6_add_addr+0xfe/0x6e0 [ipv6]
[86220.126498]  ipv6_add_addr+0xfe/0x6e0 [ipv6]
[86220.126538]  ipv6_create_tempaddr+0x24d/0x430 [ipv6]
[86220.126580]  ? ipv6_create_tempaddr+0x24d/0x430 [ipv6]
[86220.126623]  addrconf_verify_rtnl+0x339/0x510 [ipv6]
[86220.126664]  ? addrconf_verify_rtnl+0x339/0x510 [ipv6]
[86220.126708]  addrconf_verify_work+0xe/0x20 [ipv6]
[86220.126738]  process_one_work+0x258/0x680
[86220.126765]  worker_thread+0x35/0x3f0
[86220.126790]  kthread+0x124/0x140
[86220.126813]  ? process_one_work+0x680/0x680
[86220.126839]  ? kthread_create_worker_on_cpu+0x40/0x40
[86220.126869]  ? umh_complete+0x40/0x40
[86220.126893]  ? call_usermodehelper_exec_async+0x12a/0x160
[86220.126926]  ret_from_fork+0x4b/0x60
[86220.126999] BUG: sleeping function called from invalid context at mm/slab.h:420
[86220.127041] in_atomic(): 1, irqs_disabled(): 0, pid: 1003, name: kworker/0:2
[86220.127082] 4 locks held by kworker/0:2/1003:
[86220.127107]  #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
[86220.127179]  #1:  ((addr_chk_work).work){+.+.}, at: [<00000000da8e9b73>] process_one_work+0x1de/0x680
[86220.127242]  #2:  (rtnl_mutex){+.+.}, at: [<00000000b06d9510>] rtnl_lock+0x12/0x20
[86220.127300]  #3:  (rcu_read_lock_bh){....}, at: [<00000000aef52299>] addrconf_verify_rtnl+0x1e/0x510 [ipv6]
[86220.127414] CPU: 0 PID: 1003 Comm: kworker/0:2 Not tainted 4.15.0-rc7-next-20180110+ #7
[86220.127463] Hardware name: ZOTAC ZBOX-CI321NANO/ZBOX-CI321NANO, BIOS B246P105 06/01/2015
[86220.127528] Workqueue: ipv6_addrconf addrconf_verify_work [ipv6]
[86220.127568] Call Trace:
[86220.127591]  dump_stack+0x70/0x9e
[86220.127616]  ___might_sleep+0x14d/0x240
[86220.127644]  __might_sleep+0x45/0x80
[86220.127672]  kmem_cache_alloc_trace+0x53/0x250
[86220.127717]  ? ipv6_add_addr+0xfe/0x6e0 [ipv6]
[86220.127762]  ipv6_add_addr+0xfe/0x6e0 [ipv6]
[86220.127807]  ipv6_create_tempaddr+0x24d/0x430 [ipv6]
[86220.127854]  ? ipv6_create_tempaddr+0x24d/0x430 [ipv6]
[86220.127903]  addrconf_verify_rtnl+0x339/0x510 [ipv6]
[86220.127950]  ? addrconf_verify_rtnl+0x339/0x510 [ipv6]
[86220.127998]  addrconf_verify_work+0xe/0x20 [ipv6]
[86220.128032]  process_one_work+0x258/0x680
[86220.128063]  worker_thread+0x35/0x3f0
[86220.128091]  kthread+0x124/0x140
[86220.128117]  ? process_one_work+0x680/0x680
[86220.128146]  ? kthread_create_worker_on_cpu+0x40/0x40
[86220.128180]  ? umh_complete+0x40/0x40
[86220.128207]  ? call_usermodehelper_exec_async+0x12a/0x160
[86220.128243]  ret_from_fork+0x4b/0x60

Fixes: f3d9832 ("ipv6: addrconf: cleanup locking in ipv6_add_addr")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Heiner Kallweit <[email protected]>
Reviewed-by: Ido Schimmel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
The following soft lockup was caught. This is a deadlock caused by
recusive locking.

Process kworker/u40:1:28016 was holding spin lock "mbx->queue_lock" in
qlcnic_83xx_mailbox_worker(), while a softirq came in and ask the same spin
lock in qlcnic_83xx_enqueue_mbx_cmd(). This lock should be hold by disable
bh..

[161846.962125] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/u40:1:28016]
[161846.962367] Modules linked in: tun ocfs2 xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn xenfs xen_privcmd autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs bnx2fc fcoe libfcoe libfc sunrpc 8021q mrp garp bridge stp llc bonding dm_round_robin dm_multipath iTCO_wdt iTCO_vendor_support pcspkr sb_edac edac_core i2c_i801 shpchp lpc_ich mfd_core ioatdma ipmi_devintf ipmi_si ipmi_msghandler sg ext4 jbd2 mbcache2 sr_mod cdrom sd_mod igb i2c_algo_bit i2c_core ahci libahci megaraid_sas ixgbe dca ptp pps_core vxlan udp_tunnel ip6_udp_tunnel qla2xxx scsi_transport_fc qlcnic crc32c_intel be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi ipv6 cxgb3 mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi dm_mirror dm_region_hash dm_log dm_mod
[161846.962454]
[161846.962460] CPU: 1 PID: 28016 Comm: kworker/u40:1 Not tainted 4.1.12-94.5.9.el6uek.x86_64 #2
[161846.962463] Hardware name: Oracle Corporation SUN SERVER X4-2L      /ASSY,MB,X4-2L         , BIOS 26050100 09/19/2017
[161846.962489] Workqueue: qlcnic_mailbox qlcnic_83xx_mailbox_worker [qlcnic]
[161846.962493] task: ffff8801f2e34600 ti: ffff88004ca5c000 task.ti: ffff88004ca5c000
[161846.962496] RIP: e030:[<ffffffff810013aa>]  [<ffffffff810013aa>] xen_hypercall_sched_op+0xa/0x20
[161846.962506] RSP: e02b:ffff880202e43388  EFLAGS: 00000206
[161846.962509] RAX: 0000000000000000 RBX: ffff8801f6996b70 RCX: ffffffff810013aa
[161846.962511] RDX: ffff880202e433cc RSI: ffff880202e433b0 RDI: 0000000000000003
[161846.962513] RBP: ffff880202e433d0 R08: 0000000000000000 R09: ffff8801fe893200
[161846.962516] R10: ffff8801fe400538 R11: 0000000000000206 R12: ffff880202e4b000
[161846.962518] R13: 0000000000000050 R14: 0000000000000001 R15: 000000000000020d
[161846.962528] FS:  0000000000000000(0000) GS:ffff880202e40000(0000) knlGS:ffff880202e40000
[161846.962531] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[161846.962533] CR2: 0000000002612640 CR3: 00000001bb796000 CR4: 0000000000042660
[161846.962536] Stack:
[161846.962538]  ffff880202e43608 0000000000000000 ffffffff813f0442 ffff880202e433b0
[161846.962543]  0000000000000000 ffff880202e433cc ffffffff00000001 0000000000000000
[161846.962547]  00000009813f03d6 ffff880202e433e0 ffffffff813f0460 ffff880202e43440
[161846.962552] Call Trace:
[161846.962555]  <IRQ>
[161846.962565]  [<ffffffff813f0442>] ? xen_poll_irq_timeout+0x42/0x50
[161846.962570]  [<ffffffff813f0460>] xen_poll_irq+0x10/0x20
[161846.962578]  [<ffffffff81014222>] xen_lock_spinning+0xe2/0x110
[161846.962583]  [<ffffffff81013f01>] __raw_callee_save_xen_lock_spinning+0x11/0x20
[161846.962592]  [<ffffffff816e5c57>] ? _raw_spin_lock+0x57/0x80
[161846.962609]  [<ffffffffa028acfc>] qlcnic_83xx_enqueue_mbx_cmd+0x7c/0xe0 [qlcnic]
[161846.962623]  [<ffffffffa028e008>] qlcnic_83xx_issue_cmd+0x58/0x210 [qlcnic]
[161846.962636]  [<ffffffffa028caf2>] qlcnic_83xx_sre_macaddr_change+0x162/0x1d0 [qlcnic]
[161846.962649]  [<ffffffffa028cb8b>] qlcnic_83xx_change_l2_filter+0x2b/0x30 [qlcnic]
[161846.962657]  [<ffffffff8160248b>] ? __skb_flow_dissect+0x18b/0x650
[161846.962670]  [<ffffffffa02856e5>] qlcnic_send_filter+0x205/0x250 [qlcnic]
[161846.962682]  [<ffffffffa0285c77>] qlcnic_xmit_frame+0x547/0x7b0 [qlcnic]
[161846.962691]  [<ffffffff8160ac22>] xmit_one+0x82/0x1a0
[161846.962696]  [<ffffffff8160ad90>] dev_hard_start_xmit+0x50/0xa0
[161846.962701]  [<ffffffff81630112>] sch_direct_xmit+0x112/0x220
[161846.962706]  [<ffffffff8160b80f>] __dev_queue_xmit+0x1df/0x5e0
[161846.962710]  [<ffffffff8160bc33>] dev_queue_xmit_sk+0x13/0x20
[161846.962721]  [<ffffffffa0575bd5>] bond_dev_queue_xmit+0x35/0x80 [bonding]
[161846.962729]  [<ffffffffa05769fb>] __bond_start_xmit+0x1cb/0x210 [bonding]
[161846.962736]  [<ffffffffa0576a71>] bond_start_xmit+0x31/0x60 [bonding]
[161846.962740]  [<ffffffff8160ac22>] xmit_one+0x82/0x1a0
[161846.962745]  [<ffffffff8160ad90>] dev_hard_start_xmit+0x50/0xa0
[161846.962749]  [<ffffffff8160bb1e>] __dev_queue_xmit+0x4ee/0x5e0
[161846.962754]  [<ffffffff8160bc33>] dev_queue_xmit_sk+0x13/0x20
[161846.962760]  [<ffffffffa05cfa72>] vlan_dev_hard_start_xmit+0xb2/0x150 [8021q]
[161846.962764]  [<ffffffff8160ac22>] xmit_one+0x82/0x1a0
[161846.962769]  [<ffffffff8160ad90>] dev_hard_start_xmit+0x50/0xa0
[161846.962773]  [<ffffffff8160bb1e>] __dev_queue_xmit+0x4ee/0x5e0
[161846.962777]  [<ffffffff8160bc33>] dev_queue_xmit_sk+0x13/0x20
[161846.962789]  [<ffffffffa05adf74>] br_dev_queue_push_xmit+0x54/0xa0 [bridge]
[161846.962797]  [<ffffffffa05ae4ff>] br_forward_finish+0x2f/0x90 [bridge]
[161846.962807]  [<ffffffff810b0dad>] ? ttwu_do_wakeup+0x1d/0x100
[161846.962811]  [<ffffffff815f929b>] ? __alloc_skb+0x8b/0x1f0
[161846.962818]  [<ffffffffa05ae04d>] __br_forward+0x8d/0x120 [bridge]
[161846.962822]  [<ffffffff815f613b>] ? __kmalloc_reserve+0x3b/0xa0
[161846.962829]  [<ffffffff810be55e>] ? update_rq_runnable_avg+0xee/0x230
[161846.962836]  [<ffffffffa05ae176>] br_forward+0x96/0xb0 [bridge]
[161846.962845]  [<ffffffffa05af85e>] br_handle_frame_finish+0x1ae/0x420 [bridge]
[161846.962853]  [<ffffffffa05afc4f>] br_handle_frame+0x17f/0x260 [bridge]
[161846.962862]  [<ffffffffa05afad0>] ? br_handle_frame_finish+0x420/0x420 [bridge]
[161846.962867]  [<ffffffff8160d057>] __netif_receive_skb_core+0x1f7/0x870
[161846.962872]  [<ffffffff8160d6f2>] __netif_receive_skb+0x22/0x70
[161846.962877]  [<ffffffff8160d913>] netif_receive_skb_internal+0x23/0x90
[161846.962884]  [<ffffffffa07512ea>] ? xenvif_idx_release+0xea/0x100 [xen_netback]
[161846.962889]  [<ffffffff816e5a10>] ? _raw_spin_unlock_irqrestore+0x20/0x50
[161846.962893]  [<ffffffff8160e624>] netif_receive_skb_sk+0x24/0x90
[161846.962899]  [<ffffffffa075269a>] xenvif_tx_submit+0x2ca/0x3f0 [xen_netback]
[161846.962906]  [<ffffffffa0753f0c>] xenvif_tx_action+0x9c/0xd0 [xen_netback]
[161846.962915]  [<ffffffffa07567f5>] xenvif_poll+0x35/0x70 [xen_netback]
[161846.962920]  [<ffffffff8160e01b>] napi_poll+0xcb/0x1e0
[161846.962925]  [<ffffffff8160e1c0>] net_rx_action+0x90/0x1c0
[161846.962931]  [<ffffffff8108aaba>] __do_softirq+0x10a/0x350
[161846.962938]  [<ffffffff8108ae75>] irq_exit+0x125/0x130
[161846.962943]  [<ffffffff813f03a9>] xen_evtchn_do_upcall+0x39/0x50
[161846.962950]  [<ffffffff816e7ffe>] xen_do_hypervisor_callback+0x1e/0x40
[161846.962952]  <EOI>
[161846.962959]  [<ffffffff816e5c4a>] ? _raw_spin_lock+0x4a/0x80
[161846.962964]  [<ffffffff816e5b1e>] ? _raw_spin_lock_irqsave+0x1e/0xa0
[161846.962978]  [<ffffffffa028e279>] ? qlcnic_83xx_mailbox_worker+0xb9/0x2a0 [qlcnic]
[161846.962991]  [<ffffffff810a14e1>] ? process_one_work+0x151/0x4b0
[161846.962995]  [<ffffffff8100c3f2>] ? check_events+0x12/0x20
[161846.963001]  [<ffffffff810a1960>] ? worker_thread+0x120/0x480
[161846.963005]  [<ffffffff816e187b>] ? __schedule+0x30b/0x890
[161846.963010]  [<ffffffff810a1840>] ? process_one_work+0x4b0/0x4b0
[161846.963015]  [<ffffffff810a1840>] ? process_one_work+0x4b0/0x4b0
[161846.963021]  [<ffffffff810a6b3e>] ? kthread+0xce/0xf0
[161846.963025]  [<ffffffff810a6a70>] ? kthread_freezable_should_stop+0x70/0x70
[161846.963031]  [<ffffffff816e6522>] ? ret_from_fork+0x42/0x70
[161846.963035]  [<ffffffff810a6a70>] ? kthread_freezable_should_stop+0x70/0x70
[161846.963037] Code: cc 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc

Signed-off-by: Junxiao Bi <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
…sfers

This bug happens only when the UDC needs to sleep during usb_ep_dequeue,
as is the case for (at least) dwc3.

[  382.200896] BUG: scheduling while atomic: screen/1808/0x00000100
[  382.207124] 4 locks held by screen/1808:
[  382.211266]  #0:  (rcu_callback){....}, at: [<c10b4ff0>] rcu_process_callbacks+0x260/0x440
[  382.219949]  #1:  (rcu_read_lock_sched){....}, at: [<c1358ba0>] percpu_ref_switch_to_atomic_rcu+0xb0/0x130
[  382.230034]  #2:  (&(&ctx->ctx_lock)->rlock){....}, at: [<c11f0c73>] free_ioctx_users+0x23/0xd0
[  382.230096]  #3:  (&(&ffs->eps_lock)->rlock){....}, at: [<f81e7710>] ffs_aio_cancel+0x20/0x60 [usb_f_fs]
[  382.230160] Modules linked in: usb_f_fs libcomposite configfs bnep btsdio bluetooth ecdh_generic brcmfmac brcmutil intel_powerclamp coretemp dwc3 kvm_intel ulpi udc_core kvm irqbypass crc32_pclmul crc32c_intel pcbc dwc3_pci aesni_intel aes_i586 crypto_simd cryptd ehci_pci ehci_hcd gpio_keys usbcore basincove_gpadc industrialio usb_common
[  382.230407] CPU: 1 PID: 1808 Comm: screen Not tainted 4.14.0-edison+ #117
[  382.230416] Hardware name: Intel Corporation Merrifield/BODEGA BAY, BIOS 542 2015.01.21:18.19.48
[  382.230425] Call Trace:
[  382.230438]  <SOFTIRQ>
[  382.230466]  dump_stack+0x47/0x62
[  382.230498]  __schedule_bug+0x61/0x80
[  382.230522]  __schedule+0x43/0x7a0
[  382.230587]  schedule+0x5f/0x70
[  382.230625]  dwc3_gadget_ep_dequeue+0x14c/0x270 [dwc3]
[  382.230669]  ? do_wait_intr_irq+0x70/0x70
[  382.230724]  usb_ep_dequeue+0x19/0x90 [udc_core]
[  382.230770]  ffs_aio_cancel+0x37/0x60 [usb_f_fs]
[  382.230798]  kiocb_cancel+0x31/0x40
[  382.230822]  free_ioctx_users+0x4d/0xd0
[  382.230858]  percpu_ref_switch_to_atomic_rcu+0x10a/0x130
[  382.230881]  ? percpu_ref_exit+0x40/0x40
[  382.230904]  rcu_process_callbacks+0x2b3/0x440
[  382.230965]  __do_softirq+0xf8/0x26b
[  382.231011]  ? __softirqentry_text_start+0x8/0x8
[  382.231033]  do_softirq_own_stack+0x22/0x30
[  382.231042]  </SOFTIRQ>
[  382.231071]  irq_exit+0x45/0xc0
[  382.231089]  smp_apic_timer_interrupt+0x13c/0x150
[  382.231118]  apic_timer_interrupt+0x35/0x3c
[  382.231132] EIP: __copy_user_ll+0xe2/0xf0
[  382.231142] EFLAGS: 00210293 CPU: 1
[  382.231154] EAX: bfd4508c EBX: 00000004 ECX: 00000003 EDX: f3d8fe50
[  382.231165] ESI: f3d8fe51 EDI: bfd4508d EBP: f3d8fe14 ESP: f3d8fe08
[  382.231176]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  382.231265]  core_sys_select+0x25f/0x320
[  382.231346]  ? __wake_up_common_lock+0x62/0x80
[  382.231399]  ? tty_ldisc_deref+0x13/0x20
[  382.231438]  ? ldsem_up_read+0x1b/0x40
[  382.231459]  ? tty_ldisc_deref+0x13/0x20
[  382.231479]  ? tty_write+0x29f/0x2e0
[  382.231514]  ? n_tty_ioctl+0xe0/0xe0
[  382.231541]  ? tty_write_unlock+0x30/0x30
[  382.231566]  ? __vfs_write+0x22/0x110
[  382.231604]  ? security_file_permission+0x2f/0xd0
[  382.231635]  ? rw_verify_area+0xac/0x120
[  382.231677]  ? vfs_write+0x103/0x180
[  382.231711]  SyS_select+0x87/0xc0
[  382.231739]  ? SyS_write+0x42/0x90
[  382.231781]  do_fast_syscall_32+0xd6/0x1a0
[  382.231836]  entry_SYSENTER_32+0x47/0x71
[  382.231848] EIP: 0xb7f75b05
[  382.231857] EFLAGS: 00000246 CPU: 1
[  382.231868] EAX: ffffffda EBX: 00000400 ECX: bfd4508c EDX: bfd4510c
[  382.231878] ESI: 00000000 EDI: 00000000 EBP: 00000000 ESP: bfd45020
[  382.231889]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[  382.232281] softirq: huh, entered softirq 9 RCU c10b4d90 with preempt_count 00000100, exited with 00000000?

Tested-by: Sam Protsenko <[email protected]>
Signed-off-by: Vincent Pelletier <[email protected]>
Signed-off-by: Felipe Balbi <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
This patch avoids that lockdep reports the following:

======================================================
WARNING: possible circular locking dependency detected
4.18.0-rc1 #62 Not tainted
------------------------------------------------------
kswapd0/84 is trying to acquire lock:
00000000c313516d (&xfs_nondir_ilock_class){++++}, at: xfs_free_eofblocks+0xa2/0x1e0

but task is already holding lock:
00000000591c83ae (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (fs_reclaim){+.+.}:
  kmem_cache_alloc+0x2c/0x2b0
  radix_tree_node_alloc.constprop.19+0x3d/0xc0
  __radix_tree_create+0x161/0x1c0
  __radix_tree_insert+0x45/0x210
  dmz_map+0x245/0x2d0 [dm_zoned]
  __map_bio+0x40/0x260
  __split_and_process_non_flush+0x116/0x220
  __split_and_process_bio+0x81/0x180
  __dm_make_request.isra.32+0x5a/0x100
  generic_make_request+0x36e/0x690
  submit_bio+0x6c/0x140
  mpage_readpages+0x19e/0x1f0
  read_pages+0x6d/0x1b0
  __do_page_cache_readahead+0x21b/0x2d0
  force_page_cache_readahead+0xc4/0x100
  generic_file_read_iter+0x7c6/0xd20
  __vfs_read+0x102/0x180
  vfs_read+0x9b/0x140
  ksys_read+0x55/0xc0
  do_syscall_64+0x5a/0x1f0
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #1 (&dmz->chunk_lock){+.+.}:
  dmz_map+0x133/0x2d0 [dm_zoned]
  __map_bio+0x40/0x260
  __split_and_process_non_flush+0x116/0x220
  __split_and_process_bio+0x81/0x180
  __dm_make_request.isra.32+0x5a/0x100
  generic_make_request+0x36e/0x690
  submit_bio+0x6c/0x140
  _xfs_buf_ioapply+0x31c/0x590
  xfs_buf_submit_wait+0x73/0x520
  xfs_buf_read_map+0x134/0x2f0
  xfs_trans_read_buf_map+0xc3/0x580
  xfs_read_agf+0xa5/0x1e0
  xfs_alloc_read_agf+0x59/0x2b0
  xfs_alloc_pagf_init+0x27/0x60
  xfs_bmap_longest_free_extent+0x43/0xb0
  xfs_bmap_btalloc_nullfb+0x7f/0xf0
  xfs_bmap_btalloc+0x428/0x7c0
  xfs_bmapi_write+0x598/0xcc0
  xfs_iomap_write_allocate+0x15a/0x330
  xfs_map_blocks+0x1cf/0x3f0
  xfs_do_writepage+0x15f/0x7b0
  write_cache_pages+0x1ca/0x540
  xfs_vm_writepages+0x65/0xa0
  do_writepages+0x48/0xf0
  __writeback_single_inode+0x58/0x730
  writeback_sb_inodes+0x249/0x5c0
  wb_writeback+0x11e/0x550
  wb_workfn+0xa3/0x670
  process_one_work+0x228/0x670
  worker_thread+0x3c/0x390
  kthread+0x11c/0x140
  ret_from_fork+0x3a/0x50

-> #0 (&xfs_nondir_ilock_class){++++}:
  down_read_nested+0x43/0x70
  xfs_free_eofblocks+0xa2/0x1e0
  xfs_fs_destroy_inode+0xac/0x270
  dispose_list+0x51/0x80
  prune_icache_sb+0x52/0x70
  super_cache_scan+0x127/0x1a0
  shrink_slab.part.47+0x1bd/0x590
  shrink_node+0x3b5/0x470
  balance_pgdat+0x158/0x3b0
  kswapd+0x1ba/0x600
  kthread+0x11c/0x140
  ret_from_fork+0x3a/0x50

other info that might help us debug this:

Chain exists of:
  &xfs_nondir_ilock_class --> &dmz->chunk_lock --> fs_reclaim

Possible unsafe locking scenario:

     CPU0                    CPU1
     ----                    ----
lock(fs_reclaim);
                             lock(&dmz->chunk_lock);
                             lock(fs_reclaim);
lock(&xfs_nondir_ilock_class);

*** DEADLOCK ***

3 locks held by kswapd0/84:
 #0: 00000000591c83ae (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
 #1: 000000000f8208f5 (shrinker_rwsem){++++}, at: shrink_slab.part.47+0x3f/0x590
 #2: 00000000cacefa54 (&type->s_umount_key#43){.+.+}, at: trylock_super+0x16/0x50

stack backtrace:
CPU: 7 PID: 84 Comm: kswapd0 Not tainted 4.18.0-rc1 #62
Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015
Call Trace:
 dump_stack+0x85/0xcb
 print_circular_bug.isra.36+0x1ce/0x1db
 __lock_acquire+0x124e/0x1310
 lock_acquire+0x9f/0x1f0
 down_read_nested+0x43/0x70
 xfs_free_eofblocks+0xa2/0x1e0
 xfs_fs_destroy_inode+0xac/0x270
 dispose_list+0x51/0x80
 prune_icache_sb+0x52/0x70
 super_cache_scan+0x127/0x1a0
 shrink_slab.part.47+0x1bd/0x590
 shrink_node+0x3b5/0x470
 balance_pgdat+0x158/0x3b0
 kswapd+0x1ba/0x600
 kthread+0x11c/0x140
 ret_from_fork+0x3a/0x50

Reported-by: Masato Suzuki <[email protected]>
Fixes: 4218a95 ("dm zoned: use GFP_NOIO in I/O path")
Cc: <[email protected]>
Signed-off-by: Bart Van Assche <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
This module exposes two USB configurations: a QMI+AT capable setup on
USB config #1 and a MBIM capable setup on USB config #2.

By default the kernel will choose the MBIM capable configuration as
long as the cdc_mbim driver is available. This patch adds support for
the QMI port in the secondary configuration.

Signed-off-by: Aleksander Morgado <[email protected]>
Acked-by: Bjørn Mork <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
This patch avoids that removing a path controlled by the dm-mpath driver
while mkfs is running triggers the following kernel bug:

    kernel BUG at block/blk-core.c:3347!
    invalid opcode: 0000 [#1] PREEMPT SMP KASAN
    CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
    RIP: 0010:blk_end_request_all+0x68/0x70
    Call Trace:
     <IRQ>
     dm_softirq_done+0x326/0x3d0 [dm_mod]
     blk_done_softirq+0x19b/0x1e0
     __do_softirq+0x128/0x60d
     irq_exit+0x100/0x110
     smp_call_function_single_interrupt+0x90/0x330
     call_function_single_interrupt+0xf/0x20
     </IRQ>

Fixes: f9d03f9 ("block: improve handling of the magic discard payload")
Reviewed-by: Ming Lei <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Mike Snitzer <[email protected]>
Signed-off-by: Bart Van Assche <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
I'm seeing a bunch of debug prints from a user of print_hex_dump_bytes()
in my kernel logs, but I don't have CONFIG_DYNAMIC_DEBUG enabled nor do I
have DEBUG defined in my build.  The problem is that
print_hex_dump_bytes() calls a wrapper function in lib/hexdump.c that
calls print_hex_dump() with KERN_DEBUG level.  There are three cases to
consider here

  1. CONFIG_DYNAMIC_DEBUG=y  --> call dynamic_hex_dum()
  2. CONFIG_DYNAMIC_DEBUG=n && DEBUG --> call print_hex_dump()
  3. CONFIG_DYNAMIC_DEBUG=n && !DEBUG --> stub it out

Right now, that last case isn't detected and we still call
print_hex_dump() from the stub wrapper.

Let's make print_hex_dump_bytes() only call print_hex_dump_debug() so that
it works properly in all cases.

Case #1, print_hex_dump_debug() calls dynamic_hex_dump() and we get same
behavior.  Case #2, print_hex_dump_debug() calls print_hex_dump() with
KERN_DEBUG and we get the same behavior.  Case #3, print_hex_dump_debug()
is a nop, changing behavior to what we want, i.e.  print nothing.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Stephen Boyd <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
After commit a2c11b0 ("kcm: use BPF_PROG_RUN")
syzbot easily triggers the warning in cant_sleep().

As explained in commit 6cab5e9 ("bpf: run bpf programs
with preemption disabled") we need to disable preemption before
running bpf programs.

BUG: assuming atomic context at net/kcm/kcmsock.c:382
in_atomic(): 0, irqs_disabled(): 0, pid: 7, name: kworker/u4:0
3 locks held by kworker/u4:0/7:
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: __write_once_size include/linux/compiler.h:226 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: atomic64_set include/asm-generic/atomic-instrumented.h:855 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: atomic_long_set include/asm-generic/atomic-long.h:40 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: set_work_data kernel/workqueue.c:620 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: set_work_pool_and_clear_pending kernel/workqueue.c:647 [inline]
 #0: ffff888216726128 ((wq_completion)kstrp){+.+.}, at: process_one_work+0x88b/0x1740 kernel/workqueue.c:2240
 #1: ffff8880a989fdc0 ((work_completion)(&strp->work)){+.+.}, at: process_one_work+0x8c1/0x1740 kernel/workqueue.c:2244
 #2: ffff888098998d10 (sk_lock-AF_INET){+.+.}, at: lock_sock include/net/sock.h:1522 [inline]
 #2: ffff888098998d10 (sk_lock-AF_INET){+.+.}, at: strp_sock_lock+0x2e/0x40 net/strparser/strparser.c:440
CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: kstrp strp_work
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 __cant_sleep kernel/sched/core.c:6826 [inline]
 __cant_sleep.cold+0xa4/0xbc kernel/sched/core.c:6803
 kcm_parse_func_strparser+0x54/0x200 net/kcm/kcmsock.c:382
 __strp_recv+0x5dc/0x1b20 net/strparser/strparser.c:221
 strp_recv+0xcf/0x10b net/strparser/strparser.c:343
 tcp_read_sock+0x285/0xa00 net/ipv4/tcp.c:1639
 strp_read_sock+0x14d/0x200 net/strparser/strparser.c:366
 do_strp_work net/strparser/strparser.c:414 [inline]
 strp_work+0xe3/0x130 net/strparser/strparser.c:423
 process_one_work+0x9af/0x1740 kernel/workqueue.c:2269

Fixes: a2c11b0 ("kcm: use BPF_PROG_RUN")
Fixes: 6cab5e9 ("bpf: run bpf programs with preemption disabled")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
qdisc_root() use from netem_enqueue() triggers a lockdep warning.

__dev_queue_xmit() uses rcu_read_lock_bh() which is
not equivalent to rcu_read_lock() + local_bh_disable_bh as far
as lockdep is concerned.

WARNING: suspicious RCU usage
5.3.0-rc7+ #0 Not tainted
-----------------------------
include/net/sch_generic.h:492 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by syz-executor427/8855:
 #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: lwtunnel_xmit_redirect include/net/lwtunnel.h:92 [inline]
 #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: ip_finish_output2+0x2dc/0x2570 net/ipv4/ip_output.c:214
 #1: 00000000b5525c01 (rcu_read_lock_bh){....}, at: __dev_queue_xmit+0x20a/0x3650 net/core/dev.c:3804
 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: spin_lock include/linux/spinlock.h:338 [inline]
 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_xmit_skb net/core/dev.c:3502 [inline]
 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_queue_xmit+0x14b8/0x3650 net/core/dev.c:3838

stack backtrace:
CPU: 0 PID: 8855 Comm: syz-executor427 Not tainted 5.3.0-rc7+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5357
 qdisc_root include/net/sch_generic.h:492 [inline]
 netem_enqueue+0x1cfb/0x2d80 net/sched/sch_netem.c:479
 __dev_xmit_skb net/core/dev.c:3527 [inline]
 __dev_queue_xmit+0x15d2/0x3650 net/core/dev.c:3838
 dev_queue_xmit+0x18/0x20 net/core/dev.c:3902
 neigh_hh_output include/net/neighbour.h:500 [inline]
 neigh_output include/net/neighbour.h:509 [inline]
 ip_finish_output2+0x1726/0x2570 net/ipv4/ip_output.c:228
 __ip_finish_output net/ipv4/ip_output.c:308 [inline]
 __ip_finish_output+0x5fc/0xb90 net/ipv4/ip_output.c:290
 ip_finish_output+0x38/0x1f0 net/ipv4/ip_output.c:318
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip_mc_output+0x292/0xf40 net/ipv4/ip_output.c:417
 dst_output include/net/dst.h:436 [inline]
 ip_local_out+0xbb/0x190 net/ipv4/ip_output.c:125
 ip_send_skb+0x42/0xf0 net/ipv4/ip_output.c:1555
 udp_send_skb.isra.0+0x6b2/0x1160 net/ipv4/udp.c:887
 udp_sendmsg+0x1e96/0x2820 net/ipv4/udp.c:1174
 inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:657
 ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
 __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg net/socket.c:2439 [inline]
 __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
 do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
When a cpu requests broadcasting, before starting the tick broadcast
hrtimer, bc_set_next() checks if the timer callback (bc_handler) is active
using hrtimer_try_to_cancel(). But hrtimer_try_to_cancel() does not provide
the required synchronization when the callback is active on other core.

The callback could have already executed tick_handle_oneshot_broadcast()
and could have also returned. But still there is a small time window where
the hrtimer_try_to_cancel() returns -1. In that case bc_set_next() returns
without doing anything, but the next_event of the tick broadcast clock
device is already set to a timeout value.

In the race condition diagram below, CPU #1 is running the timer callback
and CPU #2 is entering idle state and so calls bc_set_next().

In the worst case, the next_event will contain an expiry time, but the
hrtimer will not be started which happens when the racing callback returns
HRTIMER_NORESTART. The hrtimer might never recover if all further requests
from the CPUs to subscribe to tick broadcast have timeout greater than the
next_event of tick broadcast clock device. This leads to cascading of
failures and finally noticed as rcu stall warnings

Here is a depiction of the race condition

CPU #1 (Running timer callback)                   CPU #2 (Enter idle
                                                  and subscribe to
                                                  tick broadcast)
---------------------                             ---------------------

__run_hrtimer()                                   tick_broadcast_enter()

  bc_handler()                                      __tick_broadcast_oneshot_control()

    tick_handle_oneshot_broadcast()

      raw_spin_lock(&tick_broadcast_lock);

      dev->next_event = KTIME_MAX;                  //wait for tick_broadcast_lock
      //next_event for tick broadcast clock
      set to KTIME_MAX since no other cores
      subscribed to tick broadcasting

      raw_spin_unlock(&tick_broadcast_lock);

    if (dev->next_event == KTIME_MAX)
      return HRTIMER_NORESTART
    // callback function exits without
       restarting the hrtimer                      //tick_broadcast_lock acquired
                                                   raw_spin_lock(&tick_broadcast_lock);

                                                   tick_broadcast_set_event()

                                                     clockevents_program_event()

                                                       dev->next_event = expires;

                                                       bc_set_next()

                                                         hrtimer_try_to_cancel()
                                                         //returns -1 since the timer
                                                         callback is active. Exits without
                                                         restarting the timer
  cpu_base->running = NULL;

The comment that hrtimer cannot be armed from within the callback is
wrong. It is fine to start the hrtimer from within the callback. Also it is
safe to start the hrtimer from the enter/exit idle code while the broadcast
handler is active. The enter/exit idle code and the broadcast handler are
synchronized using tick_broadcast_lock. So there is no need for the
existing try to cancel logic. All this can be removed which will eliminate
the race condition as well.

Fixes: 5d1638a ("tick: Introduce hrtimer based broadcast")
Originally-by: Thomas Gleixner <[email protected]>
Signed-off-by: Balasubramani Vivekanandan <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
Ido Schimmel says:

====================
mlxsw: Various fixes

This patchset includes two small fixes for the mlxsw driver and one
patch which clarifies recently introduced devlink-trap documentation.

Patch #1 clears the port's VLAN filters during port initialization. This
ensures that the drop reason reported to the user is consistent. The
problem is explained in detail in the commit message.

Patch #2 clarifies the description of one of the traps exposed via
devlink-trap.

Patch #3 from Danielle forbids the installation of a tc filter with
multiple mirror actions since this is not supported by the device. The
failure is communicated to the user via extack.
====================

Signed-off-by: David S. Miller <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
This patch fixes the lock inversion complaint:

============================================
WARNING: possible recursive locking detected
5.3.0-rc7-dbg+ #1 Not tainted
--------------------------------------------
kworker/u16:6/171 is trying to acquire lock:
00000000035c6e6c (&id_priv->handler_mutex){+.+.}, at: rdma_destroy_id+0x78/0x4a0 [rdma_cm]

but task is already holding lock:
00000000bc7c307d (&id_priv->handler_mutex){+.+.}, at: iw_conn_req_handler+0x151/0x680 [rdma_cm]

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&id_priv->handler_mutex);
  lock(&id_priv->handler_mutex);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by kworker/u16:6/171:
 #0: 00000000e2eaa773 ((wq_completion)iw_cm_wq){+.+.}, at: process_one_work+0x472/0xac0
 #1: 000000001efd357b ((work_completion)(&work->work)#3){+.+.}, at: process_one_work+0x476/0xac0
 #2: 00000000bc7c307d (&id_priv->handler_mutex){+.+.}, at: iw_conn_req_handler+0x151/0x680 [rdma_cm]

stack backtrace:
CPU: 3 PID: 171 Comm: kworker/u16:6 Not tainted 5.3.0-rc7-dbg+ #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: iw_cm_wq cm_work_handler [iw_cm]
Call Trace:
 dump_stack+0x8a/0xd6
 __lock_acquire.cold+0xe1/0x24d
 lock_acquire+0x106/0x240
 __mutex_lock+0x12e/0xcb0
 mutex_lock_nested+0x1f/0x30
 rdma_destroy_id+0x78/0x4a0 [rdma_cm]
 iw_conn_req_handler+0x5c9/0x680 [rdma_cm]
 cm_work_handler+0xe62/0x1100 [iw_cm]
 process_one_work+0x56d/0xac0
 worker_thread+0x7a/0x5d0
 kthread+0x1bc/0x210
 ret_from_fork+0x24/0x30

This is not a bug as there are actually two lock classes here.

Link: https://lore.kernel.org/r/[email protected]
Fixes: de910bd ("RDMA/cma: Simplify locking needed for serialization of callbacks")
Signed-off-by: Bart Van Assche <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
A user reported a lockdep splat

 ======================================================
 WARNING: possible circular locking dependency detected
 5.2.11-gentoo #2 Not tainted
 ------------------------------------------------------
 kswapd0/711 is trying to acquire lock:
 000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500

but task is already holding lock:
 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (fs_reclaim){+.+.}:
 kmem_cache_alloc+0x1f/0x1c0
 btrfs_alloc_inode+0x1f/0x260
 alloc_inode+0x16/0xa0
 new_inode+0xe/0xb0
 btrfs_new_inode+0x70/0x610
 btrfs_symlink+0xd0/0x420
 vfs_symlink+0x9c/0x100
 do_symlinkat+0x66/0xe0
 do_syscall_64+0x55/0x1c0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (sb_internal){.+.+}:
 __sb_start_write+0xf6/0x150
 start_transaction+0x3a8/0x500
 btrfs_commit_inode_delayed_inode+0x59/0x110
 btrfs_evict_inode+0x19e/0x4c0
 evict+0xbc/0x1f0
 inode_lru_isolate+0x113/0x190
 __list_lru_walk_one.isra.4+0x5c/0x100
 list_lru_walk_one+0x32/0x50
 prune_icache_sb+0x36/0x80
 super_cache_scan+0x14a/0x1d0
 do_shrink_slab+0x131/0x320
 shrink_node+0xf7/0x380
 balance_pgdat+0x2d5/0x640
 kswapd+0x2ba/0x5e0
 kthread+0x147/0x160
 ret_from_fork+0x24/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

 CPU0 CPU1
 ---- ----
 lock(fs_reclaim);
 lock(sb_internal);
 lock(fs_reclaim);
 lock(sb_internal);
*** DEADLOCK ***

 3 locks held by kswapd0/711:
 #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
 #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380
 #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0

stack backtrace:
 CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2
 Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017
 Call Trace:
 dump_stack+0x85/0xc7
 print_circular_bug.cold.40+0x1d9/0x235
 __lock_acquire+0x18b1/0x1f00
 lock_acquire+0xa6/0x170
 ? start_transaction+0x3a8/0x500
 __sb_start_write+0xf6/0x150
 ? start_transaction+0x3a8/0x500
 start_transaction+0x3a8/0x500
 btrfs_commit_inode_delayed_inode+0x59/0x110
 btrfs_evict_inode+0x19e/0x4c0
 ? var_wake_function+0x20/0x20
 evict+0xbc/0x1f0
 inode_lru_isolate+0x113/0x190
 ? discard_new_inode+0xc0/0xc0
 __list_lru_walk_one.isra.4+0x5c/0x100
 ? discard_new_inode+0xc0/0xc0
 list_lru_walk_one+0x32/0x50
 prune_icache_sb+0x36/0x80
 super_cache_scan+0x14a/0x1d0
 do_shrink_slab+0x131/0x320
 shrink_node+0xf7/0x380
 balance_pgdat+0x2d5/0x640
 kswapd+0x2ba/0x5e0
 ? __wake_up_common_lock+0x90/0x90
 kthread+0x147/0x160
 ? balance_pgdat+0x640/0x640
 ? __kthread_create_on_node+0x160/0x160
 ret_from_fork+0x24/0x30

This is because btrfs_new_inode() calls new_inode() under the
transaction.  We could probably move the new_inode() outside of this but
for now just wrap it in memalloc_nofs_save().

Reported-by: Zdenek Sojka <[email protected]>
Fixes: 712e36c ("btrfs: use GFP_KERNEL in btrfs_alloc_inode")
CC: [email protected] # 4.16+
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
otherwise will cause init boot failure reported like this:

[    1.959344] Run /init as init process
[    1.966119] tmpfs: Unknown parameter 'mode'
[    1.971092] tmpfs: Unknown parameter 'mode'
[    1.975426] tmpfs: Unknown parameter 'mode'
[    1.979709] tmpfs: Unknown parameter 'mode'
[    1.984298] init: mount("tmpfs", "/dev", "tmpfs", MS_NOSUID, "mode=0755") failed Invalid argument
[    1.993214] init: mount("tmpfs", "/mnt", "tmpfs", MS_NOEXEC | MS_NOSUID | MS_NODEV, "mode=0755,uid=0,gid=1000") failed Invalid argument
[    2.005417] init: mount("tmpfs", "/apex", "tmpfs", MS_NOEXEC | MS_NOSUID | MS_NODEV, "mode=0755,uid=0,gid=0") failed Invalid argument
[    2.017435] init: mount("tmpfs", "/debug_ramdisk", "tmpfs", MS_NOEXEC | MS_NOSUID | MS_NODEV, "mode=0755,uid=0,gid=0") failed Invalid argument
[    2.030276] init: Init encountered errors starting first stage, aborting
[    2.057074] init: #00 pc 00000000002eeac8  /init (UnwindStackCurrent::UnwindFromContext(unsigned long, void*)+96)
[    2.067403] init: #1 pc 00000000002730c4  /init (android::init::InitFatalReboot()+92)
[    2.075341] init: #2 pc 00000000002754f4  /init (android::init::InitAborter(char const*)+44)
[    2.083887] init: #3 pc 00000000002ae4f8  /init (android::base::LogMessage::~LogMessage()+608)
[    2.092605] init: #4 pc 000000000026aba8  /init (android::init::FirstStageMain(int, char**)+4400)
[    2.101588] init: #5 pc 00000000003599e4  /init (__real_libc_init(void*, void (*)(), int (*)(int, char**, char**), structors_array_t const*, bionic_tcb*)+572)
[    2.115863] init: Reboot ending, jumping to kernel
[    2.121709] reboot: Restarting system with command 'bootloader'

Tested: with hikey

Change-Id: I607856cbcc5bfdaf304b13089c07ca0cf10c2f76
Signed-off-by: Yongqin Liu <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
… set

We received an error report that perf-record caused 'Segmentation fault'
on a newly system (e.g. on the new installed ubuntu).

  (gdb) backtrace
  #0  __read_once_size (size=4, res=<synthetic pointer>, p=0x14) at /root/0-jinyao/acme/tools/include/linux/compiler.h:139
  #1  atomic_read (v=0x14) at /root/0-jinyao/acme/tools/include/asm/../../arch/x86/include/asm/atomic.h:28
  #2  refcount_read (r=0x14) at /root/0-jinyao/acme/tools/include/linux/refcount.h:65
  #3  perf_mmap__read_init (map=map@entry=0x0) at mmap.c:177
  #4  0x0000561ce5c0de39 in perf_evlist__poll_thread (arg=0x561ce68584d0) at util/sideband_evlist.c:62
  #5  0x00007fad78491609 in start_thread (arg=<optimized out>) at pthread_create.c:477
  #6  0x00007fad7823c103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The root cause is, evlist__add_bpf_sb_event() just returns 0 if
HAVE_LIBBPF_SUPPORT is not defined (inline function path). So it will
not create a valid evsel for side-band event.

But perf-record still creates BPF side band thread to process the
side-band event, then the error happpens.

We can reproduce this issue by removing the libelf-dev. e.g.
1. apt-get remove libelf-dev
2. perf record -a -- sleep 1

  root@test:~# ./perf record -a -- sleep 1
  perf: Segmentation fault
  Obtained 6 stack frames.
  ./perf(+0x28eee8) [0x5562d6ef6ee8]
  /lib/x86_64-linux-gnu/libc.so.6(+0x46210) [0x7fbfdc65f210]
  ./perf(+0x342e74) [0x5562d6faae74]
  ./perf(+0x257e39) [0x5562d6ebfe39]
  /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7fbfdc990609]
  /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fbfdc73b103]
  Segmentation fault (core dumped)

To fix this issue,

1. We either install the missing libraries to let HAVE_LIBBPF_SUPPORT
   be defined.
   e.g. apt-get install libelf-dev and install other related libraries.

2. Use this patch to skip the side-band event setup if HAVE_LIBBPF_SUPPORT
   is not set.

Committer notes:

The side band thread is not used just with BPF, it is also used with
--switch-output-event, so narrow the ifdef to the BPF specific part.

Fixes: 23cbb41 ("perf record: Move side band evlist setup to separate routine")
Signed-off-by: Jin Yao <[email protected]>
Acked-by: Jiri Olsa <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Jin Yao <[email protected]>
Cc: Kan Liang <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
KVM/arm64 fixes for 5.11, take #2

- Don't allow tagged pointers to point to memslots
- Filter out ARMv8.1+ PMU events on v8.0 hardware
- Hide PMU registers from userspace when no PMU is configured
- More PMU cleanups
- Don't try to handle broken PSCI firmware
- More sys_reg() to reg_to_encoding() conversions

Signed-off-by: Marc Zyngier <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
This driver, like several others, uses a chained IRQ for each GPIO bank,
and forwards .irq_set_wake to the GPIO bank's upstream IRQ. As a result,
a call to irq_set_irq_wake() needs to lock both the upstream and
downstream irq_desc's. Lockdep considers this to be a possible deadlock
when the irq_desc's share lockdep classes, which they do by default:

 ============================================
 WARNING: possible recursive locking detected
 5.17.0-rc3-00394-gc849047c2473 #1 Not tainted
 --------------------------------------------
 init/307 is trying to acquire lock:
 c2dfe27c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0x58/0xa0

 but task is already holding lock:
 c3c0ac7c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0x58/0xa0

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&irq_desc_lock_class);
   lock(&irq_desc_lock_class);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 4 locks held by init/307:
  #0: c1f29f18 (system_transition_mutex){+.+.}-{3:3}, at: __do_sys_reboot+0x90/0x23c
  #1: c20f7760 (&dev->mutex){....}-{3:3}, at: device_shutdown+0xf4/0x224
  #2: c2e804d8 (&dev->mutex){....}-{3:3}, at: device_shutdown+0x104/0x224
  #3: c3c0ac7c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0x58/0xa0

 stack backtrace:
 CPU: 0 PID: 307 Comm: init Not tainted 5.17.0-rc3-00394-gc849047c2473 #1
 Hardware name: Allwinner sun8i Family
  unwind_backtrace from show_stack+0x10/0x14
  show_stack from dump_stack_lvl+0x68/0x90
  dump_stack_lvl from __lock_acquire+0x1680/0x31a0
  __lock_acquire from lock_acquire+0x148/0x3dc
  lock_acquire from _raw_spin_lock_irqsave+0x50/0x6c
  _raw_spin_lock_irqsave from __irq_get_desc_lock+0x58/0xa0
  __irq_get_desc_lock from irq_set_irq_wake+0x2c/0x19c
  irq_set_irq_wake from irq_set_irq_wake+0x13c/0x19c
    [tail call from sunxi_pinctrl_irq_set_wake]
  irq_set_irq_wake from gpio_keys_suspend+0x80/0x1a4
  gpio_keys_suspend from gpio_keys_shutdown+0x10/0x2c
  gpio_keys_shutdown from device_shutdown+0x180/0x224
  device_shutdown from __do_sys_reboot+0x134/0x23c
  __do_sys_reboot from ret_fast_syscall+0x0/0x1c

However, this can never deadlock because the upstream and downstream
IRQs are never the same (nor do they even involve the same irqchip).

Silence this erroneous lockdep splat by applying what appears to be the
usual fix of moving the GPIO IRQs to separate lockdep classes.

Fixes: a59c99d ("pinctrl: sunxi: Forward calls to irq_set_irq_wake")
Reported-by: Guenter Roeck <[email protected]>
Signed-off-by: Samuel Holland <[email protected]>
Reviewed-by: Jernej Skrabec <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Linus Walleij <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
in tunnel mode, if outer interface(ipv4) is less, it is easily to let
inner IPV6 mtu be less than 1280. If so, a Packet Too Big ICMPV6 message
is received. When send again, packets are fragmentized with 1280, they
are still rejected with ICMPV6(Packet Too Big) by xfrmi_xmit2().

According to RFC4213 Section3.2.2:
if (IPv4 path MTU - 20) is less than 1280
	if packet is larger than 1280 bytes
		Send ICMPv6 "packet too big" with MTU=1280
                Drop packet
        else
		Encapsulate but do not set the Don't Fragment
                flag in the IPv4 header.  The resulting IPv4
                packet might be fragmented by the IPv4 layer
                on the encapsulator or by some router along
                the IPv4 path.
	endif
else
	if packet is larger than (IPv4 path MTU - 20)
        	Send ICMPv6 "packet too big" with
                MTU = (IPv4 path MTU - 20).
                Drop packet.
        else
                Encapsulate and set the Don't Fragment flag
                in the IPv4 header.
        endif
endif
Packets should be fragmentized with ipv4 outer interface, so change it.

After it is fragemtized with ipv4, there will be double fragmenation.
No.48 & No.51 are ipv6 fragment packets, No.48 is double fragmentized,
then tunneled with IPv4(No.49& No.50), which obey spec. And received peer
cannot decrypt it rightly.

48              2002::10        2002::11 1296(length) IPv6 fragment (off=0 more=y ident=0xa20da5bc nxt=50)
49   0x0000 (0) 2002::10        2002::11 1304         IPv6 fragment (off=0 more=y ident=0x7448042c nxt=44)
50   0x0000 (0) 2002::10        2002::11 200          ESP (SPI=0x00035000)
51              2002::10        2002::11 180          Echo (ping) request
52   0x56dc     2002::10        2002::11 248          IPv6 fragment (off=1232 more=n ident=0xa20da5bc nxt=50)

xfrm6_noneed_fragment has fixed above issues. Finally, it acted like below:
1   0x6206 192.168.1.138   192.168.1.1 1316 Fragmented IP protocol (proto=Encap Security Payload 50, off=0, ID=6206) [Reassembled in #2]
2   0x6206 2002::10        2002::11    88   IPv6 fragment (off=0 more=y ident=0x1f440778 nxt=50)
3   0x0000 2002::10        2002::11    248  ICMPv6    Echo (ping) request

Signed-off-by: Lina Wang <[email protected]>
Signed-off-by: Steffen Klassert <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
Ido Schimmel says:

====================
selftests: mlxsw: A couple of fixes

Patch #1 fixes a breakage due to a change in iproute2 output. The real
problem is not iproute2, but the fact that the check was not strict
enough. Fixed by using JSON output instead. Targeting at net so that the
test will pass as part of old and new kernels regardless of iproute2
version.

Patch #2 fixes an issue uncovered by the first one.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
In many cases, keyctl_pkey_params_get_2() is validating the user buffer
lengths against the wrong algorithm properties.  Fix it to check against
the correct properties.

Probably this wasn't noticed before because for all asymmetric keys of
the "public_key" subtype, max_data_size == max_sig_size == max_enc_size
== max_dec_size.  However, this isn't necessarily true for the
"asym_tpm" subtype (it should be, but it's not strictly validated).  Of
course, future key types could have different values as well.

Fixes: 00d60fd ("KEYS: Provide keyctls to drive the new key type ops for asymmetric keys [ver #2]")
Cc: <[email protected]> # v4.20+
Signed-off-by: Eric Biggers <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
The following sequence of operations results in a refcount warning:

1. Open device /dev/tpmrm.
2. Remove module tpm_tis_spi.
3. Write a TPM command to the file descriptor opened at step 1.

------------[ cut here ]------------
WARNING: CPU: 3 PID: 1161 at lib/refcount.c:25 kobject_get+0xa0/0xa4
refcount_t: addition on 0; use-after-free.
Modules linked in: tpm_tis_spi tpm_tis_core tpm mdio_bcm_unimac brcmfmac
sha256_generic libsha256 sha256_arm hci_uart btbcm bluetooth cfg80211 vc4
brcmutil ecdh_generic ecc snd_soc_core crc32_arm_ce libaes
raspberrypi_hwmon ac97_bus snd_pcm_dmaengine bcm2711_thermal snd_pcm
snd_timer genet snd phy_generic soundcore [last unloaded: spi_bcm2835]
CPU: 3 PID: 1161 Comm: hold_open Not tainted 5.10.0ls-main-dirty #2
Hardware name: BCM2711
[<c0410c3c>] (unwind_backtrace) from [<c040b580>] (show_stack+0x10/0x14)
[<c040b580>] (show_stack) from [<c1092174>] (dump_stack+0xc4/0xd8)
[<c1092174>] (dump_stack) from [<c0445a30>] (__warn+0x104/0x108)
[<c0445a30>] (__warn) from [<c0445aa8>] (warn_slowpath_fmt+0x74/0xb8)
[<c0445aa8>] (warn_slowpath_fmt) from [<c08435d0>] (kobject_get+0xa0/0xa4)
[<c08435d0>] (kobject_get) from [<bf0a715c>] (tpm_try_get_ops+0x14/0x54 [tpm])
[<bf0a715c>] (tpm_try_get_ops [tpm]) from [<bf0a7d6c>] (tpm_common_write+0x38/0x60 [tpm])
[<bf0a7d6c>] (tpm_common_write [tpm]) from [<c05a7ac0>] (vfs_write+0xc4/0x3c0)
[<c05a7ac0>] (vfs_write) from [<c05a7ee4>] (ksys_write+0x58/0xcc)
[<c05a7ee4>] (ksys_write) from [<c04001a0>] (ret_fast_syscall+0x0/0x4c)
Exception stack(0xc226bfa8 to 0xc226bff0)
bfa0:                   00000000 000105b4 00000003 beafe664 00000014 00000000
bfc0: 00000000 000105b4 000103f8 00000004 00000000 00000000 b6f9c000 beafe684
bfe0: 0000006c beafe648 0001056c b6eb6944
---[ end trace d4b8409def9b8b1f ]---

The reason for this warning is the attempt to get the chip->dev reference
in tpm_common_write() although the reference counter is already zero.

Since commit 8979b02 ("tpm: Fix reference count to main device") the
extra reference used to prevent a premature zero counter is never taken,
because the required TPM_CHIP_FLAG_TPM2 flag is never set.

Fix this by moving the TPM 2 character device handling from
tpm_chip_alloc() to tpm_add_char_device() which is called at a later point
in time when the flag has been set in case of TPM2.

Commit fdc915f ("tpm: expose spaces via a device link /dev/tpmrm<n>")
already introduced function tpm_devs_release() to release the extra
reference but did not implement the required put on chip->devs that results
in the call of this function.

Fix this by putting chip->devs in tpm_chip_unregister().

Finally move the new implementation for the TPM 2 handling into a new
function to avoid multiple checks for the TPM_CHIP_FLAG_TPM2 flag in the
good case and error cases.

Cc: [email protected]
Fixes: fdc915f ("tpm: expose spaces via a device link /dev/tpmrm<n>")
Fixes: 8979b02 ("tpm: Fix reference count to main device")
Co-developed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Lino Sanfilippo <[email protected]>
Tested-by: Stefan Berger <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
When logging an inode in full mode, or when logging xattrs or when logging
the dir index items of a directory, we are modifying the log tree while
holding a read lock on a leaf from the fs/subvolume tree. This can lead to
a deadlock in rare circumstances, but it is a real possibility, and it was
recently reported by syzbot with the following trace from lockdep:

   WARNING: possible circular locking dependency detected
   6.1.0-rc5-next-20221116-syzkaller #0 Not tainted
   ------------------------------------------------------
   syz-executor.1/16154 is trying to acquire lock:
   ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256

   but task is already holding lock:
   ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #2 (btrfs-log-00){++++}-{3:3}:
          down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634
          __btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135
          btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline]
          btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280
          btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline]
          btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998
          btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209
          btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021
          log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258
          copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403
          copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873
          btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495
          btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982
          btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083
          btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921
          vfs_fsync_range+0x13e/0x230 fs/sync.c:188
          generic_write_sync include/linux/fs.h:2856 [inline]
          iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
          btrfs_direct_write fs/btrfs/file.c:1536 [inline]
          btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
          call_write_iter include/linux/fs.h:2160 [inline]
          do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
          do_iter_write+0x182/0x700 fs/read_write.c:861
          vfs_iter_write+0x74/0xa0 fs/read_write.c:902
          iter_file_splice_write+0x745/0xc90 fs/splice.c:686
          do_splice_from fs/splice.c:764 [inline]
          direct_splice_actor+0x114/0x180 fs/splice.c:931
          splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
          do_splice_direct+0x1ab/0x280 fs/splice.c:974
          do_sendfile+0xb19/0x1270 fs/read_write.c:1255
          __do_sys_sendfile64 fs/read_write.c:1323 [inline]
          __se_sys_sendfile64 fs/read_write.c:1309 [inline]
          __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
          do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
          entry_SYSCALL_64_after_hwframe+0x63/0xcd

   -> #1 (btrfs-tree-00){++++}-{3:3}:
          __lock_release kernel/locking/lockdep.c:5382 [inline]
          lock_release+0x371/0x810 kernel/locking/lockdep.c:5688
          up_write+0x2a/0x520 kernel/locking/rwsem.c:1614
          btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
          btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238
          search_leaf fs/btrfs/ctree.c:1832 [inline]
          btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074
          btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133
          btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746
          btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
          __btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline]
          __btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153
          flush_space+0x147/0xe90 fs/btrfs/space-info.c:728
          btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086
          process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
          worker_thread+0x669/0x1090 kernel/workqueue.c:2436
          kthread+0x2e8/0x3a0 kernel/kthread.c:376
          ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

   -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
          check_prev_add kernel/locking/lockdep.c:3097 [inline]
          check_prevs_add kernel/locking/lockdep.c:3216 [inline]
          validate_chain kernel/locking/lockdep.c:3831 [inline]
          __lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055
          lock_acquire kernel/locking/lockdep.c:5668 [inline]
          lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633
          __mutex_lock_common kernel/locking/mutex.c:603 [inline]
          __mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747
          __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
          __btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline]
          btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
          btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285
          btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554
          evict+0x2ed/0x6b0 fs/inode.c:664
          dispose_list+0x117/0x1e0 fs/inode.c:697
          prune_icache_sb+0xeb/0x150 fs/inode.c:896
          super_cache_scan+0x391/0x590 fs/super.c:106
          do_shrink_slab+0x464/0xce0 mm/vmscan.c:843
          shrink_slab_memcg mm/vmscan.c:912 [inline]
          shrink_slab+0x388/0x660 mm/vmscan.c:991
          shrink_node_memcgs mm/vmscan.c:6088 [inline]
          shrink_node+0x93d/0x1f30 mm/vmscan.c:6117
          shrink_zones mm/vmscan.c:6355 [inline]
          do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417
          try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732
          reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393
          mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578
          try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816
          try_charge mm/memcontrol.c:2827 [inline]
          charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889
          __mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910
          mem_cgroup_charge include/linux/memcontrol.h:667 [inline]
          __filemap_add_folio+0x615/0xf80 mm/filemap.c:852
          filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934
          __filemap_get_folio+0x389/0xd80 mm/filemap.c:1976
          pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104
          find_or_create_page include/linux/pagemap.h:612 [inline]
          alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588
          btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline]
          btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988
          __btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440
          btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595
          btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038
          btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137
          update_log_root fs/btrfs/tree-log.c:2841 [inline]
          btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064
          btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947
          vfs_fsync_range+0x13e/0x230 fs/sync.c:188
          generic_write_sync include/linux/fs.h:2856 [inline]
          iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
          btrfs_direct_write fs/btrfs/file.c:1536 [inline]
          btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
          call_write_iter include/linux/fs.h:2160 [inline]
          do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
          do_iter_write+0x182/0x700 fs/read_write.c:861
          vfs_iter_write+0x74/0xa0 fs/read_write.c:902
          iter_file_splice_write+0x745/0xc90 fs/splice.c:686
          do_splice_from fs/splice.c:764 [inline]
          direct_splice_actor+0x114/0x180 fs/splice.c:931
          splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
          do_splice_direct+0x1ab/0x280 fs/splice.c:974
          do_sendfile+0xb19/0x1270 fs/read_write.c:1255
          __do_sys_sendfile64 fs/read_write.c:1323 [inline]
          __se_sys_sendfile64 fs/read_write.c:1309 [inline]
          __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
          do_syscall_x64 arch/x86/entry/common.c:50 [inline]
          do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
          entry_SYSCALL_64_after_hwframe+0x63/0xcd

   other info that might help us debug this:

   Chain exists of:
     &delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00

   Possible unsafe locking scenario:

          CPU0                    CPU1
          ----                    ----
     lock(btrfs-log-00);
                                  lock(btrfs-tree-00);
                                  lock(btrfs-log-00);
     lock(&delayed_node->mutex);

Holding a read lock on a leaf from a fs/subvolume tree creates a nasty
lock dependency when we are COWing extent buffers for the log tree and we
have two tasks modifying the log tree, with each one in one of the
following 2 scenarios:

1) Modifying the log tree triggers an extent buffer allocation while
   holding a write lock on a parent extent buffer from the log tree.
   Allocating the pages for an extent buffer, or the extent buffer
   struct, can trigger inode eviction and finally the inode eviction
   will trigger a release/remove of a delayed node, which requires
   taking the delayed node's mutex;

2) Allocating a metadata extent for a log tree can trigger the async
   reclaim thread and make us wait for it to release enough space and
   unblock our reservation ticket. The reclaim thread can start flushing
   delayed items, and that in turn results in the need to lock delayed
   node mutexes and in the need to write lock extent buffers of a
   subvolume tree - all this while holding a write lock on the parent
   extent buffer in the log tree.

So one task in scenario 1) running in parallel with another task in
scenario 2) could lead to a deadlock, one wanting to lock a delayed node
mutex while having a read lock on a leaf from the subvolume, while the
other is holding the delayed node's mutex and wants to write lock the same
subvolume leaf for flushing delayed items.

Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the
fs/subvolume leaf and use the clone leaf instead.

Reported-by: [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
CC: [email protected] # 6.0+
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
test_bpf tail call tests end up as:

  test_bpf: #0 Tail call leaf jited:1 85 PASS
  test_bpf: #1 Tail call 2 jited:1 111 PASS
  test_bpf: #2 Tail call 3 jited:1 145 PASS
  test_bpf: #3 Tail call 4 jited:1 170 PASS
  test_bpf: #4 Tail call load/store leaf jited:1 190 PASS
  test_bpf: #5 Tail call load/store jited:1
  BUG: Unable to handle kernel data access on write at 0xf1b4e000
  Faulting instruction address: 0xbe86b710
  Oops: Kernel access of bad area, sig: 11 [#1]
  BE PAGE_SIZE=4K MMU=Hash PowerMac
  Modules linked in: test_bpf(+)
  CPU: 0 PID: 97 Comm: insmod Not tainted 6.1.0-rc4+ #195
  Hardware name: PowerMac3,1 750CL 0x87210 PowerMac
  NIP:  be86b710 LR: be857e88 CTR: be86b704
  REGS: f1b4df20 TRAP: 0300   Not tainted  (6.1.0-rc4+)
  MSR:  00009032 <EE,ME,IR,DR,RI>  CR: 28008242  XER: 00000000
  DAR: f1b4e000 DSISR: 42000000
  GPR00: 00000001 f1b4dfe0 c11d2280 00000000 00000000 00000000 00000002 00000000
  GPR08: f1b4e000 be86b704 f1b4e000 00000000 00000000 100d816a f2440000 fe73baa8
  GPR16: f2458000 00000000 c1941ae4 f1fe2248 00000045 c0de0000 f2458030 00000000
  GPR24: 000003e8 0000000f f2458000 f1b4dc90 3e584b46 00000000 f24466a0 c1941a00
  NIP [be86b710] 0xbe86b710
  LR [be857e88] __run_one+0xec/0x264 [test_bpf]
  Call Trace:
  [f1b4dfe0] [00000002] 0x2 (unreliable)
  Instruction dump:
  XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
  XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
  ---[ end trace 0000000000000000 ]---

This is a tentative to write above the stack. The problem is encoutered
with tests added by commit 38608ee ("bpf, tests: Add load store
test case for tail call")

This happens because tail call is done to a BPF prog with a different
stack_depth. At the time being, the stack is kept as is when the caller
tail calls its callee. But at exit, the callee restores the stack based
on its own properties. Therefore here, at each run, r1 is erroneously
increased by 32 - 16 = 16 bytes.

This was done that way in order to pass the tail call count from caller
to callee through the stack. As powerpc32 doesn't have a red zone in
the stack, it was necessary the maintain the stack as is for the tail
call. But it was not anticipated that the BPF frame size could be
different.

Let's take a new approach. Use register r4 to carry the tail call count
during the tail call, and save it into the stack at function entry if
required. This means the input parameter must be in r3, which is more
correct as it is a 32 bits parameter, then tail call better match with
normal BPF function entry, the down side being that we move that input
parameter back and forth between r3 and r4. That can be optimised later.

Doing that also has the advantage of maximising the common parts between
tail calls and a normal function exit.

With the fix, tail call tests are now successfull:

  test_bpf: #0 Tail call leaf jited:1 53 PASS
  test_bpf: #1 Tail call 2 jited:1 115 PASS
  test_bpf: #2 Tail call 3 jited:1 154 PASS
  test_bpf: #3 Tail call 4 jited:1 165 PASS
  test_bpf: #4 Tail call load/store leaf jited:1 101 PASS
  test_bpf: #5 Tail call load/store jited:1 141 PASS
  test_bpf: #6 Tail call error path, max count reached jited:1 994 PASS
  test_bpf: #7 Tail call count preserved across function calls jited:1 140975 PASS
  test_bpf: #8 Tail call error path, NULL target jited:1 110 PASS
  test_bpf: #9 Tail call error path, index out of range jited:1 69 PASS
  test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed]

Suggested-by: Naveen N. Rao <[email protected]>
Fixes: 51c66ad ("powerpc/bpf: Implement extended BPF on PPC32")
Cc: [email protected]
Signed-off-by: Christophe Leroy <[email protected]>
Tested-by: Naveen N. Rao <[email protected]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/757acccb7fbfc78efa42dcf3c974b46678198905.1669278887.git.christophe.leroy@csgroup.eu
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
Matt reported a splat at msk close time:

    BUG: sleeping function called from invalid context at net/mptcp/protocol.c:2877
    in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 155, name: packetdrill
    preempt_count: 201, expected: 0
    RCU nest depth: 0, expected: 0
    4 locks held by packetdrill/155:
    #0: ffff888001536990 (&sb->s_type->i_mutex_key#6){+.+.}-{3:3}, at: __sock_release (net/socket.c:650)
    #1: ffff88800b498130 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close (net/mptcp/protocol.c:2973)
    #2: ffff88800b49a130 (sk_lock-AF_INET/1){+.+.}-{0:0}, at: __mptcp_close_ssk (net/mptcp/protocol.c:2363)
    #3: ffff88800b49a0b0 (slock-AF_INET){+...}-{2:2}, at: __lock_sock_fast (include/net/sock.h:1820)
    Preemption disabled at:
    0x0
    CPU: 1 PID: 155 Comm: packetdrill Not tainted 6.1.0-rc5 #365
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
    Call Trace:
    <TASK>
    dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4))
    __might_resched.cold (kernel/sched/core.c:9891)
    __mptcp_destroy_sock (include/linux/kernel.h:110)
    __mptcp_close (net/mptcp/protocol.c:2959)
    mptcp_subflow_queue_clean (include/net/sock.h:1777)
    __mptcp_close_ssk (net/mptcp/protocol.c:2363)
    mptcp_destroy_common (net/mptcp/protocol.c:3170)
    mptcp_destroy (include/net/sock.h:1495)
    __mptcp_destroy_sock (net/mptcp/protocol.c:2886)
    __mptcp_close (net/mptcp/protocol.c:2959)
    mptcp_close (net/mptcp/protocol.c:2974)
    inet_release (net/ipv4/af_inet.c:432)
    __sock_release (net/socket.c:651)
    sock_close (net/socket.c:1367)
    __fput (fs/file_table.c:320)
    task_work_run (kernel/task_work.c:181 (discriminator 1))
    exit_to_user_mode_prepare (include/linux/resume_user_mode.h:49)
    syscall_exit_to_user_mode (kernel/entry/common.c:130)
    do_syscall_64 (arch/x86/entry/common.c:87)
    entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

We can't call mptcp_close under the 'fast' socket lock variant, replace
it with a sock_lock_nested() as the relevant code is already under the
listening msk socket lock protection.

Reported-by: Matthieu Baerts <[email protected]>
Closes: multipath-tcp/mptcp_net-next#316
Fixes: 30e51b9 ("mptcp: fix unreleased socket in accept queue")
Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Matthieu Baerts <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
Wei Chen reported a NULL deref in sk_user_ns() [0][1], and Paolo diagnosed
the root cause: in unix_diag_get_exact(), the newly allocated skb does not
have sk. [2]

We must get the user_ns from the NETLINK_CB(in_skb).sk and pass it to
sk_diag_fill().

[0]:
BUG: kernel NULL pointer dereference, address: 0000000000000270
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 12bbce067 P4D 12bbce067 PUD 12bc40067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 0 PID: 27942 Comm: syz-executor.0 Not tainted 6.1.0-rc5-next-20221118 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
RIP: 0010:sk_user_ns include/net/sock.h:920 [inline]
RIP: 0010:sk_diag_dump_uid net/unix/diag.c:119 [inline]
RIP: 0010:sk_diag_fill+0x77d/0x890 net/unix/diag.c:170
Code: 89 ef e8 66 d4 2d fd c7 44 24 40 00 00 00 00 49 8d 7c 24 18 e8
54 d7 2d fd 49 8b 5c 24 18 48 8d bb 70 02 00 00 e8 43 d7 2d fd <48> 8b
9b 70 02 00 00 48 8d 7b 10 e8 33 d7 2d fd 48 8b 5b 10 48 8d
RSP: 0018:ffffc90000d67968 EFLAGS: 00010246
RAX: ffff88812badaa48 RBX: 0000000000000000 RCX: ffffffff840d481d
RDX: 0000000000000465 RSI: 0000000000000000 RDI: 0000000000000270
RBP: ffffc90000d679a8 R08: 0000000000000277 R09: 0000000000000000
R10: 0001ffffffffffff R11: 0001c90000d679a8 R12: ffff88812ac03800
R13: ffff88812c87c400 R14: ffff88812ae42210 R15: ffff888103026940
FS:  00007f08b4e6f700(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000270 CR3: 000000012c58b000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 unix_diag_get_exact net/unix/diag.c:285 [inline]
 unix_diag_handler_dump+0x3f9/0x500 net/unix/diag.c:317
 __sock_diag_cmd net/core/sock_diag.c:235 [inline]
 sock_diag_rcv_msg+0x237/0x250 net/core/sock_diag.c:266
 netlink_rcv_skb+0x13e/0x250 net/netlink/af_netlink.c:2564
 sock_diag_rcv+0x24/0x40 net/core/sock_diag.c:277
 netlink_unicast_kernel net/netlink/af_netlink.c:1330 [inline]
 netlink_unicast+0x5e9/0x6b0 net/netlink/af_netlink.c:1356
 netlink_sendmsg+0x739/0x860 net/netlink/af_netlink.c:1932
 sock_sendmsg_nosec net/socket.c:714 [inline]
 sock_sendmsg net/socket.c:734 [inline]
 ____sys_sendmsg+0x38f/0x500 net/socket.c:2476
 ___sys_sendmsg net/socket.c:2530 [inline]
 __sys_sendmsg+0x197/0x230 net/socket.c:2559
 __do_sys_sendmsg net/socket.c:2568 [inline]
 __se_sys_sendmsg net/socket.c:2566 [inline]
 __x64_sys_sendmsg+0x42/0x50 net/socket.c:2566
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x4697f9
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f08b4e6ec48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000000000077bf80 RCX: 00000000004697f9
RDX: 0000000000000000 RSI: 00000000200001c0 RDI: 0000000000000003
RBP: 00000000004d29e9 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000077bf80
R13: 0000000000000000 R14: 000000000077bf80 R15: 00007ffdb36bc6c0
 </TASK>
Modules linked in:
CR2: 0000000000000270

[1]: https://lore.kernel.org/netdev/CAO4mrfdvyjFpokhNsiwZiP-wpdSD0AStcJwfKcKQdAALQ9_2Qw@mail.gmail.com/
[2]: https://lore.kernel.org/netdev/[email protected]/

Fixes: cae9910 ("net: Add UNIX_DIAG_UID to Netlink UNIX socket diagnostics.")
Reported-by: syzbot <[email protected]>
Reported-by: Wei Chen <[email protected]>
Diagnosed-by: Paolo Abeni <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
QAT devices on Intel Sapphire Rapids and Emerald Rapids have a defect in
address translation service (ATS). These devices may inadvertently issue
ATS invalidation completion before posted writes initiated with
translated address that utilized translations matching the invalidation
address range, violating the invalidation completion ordering.

This patch adds an extra device TLB invalidation for the affected devices,
it is needed to ensure no more posted writes with translated address
following the invalidation completion. Therefore, the ordering is
preserved and data-corruption is prevented.

Device TLBs are invalidated under the following six conditions:
1. Device driver does DMA API unmap IOVA
2. Device driver unbind a PASID from a process, sva_unbind_device()
3. PASID is torn down, after PASID cache is flushed. e.g. process
exit_mmap() due to crash
4. Under SVA usage, called by mmu_notifier.invalidate_range() where
VM has to free pages that were unmapped
5. userspace driver unmaps a DMA buffer
6. Cache invalidation in vSVA usage (upcoming)

For #1 and #2, device drivers are responsible for stopping DMA traffic
before unmap/unbind. For #3, iommu driver gets mmu_notifier to
invalidate TLB the same way as normal user unmap which will do an extra
invalidation. The dTLB invalidation after PASID cache flush does not
need an extra invalidation.

Therefore, we only need to deal with #4 and #5 in this patch. #1 is also
covered by this patch due to common code path with #5.

Tested-by: Yuzhang Luo <[email protected]>
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Lu Baolu <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>
123Xiong-Zhang pushed a commit that referenced this pull request Nov 7, 2024
When sending packets between nodes in netns, it calls tipc_lxc_xmit() for
peer node to receive the packets where tipc_sk_mcast_rcv()/tipc_sk_rcv()
might be called, and it's pretty much like in tipc_rcv().

Currently the local 'node rw lock' is held during calling tipc_lxc_xmit()
to protect the peer_net not being freed by another thread. However, when
receiving these packets, tipc_node_add_conn() might be called where the
peer 'node rw lock' is acquired. Then a dead lock warning is triggered by
lockdep detector, although it is not a real dead lock:

    WARNING: possible recursive locking detected
    --------------------------------------------
    conn_server/1086 is trying to acquire lock:
    ffff8880065cb020 (&n->lock#2){++--}-{2:2}, \
                     at: tipc_node_add_conn.cold.76+0xaa/0x211 [tipc]

    but task is already holding lock:
    ffff8880065cd020 (&n->lock#2){++--}-{2:2}, \
                     at: tipc_node_xmit+0x285/0xb30 [tipc]

    other info that might help us debug this:
     Possible unsafe locking scenario:

           CPU0
           ----
      lock(&n->lock#2);
      lock(&n->lock#2);

     *** DEADLOCK ***

     May be due to missing lock nesting notation

    4 locks held by conn_server/1086:
     #0: ffff8880036d1e40 (sk_lock-AF_TIPC){+.+.}-{0:0}, \
                          at: tipc_accept+0x9c0/0x10b0 [tipc]
     #1: ffff8880036d5f80 (sk_lock-AF_TIPC/1){+.+.}-{0:0}, \
                          at: tipc_accept+0x363/0x10b0 [tipc]
     #2: ffff8880065cd020 (&n->lock#2){++--}-{2:2}, \
                          at: tipc_node_xmit+0x285/0xb30 [tipc]
     #3: ffff888012e13370 (slock-AF_TIPC){+...}-{2:2}, \
                          at: tipc_sk_rcv+0x2da/0x1b40 [tipc]

    Call Trace:
     <TASK>
     dump_stack_lvl+0x44/0x5b
     __lock_acquire.cold.77+0x1f2/0x3d7
     lock_acquire+0x1d2/0x610
     _raw_write_lock_bh+0x38/0x80
     tipc_node_add_conn.cold.76+0xaa/0x211 [tipc]
     tipc_sk_finish_conn+0x21e/0x640 [tipc]
     tipc_sk_filter_rcv+0x147b/0x3030 [tipc]
     tipc_sk_rcv+0xbb4/0x1b40 [tipc]
     tipc_lxc_xmit+0x225/0x26b [tipc]
     tipc_node_xmit.cold.82+0x4a/0x102 [tipc]
     __tipc_sendstream+0x879/0xff0 [tipc]
     tipc_accept+0x966/0x10b0 [tipc]
     do_accept+0x37d/0x590

This patch avoids this warning by not holding the 'node rw lock' before
calling tipc_lxc_xmit(). As to protect the 'peer_net', rcu_read_lock()
should be enough, as in cleanup_net() when freeing the netns, it calls
synchronize_rcu() before the free is continued.

Also since tipc_lxc_xmit() is like the RX path in tipc_rcv(), it makes
sense to call it under rcu_read_lock(). Note that the right lock order
must be:

   rcu_read_lock();
   tipc_node_read_lock(n);
   tipc_node_read_unlock(n);
   tipc_lxc_xmit();
   rcu_read_unlock();

instead of:

   tipc_node_read_lock(n);
   rcu_read_lock();
   tipc_node_read_unlock(n);
   tipc_lxc_xmit();
   rcu_read_unlock();

and we have to call tipc_node_read_lock/unlock() twice in
tipc_node_xmit().

Fixes: f73b128 ("tipc: improve throughput between nodes in netns")
Reported-by: Shuang Li <[email protected]>
Signed-off-by: Xin Long <[email protected]>
Link: https://lore.kernel.org/r/5bdd1f8fee9db695cfff4528a48c9b9d0523fb00.1670110641.git.lucien.xin@gmail.com
Signed-off-by: Paolo Abeni <[email protected]>
NegXuan pushed a commit that referenced this pull request Dec 27, 2024
Arm has introduced a new v10 GPU architecture that replaces the Job Manager
interface with a new Command Stream Frontend. It adds firmware driven
command stream queues that can be used by kernel and user space to submit
jobs to the GPU.

Add the initial schema for the device tree that is based on support for
RK3588 SoC. The minimum number of clocks is one for the IP, but on Rockchip
platforms they will tend to expose the semi-independent clocks for better
power management.

v5:
- Move the opp-table node under the gpu node

v4:
- Fix formatting issue

v3:
- Cleanup commit message to remove redundant text
- Added opp-table property and re-ordered entries
- Clarified power-domains and power-domain-names requirements for RK3588.
- Cleaned up example

Note: power-domains and power-domain-names requirements for other platforms
are still work in progress, hence the bindings are left incomplete here.

v2:
- New commit

Signed-off-by: Liviu Dudau <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Conor Dooley <[email protected]>
Cc: [email protected]
Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Rob Herring <[email protected]>

drm: execution context for GEM buffers v7

This adds the infrastructure for an execution context for GEM buffers
which is similar to the existing TTMs execbuf util and intended to replace
it in the long term.

The basic functionality is that we abstracts the necessary loop to lock
many different GEM buffers with automated deadlock and duplicate handling.

v2: drop xarray and use dynamic resized array instead, the locking
    overhead is unnecessary and measurable.
v3: drop duplicate tracking, radeon is really the only one needing that.
v4: fixes issues pointed out by Danilo, some typos in comments and a
    helper for lock arrays of GEM objects.
v5: some suggestions by Boris Brezillon, especially just use one retry
    macro, drop loop in prepare_array, use flags instead of bool
v6: minor changes suggested by Thomas, Boris and Danilo
v7: minor typos pointed out by checkpatch.pl fixed

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Reviewed-by: Danilo Krummrich <[email protected]>
Tested-by: Danilo Krummrich <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm: manager to keep track of GPUs VA mappings

Add infrastructure to keep track of GPU virtual address (VA) mappings
with a decicated VA space manager implementation.

New UAPIs, motivated by Vulkan sparse memory bindings graphics drivers
start implementing, allow userspace applications to request multiple and
arbitrary GPU VA mappings of buffer objects. The DRM GPU VA manager is
intended to serve the following purposes in this context.

1) Provide infrastructure to track GPU VA allocations and mappings,
   using an interval tree (RB-tree).

2) Generically connect GPU VA mappings to their backing buffers, in
   particular DRM GEM objects.

3) Provide a common implementation to perform more complex mapping
   operations on the GPU VA space. In particular splitting and merging
   of GPU VA mappings, e.g. for intersecting mapping requests or partial
   unmap requests.

Acked-by: Thomas Hellström <[email protected]>
Acked-by: Matthew Brost <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Tested-by: Matthew Brost <[email protected]>
Tested-by: Donald Robson <[email protected]>
Suggested-by: Dave Airlie <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm: manager: Fix printk format for size_t

sizeof() returns a size_t which may be different to an unsigned long.
Use the correct format specifier of '%zu' to prevent compiler warnings.

Fixes: e6303f323b1a ("drm: manager to keep track of GPUs VA mappings")
Reviewed-by: Danilo Krummrich <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuva_mgr: remove unused prev pointer in __drm_gpuva_sm_map()

The prev pointer in __drm_gpuva_sm_map() was used to implement automatic
merging of mappings. Since automatic merging did not make its way
upstream, remove this leftover.

Fixes: e6303f323b1a ("drm: manager to keep track of GPUs VA mappings")
Signed-off-by: Danilo Krummrich <[email protected]>
Reviewed-by: Dave Airlie <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuvm: rename struct drm_gpuva_manager to struct drm_gpuvm

Rename struct drm_gpuva_manager to struct drm_gpuvm including
corresponding functions. This way the GPUVA manager's structures align
very well with the documentation of VM_BIND [1] and VM_BIND locking [2].

It also provides a better foundation for the naming of data structures
and functions introduced for implementing a common dma-resv per GPU-VM
including tracking of external and evicted objects in subsequent
patches.

[1] Documentation/gpu/drm-vm-bind-async.rst
[2] Documentation/gpu/drm-vm-bind-locking.rst

Cc: Thomas Hellström <[email protected]>
Cc: Matthew Brost <[email protected]>
Acked-by: Dave Airlie <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuvm: allow building as module

HB:
drivers/gpu/drm/nouveau/Kconfig
skipped because there is no gpuvm support of nouveau in 6.1

Currently, the DRM GPUVM does not have any core dependencies preventing
a module build.

Also, new features from subsequent patches require helpers (namely
drm_exec) which can be built as module.

Reviewed-by: Christian König <[email protected]>
Reviewed-by: Dave Airlie <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuvm: convert WARN() to drm_WARN() variants

HB:
drivers/gpu/drm/nouveau/nouveau_uvmm.c
skipped since 6.1 does not support gpuvm on nv

Use drm_WARN() and drm_WARN_ON() variants to indicate drivers the
context the failing VM resides in.

Acked-by: Christian König <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuvm: don't always WARN in drm_gpuvm_check_overflow()

Don't always WARN in drm_gpuvm_check_overflow() and separate it into a
drm_gpuvm_check_overflow() and a dedicated
drm_gpuvm_warn_check_overflow() variant.

This avoids printing warnings due to invalid userspace requests.

Acked-by: Christian König <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuvm: export drm_gpuvm_range_valid()

Drivers may use this function to validate userspace requests in advance,
hence export it.

Acked-by: Christian König <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuvm: add common dma-resv per struct drm_gpuvm

hb:
drivers/gpu/drm/nouveau/nouveau_uvmm.c
skipped

Provide a common dma-resv for GEM objects not being used outside of this
GPU-VM. This is used in a subsequent patch to generalize dma-resv,
external and evicted object handling and GEM validation.

Acked-by: Christian König <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuvm: add drm_gpuvm_flags to drm_gpuvm

HB:
drivers/gpu/drm/nouveau/nouveau_uvmm.c
skipped

Introduce flags for struct drm_gpuvm, this required by subsequent
commits.

Acked-by: Christian König <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuvm: reference count drm_gpuvm structures

HB:
drivers/gpu/drm/nouveau/nouveau_uvmm.c
skipped

Implement reference counting for struct drm_gpuvm.

Acked-by: Christian König <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuvm: add an abstraction for a VM / BO combination

HB:
drivers/gpu/drm/nouveau/nouveau_uvmm.c
skipped

Add an abstraction layer between the drm_gpuva mappings of a particular
drm_gem_object and this GEM object itself. The abstraction represents a
combination of a drm_gem_object and drm_gpuvm. The drm_gem_object holds
a list of drm_gpuvm_bo structures (the structure representing this
abstraction), while each drm_gpuvm_bo contains list of mappings of this
GEM object.

This has multiple advantages:

1) We can use the drm_gpuvm_bo structure to attach it to various lists
   of the drm_gpuvm. This is useful for tracking external and evicted
   objects per VM, which is introduced in subsequent patches.

2) Finding mappings of a certain drm_gem_object mapped in a certain
   drm_gpuvm becomes much cheaper.

3) Drivers can derive and extend the structure to easily represent
   driver specific states of a BO for a certain GPUVM.

The idea of this abstraction was taken from amdgpu, hence the credit for
this idea goes to the developers of amdgpu.

Cc: Christian König <[email protected]>
Acked-by: Christian König <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuvm: track/lock/validate external/evicted objects

Currently the DRM GPUVM offers common infrastructure to track GPU VA
allocations and mappings, generically connect GPU VA mappings to their
backing buffers and perform more complex mapping operations on the GPU VA
space.

However, there are more design patterns commonly used by drivers, which
can potentially be generalized in order to make the DRM GPUVM represent
a basis for GPU-VM implementations. In this context, this patch aims
at generalizing the following elements.

1) Provide a common dma-resv for GEM objects not being used outside of
   this GPU-VM.

2) Provide tracking of external GEM objects (GEM objects which are
   shared with other GPU-VMs).

3) Provide functions to efficiently lock all GEM objects dma-resv the
   GPU-VM contains mappings of.

4) Provide tracking of evicted GEM objects the GPU-VM contains mappings
   of, such that validation of evicted GEM objects is accelerated.

5) Provide some convinience functions for common patterns.

Big thanks to Boris Brezillon for his help to figure out locking for
drivers updating the GPU VA space within the fence signalling path.

Acked-by: Christian König <[email protected]>
Reviewed-by: Boris Brezillon <[email protected]>
Reviewed-by: Thomas Hellström <[email protected]>
Suggested-by: Matthew Brost <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuvm: fall back to drm_exec_lock_obj()

Fall back to drm_exec_lock_obj() if num_fences is zero for the
drm_gpuvm_prepare_* function family.

Otherwise dma_resv_reserve_fences() would actually allocate slots even
though num_fences is zero.

Cc: Christian König <[email protected]>
Acked-by: Donald Robson <[email protected]>
Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/gpuvm: Let drm_gpuvm_bo_put() report when the vm_bo object is destroyed

Some users need to release resources attached to the vm_bo object when
it's destroyed. In Panthor's case, we need to release the pin ref so
BO pages can be returned to the system when all GPU mappings are gone.

This could be done through a custom drm_gpuvm::vm_bo_free() hook, but
this has all sort of locking implications that would force us to expose
a drm_gem_shmem_unpin_locked() helper, not to mention the fact that
having a ::vm_bo_free() implementation without a ::vm_bo_alloc() one
seems odd. So let's keep things simple, and extend drm_gpuvm_bo_put()
to report when the object is destroyed.

Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/exec: Pass in initial # of objects

HB: skipped
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
drivers/gpu/drm/amd/amdgpu/amdgpu_csa.c
drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
drivers/gpu/drm/amd/amdgpu/amdgpu_umsch_mm.c
drivers/gpu/drm/amd/amdkfd/kfd_svm.c
drivers/gpu/drm/imagination/pvr_job.c
drivers/gpu/drm/nouveau/nouveau_uvmm.c

In cases where the # is known ahead of time, it is silly to do the table
resize dance.

Signed-off-by: Rob Clark <[email protected]>
Reviewed-by: Christian König <[email protected]>
Patchwork: https://patchwork.freedesktop.org/patch/568338/

drm/gem-shmem: When drm_gem_object_init failed, should release object

when goto err_free, the object had init, so it should be release when fail.

Signed-off-by: ChunyouTang <[email protected]>
Signed-off-by: Thomas Zimmermann <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm: Remove usage of deprecated DRM_DEBUG_PRIME

drm_print.h says DRM_DEBUG_PRIME is deprecated in favor of
drm_dbg_prime().

Signed-off-by: Siddh Raman Pant <[email protected]>
Reviewed-by: Simon Ser <[email protected]>
Signed-off-by: Simon Ser <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/cd663b1bc42189e55898cddecdb3b73c591b341a.1673269059.git.code@siddh.me

drm/shmem: Cleanup drm_gem_shmem_create_with_handle()

Once we create the handle, the handle owns the reference.  Currently
nothing was doing anything with the shmem ptr after the handle was
created, but let's change drm_gem_shmem_create_with_handle() to not
return the pointer, so-as to not encourage problematic use of this
function in the future.  As a bonus, it makes the code a bit cleaner.

Signed-off-by: Rob Clark <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Signed-off-by: Steven Price <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/shmem-helper: Fix locking for drm_gem_shmem_get_pages_sgt()

Other functions touching shmem->sgt take the pages lock, so do that here
too. drm_gem_shmem_get_pages() & co take the same lock, so move to the
_locked() variants to avoid recursive locking.

Discovered while auditing locking to write the Rust abstractions.

Fixes: 2194a63a818d ("drm: Add library for shmem backed GEM objects")
Fixes: 4fa3d66f132b ("drm/shmem: Do dma_unmap_sg before purging pages")
Signed-off-by: Asahi Lina <[email protected]>
Reviewed-by: Javier Martinez Canillas <[email protected]>
Signed-off-by: Javier Martinez Canillas <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/shmem-helper: Switch to use drm_* debug helpers

Ease debugging of a multi-GPU system by using drm_WARN_*() and
drm_dbg_kms() helpers that print out DRM device name corresponding
to shmem GEM.

Reviewed-by: Thomas Zimmermann <[email protected]>
Suggested-by: Thomas Zimmermann <[email protected]>
Signed-off-by: Dmitry Osipenko <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/

drm/shmem-helper: Don't use vmap_use_count for dma-bufs

DMA-buf core has its own refcounting of vmaps, use it instead of drm-shmem
counting. This change prepares drm-shmem for addition of memory shrinker
support where drm-shmem will use a single dma-buf reservation lock for
all operations performed over dma-bufs.

Reviewed-by: Thomas Zimmermann <[email protected]>
Signed-off-by: Dmitry Osipenko <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/

drm/shmem-helper: Switch to reservation lock

Replace all drm-shmem locks with a GEM reservation lock. This makes locks
consistent with dma-buf locking convention where importers are responsible
for holding reservation lock for all operations performed over dma-bufs,
preventing deadlock between dma-buf importers and exporters.

Suggested-by: Daniel Vetter <[email protected]>
Acked-by: Thomas Zimmermann <[email protected]>
Signed-off-by: Dmitry Osipenko <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/

drm/shmem-helper: Revert accidental non-GPL export

The referenced commit added a wrapper for drm_gem_shmem_get_pages_sgt(),
but in the process it accidentally changed the export type from GPL to
non-GPL. Switch it back to GPL.

Reported-by: Dmitry Osipenko <[email protected]>
Fixes: ddddedaa0db9 ("drm/shmem-helper: Fix locking for drm_gem_shmem_get_pages_sgt()")
Signed-off-by: Asahi Lina <[email protected]>
Signed-off-by: Thomas Zimmermann <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Revert "drm/shmem-helper: Switch to reservation lock"

This reverts commit 67b7836d4458790f1261e31fe0ce3250989784f0.

The locking appears incomplete. A caller of SHMEM helper's pin
function never acquires the dma-buf reservation lock. So we get

  WARNING: CPU: 3 PID: 967 at drivers/gpu/drm/drm_gem_shmem_helper.c:243 drm_gem_shmem_pin+0x42/0x90 [drm_shmem_helper]

Signed-off-by: Thomas Zimmermann <[email protected]>
Acked-by: Dmitry Osipenko <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/shmem-helper: Switch to reservation lock

Replace all drm-shmem locks with a GEM reservation lock. This makes locks
consistent with dma-buf locking convention where importers are responsible
for holding reservation lock for all operations performed over dma-bufs,
preventing deadlock between dma-buf importers and exporters.

Suggested-by: Daniel Vetter <[email protected]>
Acked-by: Thomas Zimmermann <[email protected]>
Reviewed-by: Emil Velikov <[email protected]>
Signed-off-by: Dmitry Osipenko <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/shmem-helper: Reset vma->vm_ops before calling dma_buf_mmap()

The dma-buf backend is supposed to provide its own vm_ops, but some
implementation just have nothing special to do and leave vm_ops
untouched, probably expecting this field to be zero initialized (this
is the case with the system_heap implementation for instance).
Let's reset vma->vm_ops to NULL to keep things working with these
implementations.

Fixes: 26d3ac3cb04d ("drm/shmem-helpers: Redirect mmap for imported dma-buf")
Cc: <[email protected]>
Cc: Daniel Vetter <[email protected]>
Reported-by: Roman Stratiienko <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Tested-by: Roman Stratiienko <[email protected]>
Reviewed-by: Thomas Zimmermann <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

iommu: Allow passing custom allocators to pgtable drivers

This will be useful for GPU drivers who want to keep page tables in a
pool so they can:

- keep freed page tables in a free pool and speed-up upcoming page
  table allocations
- batch page table allocation instead of allocating one page at a time
- pre-reserve pages for page tables needed for map/unmap operations,
  to ensure map/unmap operations don't try to allocate memory in paths
  they're allowed to block or fail

It might also be valuable for other aspects of GPU and similar
use-cases, like fine-grained memory accounting and resource limiting.

We will extend the Arm LPAE format to support custom allocators in a
separate commit.

Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Reviewed-by: Robin Murphy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joerg Roedel <[email protected]>

iommu: Extend LPAE page table format to support custom allocators

We need that in order to implement the VM_BIND ioctl in the GPU driver
targeting new Mali GPUs.

VM_BIND is about executing MMU map/unmap requests asynchronously,
possibly after waiting for external dependencies encoded as dma_fences.
We intend to use the drm_sched framework to automate the dependency
tracking and VM job dequeuing logic, but this comes with its own set
of constraints, one of them being the fact we are not allowed to
allocate memory in the drm_gpu_scheduler_ops::run_job() to avoid this
sort of deadlocks:

- VM_BIND map job needs to allocate a page table to map some memory
  to the VM. No memory available, so kswapd is kicked
- GPU driver shrinker backend ends up waiting on the fence attached to
  the VM map job or any other job fence depending on this VM operation.

With custom allocators, we will be able to pre-reserve enough pages to
guarantee the map/unmap operations we queued will take place without
going through the system allocator. But we can also optimize
allocation/reservation by not free-ing pages immediately, so any
upcoming page table allocation requests can be serviced by some free
page table pool kept at the driver level.

I might also be valuable for other aspects of GPU and similar
use-cases, like fine-grained memory accounting and resource limiting.

Signed-off-by: Boris Brezillon <[email protected]>
Reviewed-by: Steven Price <[email protected]>
Reviewed-by: Robin Murphy <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joerg Roedel <[email protected]>

drm/sched: Add FIFO sched policy to run queue

When many entities are competing for the same run queue
on the same scheduler, we observe an unusually long wait
times and some jobs get starved. This has been observed on GPUVis.

The issue is due to the Round Robin policy used by schedulers
to pick up the next entity's job queue for execution. Under stress
of many entities and long job queues within entity some
jobs could be stuck for very long time in it's entity's
queue before being popped from the queue and executed
while for other entities with smaller job queues a job
might execute earlier even though that job arrived later
then the job in the long queue.

Fix:
Add FIFO selection policy to entities in run queue, chose next entity
on run queue in such order that if job on one entity arrived
earlier then job on another entity the first job will start
executing earlier regardless of the length of the entity's job
queue.

v2:
Switch to rb tree structure for entities based on TS of
oldest job waiting in the job queue of an entity. Improves next
entity extraction to O(1). Entity TS update
O(log N) where N is the number of entities in the run-queue

Drop default option in module control parameter.

v3:
Various cosmetical fixes and minor refactoring of fifo update function. (Luben)

v4:
Switch drm_sched_rq_select_entity_fifo to in order search (Luben)

v5: Fix up drm_sched_rq_select_entity_fifo loop (Luben)

v6: Add missing drm_sched_rq_remove_fifo_locked

v7: Fix ts sampling bug and more cosmetic stuff (Luben)

v8: Fix module parameter string (Luben)

Cc: Luben Tuikov <[email protected]>
Cc: Christian König <[email protected]>
Cc: Direct Rendering Infrastructure - Development <[email protected]>
Cc: AMD Graphics <[email protected]>
Signed-off-by: Andrey Grodzovsky <[email protected]>
Tested-by: Yunxiang Li (Teddy) <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/scheduler: Set the FIFO scheduling policy as the default

The currently default Round-Robin GPU scheduling can result in starvation
of entities which have a large number of jobs, over entities which have
a very small number of jobs (single digit).

This can be illustrated in the following diagram, where jobs are
alphabetized to show their chronological order of arrival, where job A is
the oldest, B is the second oldest, and so on, to J, the most recent job to
arrive.

    ---> entities
j | H-F-----A--E--I--
o | --G-----B-----J--
b | --------C--------
s\/ --------D--------

WLOG, assuming all jobs are "ready", then a R-R scheduling will execute them
in the following order (a slice off of the top of the entities' list),

H, F, A, E, I, G, B, J, C, D.

However, to mitigate job starvation, we'd rather execute C and D before E,
and so on, given, of course, that they're all ready to be executed.

So, if all jobs are ready at this instant, the order of execution for this
and the next 9 instances of picking the next job to execute, should really
be,

A, B, C, D, E, F, G, H, I, J,

which is their chronological order. The only reason for this order to be
broken, is if an older job is not yet ready, but a younger job is ready, at
an instant of picking a new job to execute. For instance if job C wasn't
ready at time 2, but job D was ready, then we'd pick job D, like this:

0 +1 +2  ...
A, B, D, ...

And from then on, C would be preferred before all other jobs, if it is ready
at the time when a new job for execution is picked. So, if C became ready
two steps later, the execution order would look like this:

......0 +1 +2  ...
A, B, D, E, C, F, G, H, I, J

This is what the FIFO GPU scheduling algorithm achieves. It uses a
Red-Black tree to keep jobs sorted in chronological order, where picking
the oldest job is O(1) (we use the "cached" structure), and balancing the
tree is O(log n). IOW, it picks the *oldest ready* job to execute now.

The implementation is already in the kernel, and this commit only changes
the default GPU scheduling algorithm to use.

This was tested and achieves about 1% faster performance over the Round
Robin algorithm.

Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Direct Rendering Infrastructure - Development <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Christian König <[email protected]>

drm/scheduler: add drm_sched_job_add_resv_dependencies

Add a new function to update job dependencies from a resv obj.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/scheduler: remove drm_sched_dependency_optimized

Not used any more.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/scheduler: rework entity flush, kill and fini

This was buggy because when we had to wait for entities which were
killed as well we would just deadlock.

Instead move all the dependency handling into the callbacks so that
will all happen asynchronously.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/scheduler: rename dependency callback into prepare_job

This now matches much better what this is doing.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/amdgpu: revert "implement tdr advanced mode"

This reverts commit e6c6338f393b74ac0b303d567bb918b44ae7ad75.

This feature basically re-submits one job after another to
figure out which one was the one causing a hang.

This is obviously incompatible with gang-submit which requires
that multiple jobs run at the same time. It's also absolutely
not helpful to crash the hardware multiple times if a clean
recovery is desired.

For testing and debugging environments we should rather disable
recovery alltogether to be able to inspect the state with a hw
debugger.

Additional to that the sw implementation is clearly buggy and causes
reference count issues for the hardware fence.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/scheduler: Fix lockup in drm_sched_entity_kill()

The drm_sched_entity_kill() is invoked twice by drm_sched_entity_destroy()
while userspace process is exiting or being killed. First time it's invoked
when sched entity is flushed and second time when entity is released. This
causes a lockup within wait_for_completion(entity_idle) due to how completion
API works.

Calling wait_for_completion() more times than complete() was invoked is a
error condition that causes lockup because completion internally uses
counter for complete/wait calls. The complete_all() must be used instead
in such cases.

This patch fixes lockup of Panfrost driver that is reproducible by killing
any application in a middle of 3d drawing operation.

Fixes: 2fdb8a8f07c2 ("drm/scheduler: rework entity flush, kill and fini")
Signed-off-by: Dmitry Osipenko <[email protected]>
Reviewed-by: Christian König <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/scheduler: deprecate drm_sched_resubmit_jobs

This interface is not working as it should.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/scheduler: track GPU active time per entity

Track the accumulated time that jobs from this entity were active
on the GPU. This allows drivers using the scheduler to trivially
implement the DRM fdinfo when the hardware doesn't provide more
specific information than signalling job completion anyways.

[Bagas: Append missing colon to @elapsed_ns]
Signed-off-by: Bagas Sanjaya <[email protected]>
Signed-off-by: Lucas Stach <[email protected]>
Reviewed-by: Andrey Grodzovsky <[email protected]>

drm/sched: Create wrapper to add a syncobj dependency to job

In order to add a syncobj's fence as a dependency to a job, it is
necessary to call drm_syncobj_find_fence() to find the fence and then
add the dependency with drm_sched_job_add_dependency(). So, wrap these
steps in one single function, drm_sched_job_add_syncobj_dependency().

Reviewed-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Maíra Canal <[email protected]>
Signed-off-by: Maíra Canal <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/scheduler: Fix variable name in function description

Compiling AMD GPU drivers displays two warnings:

drivers/gpu/drm/scheduler/sched_main.c:738: warning: Function parameter or member 'file' not described in 'drm_sched_job_add_syncobj_dependency'
drivers/gpu/drm/scheduler/sched_main.c:738: warning: Excess function
parameter 'file_private' description in
'drm_sched_job_add_syncobj_dependency'

Get rid of them by renaming the variable name on the function description

Signed-off-by: Caio Novais <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>

drm/scheduler: Add fence deadline support

As the finished fence is the one that is exposed to userspace, and
therefore the one that other operations, like atomic update, would
block on, we need to propagate the deadline from from the finished
fence to the actual hw fence.

v2: Split into drm_sched_fence_set_parent() (ckoenig)
v3: Ensure a thread calling drm_sched_fence_set_deadline_finished() sees
    fence->parent set before drm_sched_fence_set_parent() does this
    test_bit(DMA_FENCE_FLAG_HAS_DEADLINE_BIT).

Signed-off-by: Rob Clark <[email protected]>
Acked-by: Luben Tuikov <[email protected]>

Revert "drm/scheduler: track GPU active time per entity"

This reverts commit df622729ddbf as it introduces a use-after-free,
which isn't easy to fix without going back to the design drawing board.

Reported-by: Danilo Krummrich <[email protected]>
Signed-off-by: Lucas Stach <[email protected]>

drm/scheduler: Fix UAF race in drm_sched_entity_push_job()

After a job is pushed into the queue, it is owned by the scheduler core
and may be freed at any time, so we can't write nor read the submit
timestamp after that point.

Fixes oopses observed with the drm/asahi driver, found with kASAN.

Signed-off-by: Asahi Lina <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Check scheduler ready before calling timeout handling

During an IGT GPU reset test we see the following oops,

[  +0.000003] ------------[ cut here ]------------
[  +0.000000] WARNING: CPU: 9 PID: 0 at kernel/workqueue.c:1656 __queue_delayed_work+0x6d/0xa0
[  +0.000004] Modules linked in: iptable_filter bpfilter amdgpu(OE) nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr ledtrig_audio snd_hda_codec_hdmi intel_rapl_common snd_hda_intel edac_mce_amd snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core iommu_v2 gpu_sched(OE) kvm_amd drm_buddy snd_hwdep kvm video drm_ttm_helper snd_pcm ttm snd_seq_midi drm_display_helper snd_seq_midi_event snd_rawmidi cec crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 snd_seq aesni_intel rc_core crypto_simd cryptd binfmt_misc drm_kms_helper rapl snd_seq_device input_leds joydev snd_timer i2c_algo_bit syscopyarea snd ccp sysfillrect sysimgblt wmi_bmof k10temp soundcore mac_hid sch_fq_codel msr parport_pc ppdev drm lp parport ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 hid_generic usbhid hid r8169 ahci xhci_pci gpio_amdpt realtek i2c_piix4 wmi crc32_pclmul xhci_pci_renesas libahci gpio_generic
[  +0.000070] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G        W OE      6.1.11+ #2
[  +0.000003] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017
[  +0.000001] RIP: 0010:__queue_delayed_work+0x6d/0xa0
[  +0.000003] Code: 7a 50 48 01 c1 48 89 4a 30 81 ff 00 20 00 00 75 38 4c 89 cf e8 64 3e 0a 00 5d e9 1e c5 11 01 e8 99 f7 ff ff 5d e9 13 c5 11 01 <0f> 0b eb c1 0f 0b 48 81 7a 38 70 5c 0e 81 74 9f 0f 0b 48 8b 42 28
[  +0.000002] RSP: 0018:ffffc90000398d60 EFLAGS: 00010007
[  +0.000002] RAX: ffff88810d589c60 RBX: 0000000000000000 RCX: 0000000000000000
[  +0.000002] RDX: ffff88810d589c58 RSI: 0000000000000000 RDI: 0000000000002000
[  +0.000001] RBP: ffffc90000398d60 R08: 0000000000000000 R09: ffff88810d589c78
[  +0.000002] R10: 72705f305f39765f R11: 7866673a6d72645b R12: ffff88810d589c58
[  +0.000001] R13: 0000000000002000 R14: 0000000000000000 R15: 0000000000000000
[  +0.000002] FS:  0000000000000000(0000) GS:ffff8887fee40000(0000) knlGS:0000000000000000
[  +0.000001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 00005562c4797fa0 CR3: 0000000110da0000 CR4: 00000000003506e0
[  +0.000002] Call Trace:
[  +0.000001]  <IRQ>
[  +0.000001]  mod_delayed_work_on+0x5e/0xa0
[  +0.000004]  drm_sched_fault+0x23/0x30 [gpu_sched]
[  +0.000007]  gfx_v9_0_fault.isra.0+0xa6/0xd0 [amdgpu]
[  +0.000258]  gfx_v9_0_priv_reg_irq+0x29/0x40 [amdgpu]
[  +0.000254]  amdgpu_irq_dispatch+0x1ac/0x2b0 [amdgpu]
[  +0.000243]  amdgpu_ih_process+0x89/0x130 [amdgpu]
[  +0.000245]  amdgpu_irq_handler+0x24/0x60 [amdgpu]
[  +0.000165]  __handle_irq_event_percpu+0x4f/0x1a0
[  +0.000003]  handle_irq_event_percpu+0x15/0x50
[  +0.000001]  handle_irq_event+0x39/0x60
[  +0.000002]  handle_edge_irq+0xa8/0x250
[  +0.000003]  __common_interrupt+0x7b/0x150
[  +0.000002]  common_interrupt+0xc1/0xe0
[  +0.000003]  </IRQ>
[  +0.000000]  <TASK>
[  +0.000001]  asm_common_interrupt+0x27/0x40
[  +0.000002] RIP: 0010:native_safe_halt+0xb/0x10
[  +0.000003] Code: 46 ff ff ff cc cc cc cc cc cc cc cc cc cc cc eb 07 0f 00 2d 69 f2 5e 00 f4 e9 f1 3b 3e 00 90 eb 07 0f 00 2d 59 f2 5e 00 fb f4 <e9> e0 3b 3e 00 0f 1f 44 00 00 55 48 89 e5 53 e8 b1 d4 fe ff 66 90
[  +0.000002] RSP: 0018:ffffc9000018fdc8 EFLAGS: 00000246
[  +0.000002] RAX: 0000000000004000 RBX: 000000000002e5a8 RCX: 000000000000001f
[  +0.000001] RDX: 0000000000000001 RSI: ffff888101298800 RDI: ffff888101298864
[  +0.000001] RBP: ffffc9000018fdd0 R08: 000000527f64bd8b R09: 000000000001dc90
[  +0.000001] R10: 000000000001dc90 R11: 0000000000000003 R12: 0000000000000001
[  +0.000001] R13: ffff888101298864 R14: ffffffff832d9e20 R15: ffff888193aa8c00
[  +0.000003]  ? acpi_idle_do_entry+0x5e/0x70
[  +0.000002]  acpi_idle_enter+0xd1/0x160
[  +0.000003]  cpuidle_enter_state+0x9a/0x6e0
[  +0.000003]  cpuidle_enter+0x2e/0x50
[  +0.000003]  call_cpuidle+0x23/0x50
[  +0.000002]  do_idle+0x1de/0x260
[  +0.000002]  cpu_startup_entry+0x20/0x30
[  +0.000002]  start_secondary+0x120/0x150
[  +0.000003]  secondary_startup_64_no_verify+0xe5/0xeb
[  +0.000004]  </TASK>
[  +0.000000] ---[ end trace 0000000000000000 ]---
[  +0.000003] BUG: kernel NULL pointer dereference, address: 0000000000000102
[  +0.006233] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=3, emitted seq=4
[  +0.000734] #PF: supervisor read access in kernel mode
[  +0.009670] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process amd_deadlock pid 2002 thread amd_deadlock pid 2002
[  +0.005135] #PF: error_code(0x0000) - not-present page
[  +0.000002] PGD 0 P4D 0
[  +0.000002] Oops: 0000 [#1] PREEMPT SMP NOPTI
[  +0.000002] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G        W OE      6.1.11+ #2
[  +0.000002] Hardware name: Gigabyte Technology Co., Ltd. AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017
[  +0.012101] amdgpu 0000:0c:00.0: amdgpu: GPU reset begin!
[  +0.005136] RIP: 0010:__queue_work+0x1f/0x4e0
[  +0.000004] Code: 87 cd 11 01 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 d5 41 54 49 89 f4 53 48 83 ec 10 89 7d d4 <f6> 86 02 01 00 00 01 0f 85 6c 03 00 00 e8 7f 36 08 00 8b 45 d4 48

For gfx_rings the schedulers may not be initialized by
amdgpu_device_init_schedulers() due to ring->no_scheduler flag being set to
true and thus the timeout_wq is NULL. As a result, since all ASICs call
drm_sched_fault() unconditionally even for schedulers which have not been
initialized, it is simpler to use the ready condition which indicates whether
the given scheduler worker thread runs and whether the timeout_wq of the reset
domain has been initialized.

Signed-off-by: Vitaly Prosyak <[email protected]>
Cc: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()

It already happend a few times that patches slipped through which
implemented access to an entity through a job that was already removed
from the entities queue. Since jobs and entities might have different
lifecycles, this can potentially cause UAF bugs.

In order to make it obvious that a jobs entity pointer shouldn't be
accessed after drm_sched_entity_pop_job() was called successfully, set
the jobs entity pointer to NULL once the job is removed from the entity
queue.

Moreover, debugging a potential NULL pointer dereference is way easier
than potentially corrupted memory through a UAF.

Signed-off-by: Danilo Krummrich <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>

drm/scheduler: properly forward fence errors

When a hw fence is signaled with an error properly forward that to the
finished fence.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/scheduler: add drm_sched_entity_error and use rcu for last_scheduled

Switch to using RCU handling for the last scheduled job and add a
function to return the error code of it.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/scheduler: mark jobs without fence as canceled

When no hw fence is provided for a job that means that the job didn't executed.

Signed-off-by: Christian König <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/sched: Check scheduler work queue before calling timeout handling

During an IGT GPU reset test we see again oops despite of
commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
timeout handling).

It uses ready condition whether to call drm_sched_fault which unwind
the TDR leads to GPU reset.
However it looks the ready condition is overloaded with other meanings,
for example, for the following stack is related GPU reset :

0  gfx_v9_0_cp_gfx_start
1  gfx_v9_0_cp_gfx_resume
2  gfx_v9_0_cp_resume
3  gfx_v9_0_hw_init
4  gfx_v9_0_resume
5  amdgpu_device_ip_resume_phase2

does the following:
	/* start the ring */
	gfx_v9_0_cp_gfx_start(adev);
	ring->sched.ready = true;

The same approach is for other ASICs as well :
gfx_v8_0_cp_gfx_resume
gfx_v10_0_kiq_resume, etc...

As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
and then drm_sched_fault. However now it depends on whether the interrupt service routine
drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
field of the scheduler to true even  for uninitialized schedulers and causes oops vs
no fault or when ISR  drm_sched_fault is completed prior  gfx_v9_0_cp_gfx_start and
NULL pointer dereference does not occur.

Use the field timeout_wq  to prevent oops for uninitialized schedulers.
The field could be initialized by the work queue of resetting the domain.

v1: Corrections to commit message (Luben)

Fixes: 11b3b9f461c5c4 ("drm/sched: Check scheduler ready before calling timeout handling")
Signed-off-by: Vitaly Prosyak <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Remove redundant check

The rq pointer points inside the drm_gpu_scheduler structure. Thus
it can't be NULL.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: c61cdbdbffc1 ("drm/scheduler: Fix hang when sched_entity released")
Signed-off-by: Vladislav Efanov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Rename to drm_sched_can_queue()

Rename drm_sched_ready() to drm_sched_can_queue(). "ready" can mean many
things and is thus meaningless in this context. Instead, rename to a name
which precisely conveys what is being checked.

Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

drm/sched: Rename to drm_sched_wakeup_if_can_queue()

Rename drm_sched_wakeup() to drm_sched_wakeup_if_canqueue() since the former
is misleading, as it wakes up the GPU scheduler _only if_ more jobs can be
queued to the underlying hardware.

This distinction is important to make, since the wake conditional in the GPU
scheduler thread wakes up when other conditions are also true, e.g. when there
are jobs to be cleaned. For instance, a user might want to wake up the
scheduler only because there are more jobs to clean, but whether we can queue
more jobs is irrelevant.

v2: Separate "canqueue" to "can_queue". (Alex D.)

Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Alex Deucher <[email protected]>

drm/scheduler: avoid infinite loop if entity's dependency is a scheduled error fence

[Why]
drm_sched_entity_add_dependency_cb ignores the scheduled fence and return false.
If entity's dependency is a scheduler error fence and drm_sched_stop is called
due to TDR, drm_sched_entity_pop_job will wait for the dependency infinitely.

[How]
Do not wait or ignore the scheduled error fence, add drm_sched_entity_wakeup
callback for the dependency with scheduled error fence.

Signed-off-by: ZhenGuo Yin <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/sched: Make sure we wait for all dependencies in kill_jobs_cb()

drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped
from the dependency array that was waited upon before
drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
so we're basically waiting for all dependencies except one.

In theory, this wait shouldn't be needed because resources should have
their users registered to the dma_resv object, thus guaranteeing that
future jobs wanting to access these resources wait on all the previous
users (depending on the access type, of course). But we want to keep
these explicit waits in the kill entity path just in case.

Let's make sure we keep all dependencies in the array in
drm_sched_job_dependency(), so we can iterate over the array and wait
in drm_sched_entity_kill_jobs_cb().

We also make sure we wait on drm_sched_fence::finished if we were
originally asked to wait on drm_sched_fence::scheduled. In that case,
we assume the intent was to delegate the wait to the firmware/GPU or
rely on the pipelining done at the entity/scheduler level, but when
killing jobs, we really want to wait for completion not just scheduling.

v2:
- Don't evict deps in drm_sched_job_dependency()

v3:
- Always wait for drm_sched_fence::finished fences in
  drm_sched_entity_kill_jobs_cb() when we see a sched_fence

v4:
- Fix commit message
- Fix a use-after-free bug

v5:
- Flag deps on which we should only wait for the scheduled event
  at insertion time

v6:
- Back to v4 implementation
- Add Christian's R-b

Cc: Frank Binns <[email protected]>
Cc: Sarah Walker <[email protected]>
Cc: Donald Robson <[email protected]>
Cc: Luben Tuikov <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Sumit Semwal <[email protected]>
Cc: "Christian König" <[email protected]>
Signed-off-by: Boris Brezillon <[email protected]>
Suggested-by: "Christian König" <[email protected]>
Reviewed-by: "Christian König" <[email protected]>
Acked-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/sched: Call drm_sched_fence_set_parent() from drm_sched_fence_scheduled()

Drivers that can delegate waits to the firmware/GPU pass the scheduled
fence to drm_sched_job_add_dependency(), and issue wait commands to
the firmware/GPU at job submission time. For this to be possible, they
need all their 'native' dependencies to have a valid parent since this
is where the actual HW fence information are encoded.

In drm_sched_main(), we currently call drm_sched_fence_set_parent()
after drm_sched_fence_scheduled(), leaving a short period of time
during which the job depending on this fence can be submitted.

Since setting parent and signaling the fence are two things that are
kinda related (you can't have a parent if the job hasn't been
scheduled),
it probably makes sense to pass the parent fence to
drm_sched_fence_scheduled() and let it call drm_sched_fence_set_parent()
before it signals the scheduled fence.

Here is a detailed description of the race we are fixing here:

Thread A				Thread B

- calls drm_sched_fence_scheduled()
- signals s_fence->scheduled which
  wakes up thread B

					- entity dep signaled, checking
					  the next dep
					- no more deps waiting
					- entity is picked for job
					  submission by drm_gpu_scheduler
					- run_job() is called
					- run_job() tries to
					  collect native fence info from
					  s_fence->parent, but it's
					  NULL =>
					  BOOM, we can't do our native
					  wait

- calls drm_sched_fence_set_parent()

v2:
* Fix commit message

v3:
* Add a detailed description of the race to the commit message
* Add Luben's R-b

Signed-off-by: Boris Brezillon <[email protected]>
Cc: Frank Binns <[email protected]>
Cc: Sarah Walker <[email protected]>
Cc: Donald Robson <[email protected]>
Cc: Luben Tuikov <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Sumit Semwal <[email protected]>
Cc: "Christian König" <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

dma-buf: add dma_fence_timestamp helper

When a fence signals there is a very small race window where the timestamp
isn't updated yet. sync_file solves this by busy waiting for the
timestamp to appear, but on other ocassions didn't handled this
correctly.

Provide a dma_fence_timestamp() helper function for this and use it in
all appropriate cases.

Another alternative would be to grab the spinlock when that happens.

v2 by teddy: add a wait parameter to wait for the timestamp to show up, in case
   the accurate timestamp is needed and/or the timestamp is not based on
   ktime (e.g. hw timestamp)
v3 chk: drop the parameter again for unified handling

Signed-off-by: Yunxiang Li <[email protected]>
Signed-off-by: Christian König <[email protected]>
Fixes: 1774baa64f93 ("drm/scheduler: Change scheduled fence track v2")
Reviewed-by: Alex Deucher <[email protected]>
CC: [email protected]
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/sched: Convert the GPU scheduler to variable number of run-queues

The GPU scheduler has now a variable number of run-queues, which are set
up at
drm_sched_init() time. This way, each driver announces how many
run-queues it
requires (supports) per each GPU scheduler it creates. Note, that
run-queues
correspond to scheduler "priorities", thus if the number of run-queues
is set
to 1 at drm_sched_init(), then that scheduler supports a single
run-queue,
i.e. single "priority". If a driver further sets a single entity per
run-queue, then this creates a 1-to-1 correspondence between a scheduler
and
a scheduled entity.

Cc: Lucas Stach <[email protected]>
Cc: Russell King <[email protected]>
Cc: Qiang Yu <[email protected]>
Cc: Rob Clark <[email protected]>
Cc: Abhinav Kumar <[email protected]>
Cc: Dmitry Baryshkov <[email protected]>
Cc: Danilo Krummrich <[email protected]>
Cc: Matthew Brost <[email protected]>
Cc: Boris Brezillon <[email protected]>
Cc: Alex Deucher <[email protected]>
Cc: Christian König <[email protected]>
Cc: Emma Anholt <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Luben Tuikov <[email protected]>
Acked-by: Christian König <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

drm/sched: Add drm_sched_wqueue_* helpers

Add scheduler wqueue ready, stop, and start helpers to hide the
implementation details of the scheduler from the drivers.

v2:
  - s/sched_wqueue/sched_wqueue (Luben)
  - Remove the extra white line after the return-statement (Luben)
  - update drm_sched_wqueue_ready comment (Luben)

Cc: Luben Tuikov <[email protected]>
Signed-off-by: Matthew Brost <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Convert drm scheduler to use a work queue rather than kthread

In Xe, the new Intel GPU driver, a choice has made to have a 1 to 1
mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
seems a bit odd but let us explain the reasoning below.

1. In Xe the submission order from multiple drm_sched_entity is not
guaranteed to be the same completion even if targeting the same hardware
engine. This is because in Xe we have a firmware scheduler, the GuC,
which allowed to reorder, timeslice, and preempt submissions. If a using
shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
apart as the TDR expects submission order == completion order. Using a
dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

2. In Xe submissions are done via programming a ring buffer (circular
buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
control on the ring for free.

A problem with this design is currently a drm_gpu_scheduler uses a
kthread for submission / job cleanup. This doesn't scale if a large
number of drm_gpu_scheduler are used. To work around the scaling issue,
use a worker rather than kthread for submission / job cleanup.

v2:
  - (Rob Clark) Fix msm build
  - Pass in run work queue
v3:
  - (Boris) don't have loop in worker
v4:
  - (Tvrtko) break out submit ready, stop, start helpers into own patch
v5:
  - (Boris) default to ordered work queue
v6:
  - (Luben / checkpatch) fix alignment in msm_ringbuffer.c
  - (Luben) s/drm_sched_submit_queue/drm_sched_wqueue_enqueue
  - (Luben) Update comment for drm_sched_wqueue_enqueue
  - (Luben) Positive check for submit_wq in drm_sched_init
  - (Luben) s/alloc_submit_wq/own_submit_wq
v7:
  - (Luben) s/drm_sched_wqueue_enqueue/drm_sched_run_job_queue
v8:
  - (Luben) Adjust var names / comments

Signed-off-by: Matthew Brost <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Split free_job into own work item

Rather than call free_job and run_job in same work item have a dedicated
work item for each. This aligns with the design and intended use of work
queues.

v2:
   - Test for DMA_FENCE_FLAG_TIMESTAMP_BIT before setting
     timestamp in free_job() work item (Danilo)
v3:
  - Drop forward dec of drm_sched_select_entity (Boris)
  - Return in drm_sched_run_job_work if entity NULL (Boris)
v4:
  - Replace dequeue with peek and invert logic (Luben)
  - Wrap to 100 lines (Luben)
  - Update comments for *_queue / *_queue_if_ready functions (Luben)
v5:
  - Drop peek argument, blindly reinit idle (Luben)
  - s/drm_sched_free_job_queue_if_ready/drm_sched_free_job_queue_if_done (Luben)
  - Update work_run_job & work_free_job kernel doc (Luben)
v6:
  - Do not move drm_sched_select_entity in file (Luben)

Signed-off-by: Matthew Brost <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Add drm_sched_start_timeout_unlocked helper

Also add a lockdep assert to drm_sched_start_timeout.

Signed-off-by: Matthew Brost <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Add a helper to queue TDR immediately

Add a helper whereby a driver can invoke TDR immediately.

v2:
 - Drop timeout args, rename function, use mod delayed work (Luben)
v3:
 - s/XE/Xe (Luben)
 - present tense in commit message (Luben)
 - Adjust comment for drm_sched_tdr_queue_imm (Luben)
v4:
 - Adjust commit message (Luben)

Cc: Luben Tuikov <[email protected]>
Signed-off-by: Matthew Brost <[email protected]>
Reviewed-by: Luben Tuikov <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Rename drm_sched_get_cleanup_job to be more descriptive

"Get cleanup job" makes it sound like helper is returning a job which will
execute some cleanup, or something, while the kerneldoc itself accurately
says "fetch the next _finished_ job". So lets rename the helper to be self
documenting.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Luben Tuikov <[email protected]>
Cc: Matthew Brost <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Move free worker re-queuing out of the if block

Whether or not there are more jobs to clean up does not depend on the
existance of the current job, given both drm_sched_get_finished_job and
drm_sched_free_job_queue_if_done take and drop the job list lock.
Therefore it is confusing to make it read like there is a dependency.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Luben Tuikov <[email protected]>
Cc: Matthew Brost <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Rename drm_sched_free_job_queue to be more descriptive

The current name makes it sound like helper will free a queue, while what
it does is it enqueues the free job worker.

Rename it to drm_sched_run_free_queue to align with existing
drm_sched_run_job_queue.

Despite that creating an illusion there are two queues, while in reality
there is only one, at least it creates a consistent naming for the two
enqueuing helpers.

At the same time simplify the "if done" helper by dropping the suffix and
adding a double underscore prefix to the one which just enqueues.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Luben Tuikov <[email protected]>
Cc: Matthew Brost <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Rename drm_sched_run_job_queue_if_ready and clarify kerneldoc

"If ready" is not immediately clear what it means - is the scheduler
ready or something else? Drop the suffix, clarify kerneldoc, and employ
the same naming scheme as in drm_sched_run_free_queue:

 - drm_sched_run_job_queue   - enqueues if there is something to enqueue
                               *and* scheduler is ready (can queue)
 - __drm_sched_run_job_queue - low-level helper to simply queue the job

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Luben Tuikov <[email protected]>
Cc: Matthew Brost <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Drop suffix from drm_sched_wakeup_if_can_queue

Because a) helper is exported to other parts of the scheduler and
b) there isn't a plain drm_sched_wakeup to begin with, I think we can
drop the suffix and by doing so separate the intimiate knowledge
between the scheduler components a bit better.

Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Luben Tuikov <[email protected]>
Cc: Matthew Brost <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Luben Tuikov <[email protected]>
Signed-off-by: Luben Tuikov <[email protected]>

drm/sched: Don't disturb the entity when in RR-mode scheduling

Don't call drm_sched_select_entity() in drm_sched_run_job_queue().  In fact,
rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let
it do just that, schedule the work item for execution.

The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity()
to determine if the scheduler has an entity ready in one of its run-queues,
and in the case of the Round-Robin (RR) scheduling, the function
drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity
which is ready, sets up the run-queue and completion and returns that
entity. The FIFO scheduling algorithm is unaffected.

Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then
in the case of RR scheduling, that would result in drm_sched_select_entity()
having been called twice, which may result in skipping a ready entity if more
than one entity is ready. This commit fixes this by eliminating the call to
drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only
in drm_sched_run_job_work().

v2: Rebased on top of Tvrtko's renames series of patches. (Luben)
    Add fixes-tag. (Tvrtko)

Signed-off-by: Luben Tuikov <[email protected]>
Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item")
Reviewed-by: Matthew Brost <[email protected]>
Reviewed-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/sched: Qualify drm_sched_wakeup() by drm_sched_entity_is_ready()

Don't "wake up" the GPU scheduler unless the entity is ready, as well as we
can queue to the scheduler, i.e. there is no point in waking up the scheduler
for the entity unless the entity is ready.

Signed-off-by: Luben Tuikov <[email protected]>
Fixes: bc8d6a9df99038 ("drm/sched: Don't disturb the entity when in RR-mode scheduling")
Reviewed-by: Danilo Krummrich <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/sched: implement dynamic job-flow control

Currently, job flow control is implemented simply by limiting the number
of jobs in flight. Therefore, a scheduler is initialized with a credit
limit that corresponds to the number of jobs w…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants