Skip to content

Commit

Permalink
tests/boot-mirror: sleep 20s after detaching primary block device
Browse files Browse the repository at this point in the history
For some reason, if we try to reboot too quickly after detaching the
primary block device, the OS will sometimes crash with:

```
[  100.662358] watchdog: watchdog0: watchdog did not stop!
[  100.969436] watchdog: watchdog0: watchdog did not stop!
[  100.998017] BUG: Unable to handle kernel data access at 0x5deadbeef0000100
[  100.998158] Faulting instruction address: 0xc000000000f219d4
[  100.998264] Oops: Kernel access of bad area, sig: 11 [coreos#1]
[  100.998348] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[  100.998454] Modules linked in: rfkill crct10dif_vpmsum binfmt_misc raid1 xfs zram virtio_net net_failover vmx_crypto
 crc32c_vpmsum pseries_wdt virtio_console failover virtio_blk scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse
[  100.998822] CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 6.4.15-200.fc38.ppc64le coreos#1
[  100.998947] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1203 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries
[  100.999107] NIP:  c000000000f219d4 LR: c000000000f219c8 CTR: c0000000001c60e0
[  100.999229] REGS: c0000000085a3860 TRAP: 0380   Not tainted  (6.4.15-200.fc38.ppc64le)
[  100.999352] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 44442404  XER: 00000092
[  100.999511] CFAR: c000000000f2191c IRQMASK: 0
[  100.999511] GPR00: c000000000f219c8 c0000000085a3b00 c000000001eea800 c000000002c88ac8
[  100.999511] GPR04: c000000002c88ac8 0000000000000001 0000000000000001 fffffffffffe0000
[  100.999511] GPR08: 0000000000000001 0000000000000001 5deadbeef0000100 0000000000002000
[  100.999511] GPR12: 0000000000000000 c000000002ca0000 0000000000000000 0000000000000000
[  100.999511] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  100.999511] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  100.999511] GPR24: 0000000000000002 0000000000000000 c000000002c888e0 c00000000294e6e8
[  100.999511] GPR28: c000000002c88ac8 5deadbeeeffffd28 c00000001f2f1218 c0000000b8d9d000
[  101.000532] NIP [c000000000f219d4] md_notify_reboot+0x154/0x250
[  101.000643] LR [c000000000f219c8] md_notify_reboot+0x148/0x250
[  101.000748] Call Trace:
[  101.000793] [c0000000085a3b00] [c000000000f219a0] md_notify_reboot+0x120/0x250 (unreliable)
[  101.000922] [c0000000085a3b60] [c000000000199e30] notifier_call_chain+0xc0/0x1b0
[  101.001049] [c0000000085a3bc0] [c00000000019a114] blocking_notifier_call_chain+0x64/0xa0
[  101.001176] [c0000000085a3c00] [c00000000019dbb8] kernel_restart+0x38/0xe0
[  101.001282] [c0000000085a3c70] [c00000000019dfbc] __do_sys_reboot+0x12c/0x2c0
[  101.001409] [c0000000085a3dd0] [c000000000030f34] system_call_exception+0x174/0x320
[  101.001537] [c0000000085a3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
[  101.001681] --- interrupt: 3000 at 0x7fffb995aa88
[  101.001767] NIP:  00007fffb995aa88 LR: 0000000000000000 CTR: 0000000000000000
[  101.001888] REGS: c0000000085a3e80 TRAP: 3000   Not tainted  (6.4.15-200.fc38.ppc64le)
[  101.002010] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 48442403  XER: 00000000
[  101.002167] IRQMASK: 0
[  101.002167] GPR00: 0000000000000058 00007fffd8caa5c0 00007fffb9a76f00 fffffffffee1dead
[  101.002167] GPR04: 0000000028121969 0000000001234567 0000000000003a5d 0000000000000020
[  101.002167] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  101.002167] GPR12: 0000000000000000 00007fffba163680 0000000000000000 0000000000000040
[  101.002167] GPR16: 0000000000000138 0000000000000001 000000010b9213c8 000000010b921a90
[  101.002167] GPR20: 0000000000000000 000000010b9212e0 0000000000000010 000000010b921318
[  101.002167] GPR24: 000000010b9212d0 00007fffd8caa6e0 00007fffd8caa6f8 00007fffd8caa6d8
[  101.002167] GPR28: 00007fffd8caada8 00007fffd8caa6e8 00007fffd8caa6c8 0000000000000000
[  101.003150] NIP [00007fffb995aa88] 0x7fffb995aa88
[  101.003234] LR [0000000000000000] 0x0
[  101.003299] --- interrupt: 3000
[  101.003363] Code: 7f84e378 48423c31 60000000 2c030000 4182000c 7fe3fb78 4bff859d 7f83e378 48465645 60000000 39000001 395d03d8 <e93d03d8> 7fbfeb78 7c2ad800 3929fc28
[  101.003604] ---[ end trace 0000000000000000 ]---
[  101.014145] pstore: backend (nvram) writing error (-1)
[  101.014255]
[  102.014301] note: systemd-shutdow[1] exited with irqs disabled
[  102.014503] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
```

The `md_notify_reboot` implicates the MD code in the kernel.

Obviously this is likely a bug that needs fixing
  • Loading branch information
jlebon committed Sep 12, 2023
1 parent c506d88 commit caa66d4
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions mantle/kola/tests/misc/boot-mirror.go
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,7 @@ func detachPrimaryBlockDevice(c cluster.TestCluster, m platform.Machine) {
if err := m.(platform.QEMUMachine).RemovePrimaryBlockDevice(); err != nil {
c.Fatalf("failed to delete the first boot disk: %v", err)
}

// Check if we can still SSH into the machine. We've noticed sometimes
// that after removing the primary device, we lose connectivity.
if err := ut.Retry(5, 3*time.Second, func() error {
Expand All @@ -230,6 +231,10 @@ func detachPrimaryBlockDevice(c cluster.TestCluster, m platform.Machine) {
}); err != nil {
c.Fatalf("Failed to retrieve boot ID: %v", err)
}

// Give some time to the host before doing the reboot.
time.Sleep(20 * time.Second)

err := m.Reboot()
if err != nil {
c.Fatalf("Failed to reboot the machine: %v", err)
Expand Down

0 comments on commit caa66d4

Please sign in to comment.