Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BSOD during physical disk removal from zpool #206

Closed
arun-kv opened this issue Feb 14, 2023 · 13 comments
Closed

BSOD during physical disk removal from zpool #206

arun-kv opened this issue Feb 14, 2023 · 13 comments

Comments

@arun-kv
Copy link

arun-kv commented Feb 14, 2023

Describe the problem you're observing

Verify VERIFY(list_is_empty(&lwb->lwb_itxs)) failure in zil_free_lwb while running zpool.exe remove zpool-name "physical disk name"

Include any warning/errors/backtraces from the system logs

ZFSin!panic+0x20 [C:\BuildAgent\work\9f5ca3432854b81\module\os\windows\spl\spl-debug.c @ 32] 
ZFSin!zil_free_lwb+0x87 [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\zil.c @ 603] 
ZFSin!zil_destroy+0x15b [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\zil.c @ 777] 
ZFSin!zil_suspend+0x2b8 [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\zil.c @ 3453] 
ZFSin!zil_reset+0x1e [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\zil.c @ 3660] 
ZFSin!dmu_objset_find_impl+0x6f8 [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\dmu_objset.c @ 2934] 
ZFSin!dmu_objset_find_impl+0x3b3 [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\dmu_objset.c @ 2874] 
ZFSin!dmu_objset_find+0xed [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\dmu_objset.c @ 2952] 
ZFSin!spa_reset_logs+0x26 [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\spa.c @ 2244] 
ZFSin!spa_vdev_remove_top+0x1a3 [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\vdev_removal.c @ 2172] 
ZFSin!spa_vdev_remove+0x48f [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\vdev_removal.c @ 2334] 
ZFSin!zfs_ioc_vdev_remove+0x262 [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\zfs_ioctl.c @ 1931] 
ZFSin!zfsdev_ioctl_common+0x2ef [C:\BuildAgent\work\9f5ca3432854b81\module\zfs\zfs_ioctl.c @ 7853] 
ZFSin!zfsdev_ioctl+0xc4 [C:\BuildAgent\work\9f5ca3432854b81\module\os\windows\zfs\zfs_ioctl_os.c @ 911] 
ZFSin!ioctlDispatcher+0x54f [C:\BuildAgent\work\9f5ca3432854b81\module\os\windows\zfs\zfs_vnops_windows.c @ 4606] 
ZFSin!dispatcher+0x1fd [C:\BuildAgent\work\9f5ca3432854b81\module\os\windows\zfs\zfs_vnops_windows.c @ 5529] 
nt!IofCallDriver+0x59
nt!IopSynchronousServiceTail+0x1b1
nt!IopXxxControlFile+0xe0c
nt!NtDeviceIoControlFile+0x56
nt!KiSystemServiceCopyEnd+0x25
0x00007ffd`0acbfad4

Describe how to reproduce the problem

I was able to reproduce the issue by introducing the some sleep in the code and also removing the if (zilog->zl_suspend > 0) from zil_commit.

I see zilog->zl_suspend is not protected by any lock, and can this cause a raise condition and zil_commit miss the update done by zil_suspend on zilog->zl_suspend variable?.

I removed the if (zilog->zl_suspend > 0) to see how the system will behave if the zl_suspend is not properly protected.

Below is the patch i used to reproduce the issue,

Steps to reproduce the issue,

  1. Create zvol with sync=always
  2. Run some IO on zvol (eg: ddrelease64.exe if=/dev/random of=\.\PHYSICALDRIVE# bs=512 count=1)
  3. Run zpool.exe remove zpool , while the code is sleeping in zil_suspend
  4. Trigger a small IO on the zvol (eg: ddrelease64.exe if=/dev/random of=\.\PHYSICALDRIVE# bs=512 count=1)
diff --git a/module/zfs/zil.c b/module/zfs/zil.c
index 0f9d196a3..44421847b 100644
--- a/module/zfs/zil.c
+++ b/module/zfs/zil.c
@@ -2237,6 +2237,10 @@ zil_commit_writer_stall(zilog_t *zilog)
  * created lwbs. Additionally, as a new lwb is created, the previous
  * lwb will be issued to the zio layer to be written to disk.
  */
+
+int arun_debug1 = 0;
+int arun_sleep1 = 500;
+
 static void
 zil_process_commit_list(zilog_t *zilog)
 {
@@ -2335,10 +2339,17 @@ zil_process_commit_list(zilog_t *zilog)
                        if (lwb != NULL) {
                                lwb = zil_lwb_commit(zilog, itx, lwb);

-                               if (lwb == NULL)
+                               if (lwb == NULL) {
                                        list_insert_tail(&nolwb_itxs, itx);
-                               else
+                               } else {
                                        list_insert_tail(&lwb->lwb_itxs, itx);
+
+#ifdef _KERNEL
+                                       if (arun_debug1) {
+                                               IOSleep(arun_sleep1);
+                                       }
+#endif
+                               }
                        } else {
                                if (lrc->lrc_txtype == TX_COMMIT) {
                                        zil_commit_waiter_link_nolwb(
@@ -2454,6 +2465,7 @@ zil_process_commit_list(zilog_t *zilog)
  * not issued, we rely on future calls to zil_commit_writer() to issue
  * the lwb, or the timeout mechanism found in zil_commit_waiter().
  */
+
 static void
 zil_commit_writer(zilog_t *zilog, zil_commit_waiter_t *zcw)
 {
@@ -2948,10 +2960,10 @@ zil_commit(zilog_t *zilog, uint64_t foid)
         * simply rely on txg_wait_synced() to maintain the necessary
         * semantics, and avoid calling those functions altogether.
         */
-       if (zilog->zl_suspend > 0) {
+       /*if (zilog->zl_suspend > 0) {
                txg_wait_synced(zilog->zl_dmu_pool, 0);
                return;
-       }
+       }*/

        zil_commit_impl(zilog, foid);
 }
@@ -3338,6 +3350,9 @@ static char *suspend_tag = "zil suspending";
  * Otherwise, it returns with the dataset "long held", and the cookie
  * should be passed into zil_resume().
  */
+int arun_debug2 = 0;
+int arun_sleep2 = 500;
+
 int
 zil_suspend(const char *osname, void **cookiep)
 {
@@ -3449,6 +3464,12 @@ zil_suspend(const char *osname, void **cookiep)
         */
        txg_wait_synced(zilog->zl_dmu_pool, 0);

+
+#ifdef _KERNEL
+        if (arun_debug2) {
+                IOSleep(arun_sleep2);
+        }
+#endif
        zil_destroy(zilog, B_FALSE);

        mutex_enter(&zilog->zl_lock);
@lundman
Copy link

lundman commented Feb 15, 2023

Looks like you are onto something here, I wonder if I can some people more familiar with the ZIL to take a peek.

@arun-kv
Copy link
Author

arun-kv commented Feb 15, 2023

@lundman sure, thank you.

@arun-kv
Copy link
Author

arun-kv commented Feb 21, 2023

Hi @lundman, did you get any response from the community?

ryao added a commit to ryao/zfs that referenced this issue Feb 21, 2023
openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a zil_commit() is delayed
by the scheduler long enough for a parallel zil_suspend() operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_commit_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to txg_wait_synced()
after `zil_suspend()` has begun.

On PREEMPT_RT Linux kernels, the rw_enter() implementation suffers from
writer starvation. This means that a ZIL intensive system can delay
zil_suspend indefinitely. This is a pre-existing problem that affects
everything that uses rw locks, so it needs to be addressed in the SPL.
However, builds against PREEMPT_RT Linux kernels are currently broken
due to a GPL symbol issue (openzfs#11097), so we can safely disregard that
issue for now.

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit to ryao/zfs that referenced this issue Feb 21, 2023
openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a zil_commit() is delayed
by the scheduler long enough for a parallel zil_suspend() operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_commit_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to txg_wait_synced()
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Signed-off-by: Richard Yao <[email protected]>
@ryao
Copy link

ryao commented Feb 21, 2023

@arun-kv I just heard about this today.

That said, you found a data race inside ZIL. I imagine that we did not detect this on other platforms because the schedulers on those platforms made losing this race unlikely. I opened openzfs#14514 with a patch that should prevent this from happening.

ryao added a commit to ryao/zfs that referenced this issue Feb 21, 2023
openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a zil_commit() is delayed
by the scheduler long enough for a parallel zil_suspend() operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_commit_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to txg_wait_synced()
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit to ryao/zfs that referenced this issue Feb 21, 2023
openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_commit_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Signed-off-by: Richard Yao <[email protected]>
ryao added a commit to ryao/zfs that referenced this issue Feb 21, 2023
openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_commit_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
@arun-kv
Copy link
Author

arun-kv commented Feb 22, 2023

@ryao @lundman, Thank you for looking into the issue.

ryao added a commit to ryao/zfs that referenced this issue Feb 28, 2023
openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_commit_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
ryao added a commit to ryao/zfs that referenced this issue Mar 1, 2023
openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
behlendorf pushed a commit to openzfs/zfs that referenced this issue Mar 1, 2023
openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #14514
@ryao
Copy link

ryao commented Mar 2, 2023

The solution for this was revised after feedback and then merged to master as openzfs/zfs@4c856fb.

@arun-kv
Copy link
Author

arun-kv commented Mar 3, 2023

Thanks @ryao

@arun-kv
Copy link
Author

arun-kv commented Mar 3, 2023

@lundman when can we merge this into windows repo?

@lundman
Copy link

lundman commented Mar 3, 2023

I'm doing a rebase right now - just fighting git at the moment

lundman pushed a commit that referenced this issue Mar 3, 2023
#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14514
datacore-rm pushed a commit to DataCoreSoftware/openzfs that referenced this issue Mar 7, 2023
openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14514
@ryao
Copy link

ryao commented Mar 19, 2023

@lundman I suspect that we can close this. I would close it for you, but I do not have privileges to close it.

@lundman lundman closed this as completed Mar 19, 2023
datacore-rm pushed a commit to DataCoreSoftware/openzfs that referenced this issue Mar 21, 2023
openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#14514
@amotin
Copy link

amotin commented Jun 15, 2023

Since the original fix was reverted, I've created alternative one: openzfs#14979 . Reviewers and testers are welcome.

@sskras
Copy link

sskras commented Jun 15, 2023

@amotin: could you give a short summary of why the fix was reverted in upstream? I don't see commit of the reversal.

@amotin
Copy link

amotin commented Jun 15, 2023

@sskras There were some deadlock reports. I haven't investigated what caused them, but implemented it in different way without introducing new locks. See: openzfs#14790 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants