Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread watching debounced events hanging #205

Open
itkovian opened this issue Jun 14, 2019 · 19 comments
Open

Thread watching debounced events hanging #205

itkovian opened this issue Jun 14, 2019 · 19 comments
Labels
B-debouncer debouncer related V-4 Only on major version 4

Comments

@itkovian
Copy link
Contributor

System details

  • Linux, CentOS 7.6, kernel 3.10.0-957.12.2.el7.ug.x86_64 (custom rebuild, with some support for GPFS)
  • rustc 1.34.2 (6c2484dc3 2019-05-13)
  • notify main brach, commit 56aac12

I use notify in https://github.com/itkovian/sarchive/. The issue I opened for this is at itkovian/sarchive#20.

  • FS: /dev/mapper/vg0-scratch on /tmp type ext4 (rw,relatime,data=ordered)
  • Hypervisor is kvm, in an OpenNebula env.
  • Underlying storage is Ceph (but I do not think that is relevant here)

What you did (as detailed as you can)

  • I created 50K files, in a directory that was watched. The watcher threads always stops, but not always after the same number of files.
  • The notify queue size was raised to 64K
  • These are the events seen:
[2019-06-14T11:36:08.930803788+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/1.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.931012465+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/2.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.931178128+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/3.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.931341658+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/4.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.931504147+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/5.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.931660784+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/6.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.931798929+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/7.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.931937007+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/8.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.932074810+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/9.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.932212217+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/10.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.932352470+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/11.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.932492183+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/12.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.932631604+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/13.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.932770371+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/14.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.932908955+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/15.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.933049157+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/16.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.933188638+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/17.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.933327999+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/18.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.933467705+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/19.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.933608266+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/20.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.933747966+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/21.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.933886300+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/22.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.934026297+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/23.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.934166356+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/24.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.934306319+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/25.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.934472078+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/26.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.940717437+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/27.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.940865118+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/28.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.941006427+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/29.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.941147677+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/30.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }
[2019-06-14T11:36:08.941288909+02:00][sarchive::lib][DEBUG] Event received: Event { kind: Create(Any), paths: ["/tmp/torque/31.SC"], attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }

What you expected

It should just be able to keep track of all events, especially since the queue is big enough.

What happened

The watcher thread and the debouncer (Threads 3 and 2 below) event loop are waiting and seem to be deadlocked. There is no further processing of events.

Thread 6 (Thread 0x7f545580a700 (LWP 4027)):
#0  0x00007f5455df9965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000055713a5ccd0d in crossbeam_utils::sync::parker::Inner::park::h6f7e9f05947e58c5 ()
#2  0x000055713a5cc9cb in crossbeam_utils::sync::parker::Parker::park::h1dacaa9c32ead107 ()
#3  0x000055713a553f5b in sarchive::lib::signal_handler_atomic::h2d8e9d3d751c5d1f ()
#4  0x000055713a5546f6 in crossbeam_utils::thread::ScopedThreadBuilder::spawn::_$u7b$$u7b$closure$u7d$$u7d$::h86eeb285f945e829 ()
#5  0x000055713a5462eb in std::sys_common::backtrace::__rust_begin_short_backtrace::hdc70f0854b567d1a ()
#6  0x000055713a63ea0a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:87
#7  0x000055713a54fe6d in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h9a5ade7f14d85759 ()
#8  0x000055713a63e09e in call_once<(),()> () at /rustc/6c2484dc3c532c052f159264e970278d8b77cdc9/src/liballoc/boxed.rs:759
#9  start_thread () at src/libstd/sys_common/thread.rs:14
#10 thread_start () at src/libstd/sys/unix/thread.rs:81
#11 0x00007f5455df5dd5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f5455908ead in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f5455609700 (LWP 4028)):
#0  0x00007f5455df9965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000055713a62f353 in wait () at src/libstd/sys/unix/condvar.rs:70
#2  wait () at src/libstd/sys_common/condvar.rs:41
#3  wait<()> () at src/libstd/sync/condvar.rs:204
#4  park () at src/libstd/thread/mod.rs:908
#5  0x000055713a5cc4f3 in crossbeam_channel::context::Context::wait_until::h0bf53e5a4352da39 ()
#6  0x000055713a5cc1cc in crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::hb09c5663e8cedc6d ()
#7  0x000055713a5cbcaf in _$LT$std..thread..local..LocalKey$LT$T$GT$$GT$::try_with::h2081487015026962 ()
#8  0x000055713a5caf21 in crossbeam_channel::select::run_select::he3afd68d128e33fc ()
#9  0x000055713a5cb706 in crossbeam_channel::select::Select::select::hced2ab4085c02c35 ()
#10 0x000055713a551776 in sarchive::lib::monitor::h8503efb39a9206f2 ()
#11 0x000055713a554365 in crossbeam_utils::thread::ScopedThreadBuilder::spawn::_$u7b$$u7b$closure$u7d$$u7d$::h5c3cfc13ba01db7e ()
#12 0x000055713a5462eb in std::sys_common::backtrace::__rust_begin_short_backtrace::hdc70f0854b567d1a ()
#13 0x000055713a63ea0a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:87
#14 0x000055713a54fb2d in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h190f113dc2fb2a1b ()
#15 0x000055713a63e09e in call_once<(),()> () at /rustc/6c2484dc3c532c052f159264e970278d8b77cdc9/src/liballoc/boxed.rs:759
#16 start_thread () at src/libstd/sys_common/thread.rs:14
#17 thread_start () at src/libstd/sys/unix/thread.rs:81
#18 0x00007f5455df5dd5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f5455908ead in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f5455408700 (LWP 4029)):
#0  0x00007f5455df9965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000055713a62f353 in wait () at src/libstd/sys/unix/condvar.rs:70
#2  wait () at src/libstd/sys_common/condvar.rs:41
#3  wait<()> () at src/libstd/sync/condvar.rs:204
#4  park () at src/libstd/thread/mod.rs:908
#5  0x000055713a5cc4f3 in crossbeam_channel::context::Context::wait_until::h0bf53e5a4352da39 ()
#6  0x000055713a5cc1cc in crossbeam_channel::context::Context::with::_$u7b$$u7b$closure$u7d$$u7d$::hb09c5663e8cedc6d ()
#7  0x000055713a5cbcaf in _$LT$std..thread..local..LocalKey$LT$T$GT$$GT$::try_with::h2081487015026962 ()
#8  0x000055713a5caf21 in crossbeam_channel::select::run_select::he3afd68d128e33fc ()
#9  0x000055713a5cb706 in crossbeam_channel::select::Select::select::hced2ab4085c02c35 ()
#10 0x000055713a5529c8 in sarchive::lib::process::h005bbd03ae18b9b3 ()
#11 0x000055713a55415d in crossbeam_utils::thread::ScopedThreadBuilder::spawn::_$u7b$$u7b$closure$u7d$$u7d$::h1c9ff6e400f4e454 ()
#12 0x000055713a5462eb in std::sys_common::backtrace::__rust_begin_short_backtrace::hdc70f0854b567d1a ()
#13 0x000055713a63ea0a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:87
#14 0x000055713a55000d in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::hae33da4cd92a3d73 ()
#15 0x000055713a63e09e in call_once<(),()> () at /rustc/6c2484dc3c532c052f159264e970278d8b77cdc9/src/liballoc/boxed.rs:759
#16 start_thread () at src/libstd/sys_common/thread.rs:14
#17 thread_start () at src/libstd/sys/unix/thread.rs:81
#18 0x00007f5455df5dd5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f5455908ead in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f5455207700 (LWP 4030)):
#0  0x00007f5455dfc4ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f5455df7dcb in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f5455df7c98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000055713a5811f3 in notify::debounce::timer::ScheduleWorker::run::h05a242d775f47422 ()
#4  0x000055713a5946ff in std::sys_common::backtrace::__rust_begin_short_backtrace::he553ad6571cfadd7 ()
#5  0x000055713a597c1e in std::panicking::try::do_call::h816d9a307bc6821f ()
#6  0x000055713a63ea0a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:87
#7  0x000055713a59fc2f in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h2b7c3a08953572d0 ()
#8  0x000055713a63e09e in call_once<(),()> () at /rustc/6c2484dc3c532c052f159264e970278d8b77cdc9/src/liballoc/boxed.rs:759
#9  start_thread () at src/libstd/sys_common/thread.rs:14
#10 thread_start () at src/libstd/sys/unix/thread.rs:81
---Type <return> to continue, or q <return> to quit---
#11 0x00007f5455df5dd5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f5455908ead in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f5455006700 (LWP 4031)):
#0  0x00007f5455dfc4ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f5455df7dcb in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007f5455df7c98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000055713a5829df in notify::debounce::timer::WatchTimer::ignore::h41e8c48798833732 ()
#4  0x000055713a589b0d in notify::debounce::Debounce::event::h457eb1b9afa24ac7 ()
#5  0x000055713a586f72 in notify::debounce::EventTx::send::h710e47f0ce4ad6c5 ()
#6  0x000055713a577568 in notify::inotify::EventLoop::event_loop_thread::h77a1f3739e8f1912 ()
#7  0x000055713a5946c6 in std::sys_common::backtrace::__rust_begin_short_backtrace::h5d4573f5b62c565e ()
#8  0x000055713a597c56 in std::panicking::try::do_call::h9da2af6e0d2c9e8c ()
#9  0x000055713a63ea0a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:87
#10 0x000055713a59ffa8 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::heaa12d6560ac3ec4 ()
#11 0x000055713a63e09e in call_once<(),()> () at /rustc/6c2484dc3c532c052f159264e970278d8b77cdc9/src/liballoc/boxed.rs:759
#12 start_thread () at src/libstd/sys_common/thread.rs:14
#13 thread_start () at src/libstd/sys/unix/thread.rs:81
#14 0x00007f5455df5dd5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f5455908ead in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f5456837940 (LWP 4026)):
#0  0x00007f5455df9965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000055713a5cd670 in crossbeam_utils::sync::wait_group::WaitGroup::wait::h99903d437b2ae8f1 ()
#2  0x000055713a55630b in crossbeam_utils::thread::scope::h4e5e6b620eff23ee ()
#3  0x000055713a54d761 in sarchive::main::hcf3a3ed84010428d ()
#4  0x000055713a54dbe3 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::h6aee53ed84e23fb3 ()
#5  0x000055713a63a263 in {{closure}} () at src/libstd/rt.rs:49
#6  do_call<closure,i32> () at src/libstd/panicking.rs:297
#7  0x000055713a63ea0a in __rust_maybe_catch_panic () at src/libpanic_unwind/lib.rs:87
#8  0x000055713a63ae2d in try<i32,closure> () at src/libstd/panicking.rs:276
#9  catch_unwind<closure,i32> () at src/libstd/panic.rs:388
#10 lang_start_internal () at src/libstd/rt.rs:48
#11 0x000055713a54dbb2 in main ()

Relevant code:

pub fn monitor(path: &Path, s: &Sender<TorqueJobEntry>, sigchannel: &Receiver<bool>) -> notify::Result<()> {
    let (tx, rx) = unbounded();

    // create a platform-specific watcher
    let mut watcher: RecommendedWatcher = Watcher::new(tx, Duration::from_secs(2))?;

    info!("Watching path {:?}", &path);

    if let Err(e) = watcher.watch(&path, RecursiveMode::NonRecursive) {
        return Err(e);
    }
    loop {
        select! {
            recv(sigchannel) -> b => if let Ok(true) = b  {
                return Ok(());
            },
            recv(rx) -> event => { match event {
                Ok(e) => check_and_queue(s, e.unwrap())?,
                Err(e) => {
                    error!("Error on received event: {:?}", e);
                    break;
                }
            };}
        }
    }

    Ok(())
}
@passcod
Copy link
Member

passcod commented Jun 14, 2019

Can you use or test the latest 4.x release instead of what's in main? At least to figure out if this is new or existing

@itkovian
Copy link
Contributor Author

I can confirm that the issue also exists in 4.0.12. I tried forcing to 4.0.10, but I do not know why it keeps picking 4.0.12.

@itkovian
Copy link
Contributor Author

I tried changing the delay, but that does not fix anything. Raising the delay to 10s, gives me the highest ever count of processed events (5055), but then it also hangs.

@passcod
Copy link
Member

passcod commented Jun 15, 2019 via email

@passcod
Copy link
Member

passcod commented Jun 15, 2019

You may need to specify "=4.0.10" to pin to a specific version (and apologies if you've tried that already)

@itkovian
Copy link
Contributor Author

I'd rather not go with RawEvents for the following reason. In one version of the tool, I'm looking to archive files that reside in a directory. Upon receipt of the CREATE (for the dir), I have no idea in what state the files are in. I tried to recursively watch, but that seemed to miss CREATE events on the files, though I have no immediate idea why. The debounced events will only appear after the delay, if I understood correctly, and then I can be reasonably sure that if the files exist, their contents is also complete. I'll try to pinpoint an earlier version where it still fails later today.

@passcod
Copy link
Member

passcod commented Jun 16, 2019 via email

@itkovian
Copy link
Contributor Author

Would it issue e.g., CLOSE_WRITE? That would also by ok ;) Or is the problem the underlying inotify implementation in the kernel itself?

@passcod
Copy link
Member

passcod commented Jun 16, 2019

If files are created before the watch is placed, they won't emit events. What notify could do is readdir a directory it notices is new / gets a create event for and issue creates for everything it finds inside, to simulate what people expect would be the correct behaviour.

@passcod
Copy link
Member

passcod commented Jun 16, 2019

(That is, the inotify watch depth is one. Given a directory, it will only watch itself and its immediate contents. Recursive watching on Linux is a Notify abstraction on top.)

@0xpr03
Copy link
Member

0xpr03 commented Sep 4, 2019

I'm not sure but I think I've got the same problem, using 5.0.0-pre.1.
I've got foo/ which I watch.
I add foo/A/AA/AA.txt and can delete A/ and add it again, works.
But when I insert A/AA/AA.txt inside foo/A, creating foo/A/A/AA/AA.txt it gets to hang. So it seems like the update of the recursive inotify? creates a total hangup.
After this no event is received ever again.

@0xpr03
Copy link
Member

0xpr03 commented Sep 4, 2019

I switched to 4.0.13 and get the same behavior.

0xpr03 added a commit to 0xpr03/group_watchdog that referenced this issue Sep 4, 2019
We currently can hang up due to notify-rs/notify#205

Signed-off-by: Aron Heinecke <[email protected]>
@braincore
Copy link

I've also observed this behavior for many months on 4.x.x and finally got to verifying that it only presents itself with the debouncer. I've switched to raw_watcher to get around this.

@mtak-
Copy link

mtak- commented Sep 11, 2019

Version: 4.0.13

I have discovered this same issue on macos (fsevents). It manifests semi-frequently when using the debouncer with a low delay (e.g. 0).

  thread #4
    frame #0: 0x00007fff6df2af4a libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff6dfe9a4a libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 96
    frame #2: 0x00007fff6dfe7769 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 222
    frame #3: 0x000000010a7c23e5 test-desktop`std::sync::mutex::Mutex$LT$T$GT$::lock::h85454f619cabf454(self=0x00007faea6728190) at mutex.rs:220:12 [opt]
    frame #4: 0x000000010a7c9772 test-desktop`notify::debounce::timer::LoggingMutex$LT$T$GT$::lock::hc385c4f438d70e45(self=0x00007faea6728190) at timer.rs:24:8 [opt]
    frame #5: 0x000000010a79f831 test-desktop`notify::debounce::timer::ScheduleWorker::fire_event::h7e0fa0746004f739(self=0x0000700006876bc0, ev=<unavailable>) at timer.rs:98:36 [opt]
    frame #6: 0x000000010a79f70e test-desktop`notify::debounce::timer::ScheduleWorker::fire_due_events::h68f460ada7f4030b(self=0x0000700006876bc0, now=<unavailable>) at timer.rs:85:16 [opt]
    frame #7: 0x000000010a79fc75 test-desktop`notify::debounce::timer::ScheduleWorker::run::h90d88ef0b0efc6b1(self=0x0000700006876bc0) at timer.rs:138:28 [opt]
    frame #8: 0x000000010a7c8e90 test-desktop`notify::debounce::timer::WatchTimer::new::_$u7b$$u7b$closure$u7d$$u7d$::hcb8abd07577d45f9 at timer.rs:186:12 [opt]
    frame #9: 0x000000010a7c8237 test-desktop`std::sys_common::backtrace::__rust_begin_short_backtrace::hbe01058a8f40b652(f=<unavailable>) at backtrace.rs:77:4 [opt]
    frame #10: 0x000000010a79cfe7 test-desktop`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h0dbd0c1b33aa8f6a at mod.rs:470:16 [opt]
    frame #11: 0x000000010a79cf27 test-desktop`_$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h3145a96354848384(self=<unavailable>, _args=<unavailable>) at panic.rs:315:8 [opt]
    frame #12: 0x000000010a7c9017 test-desktop`std::panicking::try::do_call::hf5ead24062a0b84a(data=<unavailable>) at panicking.rs:296:39 [opt]
    frame #13: 0x000000010a8cf47f test-desktop`__rust_maybe_catch_panic at lib.rs:80:7 [opt]
    frame #14: 0x000000010a7c8f06 test-desktop`std::panicking::try::h1a92081908bf3b80(f=<unavailable>) at panicking.rs:275:12 [opt]
    frame #15: 0x000000010a79cf87 test-desktop`std::panic::catch_unwind::h362c35e8f3a05ee7(f=<unavailable>) at panic.rs:394:8 [opt]
    frame #16: 0x000000010a7afb31 test-desktop`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::hcc81242fee004cf7 at mod.rs:469:29 [opt]
    frame #17: 0x000000010a8bc77e test-desktop`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::hc8e54c901e83587c at boxed.rs:922:8 [opt]
    frame #18: 0x000000010a8ced0e test-desktop`std::sys::unix::thread::Thread::new::thread_start::h52cdb32292a4bc87 [inlined] _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::hefb3e6f10e6b13a2 at boxed.rs:922:8 [opt]
    frame #19: 0x000000010a8ced02 test-desktop`std::sys::unix::thread::Thread::new::thread_start::h52cdb32292a4bc87 [inlined] std::sys_common::thread::start_thread::h588c9bae9dd3729b at thread.rs:13 [opt]
    frame #20: 0x000000010a8cec8e test-desktop`std::sys::unix::thread::Thread::new::thread_start::h52cdb32292a4bc87 at thread.rs:79 [opt]
    frame #21: 0x00007fff6dfebd76 libsystem_pthread.dylib`_pthread_start + 125
    frame #22: 0x00007fff6dfe85e3 libsystem_pthread.dylib`thread_start + 15
  thread #18
    frame #0: 0x00007fff6df2af4a libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff6dfe9a4a libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 96
    frame #2: 0x00007fff6dfe7769 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 222
    frame #3: 0x000000010a7c2385 test-desktop`std::sync::mutex::Mutex$LT$T$GT$::lock::h13745d6d73d39497(self=0x00007faea6731380) at mutex.rs:220:12 [opt]
    frame #4: 0x000000010a7c9692 test-desktop`notify::debounce::timer::LoggingMutex$LT$T$GT$::lock::h7d5e14a913f8e32b(self=0x00007faea6731380) at timer.rs:24:8 [opt]
    frame #5: 0x000000010a7a00a5 test-desktop`notify::debounce::timer::WatchTimer::ignore::hfa4765c6fbd481ee(self=<unavailable>, id=<unavailable>) at timer.rs:222:25 [opt]
    frame #6: 0x000000010a7b6168 test-desktop`notify::debounce::restart_timer::hf02fd7b09cfbb09d(timer_id=0x00007faea8b06340, path=PathBuf @ 0x0000700007c97820, timer=0x00007faea6736f48) at mod.rs:506:8 [opt]
    frame #7: 0x000000010a7b5c5a test-desktop`notify::debounce::Debounce::event::h304ff5a9c2a8fbb4(self=0x00007faea6736f18, path=PathBuf @ 0x0000700007c978e0, op=<unavailable>, cookie=<unavailable>) at mod.rs:465:32 [opt]
    frame #8: 0x000000010a7b474d test-desktop`notify::debounce::EventTx::send::h979500e8b59b8377(self=<unavailable>, event=<unavailable>) at mod.rs:46:24 [opt]
    frame #9: 0x000000010a7ae477 test-desktop`notify::fsevent::send_pending_rename_event::hf8882bfb28062e1b(event=Option<notify::RawEvent> @ 0x0000700007c97b40, event_tx=<unavailable>) at fsevent.rs:76:8 [opt]
    frame #10: 0x000000010a7aeec9 test-desktop`notify::fsevent::callback::h6f970d2ae789d1a1(stream_ref=<unavailable>, info=<unavailable>, num_events=12, event_paths=<unavailable>, event_flags=<unavailable>, event_ids=<unavailable>) at fsevent.rs:348:8 [opt]
    frame #11: 0x00007fff382098eb FSEvents`implementation_callback_rpc + 2911
    frame #12: 0x00007fff38208d13 FSEvents`_Xcallback_rpc + 233
    frame #13: 0x00007fff38208c0d FSEvents`FSEventsD2F_server + 55
    frame #14: 0x00007fff3820b68d FSEvents`FSEventsClientProcessMessageCallback + 46
    frame #15: 0x00007fff36a35f73 CoreFoundation`__CFMachPortPerform + 250
    frame #16: 0x00007fff36a35e6f CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE1_PERFORM_FUNCTION__ + 41
    frame #17: 0x00007fff36a35dbf CoreFoundation`__CFRunLoopDoSource1 + 541
    frame #18: 0x00007fff36a1e0dc CoreFoundation`__CFRunLoopRun + 2612
    frame #19: 0x00007fff36a1d423 CoreFoundation`CFRunLoopRunSpecific + 499
    frame #20: 0x00007fff36a1d1fa CoreFoundation`CFRunLoopRun + 40
    frame #21: 0x000000010a7c8e01 test-desktop`notify::fsevent::FsEventWatcher::run::_$u7b$$u7b$closure$u7d$$u7d$::hb7e959d675ad69ca at fsevent.rs:240:16 [opt]
    frame #22: 0x000000010a7c8265 test-desktop`std::sys_common::backtrace::__rust_begin_short_backtrace::hd22343935848eab5(f=<unavailable>) at backtrace.rs:77:4 [opt]
    frame #23: 0x000000010a79d015 test-desktop`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h30b537835a8eb003 at mod.rs:470:16 [opt]
    frame #24: 0x000000010a79cf55 test-desktop`_$LT$std..panic..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h984237bcc127aa0f(self=<unavailable>, _args=<unavailable>) at panic.rs:315:8 [opt]
    frame #25: 0x000000010a7c8fd2 test-desktop`std::panicking::try::do_call::he72e0574bf681862(data=<unavailable>) at panicking.rs:296:39 [opt]
    frame #26: 0x000000010a8cf47f test-desktop`__rust_maybe_catch_panic at lib.rs:80:7 [opt]
    frame #27: 0x000000010a7c8f74 test-desktop`std::panicking::try::h2b3f0e92b094ef5d(f=<unavailable>) at panicking.rs:275:12 [opt]
    frame #28: 0x000000010a79cfb5 test-desktop`std::panic::catch_unwind::h7a7b604217f078c6(f=<unavailable>) at panic.rs:394:8 [opt]
    frame #29: 0x000000010a7afa3e test-desktop`std::thread::Builder::spawn_unchecked::_$u7b$$u7b$closure$u7d$$u7d$::hc7744707bf6d169a at mod.rs:469:29 [opt]
    frame #30: 0x000000010a8bc77e test-desktop`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::hc8e54c901e83587c at boxed.rs:922:8 [opt]
    frame #31: 0x000000010a8ced0e test-desktop`std::sys::unix::thread::Thread::new::thread_start::h52cdb32292a4bc87 [inlined] _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::hefb3e6f10e6b13a2 at boxed.rs:922:8 [opt]
    frame #32: 0x000000010a8ced02 test-desktop`std::sys::unix::thread::Thread::new::thread_start::h52cdb32292a4bc87 [inlined] std::sys_common::thread::start_thread::h588c9bae9dd3729b at thread.rs:13 [opt]
    frame #33: 0x000000010a8cec8e test-desktop`std::sys::unix::thread::Thread::new::thread_start::h52cdb32292a4bc87 at thread.rs:79 [opt]
    frame #34: 0x00007fff6dfebd76 libsystem_pthread.dylib`_pthread_start + 125
    frame #35: 0x00007fff6dfe85e3 libsystem_pthread.dylib`thread_start + 15

I added some logging to confirm the deadlock:

ThreadId(18) about to lock: 0x7faea6728190
// ThreadId(18) about to lock: 0x7faea6731380
// ThreadId(18) about to unlock: 0x7faea6731380 (these two cancel out)
ThreadId(3)  about to lock: 0x7faea6731380
ThreadId(3)  about to lock: 0x7faea6728190
ThreadId(18) about to lock: 0x7faea6731380

Diving deeper, the operations buffer lock, and the events lock are held simultaneously, but grabbed in different orders from the two threads.

These lines from thread 4 (id 3):

// attempts to grab operations buffer lock
    frame #5: 0x000000010a79f831 test-desktop`notify::debounce::timer::ScheduleWorker::fire_event::h7e0fa0746004f739(self=0x0000700006876bc0, ev=<unavailable>) at timer.rs:98:36 [opt]
// successfully acquires events lock
    frame #6: 0x000000010a79f70e test-desktop`notify::debounce::timer::ScheduleWorker::fire_due_events::h68f460ada7f4030b(self=0x0000700006876bc0, now=<unavailable>) at timer.rs:85:16 [opt]

And these lines from thread 18 (id 18):

// attempts to grab events lock
    frame #5: 0x000000010a7a00a5 test-desktop`notify::debounce::timer::WatchTimer::ignore::hfa4765c6fbd481ee(self=<unavailable>, id=<unavailable>) at timer.rs:222:25 [opt]
    frame #6: 0x000000010a7b6168 test-desktop`notify::debounce::restart_timer::hf02fd7b09cfbb09d(timer_id=0x00007faea8b06340, path=PathBuf @ 0x0000700007c97820, timer=0x00007faea6736f48) at mod.rs:506:8 [opt]
// successfully acquires operations buffer lock
    frame #7: 0x000000010a7b5c5a test-desktop`notify::debounce::Debounce::event::h304ff5a9c2a8fbb4(self=0x00007faea6736f18, path=PathBuf @ 0x0000700007c978e0, op=<unavailable>, cookie=<unavailable>) at mod.rs:465:32 [opt]

@mtak-
Copy link

mtak- commented Sep 12, 2019

Possibly fixed by #210.

@itkovian
Copy link
Contributor Author

For now, I went with 5.0.0-pre.2, which has no RawEvents, and used new_immediate to get a watcher. I did the delay manually, keeping track of timestamps at the event receipt time and waiting a bit in case the file I expect to find in the directory is not there yet.

I am fine with that for the moment, and as I understood the debounced part was up for a major rewrite at some point anyway.

@0xpr03
Copy link
Member

0xpr03 commented Oct 16, 2019

Some deadlocks are fixed now by #210 in the 4.X branch. We'll need some input if that resolves the problems and up-port this to 5.X

@xudesheng
Copy link

I encountered exactly same issue on 5.0.0-pre.1.
I do need debouned event and I may not be able to move to 5.0.0-pre.2.

stuhood pushed a commit to stuhood/pants that referenced this issue May 13, 2020
…the Graph, and we want to avoid things like notify-rs/notify#205.

# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
stuhood pushed a commit to stuhood/pants that referenced this issue May 14, 2020
…the Graph, and we want to avoid things like notify-rs/notify#205.

# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
stuhood pushed a commit to stuhood/pants that referenced this issue May 14, 2020
…the Graph, and we want to avoid things like notify-rs/notify#205.

# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
stuhood pushed a commit to pantsbuild/pants that referenced this issue May 14, 2020
### Problem

We're on a slightly older version of the `notify` crate because we wanted to use the "debounced" API, which has a thread that attempts to batch events and dedupe them where possible. But the implementation of debouncing has issues like notify-rs/notify#205.

### Solution

Move away from the debounced watcher towards the trustier direct delivery of events: post #9636 we do our own debouncing via the Graph not double-cleaning `Nodes`, and via the `--loop` delay.

Additionally, fix two cases where code and tests used to race pants' invalidation and assume (intentionally or otherwise) that pants would not notice created files. 

### Result

Reduced risk of lost `notify` events.

[ci skip-jvm-tests]
stuhood pushed a commit to pantsbuild/pants that referenced this issue May 14, 2020
### Problem

We're on a slightly older version of the `notify` crate because we wanted to use the "debounced" API, which has a thread that attempts to batch events and dedupe them where possible. But the implementation of debouncing has issues like notify-rs/notify#205.

### Solution

Move away from the debounced watcher towards the trustier direct delivery of events: post #9636 we do our own debouncing via the Graph not double-cleaning `Nodes`, and via the `--loop` delay.

Additionally, fix two cases where code and tests used to race pants' invalidation and assume (intentionally or otherwise) that pants would not notice created files. 

### Result

Reduced risk of lost `notify` events.

[ci skip-jvm-tests]
@0xpr03 0xpr03 added the B-debouncer debouncer related label Mar 6, 2021
@0xpr03
Copy link
Member

0xpr03 commented Mar 22, 2021

Note that since 5.0.0-pre.2 none of the original debouncer code exists any more (and in fact no debouncer for the time being). So this is only v4 related for now. (I'm adding a v4 tag so we can track v5 issues separately.)

@0xpr03 0xpr03 added the V-4 Only on major version 4 label Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-debouncer debouncer related V-4 Only on major version 4
Projects
None yet
Development

No branches or pull requests

6 participants