Move file invalidation handling to rust #9636

stuhood · 2020-04-26T06:10:30Z

Problem

A few different kinds of file watching span the boundary between the SchedulerService and FSEventService:

pantsd invalidation globs - how pantsd detects that its implementing code or config has changed
pidfile - watches pantsd's pidfile to ensure that the daemon exits if it loses exclusivity
graph invalidation - any files changing in the workspace should invalidate the engine's Graph
--loop - implemented directly in the SchedulerService

Because of the semi-cyclic nature of the relationship between the SchedulerService and FSEventService, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the notify crate implementation only satisfying one of them.

Solution

The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the Graph:

poll: bool - When poll is enabled, a product_request will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional poll_delay is applied before it returns to "debounce" polls.
timeout: Optional[Duration] - When a timeout is set, a product_request will wait up to the given duration for the requested Node(s) to complete (including any time polling).

These features are then used by:

--loop: uses poll (with a poll_delay, but without a timeout) to immediately re-run a Goal when its inputs have changed.
invalidation globs and pidfile watching: use poll (with no poll_delay) and timeout to block their SchedulerService thread and watch for changes to those files.

Result

The FSEventService and SchedulerService are decoupled, and each now interacts only with the Scheduler: FSEventService to push watchman events to the Graph, and the SchedulerService to pull invalidation information from the Graph.

Because all events now flow through the Graph, the notify crate has reached feature parity with watchman.

In followup changes we can remove the experimental flag, disable watchman (and thus the FSEventService) by default, and remove the dependency between --loop and pantsd.

stuhood · 2020-05-01T20:46:50Z

Ok, this is finally ready for review. The commits are useful to look at independently.

stuhood · 2020-05-02T05:13:00Z

src/rust/engine/src/scheduler.rs

+  // complete. The second request will check whether the roots have changed, and if they haven't
+  // changed, will wait until they have (or until the timeout elapses) before re-requesting them.
+  //
+  // TODO: The `poll`, `poll_delay`, and `timeout` parameters exist to support a coarse-grained API


So, not going to lie: I'm not happy with how large this API ended up being. In a near-term followup, I'd like to move (the loop around) run_goal_rules into the rust Scheduler. But right now that would interfere with Greg's UI work.

…mplify tests slightly. [ci skip-jvm-tests]

[ci skip-jvm-tests]

…ation, and add long-polling of invalidation globs to SchedulerService. [ci skip-jvm-tests]

…:SessionId, and rename to RunId to avoid overloading with the engine's Session. [ci skip-jvm-tests]

hrfuller

LGTM Stu. Thanks.

hrfuller · 2020-05-06T18:42:41Z

src/rust/engine/watch/src/lib.rs

+    let mut inner = self.0.lock();
+    let (ignorer, canonical_build_root, liveness_sender, watch_receiver) = inner
+      .background_task_inputs
+      .take()


This moves the contents out of the Option right?

Yep. The next line asserts that start is not called more than once.

hrfuller · 2020-05-06T18:53:56Z

src/rust/engine/watch/src/lib.rs

-            .spawn_blocking(move || watcher_lock.watch(path, RecursiveMode::NonRecursive))
-            .await
-        }
-        Err(()) => Err(notify::Error::new(notify::ErrorKind::Generic(


So locking is infallible now?

I switched to parking_lot because I couldn't reproduce the issue you were seeing.

hrfuller · 2020-05-06T22:06:32Z

src/rust/engine/graph/src/entry.rs

+  /// currently to clean it).
+  fn poll_should_wait(&self, context: &N::Context) -> bool {
+    match self {
+      EntryResult::Uncacheable(_, session_id) => context.session_id() == session_id,


If we are polling an uncacheable node and the result was for the same session we are currently in, we wait for the uncacheable node to finish?

poll waits for something to change (unlike waiters, which are about waiting for something to finish). If the uncacheable node matches our session/run-id, then it is already up to date.

A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`: 1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed 2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity 3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph` 4. `--loop` - implemented directly in the `SchedulerService` Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them. The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`: * `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls. * `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing). These features are then used by: * `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed. * invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files. The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`. Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`. In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`.

### Problem We're on a slightly older version of the `notify` crate because we wanted to use the "debounced" API, which has a thread that attempts to batch events and dedupe them where possible. But the implementation of debouncing has issues like notify-rs/notify#205. ### Solution Move away from the debounced watcher towards the trustier direct delivery of events: post #9636 we do our own debouncing via the Graph not double-cleaning `Nodes`, and via the `--loop` delay. Additionally, fix two cases where code and tests used to race pants' invalidation and assume (intentionally or otherwise) that pants would not notice created files. ### Result Reduced risk of lost `notify` events. [ci skip-jvm-tests]

@hrfuller

* Port to tokio 0.2, and to stdlib futures for fs and task_executor (#9071) We're on an older version of tokio, which doesn't smoothly support usage of async/await. Switch to tokio 0.2, which supports directly spawning and awaiting (via its macros) stdlib futures, which is an important step toward being able to utilize async/await more broadly. Additionally, port the `fs` and `task_executor` crates to stdlib futures. Finally, transitively fixup for the new APIs: in particular, since both `task_executor` and `tokio` now consume stdlib futures to spawn tasks, we switch all relevant tests and main methods to use the `tokio::main` and `tokio::test` macros, which annotate async methods and spawn a runtime to allow for `await`ing futures inline. Progress toward more usage of async/await! * Add notify fs watcher to engine. (#9318) * Use the notify crate to implement an `InvalidationWatcher` for Graph operations. * Make watch async, and watcher log to pantsd.log. Relativize paths returned from notify to the build_root. Refactor invalidate method to be an associated method on the InvalidationWatcher. * Respond to feedback. * Use spawn on io pool instead of custom future impl * Write python fs tests * Relativize paths to invalidate to build root * invalidate nodes with parent paths. * Comments * Add rust tests. Make some things public so we can use them in tests. Use canonical path to build root for relativizing changed paths. * Refactor Python tests. Return watch errors as core::Failure all the way to user. Move task executor onto invalidation watcher. Move test_support trait impl into test_support mod. * use futures lock on watcher * Platform specific watching behavior. On Darwin recursively watch the build root at startup. On Linux watch individual directory roots. Co-authored-by: Stu Hood <[email protected]> * Ignore notify events for pants_ignore patterns. (#9406) * Create a git ignorer on the context object. Adjust all call sites which create a posix fs to pass in an ignorer. * Ignore fsevents from files that match pants_ignore patterns. * Always pass is_dir = false to ignorer to avoid stat-ing every path the event watch thread sees. * Add a feature gate to disable the engine fs watcher introduced in #9318 (#9416) * Add a feature gate to disable the engine fs watcher introduced in #9318 by default, to mitigate issues seen in #9415 until a fix is in place. * Don't rerun uncachable nodes if they are dirtied while running. (#9452) * Don't rerun uncachable nodes if they are dirtied while running. - Retry dependencies of uncacheable nodes a few times to get a result until we are exhausted from trying too many times. - Bubble uncacheable node retry errors up to the user, tell them things were chaning to much. - Don't propagate dirtiness past uncacheable nodes when invalidating from changed roots. Otherwise dirty dependents of uncacheable nodes will need to re-run. * enable the engine fs watcher by default, now that it won't cause issues. Remove execution option override from tests. * use reference to self in stop_walk_predicate closure * invalidate often enough that test doesn't flake * Add a flag to prevent the FsEventService and watchman from starting (#9487) * add --watchman-enable flag * disable watchman when flag is false * Don't wait for the initial watchman event if we aren't using watchman. * check invalidation watcher liveness from scheduler service * Extract a `watch` crate. (#9635) The `watch` module directly accesses the `engine` crate's `Graph`, which makes it more challenging to test. Extract a `watch` crate which is used via an `trait Invalidatable` which is implemented for the engine's `Graph`, as well as independently in tests. [ci skip-jvm-tests] * Simplify Scheduler::execute and unify Graph retry (#9674) Both the `Graph` and the `Scheduler` implemented retry for requested Nodes, but the `Scheduler` implementation was pre-async-await and much more complicated. Unify the retry implementations into `Graph::get` (either roots or uncacheable nodes are retried), and simplify the `Scheduler`'s loop down to: ``` let maybe_display_handle = Self::maybe_display_initialize(&session); let result = loop { if let Ok(res) = receiver.recv_timeout(refresh_interval) { break Ok(res); } else if let Err(e) = Self::maybe_display_render(&session, &mut tasks) { break Err(e); } }; Self::maybe_display_teardown(session, maybe_display_handle); result ``` A single, more modern retry implementation (thanks @hrfuller!), and a cleaner `Scheduler::execute` loop. * Move file invalidation handling to rust (#9636) A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`: 1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed 2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity 3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph` 4. `--loop` - implemented directly in the `SchedulerService` Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them. The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`: * `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls. * `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing). These features are then used by: * `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed. * invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files. The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`. Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`. In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`. * pin tokio at exactly 0.2.13 * fix lint issues * fix mypy typing issues * Move away from the debounced notify watcher #9754 * Remove test that has raced file invalidation ever since the notify backend was added, but which will now fairly consistently lose that race. # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] * As explained in the comment: we can no longer create duplicate parallel BUILD files and hope that pants does not notice them before we scan the directory again! # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] Co-authored-by: Stu Hood <[email protected]> Co-authored-by: Stu Hood <[email protected]>

stuhood requested review from hrfuller, Eric-Arellano and gshuflin April 26, 2020 06:10

This comment has been minimized.

Sign in to view

stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch from 0bef48b to efcfdc1 Compare April 26, 2020 18:52

stuhood marked this pull request as draft April 26, 2020 22:38

stuhood removed request for hrfuller, Eric-Arellano and gshuflin April 26, 2020 22:38

This comment has been minimized.

Sign in to view

stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch 2 times, most recently from 3a8935a to 0903f71 Compare April 29, 2020 06:51

stuhood changed the title ~~Move pantsd invalidation globs to rust~~ Move file invalidation handling to rust Apr 29, 2020

stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch 3 times, most recently from 4c12d90 to 563a1de Compare April 30, 2020 22:43

stuhood mentioned this pull request Apr 30, 2020

Stabilize the cron CI test suite #9581

Merged

This comment has been minimized.

Sign in to view

stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch 2 times, most recently from b15d3d6 to cb237a3 Compare May 1, 2020 20:44

stuhood marked this pull request as ready for review May 1, 2020 20:45

stuhood requested review from Eric-Arellano, gshuflin and jsirois May 1, 2020 20:45

stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch 2 times, most recently from f58dcbe to ab561a1 Compare May 2, 2020 04:43

stuhood commented May 2, 2020

View reviewed changes

stuhood requested review from hrfuller and removed request for Eric-Arellano May 2, 2020 19:25

stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch from ab561a1 to faa4eb4 Compare May 2, 2020 20:44

stuhood added the needs-cherrypick label May 2, 2020

stuhood added this to the 1.26.x milestone May 2, 2020

stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch from faa4eb4 to 3998e08 Compare May 2, 2020 21:48

stuhood and others added 4 commits May 3, 2020 09:24

Propagate InvalidationWatcher exit state to the is_valid call, and si…

390548a

…mplify tests slightly. [ci skip-jvm-tests]

Add Graph::poll to allow for waiting until a Node has changed.

170bff6

[ci skip-jvm-tests]

Isolate FSEventService to one event type and direct Scheduler invalid…

a971228

…ation, and add long-polling of invalidation globs to SchedulerService. [ci skip-jvm-tests]

Get --loop working atop polling. Adjust the meaning of Graph::Node:…

90614ba

…:SessionId, and rename to RunId to avoid overloading with the engine's Session. [ci skip-jvm-tests]

stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch from 3998e08 to 90614ba Compare May 3, 2020 16:29

hrfuller approved these changes May 6, 2020

View reviewed changes

hrfuller reviewed May 6, 2020

View reviewed changes

stuhood merged commit 9979b78 into pantsbuild:master May 6, 2020

stuhood deleted the stuhood/pantsd-invalidation-in-rust branch May 6, 2020 23:19

This was referenced May 13, 2020

Move away from the debounced notify watcher #9754

Merged

Narrow or remove the hole that pantsd punches in the ignore patterns for .pids #9758

Closed

stuhood removed the needs-cherrypick label May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move file invalidation handling to rust #9636

Move file invalidation handling to rust #9636

stuhood commented Apr 26, 2020 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

stuhood commented May 1, 2020

stuhood May 2, 2020

hrfuller left a comment

hrfuller May 6, 2020

stuhood May 6, 2020

hrfuller May 6, 2020

stuhood May 6, 2020

hrfuller May 6, 2020

stuhood May 6, 2020

Move file invalidation handling to rust #9636

Move file invalidation handling to rust #9636

Conversation

stuhood commented Apr 26, 2020 • edited Loading

Problem

Solution

Result

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

stuhood commented May 1, 2020

stuhood May 2, 2020

Choose a reason for hiding this comment

hrfuller left a comment

Choose a reason for hiding this comment

hrfuller May 6, 2020

Choose a reason for hiding this comment

stuhood May 6, 2020

Choose a reason for hiding this comment

hrfuller May 6, 2020

Choose a reason for hiding this comment

stuhood May 6, 2020

Choose a reason for hiding this comment

hrfuller May 6, 2020

Choose a reason for hiding this comment

stuhood May 6, 2020

Choose a reason for hiding this comment

stuhood commented Apr 26, 2020 •

edited

Loading