-
-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move file invalidation handling to rust #9636
Move file invalidation handling to rust #9636
Conversation
This comment has been minimized.
This comment has been minimized.
0bef48b
to
efcfdc1
Compare
This comment has been minimized.
This comment has been minimized.
3a8935a
to
0903f71
Compare
4c12d90
to
563a1de
Compare
This comment has been minimized.
This comment has been minimized.
b15d3d6
to
cb237a3
Compare
Ok, this is finally ready for review. The commits are useful to look at independently. |
f58dcbe
to
ab561a1
Compare
// complete. The second request will check whether the roots have changed, and if they haven't | ||
// changed, will wait until they have (or until the timeout elapses) before re-requesting them. | ||
// | ||
// TODO: The `poll`, `poll_delay`, and `timeout` parameters exist to support a coarse-grained API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, not going to lie: I'm not happy with how large this API ended up being. In a near-term followup, I'd like to move (the loop around) run_goal_rules
into the rust Scheduler
. But right now that would interfere with Greg's UI work.
ab561a1
to
faa4eb4
Compare
faa4eb4
to
3998e08
Compare
…mplify tests slightly. [ci skip-jvm-tests]
[ci skip-jvm-tests]
…ation, and add long-polling of invalidation globs to SchedulerService. [ci skip-jvm-tests]
…:SessionId, and rename to RunId to avoid overloading with the engine's Session. [ci skip-jvm-tests]
3998e08
to
90614ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Stu. Thanks.
let mut inner = self.0.lock(); | ||
let (ignorer, canonical_build_root, liveness_sender, watch_receiver) = inner | ||
.background_task_inputs | ||
.take() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This moves the contents out of the Option right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. The next line asserts that start
is not called more than once.
.spawn_blocking(move || watcher_lock.watch(path, RecursiveMode::NonRecursive)) | ||
.await | ||
} | ||
Err(()) => Err(notify::Error::new(notify::ErrorKind::Generic( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So locking is infallible now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to parking_lot
because I couldn't reproduce the issue you were seeing.
src/rust/engine/graph/src/entry.rs
Outdated
/// currently to clean it). | ||
fn poll_should_wait(&self, context: &N::Context) -> bool { | ||
match self { | ||
EntryResult::Uncacheable(_, session_id) => context.session_id() == session_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are polling an uncacheable node and the result was for the same session we are currently in, we wait for the uncacheable node to finish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poll
waits for something to change (unlike waiters
, which are about waiting for something to finish). If the uncacheable node matches our session/run-id, then it is already up to date.
A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`: 1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed 2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity 3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph` 4. `--loop` - implemented directly in the `SchedulerService` Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them. The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`: * `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls. * `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing). These features are then used by: * `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed. * invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files. The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`. Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`. In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`.
### Problem We're on a slightly older version of the `notify` crate because we wanted to use the "debounced" API, which has a thread that attempts to batch events and dedupe them where possible. But the implementation of debouncing has issues like notify-rs/notify#205. ### Solution Move away from the debounced watcher towards the trustier direct delivery of events: post #9636 we do our own debouncing via the Graph not double-cleaning `Nodes`, and via the `--loop` delay. Additionally, fix two cases where code and tests used to race pants' invalidation and assume (intentionally or otherwise) that pants would not notice created files. ### Result Reduced risk of lost `notify` events. [ci skip-jvm-tests]
### Problem We're on a slightly older version of the `notify` crate because we wanted to use the "debounced" API, which has a thread that attempts to batch events and dedupe them where possible. But the implementation of debouncing has issues like notify-rs/notify#205. ### Solution Move away from the debounced watcher towards the trustier direct delivery of events: post #9636 we do our own debouncing via the Graph not double-cleaning `Nodes`, and via the `--loop` delay. Additionally, fix two cases where code and tests used to race pants' invalidation and assume (intentionally or otherwise) that pants would not notice created files. ### Result Reduced risk of lost `notify` events. [ci skip-jvm-tests]
* Port to tokio 0.2, and to stdlib futures for fs and task_executor (#9071) We're on an older version of tokio, which doesn't smoothly support usage of async/await. Switch to tokio 0.2, which supports directly spawning and awaiting (via its macros) stdlib futures, which is an important step toward being able to utilize async/await more broadly. Additionally, port the `fs` and `task_executor` crates to stdlib futures. Finally, transitively fixup for the new APIs: in particular, since both `task_executor` and `tokio` now consume stdlib futures to spawn tasks, we switch all relevant tests and main methods to use the `tokio::main` and `tokio::test` macros, which annotate async methods and spawn a runtime to allow for `await`ing futures inline. Progress toward more usage of async/await! * Add notify fs watcher to engine. (#9318) * Use the notify crate to implement an `InvalidationWatcher` for Graph operations. * Make watch async, and watcher log to pantsd.log. Relativize paths returned from notify to the build_root. Refactor invalidate method to be an associated method on the InvalidationWatcher. * Respond to feedback. * Use spawn on io pool instead of custom future impl * Write python fs tests * Relativize paths to invalidate to build root * invalidate nodes with parent paths. * Comments * Add rust tests. Make some things public so we can use them in tests. Use canonical path to build root for relativizing changed paths. * Refactor Python tests. Return watch errors as core::Failure all the way to user. Move task executor onto invalidation watcher. Move test_support trait impl into test_support mod. * use futures lock on watcher * Platform specific watching behavior. On Darwin recursively watch the build root at startup. On Linux watch individual directory roots. Co-authored-by: Stu Hood <[email protected]> * Ignore notify events for pants_ignore patterns. (#9406) * Create a git ignorer on the context object. Adjust all call sites which create a posix fs to pass in an ignorer. * Ignore fsevents from files that match pants_ignore patterns. * Always pass is_dir = false to ignorer to avoid stat-ing every path the event watch thread sees. * Add a feature gate to disable the engine fs watcher introduced in #9318 (#9416) * Add a feature gate to disable the engine fs watcher introduced in #9318 by default, to mitigate issues seen in #9415 until a fix is in place. * Don't rerun uncachable nodes if they are dirtied while running. (#9452) * Don't rerun uncachable nodes if they are dirtied while running. - Retry dependencies of uncacheable nodes a few times to get a result until we are exhausted from trying too many times. - Bubble uncacheable node retry errors up to the user, tell them things were chaning to much. - Don't propagate dirtiness past uncacheable nodes when invalidating from changed roots. Otherwise dirty dependents of uncacheable nodes will need to re-run. * enable the engine fs watcher by default, now that it won't cause issues. Remove execution option override from tests. * use reference to self in stop_walk_predicate closure * invalidate often enough that test doesn't flake * Add a flag to prevent the FsEventService and watchman from starting (#9487) * add --watchman-enable flag * disable watchman when flag is false * Don't wait for the initial watchman event if we aren't using watchman. * check invalidation watcher liveness from scheduler service * Extract a `watch` crate. (#9635) The `watch` module directly accesses the `engine` crate's `Graph`, which makes it more challenging to test. Extract a `watch` crate which is used via an `trait Invalidatable` which is implemented for the engine's `Graph`, as well as independently in tests. [ci skip-jvm-tests] * Simplify Scheduler::execute and unify Graph retry (#9674) Both the `Graph` and the `Scheduler` implemented retry for requested Nodes, but the `Scheduler` implementation was pre-async-await and much more complicated. Unify the retry implementations into `Graph::get` (either roots or uncacheable nodes are retried), and simplify the `Scheduler`'s loop down to: ``` let maybe_display_handle = Self::maybe_display_initialize(&session); let result = loop { if let Ok(res) = receiver.recv_timeout(refresh_interval) { break Ok(res); } else if let Err(e) = Self::maybe_display_render(&session, &mut tasks) { break Err(e); } }; Self::maybe_display_teardown(session, maybe_display_handle); result ``` A single, more modern retry implementation (thanks @hrfuller!), and a cleaner `Scheduler::execute` loop. * Move file invalidation handling to rust (#9636) A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`: 1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed 2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity 3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph` 4. `--loop` - implemented directly in the `SchedulerService` Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them. The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`: * `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls. * `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing). These features are then used by: * `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed. * invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files. The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`. Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`. In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`. * pin tokio at exactly 0.2.13 * fix lint issues * fix mypy typing issues * Move away from the debounced notify watcher #9754 * Remove test that has raced file invalidation ever since the notify backend was added, but which will now fairly consistently lose that race. # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] * As explained in the comment: we can no longer create duplicate parallel BUILD files and hope that pants does not notice them before we scan the directory again! # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] Co-authored-by: Stu Hood <[email protected]> Co-authored-by: Stu Hood <[email protected]>
Problem
A few different kinds of file watching span the boundary between the
SchedulerService
andFSEventService
:pantsd
detects that its implementing code or config has changedpantsd
's pidfile to ensure that the daemon exits if it loses exclusivityGraph
--loop
- implemented directly in theSchedulerService
Because of the semi-cyclic nature of the relationship between the
SchedulerService
andFSEventService
, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to thenotify
crate implementation only satisfying one of them.Solution
The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the
Graph
:poll: bool
- Whenpoll
is enabled, aproduct_request
will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optionalpoll_delay
is applied before it returns to "debounce" polls.timeout: Optional[Duration]
- When atimeout
is set, aproduct_request
will wait up to the given duration for the requested Node(s) to complete (including any timepoll
ing).These features are then used by:
--loop
: usespoll
(with apoll_delay
, but without atimeout
) to immediately re-run aGoal
when its inputs have changed.poll
(with nopoll_delay
) andtimeout
to block theirSchedulerService
thread and watch for changes to those files.Result
The
FSEventService
andSchedulerService
are decoupled, and each now interacts only with theScheduler
:FSEventService
to pushwatchman
events to theGraph
, and theSchedulerService
to pull invalidation information from theGraph
.Because all events now flow through the
Graph
, thenotify
crate has reached feature parity withwatchman
.In followup changes we can remove the experimental flag, disable
watchman
(and thus theFSEventService
) by default, and remove the dependency between--loop
andpantsd
.