Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move file invalidation handling to rust #9636

Merged

Conversation

stuhood
Copy link
Member

@stuhood stuhood commented Apr 26, 2020

Problem

A few different kinds of file watching span the boundary between the SchedulerService and FSEventService:

  1. pantsd invalidation globs - how pantsd detects that its implementing code or config has changed
  2. pidfile - watches pantsd's pidfile to ensure that the daemon exits if it loses exclusivity
  3. graph invalidation - any files changing in the workspace should invalidate the engine's Graph
  4. --loop - implemented directly in the SchedulerService

Because of the semi-cyclic nature of the relationship between the SchedulerService and FSEventService, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the notify crate implementation only satisfying one of them.

Solution

The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the Graph:

  • poll: bool - When poll is enabled, a product_request will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional poll_delay is applied before it returns to "debounce" polls.
  • timeout: Optional[Duration] - When a timeout is set, a product_request will wait up to the given duration for the requested Node(s) to complete (including any time polling).

These features are then used by:

  • --loop: uses poll (with a poll_delay, but without a timeout) to immediately re-run a Goal when its inputs have changed.
  • invalidation globs and pidfile watching: use poll (with no poll_delay) and timeout to block their SchedulerService thread and watch for changes to those files.

Result

The FSEventService and SchedulerService are decoupled, and each now interacts only with the Scheduler: FSEventService to push watchman events to the Graph, and the SchedulerService to pull invalidation information from the Graph.

Because all events now flow through the Graph, the notify crate has reached feature parity with watchman.

In followup changes we can remove the experimental flag, disable watchman (and thus the FSEventService) by default, and remove the dependency between --loop and pantsd.

@stuhood

This comment has been minimized.

@stuhood stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch from 0bef48b to efcfdc1 Compare April 26, 2020 18:52
@stuhood stuhood marked this pull request as draft April 26, 2020 22:38
@stuhood

This comment has been minimized.

@stuhood stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch 2 times, most recently from 3a8935a to 0903f71 Compare April 29, 2020 06:51
@stuhood stuhood changed the title Move pantsd invalidation globs to rust Move file invalidation handling to rust Apr 29, 2020
@stuhood stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch 3 times, most recently from 4c12d90 to 563a1de Compare April 30, 2020 22:43
@stuhood

This comment has been minimized.

@stuhood stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch 2 times, most recently from b15d3d6 to cb237a3 Compare May 1, 2020 20:44
@stuhood stuhood marked this pull request as ready for review May 1, 2020 20:45
@stuhood
Copy link
Member Author

stuhood commented May 1, 2020

Ok, this is finally ready for review. The commits are useful to look at independently.

@stuhood stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch 2 times, most recently from f58dcbe to ab561a1 Compare May 2, 2020 04:43
// complete. The second request will check whether the roots have changed, and if they haven't
// changed, will wait until they have (or until the timeout elapses) before re-requesting them.
//
// TODO: The `poll`, `poll_delay`, and `timeout` parameters exist to support a coarse-grained API
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, not going to lie: I'm not happy with how large this API ended up being. In a near-term followup, I'd like to move (the loop around) run_goal_rules into the rust Scheduler. But right now that would interfere with Greg's UI work.

@stuhood stuhood requested review from hrfuller and removed request for Eric-Arellano May 2, 2020 19:25
@stuhood stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch from ab561a1 to faa4eb4 Compare May 2, 2020 20:44
@stuhood stuhood added this to the 1.26.x milestone May 2, 2020
@stuhood stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch from faa4eb4 to 3998e08 Compare May 2, 2020 21:48
stuhood and others added 4 commits May 3, 2020 09:24
…ation, and add long-polling of invalidation globs to SchedulerService.

[ci skip-jvm-tests]
…:SessionId, and rename to RunId to avoid overloading with the engine's Session.

[ci skip-jvm-tests]
@stuhood stuhood force-pushed the stuhood/pantsd-invalidation-in-rust branch from 3998e08 to 90614ba Compare May 3, 2020 16:29
Copy link
Member

@hrfuller hrfuller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Stu. Thanks.

let mut inner = self.0.lock();
let (ignorer, canonical_build_root, liveness_sender, watch_receiver) = inner
.background_task_inputs
.take()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This moves the contents out of the Option right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. The next line asserts that start is not called more than once.

.spawn_blocking(move || watcher_lock.watch(path, RecursiveMode::NonRecursive))
.await
}
Err(()) => Err(notify::Error::new(notify::ErrorKind::Generic(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So locking is infallible now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to parking_lot because I couldn't reproduce the issue you were seeing.

/// currently to clean it).
fn poll_should_wait(&self, context: &N::Context) -> bool {
match self {
EntryResult::Uncacheable(_, session_id) => context.session_id() == session_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are polling an uncacheable node and the result was for the same session we are currently in, we wait for the uncacheable node to finish?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poll waits for something to change (unlike waiters, which are about waiting for something to finish). If the uncacheable node matches our session/run-id, then it is already up to date.

@stuhood stuhood merged commit 9979b78 into pantsbuild:master May 6, 2020
@stuhood stuhood deleted the stuhood/pantsd-invalidation-in-rust branch May 6, 2020 23:19
hrfuller pushed a commit to twitter/pants that referenced this pull request May 11, 2020
A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`:
1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed
2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity
3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph`
4. `--loop` - implemented directly in the `SchedulerService`

Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them.

The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`:
* `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls.
* `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing).

These features are then used by:
* `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed.
* invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files.

The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`.

Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`.

In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`.
stuhood pushed a commit that referenced this pull request May 14, 2020
### Problem

We're on a slightly older version of the `notify` crate because we wanted to use the "debounced" API, which has a thread that attempts to batch events and dedupe them where possible. But the implementation of debouncing has issues like notify-rs/notify#205.

### Solution

Move away from the debounced watcher towards the trustier direct delivery of events: post #9636 we do our own debouncing via the Graph not double-cleaning `Nodes`, and via the `--loop` delay.

Additionally, fix two cases where code and tests used to race pants' invalidation and assume (intentionally or otherwise) that pants would not notice created files. 

### Result

Reduced risk of lost `notify` events.

[ci skip-jvm-tests]
stuhood pushed a commit that referenced this pull request May 14, 2020
### Problem

We're on a slightly older version of the `notify` crate because we wanted to use the "debounced" API, which has a thread that attempts to batch events and dedupe them where possible. But the implementation of debouncing has issues like notify-rs/notify#205.

### Solution

Move away from the debounced watcher towards the trustier direct delivery of events: post #9636 we do our own debouncing via the Graph not double-cleaning `Nodes`, and via the `--loop` delay.

Additionally, fix two cases where code and tests used to race pants' invalidation and assume (intentionally or otherwise) that pants would not notice created files. 

### Result

Reduced risk of lost `notify` events.

[ci skip-jvm-tests]
hrfuller pushed a commit that referenced this pull request May 15, 2020
* Port to tokio 0.2, and to stdlib futures for fs and task_executor (#9071)

We're on an older version of tokio, which doesn't smoothly support usage of async/await.

Switch to tokio 0.2, which supports directly spawning and awaiting (via its macros) stdlib futures, which is an important step toward being able to utilize async/await more broadly. Additionally, port the `fs` and `task_executor` crates to stdlib futures.

Finally, transitively fixup for the new APIs: in particular, since both `task_executor` and `tokio` now consume stdlib futures to spawn tasks, we switch all relevant tests and main methods to use the `tokio::main` and `tokio::test` macros, which annotate async methods and spawn a runtime to allow for `await`ing futures inline.

Progress toward more usage of async/await!

* Add notify fs watcher to engine. (#9318)

* Use the notify crate to implement an `InvalidationWatcher` for Graph operations.

* Make watch async, and watcher log to pantsd.log.
Relativize paths returned from notify to the build_root.
Refactor invalidate method to be an associated method on the
InvalidationWatcher.

* Respond to feedback.
* Use spawn on io pool instead of custom future impl
* Write python fs tests
* Relativize paths to invalidate to build root
* invalidate nodes with parent paths.
* Comments

* Add rust tests.
Make some things public so we can use them in tests.
Use canonical path to build root for relativizing changed paths.

* Refactor Python tests.
Return watch errors as core::Failure all the way to user.
Move task executor onto invalidation watcher.
Move test_support trait impl into test_support mod.

* use futures lock on watcher

* Platform specific watching behavior. On Darwin recursively watch the
build root at startup. On Linux watch individual directory roots.

Co-authored-by: Stu Hood <[email protected]>

* Ignore notify events for pants_ignore patterns. (#9406)

* Create a git ignorer on the context object. Adjust all call sites which
create a posix fs to pass in an ignorer.

* Ignore fsevents from files that match pants_ignore patterns.

* Always pass is_dir = false to ignorer to avoid stat-ing every path the
event watch thread sees.

* Add a feature gate to disable the engine fs watcher introduced in #9318 (#9416)

* Add a feature gate to disable the engine fs watcher introduced in #9318
by default, to mitigate issues seen in #9415 until a fix is in place.

* Don't rerun uncachable nodes if they are dirtied while running. (#9452)

* Don't rerun uncachable nodes if they are dirtied while running.
- Retry dependencies of uncacheable nodes a few times to get a result
  until we are exhausted from trying too many times.
- Bubble uncacheable node retry errors up to the user, tell them things
  were chaning to much.
- Don't propagate dirtiness past uncacheable nodes when invalidating
  from changed roots. Otherwise dirty dependents of uncacheable nodes
  will need to re-run.

* enable the engine fs watcher by default, now that it won't cause issues.
Remove execution option override from tests.

* use reference to self in stop_walk_predicate closure

* invalidate often enough that test doesn't flake

* Add a flag to prevent the FsEventService and watchman from starting (#9487)

* add --watchman-enable flag
* disable watchman when flag is false
* Don't wait for the initial watchman event if we aren't using watchman.
* check invalidation watcher liveness from scheduler service

* Extract a `watch` crate. (#9635)

The `watch` module directly accesses the `engine` crate's `Graph`, which makes it more challenging to test.

Extract a `watch` crate which is used via an `trait Invalidatable` which is implemented for the engine's `Graph`, as well as independently in tests.

[ci skip-jvm-tests]

* Simplify Scheduler::execute and unify Graph retry (#9674)

Both the `Graph` and the `Scheduler` implemented retry for requested Nodes, but the `Scheduler` implementation was pre-async-await and much more complicated.

Unify the retry implementations into `Graph::get` (either roots or uncacheable nodes are retried), and simplify the `Scheduler`'s loop down to:
```
let maybe_display_handle = Self::maybe_display_initialize(&session);
let result = loop {
  if let Ok(res) = receiver.recv_timeout(refresh_interval) {
    break Ok(res);
  } else if let Err(e) = Self::maybe_display_render(&session, &mut tasks) {
    break Err(e);
  }
};
Self::maybe_display_teardown(session, maybe_display_handle);
result
```

A single, more modern retry implementation (thanks @hrfuller!), and a cleaner `Scheduler::execute` loop.

* Move file invalidation handling to rust (#9636)

A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`:
1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed
2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity
3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph`
4. `--loop` - implemented directly in the `SchedulerService`

Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them.

The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`:
* `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls.
* `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing).

These features are then used by:
* `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed.
* invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files.

The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`.

Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`.

In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`.

* pin tokio at exactly 0.2.13

* fix lint issues

* fix mypy typing issues

* Move away from the debounced notify watcher #9754

* Remove test that has raced file invalidation ever since the notify backend was added, but which will now fairly consistently lose that race.
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]

* As explained in the comment: we can no longer create duplicate parallel BUILD files and hope that pants does not notice them before we scan the directory again!
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]

Co-authored-by: Stu Hood <[email protected]>
Co-authored-by: Stu Hood <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants