-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the WaitExecution #177
Implement the WaitExecution #177
Conversation
9ad6c6d
to
c91aaa5
Compare
If you think it's better to copy this code into the below two methods rather than this abomination of generics, let me know. However, I kinda just wanted to do it to see if I could... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r1.
Reviewable status: 1 of 8 files reviewed, 2 unresolved discussions (waiting on @chrisstaite-menlo)
cas/grpc_service/execution_server.rs
line 166 at r3 (raw file):
} pub fn into_services(self) -> (Server<ExecutionServer>, OperationsServer<ExecutionServer>) {
At first I was thinking, this should be a separate service and not share with ExecutionServer. Then I was like... "Nah, it makes sense to put them together"... Then I started reading the API and now I think they should be separate.
ExecutionServer only serves execution requests. OperationServer is designed to serve other kinds of operations too like "WaitForWorkerToComeOnline" or "DrainAndShutdown".
Because of this and unlikely, but possible additional features, lets break this into it's own grpc_service
. All our operations will happen on the scheduler anyway, so interacting should be easy, since the scheduler can be referenced in both services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 8 files reviewed, 3 unresolved discussions (waiting on @chrisstaite-menlo)
cas/grpc_service/execution_server.rs
line 224 at r3 (raw file):
// Very innefficient having to search through all instances, however // the protobuf spec doesn't state that the instance_name can be // retrieved from the path.
Actually it can be retrieved from the path if it matches as outlined here:
https://github.com/allada/turbo-cache/blob/master/proto/build/bazel/remote/execution/v2/remote_execution.proto#L124
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 8 files reviewed, 3 unresolved discussions (waiting on @chrisstaite-menlo)
cas/grpc_service/execution_server.rs
line 224 at r3 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Actually it can be retrieved from the path if it matches as outlined here:
https://github.com/allada/turbo-cache/blob/master/proto/build/bazel/remote/execution/v2/remote_execution.proto#L124
Wait, no... We choose the name, so we choose the filter. In our case we can require any kind of filter format we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 1 of 8 files reviewed, 5 unresolved discussions (waiting on @chrisstaite-menlo)
cas/grpc_service/execution_server.rs
line 216 at r3 (raw file):
} async fn lookup_action<'a, Fut, F, G, R, S>(&'a self, operation: F, mut filter: G) -> Result<S, Status>
nit: Lets rename filter_actions()
and add a limit
argument. This will make it trivial to implement ListOperations()
in the future.
cas/grpc_service/execution_server.rs
line 225 at r3 (raw file):
// the protobuf spec doesn't state that the instance_name can be // retrieved from the path. self.instance_infos
nit: This is super confusing, the Stream
api makes it hard to understand what is happening, can we expand this into a while let Some(item) = stream.next().await {
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 8 files reviewed, 6 unresolved discussions (waiting on @chrisstaite-menlo)
a discussion (no related file):
Give me some time on this, I'm going to hack up your branch to get a better feel for it. There's a lot going on and I want to see how it goes.
Right now I'm leaning towards the operation service being completely separate from the execution service. Requiring a configuration and all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 8 files reviewed, 5 unresolved discussions (waiting on @chrisstaite-menlo)
cas/grpc_service/execution_server.rs
line 166 at r3 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
At first I was thinking, this should be a separate service and not share with ExecutionServer. Then I was like... "Nah, it makes sense to put them together"... Then I started reading the API and now I think they should be separate.
ExecutionServer only serves execution requests. OperationServer is designed to serve other kinds of operations too like "WaitForWorkerToComeOnline" or "DrainAndShutdown".
Because of this and unlikely, but possible additional features, lets break this into it's own
grpc_service
. All our operations will happen on the scheduler anyway, so interacting should be easy, since the scheduler can be referenced in both services.
Disregard. I experimented with this idea for a few hours and came to the conclusion we should not try to recycle Operation
for anything but BRE required stuff. It is a decent interface, but tonic doesn't really support HTTP1 + json endpoints yet, it's close, but I did some testing with tonic-web and it didn't really work for what I wanted, since it just encodes everything to a proto and requires input as a proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 8 files reviewed, 7 unresolved discussions (waiting on @chrisstaite-menlo)
cas/cas_main.rs
line 217 at r3 (raw file):
.map_or(Ok(None), |cfg| { ExecutionServer::new(&cfg, &action_schedulers, &store_manager).map(|v| { let (mut execution_service, mut operations_service) = v.into_services();
I tried really really hard to make this interface not suck. Well I couldn't figure out a way to abstract this out without making a new config, which is pointless for us.
cas/grpc_service/execution_server.rs
line 241 at r3 (raw file):
let name = request.into_inner().name; self.lookup_action( |instance_info| instance_info.scheduler.find_existing_action(&name),
Since this file is the gateway to all things execution related, lets rewrite all output Operation
s so it has the instance name as prefixed to the operation name. Then just lookup what scheduler it should be on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 8 files reviewed, 8 unresolved discussions (waiting on @chrisstaite-menlo)
cas/grpc_service/execution_server.rs
line 237 at r3 (raw file):
async fn inner_wait_execution( &self, request: Request<WaitExecutionRequest>,
nit: Lets just pass in name as a string here and then this function is 100% compatible with: wait_operation()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 8 files reviewed, 9 unresolved discussions (waiting on @chrisstaite-menlo)
cas/grpc_service/execution_server.rs
line 308 at r3 (raw file):
} async fn cancel_operation(&self, request: Request<CancelOperationRequest>) -> Result<Response<()>, Status> {
I'm really struggling on this one. Do we need/want to implement cancel_operation
? I understand removing delete_operation
, since it allows the scheduler to stop the action if it wants. But from what I can tell in the documentation, cancel_operation
pretty much says: "Kill it, don't ask questions" and gives little flexibility.
The reason I'm worried here is because the scheduler multiplexes the actions. Meaning if two requests from two different clients come in for the same action, it will only execute the action one time and send the results to both clients. If we implement cancel_action
we are allowing any client to terminate another client's action.
If the argument is made to say that's an implementation detail and the client should not be able to see other client's jobs and it should only terminate the job if no-one else is waiting on that job, I feel that's more of what DeleteOperation
is for. But then again DeleteOperation
might be more about actions that change external state (like upload files, install new software, exc...), which is why the client would not care about the results.
Previously, allada (Nathan (Blaise) Bruer) wrote…
I think I'm going to drop the operations service. It was mostly for the use case that someone changes a header that's used in a lot of files and it's bad, so cancels the build job but we still execute a bunch of jobs that are going to fail. However, I don't think it really gains us all that much, so I think I'm going to drop it and we can bring it in a separate branch later if we want to. |
c91aaa5
to
84e99a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 8 files reviewed, 5 unresolved discussions (waiting on @allada)
cas/grpc_service/execution_server.rs
line 224 at r3 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Wait, no... We choose the name, so we choose the filter. In our case we can require any kind of filter format we want.
Think that falls down with the GrpcScheduler, right?
cas/grpc_service/execution_server.rs
line 237 at r3 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Lets just pass in name as a string here and then this function is 100% compatible with:
wait_operation()
.
Got rid of operation server, so probably not worth doing.
cas/grpc_service/execution_server.rs
line 241 at r3 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Since this file is the gateway to all things execution related, lets rewrite all output
Operation
s so it has the instance name as prefixed to the operation name. Then just lookup what scheduler it should be on here.
Got rid of the operations server...
cas/grpc_service/execution_server.rs
line 308 at r3 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
I'm really struggling on this one. Do we need/want to implement
cancel_operation
? I understand removingdelete_operation
, since it allows the scheduler to stop the action if it wants. But from what I can tell in the documentation,cancel_operation
pretty much says: "Kill it, don't ask questions" and gives little flexibility.The reason I'm worried here is because the scheduler multiplexes the actions. Meaning if two requests from two different clients come in for the same action, it will only execute the action one time and send the results to both clients. If we implement
cancel_action
we are allowing any client to terminate another client's action.If the argument is made to say that's an implementation detail and the client should not be able to see other client's jobs and it should only terminate the job if no-one else is waiting on that job, I feel that's more of what
DeleteOperation
is for. But then againDeleteOperation
might be more about actions that change external state (like upload files, install new software, exc...), which is why the client would not care about the results.
Done.
6b5f80f
to
e6ff75b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 9 files at r4.
Reviewable status: 1 of 9 files reviewed, 6 unresolved discussions (waiting on @chrisstaite-menlo)
-- commits
line 1 at r4:
I think something went wrong in your rebase.
04b492a
to
2c81ef1
Compare
Previously, allada (Nathan (Blaise) Bruer) wrote…
Whoops.... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 10 files reviewed, 3 unresolved discussions (waiting on @chrisstaite-menlo)
a discussion (no related file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
I think I'm going to drop the operations service. It was mostly for the use case that someone changes a header that's used in a lot of files and it's bad, so cancels the build job but we still execute a bunch of jobs that are going to fail. However, I don't think it really gains us all that much, so I think I'm going to drop it and we can bring it in a separate branch later if we want to.
Yeah actually this brings up a feature I was planning on doing but haven't yet... A config that times out jobs. This is common for these kinds of services, where after some config, lets say 60 seconds after all clients disconnect from a job it will terminate the job.
To do this we'll need to wrap add_action()
's watch::Receiver<Arc<ActionState>
with a wrapper that if dropped will notify the scheduler that a client disconnected, so it can start a counter if it was the last one to connect.
cas/grpc_service/execution_server.rs
line 224 at r3 (raw file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
Think that falls down with the GrpcScheduler, right?
I don't think so. If this is a problem in GrpcScheduler
it'll just need to do an instance_name
-> instance_name
mapping.
We need to make the name of the operation contain the instance name. Something like:
{intance_name}/{action_info_hash}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 10 files reviewed, 2 unresolved discussions (waiting on @chrisstaite-menlo)
cas/grpc_service/execution_server.rs
line 224 at r3 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
I don't think so. If this is a problem in
GrpcScheduler
it'll just need to do aninstance_name
->instance_name
mapping.We need to make the name of the operation contain the instance name. Something like:
{intance_name}/{action_info_hash}
I think #186 will solve much of this.
2c81ef1
to
cc51fe5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 21 files reviewed, 2 unresolved discussions (waiting on @allada)
a discussion (no related file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Yeah actually this brings up a feature I was planning on doing but haven't yet... A config that times out jobs. This is common for these kinds of services, where after some config, lets say 60 seconds after all clients disconnect from a job it will terminate the job.
To do this we'll need to wrap
add_action()
'swatch::Receiver<Arc<ActionState>
with a wrapper that if dropped will notify the scheduler that a client disconnected, so it can start a counter if it was the last one to connect.
It will also need to be cancelled on a new subscribe()
.
cas/grpc_service/execution_server.rs
line 224 at r3 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
I think #186 will solve much of this.
Done.
cc51fe5
to
a675e66
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 21 files reviewed, 2 unresolved discussions (waiting on @chrisstaite-menlo)
a discussion (no related file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
It will also need to be cancelled on a new
subscribe()
.
Does it? If we ever get to a point where we a user wants to subscribe()
and the task happens to be running. We always feed the results of the running job.
Previously, allada (Nathan (Blaise) Bruer) wrote…
This situation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @allada from a discussion.
Reviewable status: 18 of 21 files reviewed, 8 unresolved discussions (waiting on @allada and @chrisstaite)
cas/grpc_service/execution_server.rs
line 213 at r8 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit:
&request.into_inner().name
looks better.
the trait \
From<&String>` is not implemented for `ActionInfoHashKey``
cas/grpc_service/execution_server.rs
line 219 at r8 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
I wounder if it'd be a good idea to then try and find the item in the action cache before rejecting it?
I don't think the spec says anything about this, but I think it makes sense. From looking at the code, we'd need to modify the scheduler so before it fully drops
ActionResult
, it'll upload it to the CAS (not AC).No need to do this now, but if you don't want to do it, lets make a TODO. The cool part is that it'll allow users to run an action, go away, and come back later to get the results (assuming the timeout is properly permissive).
Isn't the idea that if you do that you should just look it up in the Action Cache yourself?
cas/scheduler/cache_lookup_scheduler.rs
line 158 at r8 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
There's a bit of a race condition here because we can have two clients at this point trying add the same
unique_qualifier
. This would cause some very bizarre things to happen. We should do a cache lookup here first and feed the existing item if it exists within the same lock.
Done.
cas/scheduler/cache_lookup_scheduler.rs
line 181 at r8 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Hmmm, minor bug here, where we only notify the downstream scheduler that the client has disconnected when the worker sends a notification to the client and
tx.send(action_state).is_err()
triggers an error and drops theSender
, which then causes the downstream scheduler to now be able to know the client disconnected.This currently is not a big problem, since it only causes a very minor amount of additional needless memory, but if/when we implement the ability for the scheduler to timeout tasks because no listeners are active it'll become an issue.
This can be fixed by using a
select!{}
overtx.closed()
andwatch_stream.next()
. Not sure if we should do this in this PR, but wanted to make sure it was known.This is a good example on why I think we should try and avoid using spawns as much as possible, granted here we need to use a spawn (unless we change function signatures), this would auto-propagate everything down because the future would get dropped which would call
Drop
on everything downstream.
Done.
cas/scheduler/cache_lookup_scheduler.rs
line 208 at r8 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
I created an upstream ticket for this:
tokio-rs/tokio#5871Can we make a TODO referencing this ticket?
Done.
cas/scheduler/grpc_scheduler.rs
line 42 at r8 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
nit: Looks unrelated (possibly future PR?).
This made more sense when I was adding in the OperationsClient
too, but now I've taken that out it makes a little less sense, but I think this is worth adding in as I believe the balance_list
will re-try a failed connection whereas the existing code won't. Happy to move this out to another PR, but it's a bit mixed in now...
cas/scheduler/simple_scheduler.rs
line 166 at r8 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Can we do this part of the change in a followup PR? I want to think about this one a bit because it plays into both debugging and caching. Also, I'm not sure if the scheduler should do this here or if maybe some kind of logging service might be better to solve this for the debugging case.
We also can probably solve the non-debug use case by doing what is commented on above (store the
ActionResult
in CAS). I am pretty sure buildbarn does this.
For fast build actions this is required for Goma otherwise the action disappears between it calling add_action
and calling wait_execution
. That's why I had to implement it. We could pull it out, but both are necessary for this code to work, so it doesn't make much sense to I don't think?
Previously, allada (Nathan (Blaise) Bruer) wrote…
Looking at the source of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 14 of 22 files reviewed, 8 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)
cas/grpc_service/execution_server.rs
line 213 at r8 (raw file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
the trait \
From<&String>` is not implemented for `ActionInfoHashKey``
blah. Probably needs Borrow<&T>
magic in the TryFrom
trait.
cas/grpc_service/execution_server.rs
line 219 at r8 (raw file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
Isn't the idea that if you do that you should just look it up in the Action Cache yourself?
Yes and no. If caching was enabled 100% of the time yes, but in our case we can send requests that will not be saved in action cache. In the event the no_cache
flag is sent on the initial request, by design the ActionInfoHashKey
will be different for every request, but if the no_cache
is not set, then it may use the same ActionInfoHashKey
and thus the same Operation Id.
Lets not get caught up on this in this PR and if possible lets split off this part into a different PR. I will look into how other projects do this. I do remember buildbarn having a flag somewhere saying something like: Store action results for non-cachable items in the CAS. This breaks the "HASH(value) -> value" guarantee though
.
f34ef3e
to
facbb06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 5 files at r9.
Reviewable status: 16 of 25 files reviewed, 5 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)
cas/scheduler/grpc_scheduler.rs
line 42 at r8 (raw file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
This made more sense when I was adding in the
OperationsClient
too, but now I've taken that out it makes a little less sense, but I think this is worth adding in as I believe thebalance_list
will re-try a failed connection whereas the existing code won't. Happy to move this out to another PR, but it's a bit mixed in now...
Yeah, I agree we should use balance_list
, but we need to modify the config file too, which is technically a breaking change, so it needs to be documented in the commit title.
Previously, allada (Nathan (Blaise) Bruer) wrote…
Not at all, I just add a single endpoint to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 16 of 25 files reviewed, 5 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)
cas/scheduler/simple_scheduler.rs
line 166 at r8 (raw file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
For fast build actions this is required for Goma otherwise the action disappears between it calling
add_action
and callingwait_execution
. That's why I had to implement it. We could pull it out, but both are necessary for this code to work, so it doesn't make much sense to I don't think?
I dont want to block this PR on this especially since you are able to get Goma working. So I created a ticket:
#192
Can you just make a note in the code referencing this ticket here?
Something like:
// TOOD(#192) Revisit if this is the best way to handle recently completed actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 16 of 25 files reviewed, 5 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)
cas/scheduler/simple_scheduler.rs
line 283 at r8 (raw file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
Looking at the source of
HashSet
it callsself.base.get
wherebase
ishashbrown
...
Can we add a comment here with something like:
// TODO: We can lookup the action without needing to iterate the maps but there's no API for it.
// Maybe when: https://github.com/rust-lang/rust/issues/56167 is done?
Previously, allada (Nathan (Blaise) Bruer) wrote…
Worth a new issue? |
facbb06
to
0cbd85d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 16 of 25 files reviewed, 4 unresolved discussions (waiting on @allada and @chrisstaite)
cas/grpc_service/execution_server.rs
line 213 at r8 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
blah. Probably needs
Borrow<&T>
magic in theTryFrom
trait.
Did some messing and Googling and found there's not a nice way to do this...
cas/scheduler/simple_scheduler.rs
line 166 at r8 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
I dont want to block this PR on this especially since you are able to get Goma working. So I created a ticket:
#192Can you just make a note in the code referencing this ticket here?
Something like:
// TOOD(#192) Revisit if this is the best way to handle recently completed actions.
Done.
Previously, allada (Nathan (Blaise) Bruer) wrote…
I already swapped to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 16 of 25 files reviewed, 4 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)
cas/scheduler/simple_scheduler.rs
line 283 at r8 (raw file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
I already swapped to
hashbrown
. Do you think I should put it back with an issue?
Wait, it was really that easy!?!?!
cas/scheduler/simple_scheduler.rs
line 168 at r12 (raw file):
// keep their completion state around for a while to send back. // TODO(#192) Revisit if this is the best way to handle recently completed actions. recently_completed_actions: Vec<Arc<ActionState>>,
Can't this be a HashSet
now so we can just grab it without iterating the vector?
cas/scheduler/simple_scheduler.rs
line 255 at r12 (raw file):
self.recently_completed_actions.retain(|action| match &action.stage { ActionStage::Completed(state) => { if state.execution_metadata.worker_completed_timestamp > expiry_time {
I think we should wrap ActionStage
so we can store the timestamp because I don't think we should blindly trust workers and I think we should hold on to non ActionStage::Completed
(like errors) for some set duration too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 16 of 25 files reviewed, 5 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)
cas/scheduler/simple_scheduler.rs
line 252 at r12 (raw file):
fn clean_recently_completed_actions(&mut self) { // Keep for a minute maximum. let expiry_time = SystemTime::now().checked_sub(Duration::new(60, 0)).unwrap();
Can we move this setting to a config?
Previously, allada (Nathan (Blaise) Bruer) wrote…
Seemingly? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 16 of 25 files reviewed, 7 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)
cas/scheduler/simple_scheduler.rs
line 280 at r12 (raw file):
// Clean up the actions list. self.clean_recently_completed_actions();
lets not do this here, we'll do it similar to how we deal with remove_timedout_workers()
.
cas/scheduler/simple_scheduler.rs
line 521 at r12 (raw file):
.push(running_action.action.current_state); // Clean up any old actions. self.clean_recently_completed_actions();
lets not do this here, we'll do it similar to how we deal with remove_timedout_workers()
.
Previously, allada (Nathan (Blaise) Bruer) wrote…
No, because it's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 8 files at r7, 2 of 5 files at r9, 4 of 5 files at r10, all commit messages.
Reviewable status: 23 of 25 files reviewed, 7 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)
cas/scheduler/simple_scheduler.rs
line 516 at r12 (raw file):
self.active_actions.insert(action_info, running_action); return Ok(()); } else if send_result.is_err() {
shouldn't we add the item to recently_complted_actions
regardless of if the result was unable to be sent? Ie: there may be multiple clients listening, the first client was success, but the second client then calls WaitExecution
and fails because the task was multiplexed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r12.
Reviewable status: 24 of 25 files reviewed, 7 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about doing this in phases. It was a pretty difficult one to review. I finished going over everything. The things commented on are the only things left. Before it was: "I can't really review this until the last comment is addressed".
Reviewable status: 24 of 25 files reviewed, 7 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 24 of 25 files reviewed, 7 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)
cas/scheduler/simple_scheduler.rs
line 168 at r12 (raw file):
Previously, chrisstaite-menlo (Chris Staite) wrote…
No, because it's
ActionState
notActionInfo
:(. I suppose we could updateActionState
to also be hashed in the same way asActionInfo
...
Yeah actually it can. ActionState
just needs the same hash function that ActionInfo
has. Ironically ActionState
is owns the properties that ActionInfo
uses to hash.
0cbd85d
to
6fec564
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 16 of 26 files reviewed, 7 unresolved discussions (waiting on @allada and @chrisstaite)
cas/scheduler/simple_scheduler.rs
line 168 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Yeah actually it can.
ActionState
just needs the same hash function thatActionInfo
has. IronicallyActionState
is owns the properties thatActionInfo
uses to hash.
I actually did this in a different way since I was wrapping the completed time in anyway and implemented it for CompletedAction
...
cas/scheduler/simple_scheduler.rs
line 252 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
Can we move this setting to a config?
Done.
cas/scheduler/simple_scheduler.rs
line 255 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
I think we should wrap
ActionStage
so we can store the timestamp because I don't think we should blindly trust workers and I think we should hold on to nonActionStage::Completed
(like errors) for some set duration too.
Done.
cas/scheduler/simple_scheduler.rs
line 280 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
lets not do this here, we'll do it similar to how we deal with
remove_timedout_workers()
.
Done.
cas/scheduler/simple_scheduler.rs
line 516 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
shouldn't we add the item to
recently_complted_actions
regardless of if the result was unable to be sent? Ie: there may be multiple clients listening, the first client was success, but the second client then callsWaitExecution
and fails because the task was multiplexed.
Done.
cas/scheduler/simple_scheduler.rs
line 521 at r12 (raw file):
Previously, allada (Nathan (Blaise) Bruer) wrote…
lets not do this here, we'll do it similar to how we deal with
remove_timedout_workers()
.
Done.
I knew it was going to be a horrid one, because it really should have been more than one PR, but it was a mangled mess! |
Goma requires WaitExecution to function correctly. Implement this and store recently completed actions.
6fec564
to
436642d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woot! This was a tough one.
Reviewed 1 of 3 files at r5, 9 of 9 files at r13, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)
cas/scheduler/default_scheduler_factory.rs
line 70 at r13 (raw file):
fn start_cleanup_timer(action_scheduler: &Arc<dyn ActionScheduler>) { let weak_scheduler = Arc::downgrade(action_scheduler); tokio::spawn(async move {
SchedulerManager
so we don't need these spawns to be running all over the place.
No action needed.
These functions are used by Goma.
This also resolves the issue of holding on to abandoned queued jobs.
I probably could break this down into a few more PRs, but I'd likely break something along the way.
This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)