Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the WaitExecution #177

Merged

Conversation

chrisstaite-menlo
Copy link
Collaborator

@chrisstaite-menlo chrisstaite-menlo commented Jul 14, 2023

These functions are used by Goma.

This also resolves the issue of holding on to abandoned queued jobs.

I probably could break this down into a few more PRs, but I'd likely break something along the way.


This change is Reviewable

@chrisstaite-menlo chrisstaite-menlo force-pushed the wait_execution branch 2 times, most recently from 9ad6c6d to c91aaa5 Compare July 14, 2023 16:19
@chrisstaite-menlo
Copy link
Collaborator Author

cas/grpc_service/execution_server.rs line 216 at r3 (raw file):

    }

    async fn lookup_action<'a, Fut, F, G, R, S>(&'a self, operation: F, mut filter: G) -> Result<S, Status>

If you think it's better to copy this code into the below two methods rather than this abomination of generics, let me know. However, I kinda just wanted to do it to see if I could...

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 8 files at r1.
Reviewable status: 1 of 8 files reviewed, 2 unresolved discussions (waiting on @chrisstaite-menlo)


cas/grpc_service/execution_server.rs line 166 at r3 (raw file):

    }

    pub fn into_services(self) -> (Server<ExecutionServer>, OperationsServer<ExecutionServer>) {

At first I was thinking, this should be a separate service and not share with ExecutionServer. Then I was like... "Nah, it makes sense to put them together"... Then I started reading the API and now I think they should be separate.

ExecutionServer only serves execution requests. OperationServer is designed to serve other kinds of operations too like "WaitForWorkerToComeOnline" or "DrainAndShutdown".

Because of this and unlikely, but possible additional features, lets break this into it's own grpc_service. All our operations will happen on the scheduler anyway, so interacting should be easy, since the scheduler can be referenced in both services.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 8 files reviewed, 3 unresolved discussions (waiting on @chrisstaite-menlo)


cas/grpc_service/execution_server.rs line 224 at r3 (raw file):

        // Very innefficient having to search through all instances, however
        // the protobuf spec doesn't state that the instance_name can be
        // retrieved from the path.

Actually it can be retrieved from the path if it matches as outlined here:
https://github.com/allada/turbo-cache/blob/master/proto/build/bazel/remote/execution/v2/remote_execution.proto#L124

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 8 files reviewed, 3 unresolved discussions (waiting on @chrisstaite-menlo)


cas/grpc_service/execution_server.rs line 224 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Actually it can be retrieved from the path if it matches as outlined here:
https://github.com/allada/turbo-cache/blob/master/proto/build/bazel/remote/execution/v2/remote_execution.proto#L124

Wait, no... We choose the name, so we choose the filter. In our case we can require any kind of filter format we want.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 1 of 8 files reviewed, 5 unresolved discussions (waiting on @chrisstaite-menlo)


cas/grpc_service/execution_server.rs line 216 at r3 (raw file):

    }

    async fn lookup_action<'a, Fut, F, G, R, S>(&'a self, operation: F, mut filter: G) -> Result<S, Status>

nit: Lets rename filter_actions() and add a limit argument. This will make it trivial to implement ListOperations() in the future.


cas/grpc_service/execution_server.rs line 225 at r3 (raw file):

        // the protobuf spec doesn't state that the instance_name can be
        // retrieved from the path.
        self.instance_infos

nit: This is super confusing, the Stream api makes it hard to understand what is happening, can we expand this into a while let Some(item) = stream.next().await { ?

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 8 files reviewed, 6 unresolved discussions (waiting on @chrisstaite-menlo)

a discussion (no related file):

Give me some time on this, I'm going to hack up your branch to get a better feel for it. There's a lot going on and I want to see how it goes.

Right now I'm leaning towards the operation service being completely separate from the execution service. Requiring a configuration and all.


Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 8 files reviewed, 5 unresolved discussions (waiting on @chrisstaite-menlo)


cas/grpc_service/execution_server.rs line 166 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

At first I was thinking, this should be a separate service and not share with ExecutionServer. Then I was like... "Nah, it makes sense to put them together"... Then I started reading the API and now I think they should be separate.

ExecutionServer only serves execution requests. OperationServer is designed to serve other kinds of operations too like "WaitForWorkerToComeOnline" or "DrainAndShutdown".

Because of this and unlikely, but possible additional features, lets break this into it's own grpc_service. All our operations will happen on the scheduler anyway, so interacting should be easy, since the scheduler can be referenced in both services.

Disregard. I experimented with this idea for a few hours and came to the conclusion we should not try to recycle Operation for anything but BRE required stuff. It is a decent interface, but tonic doesn't really support HTTP1 + json endpoints yet, it's close, but I did some testing with tonic-web and it didn't really work for what I wanted, since it just encodes everything to a proto and requires input as a proto.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 8 files reviewed, 7 unresolved discussions (waiting on @chrisstaite-menlo)


cas/cas_main.rs line 217 at r3 (raw file):

                    .map_or(Ok(None), |cfg| {
                        ExecutionServer::new(&cfg, &action_schedulers, &store_manager).map(|v| {
                            let (mut execution_service, mut operations_service) = v.into_services();

I tried really really hard to make this interface not suck. Well I couldn't figure out a way to abstract this out without making a new config, which is pointless for us.


cas/grpc_service/execution_server.rs line 241 at r3 (raw file):

        let name = request.into_inner().name;
        self.lookup_action(
            |instance_info| instance_info.scheduler.find_existing_action(&name),

Since this file is the gateway to all things execution related, lets rewrite all output Operations so it has the instance name as prefixed to the operation name. Then just lookup what scheduler it should be on here.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 8 files reviewed, 8 unresolved discussions (waiting on @chrisstaite-menlo)


cas/grpc_service/execution_server.rs line 237 at r3 (raw file):

    async fn inner_wait_execution(
        &self,
        request: Request<WaitExecutionRequest>,

nit: Lets just pass in name as a string here and then this function is 100% compatible with: wait_operation().

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 8 files reviewed, 9 unresolved discussions (waiting on @chrisstaite-menlo)


cas/grpc_service/execution_server.rs line 308 at r3 (raw file):

    }

    async fn cancel_operation(&self, request: Request<CancelOperationRequest>) -> Result<Response<()>, Status> {

I'm really struggling on this one. Do we need/want to implement cancel_operation? I understand removing delete_operation, since it allows the scheduler to stop the action if it wants. But from what I can tell in the documentation, cancel_operation pretty much says: "Kill it, don't ask questions" and gives little flexibility.

The reason I'm worried here is because the scheduler multiplexes the actions. Meaning if two requests from two different clients come in for the same action, it will only execute the action one time and send the results to both clients. If we implement cancel_action we are allowing any client to terminate another client's action.

If the argument is made to say that's an implementation detail and the client should not be able to see other client's jobs and it should only terminate the job if no-one else is waiting on that job, I feel that's more of what DeleteOperation is for. But then again DeleteOperation might be more about actions that change external state (like upload files, install new software, exc...), which is why the client would not care about the results.

@chrisstaite-menlo
Copy link
Collaborator Author

Previously, allada (Nathan (Blaise) Bruer) wrote…

Give me some time on this, I'm going to hack up your branch to get a better feel for it. There's a lot going on and I want to see how it goes.

Right now I'm leaning towards the operation service being completely separate from the execution service. Requiring a configuration and all.

I think I'm going to drop the operations service. It was mostly for the use case that someone changes a header that's used in a lot of files and it's bad, so cancels the build job but we still execute a bunch of jobs that are going to fail. However, I don't think it really gains us all that much, so I think I'm going to drop it and we can bring it in a separate branch later if we want to.

Copy link
Collaborator Author

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 5 unresolved discussions (waiting on @allada)


cas/grpc_service/execution_server.rs line 224 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Wait, no... We choose the name, so we choose the filter. In our case we can require any kind of filter format we want.

Think that falls down with the GrpcScheduler, right?


cas/grpc_service/execution_server.rs line 237 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Lets just pass in name as a string here and then this function is 100% compatible with: wait_operation().

Got rid of operation server, so probably not worth doing.


cas/grpc_service/execution_server.rs line 241 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Since this file is the gateway to all things execution related, lets rewrite all output Operations so it has the instance name as prefixed to the operation name. Then just lookup what scheduler it should be on here.

Got rid of the operations server...


cas/grpc_service/execution_server.rs line 308 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I'm really struggling on this one. Do we need/want to implement cancel_operation? I understand removing delete_operation, since it allows the scheduler to stop the action if it wants. But from what I can tell in the documentation, cancel_operation pretty much says: "Kill it, don't ask questions" and gives little flexibility.

The reason I'm worried here is because the scheduler multiplexes the actions. Meaning if two requests from two different clients come in for the same action, it will only execute the action one time and send the results to both clients. If we implement cancel_action we are allowing any client to terminate another client's action.

If the argument is made to say that's an implementation detail and the client should not be able to see other client's jobs and it should only terminate the job if no-one else is waiting on that job, I feel that's more of what DeleteOperation is for. But then again DeleteOperation might be more about actions that change external state (like upload files, install new software, exc...), which is why the client would not care about the results.

Done.

@chrisstaite-menlo chrisstaite-menlo force-pushed the wait_execution branch 3 times, most recently from 6b5f80f to e6ff75b Compare July 15, 2023 13:04
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 9 files at r4.
Reviewable status: 1 of 9 files reviewed, 6 unresolved discussions (waiting on @chrisstaite-menlo)


-- commits line 1 at r4:
I think something went wrong in your rebase.

@chrisstaite-menlo chrisstaite-menlo force-pushed the wait_execution branch 2 times, most recently from 04b492a to 2c81ef1 Compare July 15, 2023 19:22
@chrisstaite-menlo
Copy link
Collaborator Author

-- commits line 1 at r4:

Previously, allada (Nathan (Blaise) Bruer) wrote…

I think something went wrong in your rebase.

Whoops....

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 10 files reviewed, 3 unresolved discussions (waiting on @chrisstaite-menlo)

a discussion (no related file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

I think I'm going to drop the operations service. It was mostly for the use case that someone changes a header that's used in a lot of files and it's bad, so cancels the build job but we still execute a bunch of jobs that are going to fail. However, I don't think it really gains us all that much, so I think I'm going to drop it and we can bring it in a separate branch later if we want to.

Yeah actually this brings up a feature I was planning on doing but haven't yet... A config that times out jobs. This is common for these kinds of services, where after some config, lets say 60 seconds after all clients disconnect from a job it will terminate the job.

To do this we'll need to wrap add_action()'s watch::Receiver<Arc<ActionState> with a wrapper that if dropped will notify the scheduler that a client disconnected, so it can start a counter if it was the last one to connect.



cas/grpc_service/execution_server.rs line 224 at r3 (raw file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

Think that falls down with the GrpcScheduler, right?

I don't think so. If this is a problem in GrpcScheduler it'll just need to do an instance_name -> instance_name mapping.

We need to make the name of the operation contain the instance name. Something like:

{intance_name}/{action_info_hash}

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 10 files reviewed, 2 unresolved discussions (waiting on @chrisstaite-menlo)


cas/grpc_service/execution_server.rs line 224 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I don't think so. If this is a problem in GrpcScheduler it'll just need to do an instance_name -> instance_name mapping.

We need to make the name of the operation contain the instance name. Something like:

{intance_name}/{action_info_hash}

I think #186 will solve much of this.

@chrisstaite-menlo chrisstaite-menlo changed the title Implement the WaitExecution and CancelOperation Implement the WaitExecution Jul 16, 2023
Copy link
Collaborator Author

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 21 files reviewed, 2 unresolved discussions (waiting on @allada)

a discussion (no related file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Yeah actually this brings up a feature I was planning on doing but haven't yet... A config that times out jobs. This is common for these kinds of services, where after some config, lets say 60 seconds after all clients disconnect from a job it will terminate the job.

To do this we'll need to wrap add_action()'s watch::Receiver<Arc<ActionState> with a wrapper that if dropped will notify the scheduler that a client disconnected, so it can start a counter if it was the last one to connect.

It will also need to be cancelled on a new subscribe().



cas/grpc_service/execution_server.rs line 224 at r3 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I think #186 will solve much of this.

Done.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 21 files reviewed, 2 unresolved discussions (waiting on @chrisstaite-menlo)

a discussion (no related file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

It will also need to be cancelled on a new subscribe().

Does it? If we ever get to a point where we a user wants to subscribe() and the task happens to be running. We always feed the results of the running job.


@chrisstaite-menlo
Copy link
Collaborator Author

Previously, allada (Nathan (Blaise) Bruer) wrote…

Does it? If we ever get to a point where we a user wants to subscribe() and the task happens to be running. We always feed the results of the running job.

This situation:
User starts job and disconnects
Timer started as no connections
User uses WaitExecution to subscribe
Timer fires and kills job being listened to

Copy link
Collaborator Author

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @allada from a discussion.
Reviewable status: 18 of 21 files reviewed, 8 unresolved discussions (waiting on @allada and @chrisstaite)


cas/grpc_service/execution_server.rs line 213 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: &request.into_inner().name looks better.

the trait \From<&String>` is not implemented for `ActionInfoHashKey``


cas/grpc_service/execution_server.rs line 219 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I wounder if it'd be a good idea to then try and find the item in the action cache before rejecting it?

I don't think the spec says anything about this, but I think it makes sense. From looking at the code, we'd need to modify the scheduler so before it fully drops ActionResult, it'll upload it to the CAS (not AC).

No need to do this now, but if you don't want to do it, lets make a TODO. The cool part is that it'll allow users to run an action, go away, and come back later to get the results (assuming the timeout is properly permissive).

Isn't the idea that if you do that you should just look it up in the Action Cache yourself?


cas/scheduler/cache_lookup_scheduler.rs line 158 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

There's a bit of a race condition here because we can have two clients at this point trying add the same unique_qualifier. This would cause some very bizarre things to happen. We should do a cache lookup here first and feed the existing item if it exists within the same lock.

Done.


cas/scheduler/cache_lookup_scheduler.rs line 181 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Hmmm, minor bug here, where we only notify the downstream scheduler that the client has disconnected when the worker sends a notification to the client and tx.send(action_state).is_err() triggers an error and drops the Sender, which then causes the downstream scheduler to now be able to know the client disconnected.

This currently is not a big problem, since it only causes a very minor amount of additional needless memory, but if/when we implement the ability for the scheduler to timeout tasks because no listeners are active it'll become an issue.

This can be fixed by using a select!{} over tx.closed() and watch_stream.next(). Not sure if we should do this in this PR, but wanted to make sure it was known.

This is a good example on why I think we should try and avoid using spawns as much as possible, granted here we need to use a spawn (unless we change function signatures), this would auto-propagate everything down because the future would get dropped which would call Drop on everything downstream.

Done.


cas/scheduler/cache_lookup_scheduler.rs line 208 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I created an upstream ticket for this:
tokio-rs/tokio#5871

Can we make a TODO referencing this ticket?

Done.


cas/scheduler/grpc_scheduler.rs line 42 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Looks unrelated (possibly future PR?).

This made more sense when I was adding in the OperationsClient too, but now I've taken that out it makes a little less sense, but I think this is worth adding in as I believe the balance_list will re-try a failed connection whereas the existing code won't. Happy to move this out to another PR, but it's a bit mixed in now...


cas/scheduler/simple_scheduler.rs line 166 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Can we do this part of the change in a followup PR? I want to think about this one a bit because it plays into both debugging and caching. Also, I'm not sure if the scheduler should do this here or if maybe some kind of logging service might be better to solve this for the debugging case.

We also can probably solve the non-debug use case by doing what is commented on above (store the ActionResult in CAS). I am pretty sure buildbarn does this.

For fast build actions this is required for Goma otherwise the action disappears between it calling add_action and calling wait_execution. That's why I had to implement it. We could pull it out, but both are necessary for this code to work, so it doesn't make much sense to I don't think?

@chrisstaite-menlo
Copy link
Collaborator Author

cas/scheduler/simple_scheduler.rs line 283 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Uug, these values area already indexed by the ActionInfoHashKey, but rust's HashMap does not support getting it by this value without: rust-lang/rust#56167

It does look like we can use hashbrown though to solve it, because it uses Equivalent<K> for lookups instead of forcing K: Borrow<Q>, but untested:
https://docs.rs/hashbrown/latest/hashbrown/hash_map/struct.HashMap.html#method.get

Looking at the source of HashSet it calls self.base.get where base is hashbrown...

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 14 of 22 files reviewed, 8 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)


cas/grpc_service/execution_server.rs line 213 at r8 (raw file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

the trait \From<&String>` is not implemented for `ActionInfoHashKey``

blah. Probably needs Borrow<&T> magic in the TryFrom trait.


cas/grpc_service/execution_server.rs line 219 at r8 (raw file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

Isn't the idea that if you do that you should just look it up in the Action Cache yourself?

Yes and no. If caching was enabled 100% of the time yes, but in our case we can send requests that will not be saved in action cache. In the event the no_cache flag is sent on the initial request, by design the ActionInfoHashKey will be different for every request, but if the no_cache is not set, then it may use the same ActionInfoHashKey and thus the same Operation Id.

Lets not get caught up on this in this PR and if possible lets split off this part into a different PR. I will look into how other projects do this. I do remember buildbarn having a flag somewhere saying something like: Store action results for non-cachable items in the CAS. This breaks the "HASH(value) -> value" guarantee though.

@chrisstaite-menlo chrisstaite-menlo force-pushed the wait_execution branch 2 times, most recently from f34ef3e to facbb06 Compare July 16, 2023 19:35
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 5 files at r9.
Reviewable status: 16 of 25 files reviewed, 5 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)


cas/scheduler/grpc_scheduler.rs line 42 at r8 (raw file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

This made more sense when I was adding in the OperationsClient too, but now I've taken that out it makes a little less sense, but I think this is worth adding in as I believe the balance_list will re-try a failed connection whereas the existing code won't. Happy to move this out to another PR, but it's a bit mixed in now...

Yeah, I agree we should use balance_list, but we need to modify the config file too, which is technically a breaking change, so it needs to be documented in the commit title.

@chrisstaite-menlo
Copy link
Collaborator Author

cas/scheduler/grpc_scheduler.rs line 42 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Yeah, I agree we should use balance_list, but we need to modify the config file too, which is technically a breaking change, so it needs to be documented in the commit title.

Not at all, I just add a single endpoint to the balance_list as shown here...

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 16 of 25 files reviewed, 5 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)


cas/scheduler/simple_scheduler.rs line 166 at r8 (raw file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

For fast build actions this is required for Goma otherwise the action disappears between it calling add_action and calling wait_execution. That's why I had to implement it. We could pull it out, but both are necessary for this code to work, so it doesn't make much sense to I don't think?

I dont want to block this PR on this especially since you are able to get Goma working. So I created a ticket:
#192

Can you just make a note in the code referencing this ticket here?

Something like:

// TOOD(#192) Revisit if this is the best way to handle recently completed actions.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 16 of 25 files reviewed, 5 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)


cas/scheduler/simple_scheduler.rs line 283 at r8 (raw file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

Looking at the source of HashSet it calls self.base.get where base is hashbrown...

Can we add a comment here with something like:

// TODO: We can lookup the action without needing to iterate the maps but there's no API for it.
// Maybe when: https://github.com/rust-lang/rust/issues/56167 is done?

@chrisstaite-menlo
Copy link
Collaborator Author

cas/grpc_service/execution_server.rs line 219 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Yes and no. If caching was enabled 100% of the time yes, but in our case we can send requests that will not be saved in action cache. In the event the no_cache flag is sent on the initial request, by design the ActionInfoHashKey will be different for every request, but if the no_cache is not set, then it may use the same ActionInfoHashKey and thus the same Operation Id.

Lets not get caught up on this in this PR and if possible lets split off this part into a different PR. I will look into how other projects do this. I do remember buildbarn having a flag somewhere saying something like: Store action results for non-cachable items in the CAS. This breaks the "HASH(value) -> value" guarantee though.

Worth a new issue?

Copy link
Collaborator Author

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 16 of 25 files reviewed, 4 unresolved discussions (waiting on @allada and @chrisstaite)


cas/grpc_service/execution_server.rs line 213 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

blah. Probably needs Borrow<&T> magic in the TryFrom trait.

Did some messing and Googling and found there's not a nice way to do this...


cas/scheduler/simple_scheduler.rs line 166 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I dont want to block this PR on this especially since you are able to get Goma working. So I created a ticket:
#192

Can you just make a note in the code referencing this ticket here?

Something like:

// TOOD(#192) Revisit if this is the best way to handle recently completed actions.

Done.

@chrisstaite-menlo
Copy link
Collaborator Author

cas/scheduler/simple_scheduler.rs line 283 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Can we add a comment here with something like:

// TODO: We can lookup the action without needing to iterate the maps but there's no API for it.
// Maybe when: https://github.com/rust-lang/rust/issues/56167 is done?

I already swapped to hashbrown. Do you think I should put it back with an issue?

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 16 of 25 files reviewed, 4 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)


cas/scheduler/simple_scheduler.rs line 283 at r8 (raw file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

I already swapped to hashbrown. Do you think I should put it back with an issue?

Wait, it was really that easy!?!?!


cas/scheduler/simple_scheduler.rs line 168 at r12 (raw file):

    // keep their completion state around for a while to send back.
    // TODO(#192) Revisit if this is the best way to handle recently completed actions.
    recently_completed_actions: Vec<Arc<ActionState>>,

Can't this be a HashSet now so we can just grab it without iterating the vector?


cas/scheduler/simple_scheduler.rs line 255 at r12 (raw file):

        self.recently_completed_actions.retain(|action| match &action.stage {
            ActionStage::Completed(state) => {
                if state.execution_metadata.worker_completed_timestamp > expiry_time {

I think we should wrap ActionStage so we can store the timestamp because I don't think we should blindly trust workers and I think we should hold on to non ActionStage::Completed (like errors) for some set duration too.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 16 of 25 files reviewed, 5 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)


cas/scheduler/simple_scheduler.rs line 252 at r12 (raw file):

    fn clean_recently_completed_actions(&mut self) {
        // Keep for a minute maximum.
        let expiry_time = SystemTime::now().checked_sub(Duration::new(60, 0)).unwrap();

Can we move this setting to a config?

@chrisstaite-menlo
Copy link
Collaborator Author

cas/scheduler/simple_scheduler.rs line 283 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Wait, it was really that easy!?!?!

Seemingly?

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 16 of 25 files reviewed, 7 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)


cas/scheduler/simple_scheduler.rs line 280 at r12 (raw file):

        // Clean up the actions list.
        self.clean_recently_completed_actions();

lets not do this here, we'll do it similar to how we deal with remove_timedout_workers().


cas/scheduler/simple_scheduler.rs line 521 at r12 (raw file):

                .push(running_action.action.current_state);
            // Clean up any old actions.
            self.clean_recently_completed_actions();

lets not do this here, we'll do it similar to how we deal with remove_timedout_workers().

@chrisstaite-menlo
Copy link
Collaborator Author

cas/scheduler/simple_scheduler.rs line 168 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Can't this be a HashSet now so we can just grab it without iterating the vector?

No, because it's ActionState not ActionInfo :(. I suppose we could update ActionState to also be hashed in the same way as ActionInfo...

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 8 files at r7, 2 of 5 files at r9, 4 of 5 files at r10, all commit messages.
Reviewable status: 23 of 25 files reviewed, 7 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)


cas/scheduler/simple_scheduler.rs line 516 at r12 (raw file):

            self.active_actions.insert(action_info, running_action);
            return Ok(());
        } else if send_result.is_err() {

shouldn't we add the item to recently_complted_actions regardless of if the result was unable to be sent? Ie: there may be multiple clients listening, the first client was success, but the second client then calls WaitExecution and fails because the task was multiplexed.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r12.
Reviewable status: 24 of 25 files reviewed, 7 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about doing this in phases. It was a pretty difficult one to review. I finished going over everything. The things commented on are the only things left. Before it was: "I can't really review this until the last comment is addressed".

Reviewable status: 24 of 25 files reviewed, 7 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 24 of 25 files reviewed, 7 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)


cas/scheduler/simple_scheduler.rs line 168 at r12 (raw file):

Previously, chrisstaite-menlo (Chris Staite) wrote…

No, because it's ActionState not ActionInfo :(. I suppose we could update ActionState to also be hashed in the same way as ActionInfo...

Yeah actually it can. ActionState just needs the same hash function that ActionInfo has. Ironically ActionState is owns the properties that ActionInfo uses to hash.

Copy link
Collaborator Author

@chrisstaite-menlo chrisstaite-menlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 16 of 26 files reviewed, 7 unresolved discussions (waiting on @allada and @chrisstaite)


cas/scheduler/simple_scheduler.rs line 168 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Yeah actually it can. ActionState just needs the same hash function that ActionInfo has. Ironically ActionState is owns the properties that ActionInfo uses to hash.

I actually did this in a different way since I was wrapping the completed time in anyway and implemented it for CompletedAction...


cas/scheduler/simple_scheduler.rs line 252 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Can we move this setting to a config?

Done.


cas/scheduler/simple_scheduler.rs line 255 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I think we should wrap ActionStage so we can store the timestamp because I don't think we should blindly trust workers and I think we should hold on to non ActionStage::Completed (like errors) for some set duration too.

Done.


cas/scheduler/simple_scheduler.rs line 280 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

lets not do this here, we'll do it similar to how we deal with remove_timedout_workers().

Done.


cas/scheduler/simple_scheduler.rs line 516 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

shouldn't we add the item to recently_complted_actions regardless of if the result was unable to be sent? Ie: there may be multiple clients listening, the first client was success, but the second client then calls WaitExecution and fails because the task was multiplexed.

Done.


cas/scheduler/simple_scheduler.rs line 521 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

lets not do this here, we'll do it similar to how we deal with remove_timedout_workers().

Done.

@chrisstaite-menlo
Copy link
Collaborator Author

I knew it was going to be a horrid one, because it really should have been more than one PR, but it was a mangled mess!

Goma requires WaitExecution to function correctly.  Implement this and
store recently completed actions.
Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woot! This was a tough one.

:lgtm_strong:

Reviewed 1 of 3 files at r5, 9 of 9 files at r13, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @chrisstaite and @chrisstaite-menlo)


cas/scheduler/default_scheduler_factory.rs line 70 at r13 (raw file):

fn start_cleanup_timer(action_scheduler: &Arc<dyn ActionScheduler>) {
    let weak_scheduler = Arc::downgrade(action_scheduler);
    tokio::spawn(async move {

☹️ It looks like it's time for a SchedulerManager so we don't need these spawns to be running all over the place.

No action needed.

@chrisstaite-menlo chrisstaite-menlo merged commit 857bf40 into TraceMachina:master Jul 17, 2023
@chrisstaite-menlo chrisstaite-menlo deleted the wait_execution branch September 8, 2023 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants