-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix errcheck in service/history/shard #3755
Conversation
@@ -1505,18 +1503,24 @@ func (s *ContextImpl) createEngine() Engine { | |||
|
|||
// start should only be called by the controller. | |||
func (s *ContextImpl) start() { | |||
s.transition(contextRequestAcquire{}) | |||
if err := s.transition(contextRequestAcquire{}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s.transition() already emits logs if the transition is invalid I believe.
@@ -1436,8 +1437,7 @@ func (s *ContextImpl) handleReadError(err error) error { | |||
case *persistence.ShardOwnershipLostError: | |||
// Shard is stolen, trigger shutdown of history engine. | |||
// Handling of max read level doesn't matter here. | |||
s.transition(contextRequestStop{}) | |||
return err | |||
return multierr.Combine(err, s.transition(contextRequestStop{})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can/should combine the errors here (and two more places below).
- Many places in our code path is still checking error type directly not via errors.As, so those places might break.
- To me, the error from s.transition is internal to the shard context impl and upper layer should not know.
cc @dnr Would you mind also take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this PR doesn't make much sense. I think you should revert these changes.
The error returned from transition is really only meaningful for contextRequestAcquired, the others will always return nil and the result doesn't have to be checked. I know errcheck isn't smart enough to figure that out but we can just manually ignore them.
As Yichao said, transition already logs so callers should not.
And I agree the multierr stuff is not appropriate here.
@@ -1436,8 +1437,7 @@ func (s *ContextImpl) handleReadError(err error) error { | |||
case *persistence.ShardOwnershipLostError: | |||
// Shard is stolen, trigger shutdown of history engine. | |||
// Handling of max read level doesn't matter here. | |||
s.transition(contextRequestStop{}) | |||
return err | |||
return multierr.Combine(err, s.transition(contextRequestStop{})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this PR doesn't make much sense. I think you should revert these changes.
The error returned from transition is really only meaningful for contextRequestAcquired, the others will always return nil and the result doesn't have to be checked. I know errcheck isn't smart enough to figure that out but we can just manually ignore them.
As Yichao said, transition already logs so callers should not.
And I agree the multierr stuff is not appropriate here.
@@ -790,7 +790,9 @@ func (s *controllerSuite) TestShardControllerFuzz() { | |||
shardID := int32(rand.Intn(int(s.config.NumberOfShards))) + 1 | |||
switch rand.Intn(5) { | |||
case 0: | |||
s.shardController.GetShardByID(shardID) | |||
if _, err := s.shardController.GetShardByID(shardID); err != nil { | |||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning defeats the purpose of this code, which is to generate load on the shard controller. this shouldn't return an error but if it does the worker should continue to run.
Thanks for spotting these issues. I have a PR to revert these changes here: #3842 |
What changed?
Why?
How did you test it?
Potential risks
Is hotfix candidate?