-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds update state indexing to MutableState #4224
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please split this into separate PRs?
workflowCtx = workflow.NewContext(c.shard, key, c.logger) | ||
elem, err := c.PutIfNotExist(key, workflowCtx) | ||
newCtx := workflow.NewContext(c.shardCtx, key, c.logger) | ||
if err := c.tryRestore(ctx, c.shardCtx, newCtx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do lazy restore only when UpdateRegistry is retrieved from workflow context?
implicitly doing a DB load when locking the workflow concerns me a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be only when we fault the workflow.Context into cache, not every time we lock it. The reason I was ok with the load here is that I didn't seem like there would be many (any?) cases where we loaded the context object into cache but then didn't use MutableState in some way. Does that happen?
On the other hand, a lazy restore does allow me to remove that other trick to get avoid breaking all the unit tests. I'll see how it looks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the thing about lazy load is that it makes the Workflow.UpdateRegistry function pretty ugly. If we may need to restore the registry from mutable state at that point, we'll need to take a context.Context as a parameter (not too bad) but we'll also need to return an error in case either the mutable state load fails or the incomplete update events can't be loaded. It's weird to have a function that should just be an accessor that needs to take a context and return an error.
Only way around that I see would be a second implementation of update.Registry that returned an error from every call to indicate a broken state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've remove update restoration from this PR pursuant to @MichaelSnowden's request above that the PR be limited in scope. I will follow up with the same thing later. FWIW, it looks like restoring the UpdateRegistry might be best done in the ConsistencyChecker which is also where we load MutableState for a given workflow.Context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird to have a function that should just be an accessor that needs to take a context and return an error.
I don't feel strongly about it actually, especially when comparing to other approaches I can think of.
re. consistency checker, I am not sure that's the only place workflow.Context is created.
(My original concern for adding the logic inside workflow cache is that it changes the rest of the system's view on this component & how the error from it should be handled. Previously it's only for in memory locking, so timeout simply means workflow is busy, and not persistence timeout. But I guess that doesn't matter any more. :) )
This PR now includes only the code required to add the updateRecords field to MutableState and to durably CRUD the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please move the resetBlobMap and error-handling changes to a separate PR?
proto/internal/temporal/server/api/persistence/v1/workflow_mutable_state.proto
Outdated
Show resolved
Hide resolved
} | ||
compPtr := rec.GetCompletedPointer() | ||
if compPtr == nil { | ||
// not an error because update was found, but update has not completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
return event, nil | ||
} | ||
|
||
func (ms *MutableStateImpl) GetAcceptedWorkflowExecutionUpdates( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there's no caller for this methods so this is more for the next PR.
Do we have any limit on the # of accepted update for a workflow? If no or the limit is large, then I don't think this can work because there can be one history branch read request for each accepted update.
Even it's like 10 accepted update, it means 10 more read when loading mutable state.
I am also curious to know for an accepted update why do we have to load it's accepted event, the updated request is not used from what I can tell. but maybe I missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also been discussing the restoration strategy with Alex. It turns out that there's not single point or place in code where it's optimal to restore the full UpdateRegistry state for a given workflow. The next thing I intend to explore is lazily fetching individual update state from MutableState on a per-update basis as messages are processed (i.e., if there is no existing update for a given ID, peek into MS to see if something can be restored before proceeding.
To answer your first question directly: there is a limit to the total number of entries in this index but it's high (2000 has been suggested as the default).
For the last question - are you suggesting then that rather than storing a pointer into history for updates in the Accepted state, we "inline" that data into the UpdateInfo value so that no history fetch is required? Because I think that could work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we could store the pointer "just in case" but not dereference it here.
A mechanism for storing pointers into the workflow history representing update accepted or completed events. Storage added for Cassandra, PosgtreSQL, MySQL, and SQLite.
Describes what string to use as the key in a map
Note that update accepted and completed events are cached.
No change to functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior of the code itself looks good to me, but I have a few other concerns:
- We have no test coverage for a lot of the queries here
- I think we need to more clearly distinguish the database record update and workflow update concepts in our comments
- There's also just some other nits I have
} | ||
|
||
if _, err := tx.ReplaceIntoUpdateInfoMaps(ctx, rows); err != nil { | ||
return serviceerror.NewUnavailable(fmt.Sprintf("Failed to update update records. Failed to execute update query. Error: %v", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify what updates are being updated to prevent stuttering here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to maintain the format shared with with all the other map types (activity info, signal info, &c) so that when this line is eventually grepped out of a log it matches the pattern.
const Version = "1.9" | ||
const Version = "1.10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the schema version, not the Postgres version, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema version, correct.
@@ -1105,6 +1192,18 @@ func resetSignalInfoMap( | |||
return sMap, encoding, nil | |||
} | |||
|
|||
func resetBlobMap[K comparable]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a reset function? It looks more like it's just mapping all the values to the .Data field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Named this way because it serves the same purpose as resetTimerInfoMap, resetSignalInfoMap, and resetActivityInfoMap that already exist. But this one is generic over the map key type.
out := make(map[K][]byte, len(m)) | ||
var encoding enumspb.EncodingType | ||
for k, blob := range m { | ||
encoding = blob.EncodingType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this part belongs in this function. It's probably up to the caller to interpret whether values with different encoding types are acceptable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the 5 other functions that this one is a clone of. This is how it's done and we want the same behavior.
@@ -933,6 +952,7 @@ func (m *executionManagerImpl) toWorkflowMutableState(internState *InternalWorkf | |||
SignalRequestedIds: internState.SignalRequestedIDs, | |||
NextEventId: internState.NextEventID, | |||
BufferedEvents: make([]*historypb.HistoryEvent, len(internState.BufferedEvents)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a point in pre-allocating this because we just assign a different slice to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay but this isn't new? After previous comments I've been keeping the changeset to the minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was meant to contrast with the one below. Not an actionable comment
I want to point out that this is in large part a copy/paste exercise and the difficulty is in just finding the right places to make changes. I've tried to address your comments where possible but in many cases the less-than-ideal code is preferred as it conforms to the pre-existing expectation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand the desire to maintain conventions with some of these changes, but it would be better to do some judicial preparatory refactoring (fixing the previous issues in an upstream PR), so that we don't perpetuate mistakes.
I also don't buy the argument of continuing to string-substitute hollow comments in order to follow suit. Since there's no risk of behavioral effects, we should add informative documentation, regardless of how heterogeneous it makes the code, since it's likely that the original author, you, will have the best understanding for a long time.
If we add better documentation to the schemas and protos, then this change looks good to me.
shard_id INT NOT NULL, | ||
namespace_id BINARY(16) NOT NULL, | ||
workflow_id VARCHAR(255) NOT NULL, | ||
run_id BINARY(16) NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there's tabs here and spaces are used elsewhere in this file
@@ -933,6 +952,7 @@ func (m *executionManagerImpl) toWorkflowMutableState(internState *InternalWorkf | |||
SignalRequestedIds: internState.SignalRequestedIDs, | |||
NextEventId: internState.NextEventID, | |||
BufferedEvents: make([]*historypb.HistoryEvent, len(internState.BufferedEvents)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was meant to contrast with the one below. Not an actionable comment
Add UpdateInfo records to MutableState A mechanism for storing pointers into the workflow history representing update accepted or completed events. Storage added for Cassandra, PosgtreSQL, MySQL, and SQLite.
This reverts commit 5783e78.
* Revert "Capture UpdateInfos in MutableState mutation (#4372)" This reverts commit c719eac. * Revert "Adds update state indexing to MutableState (#4224)" This reverts commit 5783e78. * Add UpdateInfo map to WorkflowExecutionInfo This inlines the UpdateInfo index into the pre-existing execution blob so that downstream persistence implementations won't see any change.
What changed?
Adds update state indexing to MutableState and recover of update registry on page-in from data stored in MS.
Why?
Part of update feature for outcome polling. Also, long-running updates are likely to encounter shard reloads and/or workflow cache evictions so it is important to keep and restore in-flight accepted updates into the update registry.
How did you test it?
Unit tests
Potential risks
Is hotfix candidate?
No