-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Repair invalid ModelRecord lastUsed values in registry #36
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: njhill The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Motivation Some ModelRecords were observed in the registry with an invalid lastUsed time of Long.MAX_VALUE, effectively pinning them at the front of the cache and causing them to remain loaded even if not used. It's not clear whether the bug causing this has yet been fixed - for now we'll repair the entries and can monitor logs for the warning message below. Modification Add a check in the reaper loop over all models in the registry, and update invalid records with a lastUsed time of "3 hours ago". Result Avoid "stuck" models due to incorrect lastUsed time. Signed-off-by: Nick Hill <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the model continues to have a last used time of Long.MAX_VALUE
does that mean that it is never being used? If so, would it make more sense to set the last used time to something that indicates that it was never used?
Other than that question (and wondering why this would happen in practice) the changes look fine to me.
Thanks @tjohnson31415.
No since when the lastUsed field is updated it will only be changed if the to-be-set value is higher than the existing value. So we don't know from that alone whether or not such models were used recently. |
/lgtm |
Motivation Some model records were observed in the registry with an invalid lastUsed time of Long.MAX_VALUE, effectively pinning them at the front of the cache and causing them to remain loaded even if not used. An attempt to fix this was made in #36, but this didn't end up repairing things in the affected environments since the invalid lastUsed value was still in the local cache(s) and ended up being synchronized back to the registry. Modifications - Add some logic to the janitor processing to detect and update cache entries with a Long.MAX_VALUE lastUsed value - This requires a new forceSetLastUsedTime() method in ConcurrentLinkedHashMap since it's otherwise not possible to decrease the current value - Never sync this value back to the registry - Avoid creating failure placeholder records with a Long.MAX_VALUE lastUsed value Result Avoid "stuck" models due to incorrect lastUsed time. We still haven't determined the root cause (which might already be fixed) but will be able to see whether the issue arises again once this fix is deployed. Signed-off-by: Nick Hill <[email protected]>
#### Motivation Some model records were observed in the registry with an invalid `lastUsed` time of `Long.MAX_VALUE`, effectively pinning them at the front of the cache and causing them to remain loaded even if not used. An attempt to fix this was made in #36, but this didn't end up repairing things in the affected environments since the invalid `lastUsed` value was still in the local cache(s) and ended up being synchronized back to the registry. #### Modifications - Add some logic to the janitor processing to detect and update cache entries with a `Long.MAX_VALUE` `lastUsed` value - This requires a new `forceSetLastUsedTime()` method in `ConcurrentLinkedHashMap` since it's otherwise not possible to decrease the current value - Never sync this value back to the registry - Avoid creating failure placeholder records with a `Long.MAX_VALUE` `lastUsed` value #### Result Avoid "stuck" models due to incorrect `lastUsed` time. We still haven't determined the root cause (which might already be fixed) but will be able to see whether the issue arises again once this fix is deployed. Signed-off-by: Nick Hill <[email protected]>
Motivation
Some model records were observed in the registry with an invalid
lastUsed
time ofLong.MAX_VALUE
, effectively pinning them at the front of the cache and causing them to remain loaded even if not used.It's not clear whether the bug causing this has yet been fixed - for now we'll repair the entries and can monitor logs for the warning message below.
Modification
Add a check in the reaper loop over all models in the registry, and update invalid records with a
lastUsed
time of "3 hours ago".Result
Avoid "stuck" models due to incorrect
lastUsed
time.