Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Repair invalid ModelRecord lastUsed values in registry #36

Merged
merged 1 commit into from
May 31, 2022

Conversation

njhill
Copy link
Member

@njhill njhill commented May 27, 2022

Motivation

Some model records were observed in the registry with an invalid lastUsed time of Long.MAX_VALUE, effectively pinning them at the front of the cache and causing them to remain loaded even if not used.

It's not clear whether the bug causing this has yet been fixed - for now we'll repair the entries and can monitor logs for the warning message below.

Modification

Add a check in the reaper loop over all models in the registry, and update invalid records with a lastUsed time of "3 hours ago".

Result

Avoid "stuck" models due to incorrect lastUsed time.

@kserve-oss-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: njhill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Motivation

Some ModelRecords were observed in the registry with an invalid lastUsed time of Long.MAX_VALUE, effectively pinning them at the front of the cache and causing them to remain loaded even if not used.

It's not clear whether the bug causing this has yet been fixed - for now we'll repair the entries and can monitor logs for the warning message below.

Modification

Add a check in the reaper loop over all models in the registry, and update invalid records with a lastUsed time of "3 hours ago".

Result

Avoid "stuck" models due to incorrect lastUsed time.

Signed-off-by: Nick Hill <[email protected]>
@njhill njhill force-pushed the repair-lastused branch from 8f05a4f to 2641c33 Compare May 27, 2022 23:24
Copy link
Contributor

@tjohnson31415 tjohnson31415 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the model continues to have a last used time of Long.MAX_VALUE does that mean that it is never being used? If so, would it make more sense to set the last used time to something that indicates that it was never used?

Other than that question (and wondering why this would happen in practice) the changes look fine to me.

@njhill
Copy link
Member Author

njhill commented May 31, 2022

Thanks @tjohnson31415.

If the model continues to have a last used time of Long.MAX_VALUE does that mean that it is never being used?

No since when the lastUsed field is updated it will only be changed if the to-be-set value is higher than the existing value. So we don't know from that alone whether or not such models were used recently.

@tjohnson31415
Copy link
Contributor

/lgtm

@kserve-oss-bot kserve-oss-bot merged commit 4ac6757 into main May 31, 2022
@tjohnson31415 tjohnson31415 deleted the repair-lastused branch June 1, 2022 06:01
njhill added a commit that referenced this pull request Jun 28, 2022
Motivation

Some model records were observed in the registry with an invalid lastUsed time of Long.MAX_VALUE, effectively pinning them at the front of the cache and causing them to remain loaded even if not used.

An attempt to fix this was made in #36, but this didn't end up repairing things in the affected environments since the invalid lastUsed value was still in the local cache(s) and ended up being synchronized back to the registry.

Modifications

- Add some logic to the janitor processing to detect and update cache entries with a Long.MAX_VALUE lastUsed value
   - This requires a new forceSetLastUsedTime() method in ConcurrentLinkedHashMap since it's otherwise not possible to decrease the current value
- Never sync this value back to the registry
- Avoid creating failure placeholder records with a Long.MAX_VALUE lastUsed value

Result

Avoid "stuck" models due to incorrect lastUsed time. We still haven't determined the root cause (which might already be fixed) but will be able to see whether the issue arises again once this fix is deployed.

Signed-off-by: Nick Hill <[email protected]>
kserve-oss-bot pushed a commit that referenced this pull request Jun 30, 2022
#### Motivation

Some model records were observed in the registry with an invalid `lastUsed` time of `Long.MAX_VALUE`, effectively pinning them at the front of the cache and causing them to remain loaded even if not used.

An attempt to fix this was made in #36, but this didn't end up repairing things in the affected environments since the invalid `lastUsed` value was still in the local cache(s) and ended up being synchronized back to the registry.

#### Modifications

- Add some logic to the janitor processing to detect and update cache entries with a `Long.MAX_VALUE` `lastUsed` value
   - This requires a new `forceSetLastUsedTime()` method in `ConcurrentLinkedHashMap` since it's otherwise not possible to decrease the current value
- Never sync this value back to the registry
- Avoid creating failure placeholder records with a `Long.MAX_VALUE` `lastUsed` value

#### Result

Avoid "stuck" models due to incorrect `lastUsed` time. We still haven't determined the root cause (which might already be fixed) but will be able to see whether the issue arises again once this fix is deployed.


Signed-off-by: Nick Hill <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants