-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make dedicated cache’s size adjustable #70
Conversation
Each detector has its dedicated cache that stores ten entities' states per node. A detector's hottest entities load their states into the dedicated cache. Other detectors cannot use space reserved by a detector's dedicated cache. Previously, the size of the dedicated cache is not dynamically adjustable by users. This PR creates a dynamic setting AnomalyDetectorSettings.DEDICATED_CACHE_SIZE to make dedicated cache's size flexible. When that setting is changed, if the size decreases, we will release memory if required (e.g., when a user also decreased AnomalyDetectorSettings.MODEL_MAX_SIZE_PERCENTAGE, the max memory percentage that AD can use); if the size increases, we may reject the setting change if we cannot fulfill that request (e.g., when it will uses more memory than allowed for AD). Testing done: * manual testing by increasing/decreasing the setting value
if (reserved) { | ||
// release in reserved memory | ||
memoryTracker.releaseMemory(memoryConsumptionPerEntity, true, Origin.HC_DETECTOR); | ||
} else { | ||
// release in shared memory | ||
memoryTracker.releaseMemory(memoryConsumptionPerEntity, false, Origin.HC_DETECTOR); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor]
Could be simplified to:
memoryTracker.releaseMemory(memoryConsumptionPerEntity, reserved, Origin.HC_DETECTOR);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized I shouldn't release reserved memory. Only keep the branch !reserved now. Please check the new version.
if (modelRemoved.getRcf() == null || modelRemoved.getThreshold() == null) { | ||
// only have samples. If we don't save, we throw the new samples and might | ||
// never be able to initialize the model | ||
checkpointWriteQueue.write(valueRemoved, true, SegmentPriority.MEDIUM); | ||
} else { | ||
checkpointWriteQueue.write(valueRemoved, false, SegmentPriority.MEDIUM); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor] Similar to above, could simplify to:
// not sure what to name this variable since I can't find the class CheckpointWriteQueue and read the function signature
boolean isNullModel = modelRemoved.getRcf() == null || modelRemoved.getThreshold() == null;
checkpointWriteQueue.write(valueRemoved, isNullModel, SegmentPriority.MEDIUM);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, changed using your version
import org.opensearch.ad.ratelimit.CheckpointWriteQueue; | ||
import org.opensearch.ad.ratelimit.SegmentPriority; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these classes defined somewhere? I can't find them in the current repo nor in this diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are defined somewhere and was included in separate PRs. Please use this branch to check: https://github.com/kaituo/anomaly-detection-1/tree/multiCat
} else if (random.nextInt(6) == 0) { | ||
// checkpoint is relatively big compared to other queued requests | ||
// save checkpoints with 1/6 probability as we expect to save | ||
// all every 6 hours statistically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use random to decide whether we should save model checkpoints rather than some deterministic way?
For this case: If some model checkpoints missing in index and we need to analyze why it's not saved, will be hard to tell if it's not saved as the random was not 6 or failed to save.
And as you said "checkpoint is relatively big", so I guess there will be some spike on metrics like disk write/CPU/memory when we save checkpoint? If we need to analyze the metrics, the random spikes may be very hard to explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will save a checkpoint when
- removing the model from cache.
- cold start
- no complete model only a few samples. If we don't save new samples, we will never be able to have enough samples for a trained mode.
- periodically save in case of exceptions.
This branch is doing 4). Previously, I will do it every hour for all in-cache models. Consider we are moving to 1M entities, this will bring the cluster in a heavy payload every hour. That's why I am doing it randomly (expected 6 hours for each checkpoint statistically).
I am doing it random since maintaining a state of which one has been saved and which one hasn't are not cheap. Also, the models in the cache can be dynamically changing. Will have to maintain the state in the removing logic. Random is a lazy way to deal with this as it is stateless and statistically sound.
We will have a checkpoint, just that if a checkpoint does not fall into the 6-hour bucket in a particular scenario, the model is stale (i.e., we don't recover from the freshest model in disaster.).
Yes, this change is mostly due to performance and easy maintenance.
Yes, the solution has flaws. It is a tradeoff. Any other suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, look good as tradeoff
return items.values().stream().collect(Collectors.toList()); | ||
} | ||
|
||
public PriorityTracker getPriorityTracker() { | ||
return priorityTracker; | ||
} | ||
|
||
public void setMinimumCapacity(int minimumCapacity) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the minimumCapacity
means minimum entity count? Add some comments or change the variable name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means the reserved cache size. So no matter how many entities there are, we will keep the size for minimum capacity entities. Added the comments.
|
||
public class PriorityCache implements EntityCache { | ||
private final Logger LOG = LogManager.getLogger(PriorityCache.class); | ||
|
||
// detector id -> CacheBuffer, weight based | ||
private final Map<String, CacheBuffer> activeEnities; | ||
private final CheckpointDao checkpointDao; | ||
private final int dedicatedCacheSize; | ||
private int dedicatedCacheSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add volatile
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. Just curious, has a bug related to this happened?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this PR we just use final
value for dedicatedCacheSize
. We need to add volatile
as we add setting update consumer in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not particular to this PR. Just wondering since you commented about this in other PRs too.
This PR is a conglomerate of the following PRs. #60 #64 #65 #67 #68 #69 #70 #71 #74 #75 #76 #77 #78 #79 #82 #83 #84 #92 #94 #93 #95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
…ject#121) This PR is a conglomerate of the following PRs. opensearch-project#60 opensearch-project#64 opensearch-project#65 opensearch-project#67 opensearch-project#68 opensearch-project#69 opensearch-project#70 opensearch-project#71 opensearch-project#74 opensearch-project#75 opensearch-project#76 opensearch-project#77 opensearch-project#78 opensearch-project#79 opensearch-project#82 opensearch-project#83 opensearch-project#84 opensearch-project#92 opensearch-project#94 opensearch-project#93 opensearch-project#95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
…ject#121) This PR is a conglomerate of the following PRs. opensearch-project#60 opensearch-project#64 opensearch-project#65 opensearch-project#67 opensearch-project#68 opensearch-project#69 opensearch-project#70 opensearch-project#71 opensearch-project#74 opensearch-project#75 opensearch-project#76 opensearch-project#77 opensearch-project#78 opensearch-project#79 opensearch-project#82 opensearch-project#83 opensearch-project#84 opensearch-project#92 opensearch-project#94 opensearch-project#93 opensearch-project#95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
This PR is a conglomerate of the following PRs. #60 #64 #65 #67 #68 #69 #70 #71 #74 #75 #76 #77 #78 #79 #82 #83 #84 #92 #94 #93 #95 kaituo#1 kaituo#2 kaituo#3 kaituo#4 kaituo#5 kaituo#6 kaituo#7 kaituo#8 kaituo#9 kaituo#10 This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included): https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
Note: since there are a lot of dependencies, I only list the main class and test code to save reviewers' time. The build will fail due to missing dependencies. I will use that PR just for review. will not merge it. Will have a big one in the end and merge once after all review PRs get approved. Now the code is missing unit tests. Will add unit tests, run performance tests, and fix bugs before the official release.
Description
Each detector has its dedicated cache that stores ten entities' states per node. A detector's hottest entities load their states into the dedicated cache. Other detectors cannot use space reserved by a detector's dedicated cache. Previously, the size of the dedicated cache is not dynamically adjustable by users. This PR creates a dynamic setting AnomalyDetectorSettings.DEDICATED_CACHE_SIZE to make dedicated cache's size flexible. When that setting is changed, if the size decreases, we will release memory if required (e.g., when a user also decreased AnomalyDetectorSettings.MODEL_MAX_SIZE_PERCENTAGE, the max memory percentage that AD can use); if the size increases, we may reject the setting change if we cannot fulfill that request (e.g., when it will uses more memory than allowed for AD).
This PR also changes the behavior of the PriorityCache.get method. Previously, it will auto-load models if space is available or the external entity’s priority is high enough. Since we have moved the logic of loading models to rate-limited queues, we carry the loading models logic to a separate method and let the callers (i.e., queues) decide how to deal with models. PriorityCache.get will mostly return the existing state and update priority.
Testing done:
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.