Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dedicated cache’s size adjustable #70

Closed

Conversation

kaituo
Copy link
Collaborator

@kaituo kaituo commented May 25, 2021

Note: since there are a lot of dependencies, I only list the main class and test code to save reviewers' time. The build will fail due to missing dependencies. I will use that PR just for review. will not merge it. Will have a big one in the end and merge once after all review PRs get approved. Now the code is missing unit tests. Will add unit tests, run performance tests, and fix bugs before the official release.

Description

Each detector has its dedicated cache that stores ten entities' states per node. A detector's hottest entities load their states into the dedicated cache. Other detectors cannot use space reserved by a detector's dedicated cache. Previously, the size of the dedicated cache is not dynamically adjustable by users. This PR creates a dynamic setting AnomalyDetectorSettings.DEDICATED_CACHE_SIZE to make dedicated cache's size flexible. When that setting is changed, if the size decreases, we will release memory if required (e.g., when a user also decreased AnomalyDetectorSettings.MODEL_MAX_SIZE_PERCENTAGE, the max memory percentage that AD can use); if the size increases, we may reject the setting change if we cannot fulfill that request (e.g., when it will uses more memory than allowed for AD).

This PR also changes the behavior of the PriorityCache.get method. Previously, it will auto-load models if space is available or the external entity’s priority is high enough. Since we have moved the logic of loading models to rate-limited queues, we carry the loading models logic to a separate method and let the callers (i.e., queues) decide how to deal with models. PriorityCache.get will mostly return the existing state and update priority.

Testing done:

  • manual testing by increasing/decreasing the setting value

Check List

  • [ Y ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Each detector has its dedicated cache that stores ten entities' states per node. A detector's hottest entities load their states into the dedicated cache. Other detectors cannot use space reserved by a detector's dedicated cache. Previously,  the size of the dedicated cache is not dynamically adjustable by users. This PR creates a dynamic setting AnomalyDetectorSettings.DEDICATED_CACHE_SIZE to make dedicated cache's size flexible. When that setting is changed, if the size decreases, we will release memory if required (e.g., when a user also decreased AnomalyDetectorSettings.MODEL_MAX_SIZE_PERCENTAGE, the max memory percentage that AD can use); if the size increases, we may reject the setting change if we cannot fulfill that request (e.g., when it will uses more memory than allowed for AD).

Testing done:
* manual testing by increasing/decreasing the setting value
@kaituo kaituo requested review from ylwu-amzn and LiuJoyceC May 25, 2021 23:17
Comment on lines 244 to 250
if (reserved) {
// release in reserved memory
memoryTracker.releaseMemory(memoryConsumptionPerEntity, true, Origin.HC_DETECTOR);
} else {
// release in shared memory
memoryTracker.releaseMemory(memoryConsumptionPerEntity, false, Origin.HC_DETECTOR);
}
Copy link

@LiuJoyceC LiuJoyceC May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor]
Could be simplified to:
memoryTracker.releaseMemory(memoryConsumptionPerEntity, reserved, Origin.HC_DETECTOR);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized I shouldn't release reserved memory. Only keep the branch !reserved now. Please check the new version.

Comment on lines 254 to 260
if (modelRemoved.getRcf() == null || modelRemoved.getThreshold() == null) {
// only have samples. If we don't save, we throw the new samples and might
// never be able to initialize the model
checkpointWriteQueue.write(valueRemoved, true, SegmentPriority.MEDIUM);
} else {
checkpointWriteQueue.write(valueRemoved, false, SegmentPriority.MEDIUM);
}
Copy link

@LiuJoyceC LiuJoyceC May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] Similar to above, could simplify to:

// not sure what to name this variable since I can't find the class CheckpointWriteQueue and read the function signature
boolean isNullModel = modelRemoved.getRcf() == null || modelRemoved.getThreshold() == null;
checkpointWriteQueue.write(valueRemoved, isNullModel, SegmentPriority.MEDIUM);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, changed using your version

Comment on lines 50 to 51
import org.opensearch.ad.ratelimit.CheckpointWriteQueue;
import org.opensearch.ad.ratelimit.SegmentPriority;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these classes defined somewhere? I can't find them in the current repo nor in this diff.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are defined somewhere and was included in separate PRs. Please use this branch to check: https://github.com/kaituo/anomaly-detection-1/tree/multiCat

} else if (random.nextInt(6) == 0) {
// checkpoint is relatively big compared to other queued requests
// save checkpoints with 1/6 probability as we expect to save
// all every 6 hours statistically
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use random to decide whether we should save model checkpoints rather than some deterministic way?
For this case: If some model checkpoints missing in index and we need to analyze why it's not saved, will be hard to tell if it's not saved as the random was not 6 or failed to save.
And as you said "checkpoint is relatively big", so I guess there will be some spike on metrics like disk write/CPU/memory when we save checkpoint? If we need to analyze the metrics, the random spikes may be very hard to explain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will save a checkpoint when

  1. removing the model from cache.
  2. cold start
  3. no complete model only a few samples. If we don't save new samples, we will never be able to have enough samples for a trained mode.
  4. periodically save in case of exceptions.

This branch is doing 4). Previously, I will do it every hour for all in-cache models. Consider we are moving to 1M entities, this will bring the cluster in a heavy payload every hour. That's why I am doing it randomly (expected 6 hours for each checkpoint statistically).

I am doing it random since maintaining a state of which one has been saved and which one hasn't are not cheap. Also, the models in the cache can be dynamically changing. Will have to maintain the state in the removing logic. Random is a lazy way to deal with this as it is stateless and statistically sound.

We will have a checkpoint, just that if a checkpoint does not fall into the 6-hour bucket in a particular scenario, the model is stale (i.e., we don't recover from the freshest model in disaster.).

Yes, this change is mostly due to performance and easy maintenance.

Yes, the solution has flaws. It is a tradeoff. Any other suggestions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, look good as tradeoff

return items.values().stream().collect(Collectors.toList());
}

public PriorityTracker getPriorityTracker() {
return priorityTracker;
}

public void setMinimumCapacity(int minimumCapacity) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the minimumCapacity means minimum entity count? Add some comments or change the variable name ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means the reserved cache size. So no matter how many entities there are, we will keep the size for minimum capacity entities. Added the comments.


public class PriorityCache implements EntityCache {
private final Logger LOG = LogManager.getLogger(PriorityCache.class);

// detector id -> CacheBuffer, weight based
private final Map<String, CacheBuffer> activeEnities;
private final CheckpointDao checkpointDao;
private final int dedicatedCacheSize;
private int dedicatedCacheSize;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add volatile ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. Just curious, has a bug related to this happened?

Copy link
Collaborator

@ylwu-amzn ylwu-amzn Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR we just use final value for dedicatedCacheSize. We need to add volatile as we add setting update consumer in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not particular to this PR. Just wondering since you commented about this in other PRs too.

@kaituo kaituo closed this Jun 22, 2021
kaituo added a commit that referenced this pull request Jul 12, 2021
This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
ohltyler pushed a commit that referenced this pull request Sep 1, 2021
This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants