New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Make dedicated cache’s size adjustable #70

Closed

kaituo wants to merge 3 commits into opensearch-project:main from kaituo:dynamicDedicatedCacheSize

Collaborator

kaituo commented May 25, 2021 •

edited

Loading

Note: since there are a lot of dependencies, I only list the main class and test code to save reviewers' time. The build will fail due to missing dependencies. I will use that PR just for review. will not merge it. Will have a big one in the end and merge once after all review PRs get approved. Now the code is missing unit tests. Will add unit tests, run performance tests, and fix bugs before the official release.

Description

Each detector has its dedicated cache that stores ten entities' states per node. A detector's hottest entities load their states into the dedicated cache. Other detectors cannot use space reserved by a detector's dedicated cache. Previously, the size of the dedicated cache is not dynamically adjustable by users. This PR creates a dynamic setting AnomalyDetectorSettings.DEDICATED_CACHE_SIZE to make dedicated cache's size flexible. When that setting is changed, if the size decreases, we will release memory if required (e.g., when a user also decreased AnomalyDetectorSettings.MODEL_MAX_SIZE_PERCENTAGE, the max memory percentage that AD can use); if the size increases, we may reject the setting change if we cannot fulfill that request (e.g., when it will uses more memory than allowed for AD).

This PR also changes the behavior of the PriorityCache.get method. Previously, it will auto-load models if space is available or the external entity’s priority is high enough. Since we have moved the logic of loading models to rate-limited queues, we carry the loading models logic to a separate method and let the callers (i.e., queues) decide how to deal with models. PriorityCache.get will mostly return the existing state and update priority.

Testing done:

manual testing by increasing/decreasing the setting value

Check List

[ Y ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.


          Make dedicated cache’s size adjustable

1a77312

Each detector has its dedicated cache that stores ten entities' states per node. A detector's hottest entities load their states into the dedicated cache. Other detectors cannot use space reserved by a detector's dedicated cache. Previously,  the size of the dedicated cache is not dynamically adjustable by users. This PR creates a dynamic setting AnomalyDetectorSettings.DEDICATED_CACHE_SIZE to make dedicated cache's size flexible. When that setting is changed, if the size decreases, we will release memory if required (e.g., when a user also decreased AnomalyDetectorSettings.MODEL_MAX_SIZE_PERCENTAGE, the max memory percentage that AD can use); if the size increases, we may reject the setting change if we cannot fulfill that request (e.g., when it will uses more memory than allowed for AD).

Testing done:
* manual testing by increasing/decreasing the setting value

kaituo requested review from ylwu-amzn and LiuJoyceC

May 25, 2021 23:17

LiuJoyceC reviewed

View reviewed changes

src/main/java/org/opensearch/ad/caching/CacheBuffer.java Outdated

Comment on lines 244 to 250

+                          if (reserved) {
+                              // release in reserved memory
+                              memoryTracker.releaseMemory(memoryConsumptionPerEntity, true, Origin.HC_DETECTOR);
+                          } else {
+                              // release in shared memory
+                              memoryTracker.releaseMemory(memoryConsumptionPerEntity, false, Origin.HC_DETECTOR);
                           }

LiuJoyceC May 27, 2021 •

edited

Loading

[minor]
Could be simplified to:
memoryTracker.releaseMemory(memoryConsumptionPerEntity, reserved, Origin.HC_DETECTOR);

Collaborator Author

kaituo May 28, 2021

I realized I shouldn't release reserved memory. Only keep the branch !reserved now. Please check the new version.

LiuJoyceC reviewed

View reviewed changes

src/main/java/org/opensearch/ad/caching/CacheBuffer.java Outdated

Comment on lines 254 to 260

+                              if (modelRemoved.getRcf() == null || modelRemoved.getThreshold() == null) {
+                                  // only have samples. If we don't save, we throw the new samples and might
+                                  // never be able to initialize the model
+                                  checkpointWriteQueue.write(valueRemoved, true, SegmentPriority.MEDIUM);
+                              } else {
+                                  checkpointWriteQueue.write(valueRemoved, false, SegmentPriority.MEDIUM);
+                              }

LiuJoyceC May 28, 2021 •

edited

Loading

[minor] Similar to above, could simplify to:

// not sure what to name this variable since I can't find the class CheckpointWriteQueue and read the function signature
boolean isNullModel = modelRemoved.getRcf() == null || modelRemoved.getThreshold() == null;
checkpointWriteQueue.write(valueRemoved, isNullModel, SegmentPriority.MEDIUM);

Collaborator Author

kaituo May 28, 2021

yeah, changed using your version

LiuJoyceC reviewed

View reviewed changes

src/main/java/org/opensearch/ad/caching/CacheBuffer.java Outdated

Comment on lines 50 to 51

		import org.opensearch.ad.ratelimit.CheckpointWriteQueue;
		import org.opensearch.ad.ratelimit.SegmentPriority;

LiuJoyceC May 28, 2021

Are these classes defined somewhere? I can't find them in the current repo nor in this diff.

Collaborator Author

kaituo May 28, 2021

Yes, they are defined somewhere and was included in separate PRs. Please use this branch to check: https://github.com/kaituo/anomaly-detection-1/tree/multiCat


          Simplify code

8a94a6b

LiuJoyceC approved these changes

View reviewed changes

ylwu-amzn reviewed

View reviewed changes

src/main/java/org/opensearch/ad/caching/CacheBuffer.java Show resolved Hide resolved

ylwu-amzn reviewed

View reviewed changes

src/main/java/org/opensearch/ad/caching/CacheBuffer.java

+                              } else if (random.nextInt(6) == 0) {
+                                  // checkpoint is relatively big compared to other queued requests
+                                  // save checkpoints with 1/6 probability as we expect to save
+                                  // all every 6 hours statistically

Collaborator

ylwu-amzn Jun 12, 2021

Any reason to use random to decide whether we should save model checkpoints rather than some deterministic way?
For this case: If some model checkpoints missing in index and we need to analyze why it's not saved, will be hard to tell if it's not saved as the random was not 6 or failed to save.
And as you said "checkpoint is relatively big", so I guess there will be some spike on metrics like disk write/CPU/memory when we save checkpoint? If we need to analyze the metrics, the random spikes may be very hard to explain.

Collaborator Author

kaituo Jun 21, 2021

We will save a checkpoint when

removing the model from cache.
cold start
no complete model only a few samples. If we don't save new samples, we will never be able to have enough samples for a trained mode.
periodically save in case of exceptions.

This branch is doing 4). Previously, I will do it every hour for all in-cache models. Consider we are moving to 1M entities, this will bring the cluster in a heavy payload every hour. That's why I am doing it randomly (expected 6 hours for each checkpoint statistically).

I am doing it random since maintaining a state of which one has been saved and which one hasn't are not cheap. Also, the models in the cache can be dynamically changing. Will have to maintain the state in the removing logic. Random is a lazy way to deal with this as it is stateless and statistically sound.

We will have a checkpoint, just that if a checkpoint does not fall into the 6-hour bucket in a particular scenario, the model is stale (i.e., we don't recover from the freshest model in disaster.).

Yes, this change is mostly due to performance and easy maintenance.

Yes, the solution has flaws. It is a tradeoff. Any other suggestions?

Collaborator

ylwu-amzn Jun 22, 2021

Thanks, look good as tradeoff

ylwu-amzn reviewed

View reviewed changes

src/main/java/org/opensearch/ad/caching/CacheBuffer.java

                       return items.values().stream().collect(Collectors.toList());
                   }
                   public PriorityTracker getPriorityTracker() {
                       return priorityTracker;
                   }
+                  public void setMinimumCapacity(int minimumCapacity) {

Collaborator

ylwu-amzn Jun 12, 2021

Seems the minimumCapacity means minimum entity count? Add some comments or change the variable name ?

Collaborator Author

kaituo Jun 21, 2021

It means the reserved cache size. So no matter how many entities there are, we will keep the size for minimum capacity entities. Added the comments.

ylwu-amzn reviewed

View reviewed changes

src/main/java/org/opensearch/ad/caching/PriorityCache.java Outdated

               public class PriorityCache implements EntityCache {
                   private final Logger LOG = LogManager.getLogger(PriorityCache.class);
                   // detector id -> CacheBuffer, weight based
                   private final Map<String, CacheBuffer> activeEnities;
                   private final CheckpointDao checkpointDao;
-                  private final int dedicatedCacheSize;
+                  private int dedicatedCacheSize;

Collaborator

ylwu-amzn Jun 12, 2021

Add volatile ?

Collaborator Author

kaituo Jun 21, 2021

sure. Just curious, has a bug related to this happened?

Collaborator

ylwu-amzn Jun 21, 2021 •

edited

Loading

Before this PR we just use final value for dedicatedCacheSize. We need to add volatile as we add setting update consumer in this PR.

Collaborator Author

kaituo Jun 21, 2021

not particular to this PR. Just wondering since you commented about this in other PRs too.


          Improve log msg and add volatile

6dc8713

ylwu-amzn approved these changes

View reviewed changes

kaituo closed this

kaituo mentioned this pull request

multi-category support, rate limiting, and pagination #121

Merged

kaituo added a commit that referenced this pull request


          multi-category support, rate limiting, and pagination (#121)

ea22d59

This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167

ohltyler pushed a commit to ohltyler/anomaly-detection-2 that referenced this pull request


          multi-category support, rate limiting, and pagination (opensearch-pro…

705a2fc

…ject#121)

This PR is a conglomerate of the following PRs.

opensearch-project#60
opensearch-project#64
opensearch-project#65
opensearch-project#67
opensearch-project#68
opensearch-project#69
opensearch-project#70
opensearch-project#71
opensearch-project#74
opensearch-project#75
opensearch-project#76
opensearch-project#77
opensearch-project#78
opensearch-project#79
opensearch-project#82
opensearch-project#83
opensearch-project#84
opensearch-project#92
opensearch-project#94
opensearch-project#93
opensearch-project#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167

ohltyler pushed a commit to ohltyler/anomaly-detection-2 that referenced this pull request


          multi-category support, rate limiting, and pagination (opensearch-pro…

c9fe262

…ject#121)

This PR is a conglomerate of the following PRs.

opensearch-project#60
opensearch-project#64
opensearch-project#65
opensearch-project#67
opensearch-project#68
opensearch-project#69
opensearch-project#70
opensearch-project#71
opensearch-project#74
opensearch-project#75
opensearch-project#76
opensearch-project#77
opensearch-project#78
opensearch-project#79
opensearch-project#82
opensearch-project#83
opensearch-project#84
opensearch-project#92
opensearch-project#94
opensearch-project#93
opensearch-project#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167

ohltyler pushed a commit that referenced this pull request


          multi-category support, rate limiting, and pagination (#121)

f65d070

This PR is a conglomerate of the following PRs.

#60
#64
#65
#67
#68
#69
#70
#71
#74
#75
#76
#77
#78
#79
#82
#83
#84
#92
#94
#93
#95
kaituo#1
kaituo#2
kaituo#3
kaituo#4
kaituo#5
kaituo#6
kaituo#7
kaituo#8
kaituo#9
kaituo#10

This spreadsheet contains the mappings from files to PR number (bug fix in my AD fork and tests are not included):
https://gist.github.com/kaituo/9e1592c4ac4f2f449356cb93d0591167

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet