-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECS reporter: Minimize API calls by caching task and service data #2065
Conversation
Tested, added a ton of debug logging and fixed a bunch of bugs. In this example system, we have one service with 3 tasks across 3 machines, so one task on our example machine.
for a total of 3 calls (1 DescribeTasks, 1 ListServices, 1 DescribeServices) In subsequent reports:
for a total of 1 call (DescribeServices) |
Shouldn't this be size-bound instead of time-bound? Also, 1-minute seems very conservative (1 minute in scheduler time is very small). |
Would this fix #2085 ? (i.e. what does known mean here? previously referenced?) |
I won't either |
@@ -78,39 +78,51 @@ func getLabelInfo(rpt report.Report) map[string]map[string]*taskLabelInfo { | |||
|
|||
// Reporter implements Tagger, Reporter | |||
type Reporter struct { | |||
clients map[string]*ecsClient |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
client, ok := r.clients[cluster] | ||
if !ok { | ||
log.Debugf("Creating new ECS client") | ||
var err error // can't use := on the next line without shadowing outer client var |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
} | ||
|
||
// New creates a new Reporter | ||
func New() Reporter { |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
client *ecs.ECS | ||
cluster string | ||
taskCache map[string]ecsTask | ||
serviceCache map[string]ecsService |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
} | ||
|
||
// Returns a ecsInfo struct containing data needed for a report. | ||
func (c ecsClient) getInfo(taskARNs []string) ecsInfo { |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
Can you make a rough estimation on how many clusters/container instances/services could be run with these improvements? Did you test this in AWS with a few clusters? Also, could you please add some unit tests? |
servicesChan <- c.getServices() | ||
}() | ||
// Evict entries from the caches which have not been used within the eviction interval. | ||
func (c ecsClient) evictOldCacheItems() { |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
client *ecs.ECS | ||
cluster string | ||
taskCache map[string]ecsTask | ||
serviceCache map[string]ecsService |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
PTAL |
lock := sync.Mutex{} // lock mediates access to results | ||
// Returns (input, done) channels. Service ARNs given to input are batched and details are fetched, | ||
// with full ecsService objects being put into the cache. Closes done when finished. | ||
func (c ecsClient) describeServices() (chan<- string, <-chan bool) { |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
|
||
func newECSService(service *ecs.Service) ecsService { | ||
now := time.Now() | ||
deploymentIDs := make([]string, 0, len(service.Deployments)) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
for _, failure := range resp.Failures { | ||
log.Warnf("Failed to describe ECS service %s, ECS service report may be incomplete: %s", failure.Arn, failure.Reason) | ||
} | ||
describePage := func(arns []string) { |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
I gave it another read. I asked some questions in my first comments which remain to be answered. |
Taking socks shop as an (admittedly poor) example: Note it's unclear if rate limits are per actual http request, or per thing being described (ie. if a describe services call with 10 services is 10x the 'cost' of a describe services with 1 service, or the same 'cost') Old: New: In summary: We make 1/4 the calls. But note this is somewhat of an ideal scenario, since most services aren't present on most machines. |
7d698d3
to
7a73532
Compare
rebased, resolved trivial conflict |
I'm running a test with 3 socks shop clusters in one region, to see if I hit rate limits. Leaving it 12h, which should be enough to hit any longer-term limits. |
Ran for about 15h with no rate limiting. Can I get a PTAL % no unit tests? |
Please rebase, #2094 has messed up your commits and it's hard to review. |
Also, there's a pending question about size-based eviction vs time-based eviction (I would go for the first one or a combination of the two if you want to refine it). |
Due to AWS API rate limits, we need to minimize API calls as much as possible. Our stated objectives: * for all displayed tasks and services to have up-to-date metadata * for all tasks to map to services if able My approach here: * Tasks only contain immutable fields (that we care about). We cache tasks forever. We only DescribeTasks the first time we see a new task. * We attempt to match tasks to services with what info we have. Any "referenced" services, ie. a service with at least one matching task, needs to be updated to refresh changing data. * In the event that a task doesn't match any of the (updated) services, ie. a new service entirely needs to be found, we do a full list and detail of all services (we don't re-detail ones we just refreshed). * To avoid unbounded memory usage, we evict tasks and services from the cache after 1 minute without use. This should be long enough for things like temporary failures to be glossed over. This gives us exactly one call per task, and one call per referenced service per report, which is unavoidable to maintain fresh data. Expensive "describe all" service queries are kept to only when newly-referenced services appear, which should be rare. We could make a few very minor improvements here, such as trying to refresh unreferenced but known services before doing a list query, or getting details one by one when "describing all" and stopping when all matches have been found, but I believe these would produce very minor, if any, gains in number of calls while having an unjustifiable effect on latency since we wouldn't be able to do requests as concurrently. Speaking of which, this change has a minor performance impact. Even though we're now doing less calls, we can't do them as concurrently. Old code: concurrently: describe tasks (1 call) sequentially: list services (1 call) describe services (N calls concurrently) Assuming full concurrency, total latency: 2 end-to-end calls New code (worst case): sequentially: describe tasks (1 call) describe services (N calls concurrently) list services (1 call) describe services (N calls concurrently) Assuming full concurrency, total latency: 4 end-to-end calls In practical terms, I don't expect this to matter.
Not only does this allow us to re-use connections, but vitally it allows us to make use of the new task and service caching within the client object.
instead of our own hand-rolled size-unbound cache
This requires making All The Things public. Yuck.
PTAL. |
(I've tested manually, and i'm just fixing some linter pedantry right now) |
} | ||
func newECSTask(task *ecs.Task) EcsTask { | ||
return EcsTask{ | ||
TaskARN: *task.TaskArn, |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
}(batch) | ||
count += len(batch) | ||
calls++ | ||
batch = make([]string, 0, maxServices) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
|
||
go func() { | ||
const maxServices = 10 // How many services we can put in one Describe command |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
continue | ||
} | ||
task := taskRaw.(EcsTask) | ||
if !strings.HasPrefix(task.StartedBy, servicePrefix) { |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
return results, unmatched | ||
} | ||
|
||
func (c ecsClientImpl) ensureTasks(taskARNs []string) { |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
for _, taskARN := range taskARNs { | ||
// It's possible that tasks could still be missing from the cache if describe tasks failed. | ||
// We'll just pretend they don't exist. | ||
if taskRaw, err := c.taskCache.Get(taskARN); err == nil { |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
@@ -287,6 +289,8 @@ func main() { | |||
|
|||
// AWS ECS | |||
flag.BoolVar(&flags.probe.ecsEnabled, "probe.ecs", false, "Collect ecs-related attributes for containers on this node") | |||
flag.IntVar(&flags.probe.ecsCacheSize, "probe.ecs.cache.size", 1024*1024, "Max size of cached info for each ECS cluster") | |||
flag.DurationVar(&flags.probe.ecsCacheExpiry, "probe.ecs.cache.expiry", time.Hour, "How long to keep cached ECS info") |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
Thanks for changing the cache and for the tests. How and for how long did you test this? (Not that I would had been able to make it simpler, but the code is quite involved) |
3 clusters, right? With how many instances each? |
PTAL Yes, 3 clusters. Each with the defaults for the socks shop demo, which is 3 instances, about 10 or so services with one task per service. |
Note, the code has changed since then, but not in ways that matter to the rate-limiting. And I've since re-tested that it still works, but haven't repeated the 15h test. |
servicesRefreshed[serviceName] = true | ||
} | ||
close(toDescribe) | ||
<-done |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
by removing ability to stream results between them, since this is such a minor optimization and greatly complicates the code.
PTAL |
calls++ | ||
} | ||
// split into batches | ||
batches := make([][]string, 0, len(services)/maxServices+1) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
LGTM. That last comment is only a recommendation. |
Fixes #2050
Due to AWS API rate limits, we need to minimize API calls as much as
possible.
Our stated objectives:
My approach here:
We only DescribeTasks the first time we see a new task.
ie. a service with at least one matching task, needs to be updated to refresh changing data.
needs to be found, we do a full list and detail of all services (we don't re-detail ones we just refreshed).
This should be long enough for things like temporary failures to be glossed over.
This gives us exactly one call per task, and one call per referenced service per report, which is unavoidable to maintain fresh data. Expensive "describe all" service queries are kept to only when newly-referenced services appear, which should be rare.
We could make a few very minor improvements here, such as trying to refresh unreferenced but known services before doing a list query, or getting details one by one when "describing all" and stopping when all matches have been found, but I believe these would produce very minor, if any, gains in number of calls while having an unjustifiable effect on latency since we wouldn't be able to do requests as concurrently.
Speaking of which, this change has a minor performance impact. Even though we're now doing less calls, we can't do them as concurrently.
Old code:
Assuming full concurrency, total latency: 2 end-to-end calls
New code (worst case):
Assuming full concurrency, total latency: 4 end-to-end calls
In practical terms, I don't expect this to matter.