Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS reporter: Minimize API calls by caching task and service data #2065

Merged
merged 19 commits into from
Jan 24, 2017

Conversation

ekimekim
Copy link
Contributor

@ekimekim ekimekim commented Dec 7, 2016

Fixes #2050

Due to AWS API rate limits, we need to minimize API calls as much as
possible.

Our stated objectives:

  • for all displayed tasks and services to have up-to-date metadata
  • for all tasks to map to services if able

My approach here:

  • Tasks only contain immutable fields (that we care about). We cache tasks forever.
    We only DescribeTasks the first time we see a new task.
  • We attempt to match tasks to services with what info we have. Any "referenced" services,
    ie. a service with at least one matching task, needs to be updated to refresh changing data.
  • In the event that a task doesn't match any of the (updated) services, ie. a new service entirely
    needs to be found, we do a full list and detail of all services (we don't re-detail ones we just refreshed).
  • To avoid unbounded memory usage, we evict tasks and services from the cache after 1 minute without use.
    This should be long enough for things like temporary failures to be glossed over.

This gives us exactly one call per task, and one call per referenced service per report, which is unavoidable to maintain fresh data. Expensive "describe all" service queries are kept to only when newly-referenced services appear, which should be rare.

We could make a few very minor improvements here, such as trying to refresh unreferenced but known services before doing a list query, or getting details one by one when "describing all" and stopping when all matches have been found, but I believe these would produce very minor, if any, gains in number of calls while having an unjustifiable effect on latency since we wouldn't be able to do requests as concurrently.

Speaking of which, this change has a minor performance impact. Even though we're now doing less calls, we can't do them as concurrently.

Old code:

concurrently:
	describe tasks (1 call)
	sequentially:
		list services (1 call)
		describe services (N calls concurrently)

Assuming full concurrency, total latency: 2 end-to-end calls

New code (worst case):

sequentially:
	describe tasks (1 call)
	describe services (N calls concurrently)
	list services (1 call)
	describe services (N calls concurrently)

Assuming full concurrency, total latency: 4 end-to-end calls

In practical terms, I don't expect this to matter.

@ekimekim
Copy link
Contributor Author

Tested, added a ton of debug logging and fixed a bunch of bugs.
Can confirm we do much less calls now.

In this example system, we have one service with 3 tasks across 3 machines, so one task on our example machine.
This is what the first report on startup looks like: (many extra logs elided)

<probe> DEBU: 2016/12/10 02:09:06.235489 Creating new ECS client
<probe> DEBU: 2016/12/10 02:09:06.324142 Describing 1 ECS tasks
<probe> DEBU: 2016/12/10 02:09:06.367489 Refreshing ECS services
<probe> DEBU: 2016/12/10 02:09:06.367542 Described 0 services in 0 calls
<probe> DEBU: 2016/12/10 02:09:06.367569 After refreshing services, 1 tasks unmatched
<probe> DEBU: 2016/12/10 02:09:06.367574 Listing ECS services
<probe> DEBU: 2016/12/10 02:09:06.376347 Listed 1 services
<probe> DEBU: 2016/12/10 02:09:06.376414 Described 1 services in 1 calls
<probe> DEBU: 2016/12/10 02:09:06.387699 Got info from ECS: 1 tasks, 1 services

for a total of 3 calls (1 DescribeTasks, 1 ListServices, 1 DescribeServices)

In subsequent reports:

<probe> DEBU: 2016/12/10 02:09:07.162438 Refreshing ECS services
<probe> DEBU: 2016/12/10 02:09:07.162537 Described 1 services in 1 calls
<probe> DEBU: 2016/12/10 02:09:07.178871 After refreshing services, 0 tasks unmatched
<probe> DEBU: 2016/12/10 02:09:07.179110 Got info from ECS: 1 tasks, 1 services

for a total of 1 call (DescribeServices)

@2opremio 2opremio self-assigned this Dec 12, 2016
@2opremio
Copy link
Contributor

To avoid unbounded memory usage, we evict tasks and services from the cache after 1 minute without use.

Shouldn't this be size-bound instead of time-bound? Also, 1-minute seems very conservative (1 minute in scheduler time is very small).

@2opremio
Copy link
Contributor

such as trying to refresh unreferenced but known services before doing a list query

Would this fix #2085 ? (i.e. what does known mean here? previously referenced?)

@2opremio
Copy link
Contributor

In practical terms, I don't expect this to matter.

I won't either

@@ -78,39 +78,51 @@ func getLabelInfo(rpt report.Report) map[string]map[string]*taskLabelInfo {

// Reporter implements Tagger, Reporter
type Reporter struct {
clients map[string]*ecsClient

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

client, ok := r.clients[cluster]
if !ok {
log.Debugf("Creating new ECS client")
var err error // can't use := on the next line without shadowing outer client var

This comment was marked as abuse.

This comment was marked as abuse.

}

// New creates a new Reporter
func New() Reporter {

This comment was marked as abuse.

This comment was marked as abuse.

client *ecs.ECS
cluster string
taskCache map[string]ecsTask
serviceCache map[string]ecsService

This comment was marked as abuse.

}

// Returns a ecsInfo struct containing data needed for a report.
func (c ecsClient) getInfo(taskARNs []string) ecsInfo {

This comment was marked as abuse.

This comment was marked as abuse.

@2opremio
Copy link
Contributor

Can you make a rough estimation on how many clusters/container instances/services could be run with these improvements?

Did you test this in AWS with a few clusters?

Also, could you please add some unit tests?

servicesChan <- c.getServices()
}()
// Evict entries from the caches which have not been used within the eviction interval.
func (c ecsClient) evictOldCacheItems() {

This comment was marked as abuse.

This comment was marked as abuse.

client *ecs.ECS
cluster string
taskCache map[string]ecsTask
serviceCache map[string]ecsService

This comment was marked as abuse.

@ekimekim
Copy link
Contributor Author

PTAL

lock := sync.Mutex{} // lock mediates access to results
// Returns (input, done) channels. Service ARNs given to input are batched and details are fetched,
// with full ecsService objects being put into the cache. Closes done when finished.
func (c ecsClient) describeServices() (chan<- string, <-chan bool) {

This comment was marked as abuse.

This comment was marked as abuse.


func newECSService(service *ecs.Service) ecsService {
now := time.Now()
deploymentIDs := make([]string, 0, len(service.Deployments))

This comment was marked as abuse.

This comment was marked as abuse.

for _, failure := range resp.Failures {
log.Warnf("Failed to describe ECS service %s, ECS service report may be incomplete: %s", failure.Arn, failure.Reason)
}
describePage := func(arns []string) {

This comment was marked as abuse.

This comment was marked as abuse.

@2opremio
Copy link
Contributor

I gave it another read. I asked some questions in my first comments which remain to be answered.

@ekimekim
Copy link
Contributor Author

Can you make a rough estimation on how many clusters/container instances/services could be run with these improvements?

Taking socks shop as an (admittedly poor) example:
Socks shop has 14 services and 1 task per service, across 3 machines.
Let's take the average of 14/3 = 4.6 tasks per machine and round up to 5.

Note it's unclear if rate limits are per actual http request, or per thing being described (ie. if a describe services call with 10 services is 10x the 'cost' of a describe services with 1 service, or the same 'cost')

Old:
for each machine, list and describe all and services, and describe all tasks on that machine
= 3 machines * (1 list services + 14 describe services + 5 describe tasks)
or 3 machines * (1 list services + 2 describe service calls + 1 describe tasks call)
= 60 list or describe operations
or 12 list or describe calls

New:
first report is as per old. for subsequent reports:
for each machine, describe all services on that machine
= 3 machines * 5 describe services
or 3 machines * 1 describe service call
= 15 describe operations
or 3 describe calls

In summary: We make 1/4 the calls. But note this is somewhat of an ideal scenario, since most services aren't present on most machines.
But I'd expect we'll always save at least half of all calls, just by virtue of caching all tasks.

@ekimekim
Copy link
Contributor Author

rebased, resolved trivial conflict

@ekimekim
Copy link
Contributor Author

I'm running a test with 3 socks shop clusters in one region, to see if I hit rate limits. Leaving it 12h, which should be enough to hit any longer-term limits.

@ekimekim
Copy link
Contributor Author

Ran for about 15h with no rate limiting.

Can I get a PTAL % no unit tests?

@2opremio
Copy link
Contributor

Please rebase, #2094 has messed up your commits and it's hard to review.

@2opremio
Copy link
Contributor

Also, there's a pending question about size-based eviction vs time-based eviction (I would go for the first one or a combination of the two if you want to refine it).

Due to AWS API rate limits, we need to minimize API calls as much as possible.

Our stated objectives:
* for all displayed tasks and services to have up-to-date metadata
* for all tasks to map to services if able

My approach here:
* Tasks only contain immutable fields (that we care about). We cache tasks forever.
  We only DescribeTasks the first time we see a new task.
* We attempt to match tasks to services with what info we have. Any "referenced" services,
  ie. a service with at least one matching task, needs to be updated to refresh changing data.
* In the event that a task doesn't match any of the (updated) services, ie. a new service entirely
  needs to be found, we do a full list and detail of all services (we don't re-detail ones we just refreshed).
* To avoid unbounded memory usage, we evict tasks and services from the cache after 1 minute without use.
  This should be long enough for things like temporary failures to be glossed over.

This gives us exactly one call per task, and one call per referenced service per report,
which is unavoidable to maintain fresh data. Expensive "describe all" service queries are kept
to only when newly-referenced services appear, which should be rare.

We could make a few very minor improvements here, such as trying to refresh unreferenced but known
services before doing a list query, or getting details one by one when "describing all" and stopping
when all matches have been found, but I believe these would produce very minor, if any, gains in
number of calls while having an unjustifiable effect on latency since we wouldn't be able to do requests
as concurrently.

Speaking of which, this change has a minor performance impact.
Even though we're now doing less calls, we can't do them as concurrently.

Old code:
	concurrently:
		describe tasks (1 call)
		sequentially:
			list services (1 call)
			describe services (N calls concurrently)
Assuming full concurrency, total latency: 2 end-to-end calls

New code (worst case):
	sequentially:
		describe tasks (1 call)
		describe services (N calls concurrently)
		list services (1 call)
		describe services (N calls concurrently)
Assuming full concurrency, total latency: 4 end-to-end calls

In practical terms, I don't expect this to matter.
Not only does this allow us to re-use connections, but vitally it allows us
to make use of the new task and service caching within the client object.
@ekimekim
Copy link
Contributor Author

PTAL.
I now use gcache with an expiring LRU, with user-controllable settings (default 1MiB, 1 hour). It doesn't "refresh" the expiry time upon use, so there'll be a spike of requests every hour, bu in practice this shouldn't matter.
I've made tests that exercise the reporter code by mocking the EcsClient. The client code has no coverage right now. I think this is acceptable considering the complexity of mocking out the AWS api.

@ekimekim
Copy link
Contributor Author

(I've tested manually, and i'm just fixing some linter pedantry right now)

}
func newECSTask(task *ecs.Task) EcsTask {
return EcsTask{
TaskARN: *task.TaskArn,

This comment was marked as abuse.

This comment was marked as abuse.

}(batch)
count += len(batch)
calls++
batch = make([]string, 0, maxServices)

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.


go func() {
const maxServices = 10 // How many services we can put in one Describe command

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

continue
}
task := taskRaw.(EcsTask)
if !strings.HasPrefix(task.StartedBy, servicePrefix) {

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

return results, unmatched
}

func (c ecsClientImpl) ensureTasks(taskARNs []string) {

This comment was marked as abuse.

This comment was marked as abuse.

for _, taskARN := range taskARNs {
// It's possible that tasks could still be missing from the cache if describe tasks failed.
// We'll just pretend they don't exist.
if taskRaw, err := c.taskCache.Get(taskARN); err == nil {

This comment was marked as abuse.

This comment was marked as abuse.

@@ -287,6 +289,8 @@ func main() {

// AWS ECS
flag.BoolVar(&flags.probe.ecsEnabled, "probe.ecs", false, "Collect ecs-related attributes for containers on this node")
flag.IntVar(&flags.probe.ecsCacheSize, "probe.ecs.cache.size", 1024*1024, "Max size of cached info for each ECS cluster")
flag.DurationVar(&flags.probe.ecsCacheExpiry, "probe.ecs.cache.expiry", time.Hour, "How long to keep cached ECS info")

This comment was marked as abuse.

@2opremio
Copy link
Contributor

Thanks for changing the cache and for the tests. How and for how long did you test this? (Not that I would had been able to make it simpler, but the code is quite involved)

@2opremio
Copy link
Contributor

Ran for about 15h with no rate limiting.

3 clusters, right? With how many instances each?

@ekimekim
Copy link
Contributor Author

PTAL

Yes, 3 clusters. Each with the defaults for the socks shop demo, which is 3 instances, about 10 or so services with one task per service.

@ekimekim
Copy link
Contributor Author

Note, the code has changed since then, but not in ways that matter to the rate-limiting. And I've since re-tested that it still works, but haven't repeated the 15h test.

servicesRefreshed[serviceName] = true
}
close(toDescribe)
<-done

This comment was marked as abuse.

This comment was marked as abuse.

by removing ability to stream results between them, since this is such a minor optimization
and greatly complicates the code.
@ekimekim
Copy link
Contributor Author

PTAL

calls++
}
// split into batches
batches := make([][]string, 0, len(services)/maxServices+1)

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

@2opremio
Copy link
Contributor

2opremio commented Jan 24, 2017

LGTM. That last comment is only a recommendation.

@ekimekim ekimekim merged commit dee274e into master Jan 24, 2017
@ekimekim ekimekim deleted the mike/ecs/caching branch January 24, 2017 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants