ECS reporter: Minimize API calls by caching task and service data #2065

ekimekim · 2016-12-07T09:22:37Z

Fixes #2050

Due to AWS API rate limits, we need to minimize API calls as much as
possible.

Our stated objectives:

for all displayed tasks and services to have up-to-date metadata
for all tasks to map to services if able

My approach here:

Tasks only contain immutable fields (that we care about). We cache tasks forever.
We only DescribeTasks the first time we see a new task.
We attempt to match tasks to services with what info we have. Any "referenced" services,
ie. a service with at least one matching task, needs to be updated to refresh changing data.
In the event that a task doesn't match any of the (updated) services, ie. a new service entirely
needs to be found, we do a full list and detail of all services (we don't re-detail ones we just refreshed).
To avoid unbounded memory usage, we evict tasks and services from the cache after 1 minute without use.
This should be long enough for things like temporary failures to be glossed over.

This gives us exactly one call per task, and one call per referenced service per report, which is unavoidable to maintain fresh data. Expensive "describe all" service queries are kept to only when newly-referenced services appear, which should be rare.

We could make a few very minor improvements here, such as trying to refresh unreferenced but known services before doing a list query, or getting details one by one when "describing all" and stopping when all matches have been found, but I believe these would produce very minor, if any, gains in number of calls while having an unjustifiable effect on latency since we wouldn't be able to do requests as concurrently.

Speaking of which, this change has a minor performance impact. Even though we're now doing less calls, we can't do them as concurrently.

Old code:

concurrently:
	describe tasks (1 call)
	sequentially:
		list services (1 call)
		describe services (N calls concurrently)

Assuming full concurrency, total latency: 2 end-to-end calls

New code (worst case):

sequentially:
	describe tasks (1 call)
	describe services (N calls concurrently)
	list services (1 call)
	describe services (N calls concurrently)

Assuming full concurrency, total latency: 4 end-to-end calls

In practical terms, I don't expect this to matter.

ekimekim · 2016-12-10T02:21:58Z

Tested, added a ton of debug logging and fixed a bunch of bugs.
Can confirm we do much less calls now.

In this example system, we have one service with 3 tasks across 3 machines, so one task on our example machine.
This is what the first report on startup looks like: (many extra logs elided)

<probe> DEBU: 2016/12/10 02:09:06.235489 Creating new ECS client
<probe> DEBU: 2016/12/10 02:09:06.324142 Describing 1 ECS tasks
<probe> DEBU: 2016/12/10 02:09:06.367489 Refreshing ECS services
<probe> DEBU: 2016/12/10 02:09:06.367542 Described 0 services in 0 calls
<probe> DEBU: 2016/12/10 02:09:06.367569 After refreshing services, 1 tasks unmatched
<probe> DEBU: 2016/12/10 02:09:06.367574 Listing ECS services
<probe> DEBU: 2016/12/10 02:09:06.376347 Listed 1 services
<probe> DEBU: 2016/12/10 02:09:06.376414 Described 1 services in 1 calls
<probe> DEBU: 2016/12/10 02:09:06.387699 Got info from ECS: 1 tasks, 1 services

for a total of 3 calls (1 DescribeTasks, 1 ListServices, 1 DescribeServices)

In subsequent reports:

<probe> DEBU: 2016/12/10 02:09:07.162438 Refreshing ECS services
<probe> DEBU: 2016/12/10 02:09:07.162537 Described 1 services in 1 calls
<probe> DEBU: 2016/12/10 02:09:07.178871 After refreshing services, 0 tasks unmatched
<probe> DEBU: 2016/12/10 02:09:07.179110 Got info from ECS: 1 tasks, 1 services

for a total of 1 call (DescribeServices)

2opremio · 2016-12-12T15:45:33Z

To avoid unbounded memory usage, we evict tasks and services from the cache after 1 minute without use.

Shouldn't this be size-bound instead of time-bound? Also, 1-minute seems very conservative (1 minute in scheduler time is very small).

2opremio · 2016-12-12T15:51:08Z

such as trying to refresh unreferenced but known services before doing a list query

Would this fix #2085 ? (i.e. what does known mean here? previously referenced?)

2opremio · 2016-12-12T15:52:08Z

In practical terms, I don't expect this to matter.

I won't either

probe/awsecs/reporter.go

@@ -78,39 +78,51 @@ func getLabelInfo(rpt report.Report) map[string]map[string]*taskLabelInfo {

 // Reporter implements Tagger, Reporter
 type Reporter struct {
+	clients map[string]*ecsClient


probe/awsecs/reporter.go

+		client, ok := r.clients[cluster]
+		if !ok {
+			log.Debugf("Creating new ECS client")
+			var err error // can't use := on the next line without shadowing outer client var


probe/awsecs/reporter.go

+}
+
+// New creates a new Reporter
+func New() Reporter {


probe/awsecs/client.go

+	client       *ecs.ECS
+	cluster      string
+	taskCache    map[string]ecsTask
+	serviceCache map[string]ecsService


probe/awsecs/client.go

+}
+
+// Returns a ecsInfo struct containing data needed for a report.
+func (c ecsClient) getInfo(taskARNs []string) ecsInfo {


2opremio · 2016-12-12T17:52:15Z

Can you make a rough estimation on how many clusters/container instances/services could be run with these improvements?

Did you test this in AWS with a few clusters?

Also, could you please add some unit tests?

probe/awsecs/client.go

-		servicesChan <- c.getServices()
-	}()
+// Evict entries from the caches which have not been used within the eviction interval.
+func (c ecsClient) evictOldCacheItems() {


probe/awsecs/client.go

+	client       *ecs.ECS
+	cluster      string
+	taskCache    map[string]ecsTask
+	serviceCache map[string]ecsService


ekimekim · 2016-12-12T22:16:02Z

PTAL

probe/awsecs/client.go

-	lock := sync.Mutex{} // lock mediates access to results
+// Returns (input, done) channels. Service ARNs given to input are batched and details are fetched,
+// with full ecsService objects being put into the cache. Closes done when finished.
+func (c ecsClient) describeServices() (chan<- string, <-chan bool) {


probe/awsecs/client.go

+
+func newECSService(service *ecs.Service) ecsService {
+	now := time.Now()
+	deploymentIDs := make([]string, 0, len(service.Deployments))


probe/awsecs/client.go

-				for _, failure := range resp.Failures {
-					log.Warnf("Failed to describe ECS service %s, ECS service report may be incomplete: %s", failure.Arn, failure.Reason)
-				}
+		describePage := func(arns []string) {


2opremio · 2016-12-14T11:59:40Z

I gave it another read. I asked some questions in my first comments which remain to be answered.

ekimekim · 2016-12-14T22:16:44Z

Can you make a rough estimation on how many clusters/container instances/services could be run with these improvements?

Taking socks shop as an (admittedly poor) example:
Socks shop has 14 services and 1 task per service, across 3 machines.
Let's take the average of 14/3 = 4.6 tasks per machine and round up to 5.

Note it's unclear if rate limits are per actual http request, or per thing being described (ie. if a describe services call with 10 services is 10x the 'cost' of a describe services with 1 service, or the same 'cost')

Old:
for each machine, list and describe all and services, and describe all tasks on that machine
= 3 machines * (1 list services + 14 describe services + 5 describe tasks)
or 3 machines * (1 list services + 2 describe service calls + 1 describe tasks call)
= 60 list or describe operations
or 12 list or describe calls

New:
first report is as per old. for subsequent reports:
for each machine, describe all services on that machine
= 3 machines * 5 describe services
or 3 machines * 1 describe service call
= 15 describe operations
or 3 describe calls

In summary: We make 1/4 the calls. But note this is somewhat of an ideal scenario, since most services aren't present on most machines.
But I'd expect we'll always save at least half of all calls, just by virtue of caching all tasks.

ekimekim · 2016-12-15T02:05:21Z

rebased, resolved trivial conflict

ekimekim · 2016-12-15T02:10:55Z

I'm running a test with 3 socks shop clusters in one region, to see if I hit rate limits. Leaving it 12h, which should be enough to hit any longer-term limits.

ekimekim · 2016-12-15T19:14:27Z

Ran for about 15h with no rate limiting.

Can I get a PTAL % no unit tests?

2opremio · 2016-12-15T21:46:45Z

Please rebase, #2094 has messed up your commits and it's hard to review.

2opremio · 2016-12-15T21:47:56Z

Also, there's a pending question about size-based eviction vs time-based eviction (I would go for the first one or a combination of the two if you want to refine it).

Due to AWS API rate limits, we need to minimize API calls as much as possible. Our stated objectives: * for all displayed tasks and services to have up-to-date metadata * for all tasks to map to services if able My approach here: * Tasks only contain immutable fields (that we care about). We cache tasks forever. We only DescribeTasks the first time we see a new task. * We attempt to match tasks to services with what info we have. Any "referenced" services, ie. a service with at least one matching task, needs to be updated to refresh changing data. * In the event that a task doesn't match any of the (updated) services, ie. a new service entirely needs to be found, we do a full list and detail of all services (we don't re-detail ones we just refreshed). * To avoid unbounded memory usage, we evict tasks and services from the cache after 1 minute without use. This should be long enough for things like temporary failures to be glossed over. This gives us exactly one call per task, and one call per referenced service per report, which is unavoidable to maintain fresh data. Expensive "describe all" service queries are kept to only when newly-referenced services appear, which should be rare. We could make a few very minor improvements here, such as trying to refresh unreferenced but known services before doing a list query, or getting details one by one when "describing all" and stopping when all matches have been found, but I believe these would produce very minor, if any, gains in number of calls while having an unjustifiable effect on latency since we wouldn't be able to do requests as concurrently. Speaking of which, this change has a minor performance impact. Even though we're now doing less calls, we can't do them as concurrently. Old code: concurrently: describe tasks (1 call) sequentially: list services (1 call) describe services (N calls concurrently) Assuming full concurrency, total latency: 2 end-to-end calls New code (worst case): sequentially: describe tasks (1 call) describe services (N calls concurrently) list services (1 call) describe services (N calls concurrently) Assuming full concurrency, total latency: 4 end-to-end calls In practical terms, I don't expect this to matter.

Not only does this allow us to re-use connections, but vitally it allows us to make use of the new task and service caching within the client object.

instead of our own hand-rolled size-unbound cache

This requires making All The Things public. Yuck.

ekimekim · 2017-01-17T20:13:39Z

PTAL.
I now use gcache with an expiring LRU, with user-controllable settings (default 1MiB, 1 hour). It doesn't "refresh" the expiry time upon use, so there'll be a spike of requests every hour, bu in practice this shouldn't matter.
I've made tests that exercise the reporter code by mocking the EcsClient. The client code has no coverage right now. I think this is acceptable considering the complexity of mocking out the AWS api.

ekimekim · 2017-01-17T20:14:11Z

(I've tested manually, and i'm just fixing some linter pedantry right now)

probe/awsecs/client.go

-		}
+func newECSTask(task *ecs.Task) EcsTask {
+	return EcsTask{
+		TaskARN:           *task.TaskArn,


probe/awsecs/client.go

+				}(batch)
+				count += len(batch)
+				calls++
+				batch = make([]string, 0, maxServices)


probe/awsecs/client.go


+	go func() {
+		const maxServices = 10 // How many services we can put in one Describe command


probe/awsecs/client.go

+			continue
+		}
+		task := taskRaw.(EcsTask)
+		if !strings.HasPrefix(task.StartedBy, servicePrefix) {


probe/awsecs/client.go

+	return results, unmatched
+}
+
+func (c ecsClientImpl) ensureTasks(taskARNs []string) {


probe/awsecs/client.go

+	for _, taskARN := range taskARNs {
+		// It's possible that tasks could still be missing from the cache if describe tasks failed.
+		// We'll just pretend they don't exist.
+		if taskRaw, err := c.taskCache.Get(taskARN); err == nil {


prog/main.go

@@ -287,6 +289,8 @@ func main() {

 	// AWS ECS
 	flag.BoolVar(&flags.probe.ecsEnabled, "probe.ecs", false, "Collect ecs-related attributes for containers on this node")
+	flag.IntVar(&flags.probe.ecsCacheSize, "probe.ecs.cache.size", 1024*1024, "Max size of cached info for each ECS cluster")
+	flag.DurationVar(&flags.probe.ecsCacheExpiry, "probe.ecs.cache.expiry", time.Hour, "How long to keep cached ECS info")


2opremio · 2017-01-20T10:46:41Z

Thanks for changing the cache and for the tests. How and for how long did you test this? (Not that I would had been able to make it simpler, but the code is quite involved)

2opremio · 2017-01-20T10:48:15Z

Ran for about 15h with no rate limiting.

3 clusters, right? With how many instances each?

ekimekim · 2017-01-20T22:33:53Z

PTAL

Yes, 3 clusters. Each with the defaults for the socks shop demo, which is 3 instances, about 10 or so services with one task per service.

ekimekim · 2017-01-20T22:34:35Z

Note, the code has changed since then, but not in ways that matter to the rate-limiting. And I've since re-tested that it still works, but haven't repeated the 15h test.

probe/awsecs/client.go

+		servicesRefreshed[serviceName] = true
+	}
+	close(toDescribe)
+	<-done


by removing ability to stream results between them, since this is such a minor optimization and greatly complicates the code.

ekimekim · 2017-01-23T21:20:10Z

PTAL

probe/awsecs/client.go

-			calls++
-		}
+	// split into batches
+	batches := make([][]string, 0, len(services)/maxServices+1)


2opremio · 2017-01-24T10:08:41Z

LGTM. That last comment is only a recommendation.

2opremio self-assigned this Dec 12, 2016

2opremio mentioned this pull request Dec 12, 2016

ECS: Render Taskless Services #2085

Open