Node level fault isolation #862

esrishi · 2015-08-14T21:36:12Z

Hi,
I am new to hystrix and trying to read up on the documentation and examples.
I have a back end dependency with requests routed to ~200 hosts (X10 JVM's per host).
Each JVM has unique data, using a lookup table my service has to connect to any of the 2000 JVM's to respond to client request.
I am looking for fault isolation at (backend jvm) level, and if I understand the documentation I have read so far, it would require my application to have ~2000 thread pools using hystrixcommand.
A single host has multiple instances running on it, so host level isolation is not the right solution.
Maybe I am missing something, any suggestions would be great.

Thanks,
Rishi.

mattrjacobs · 2015-08-17T17:04:43Z

@esrishi This is a great question. If you want fault isolation at that fine of a granularity, Hystrix currently forces you to have a HystrixCommand per JVM (otherwise faults would not be isolated). This imposes a large overhead in terms of monitoring.

For thread pools, you could group many commands into the same thread pool to reduce the number of thread pools. For instance, you could have 200 thread pools, each with 10 commands, 1 per JVM. Commands would fail independently, but could, in this case, get blocked by running out of thread pool space.

A related open issue is: #26. This allows for a more reasonable model of a case like this. I will say that one reason it hasn't been done yet is that, for internal Netflix usage, we try very hard to avoid cases like this. Instead, we attempt to build stateless services such that routing to any node is equivalent.

Of course, cases like yours are completely valid and worth solving, we just haven't put time into them as we don't see them in daily usage.

esrishi · 2015-08-17T20:40:30Z

Hi @mattrjacobs , thanks for your response, the open issue #26 is exactly the same situation I am working to solve. We have multiple clusters with sharded data. As much as we want it to be stateless, eventually we are bound to an endpoint where the end user data lives (ex: Solr or MySql etc).

Since I am new to hystrix and still learning the ropes, can you elaborated on what you mean by " you could have 200 thread pools, each with 10 commands, 1 per JVM".

Here is a sample hystrixcommand initialization I have, this will lead to ~ 2000 PerNodeHystrixCommand's
class PerNodeHystrixCommand extends HystrixCommand {
public PerNodeHystrixCommand(String host, String port) {
super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey(host +"_"+ port)));
}
}

If I understand you right, is the shared thread pool approach something like this?

class PerNodeHystrixCommand extends HystrixCommand {
public PerNodeHystrixCommand(String host, String port) {
super(Setter.withGroupKey(HystrixCommandGroupKey.Factory.asKey(host +"_"+ port))
.andThreadPoolKey(HystrixThreadPoolKey.Factory.asKey(host+"_Pool"));
}
}
For our case we have 10 JVM instances per host so its ~ 200 thread pools.

mattrjacobs · 2015-08-19T17:07:24Z

Yes, so the principle is to share thread pools between commands.

There are 3 levels of naming:

Group
Thread Pool
Command

Generally, these are hierarchical, so in your case I would have:

A unique command name (host+port)
A per-host thread pool name (host)
A per-host group name (host)

Note that, if left unset, group defaults to thread pool name, so you can safely omit that.

If you use those settings, the 10 commands that talk to the same host will use the same threadpool (and have the same group name).

esrishi · 2015-08-19T18:55:09Z

Thanks Matt for clarifying the naming and hierarchy.
I see a section about measuring cost of hystrix overhead, and examples stating 40+ thread pools in use, but is there a liner/non-linear calculation of overhead with respect to thread pools, any insights or your experience on using hystrix with large thread pools.

mattrjacobs · 2015-08-21T16:27:16Z

Unfortunately, there's not really an analytic way to arrive at an optimal sizing of thread pools. It's based upon the flow of traffic through commands and their distributions of interarrival times and execution times. Moreover, the calculations change for different machine/OSes in terms of number of cores and
thread scheduling policies.

In practice, we set our thread pool sizes by:

In steady state, rejecting 0/.01%/.1%/1% of commands (depends on the criticality of the function)
Under latency, ensuring that the system still operates as expected. In this case, we expect additional latency to directly trigger more thread pool rejections/timeouts/short-circuits, and so the system should not become overloaded.

There's more context on the problem here: #131, if you're interested.

I will also say that on the high-volume system I work on at Netflix, we've never tuned a threadpool above 28, so that might give you a sense of absolute numbers (though of course our domains/system characteristics are likely very different).

esrishi · 2015-08-26T14:31:02Z

Thanks @mattrjacobs, will try out with higher number of threadpools and see it that options pans out well for our deployment.

mattrjacobs added the question label Aug 17, 2015

mattrjacobs mentioned this issue Aug 21, 2015

strategy for tuning the hystrix configuration #866

Closed

esrishi mentioned this issue Aug 24, 2015

Group and Threadpool names across libraries #867

Closed

esrishi mentioned this issue Aug 27, 2015

Isolation Strategy with Async library #879

Closed

mattrjacobs closed this as completed Oct 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node level fault isolation #862

Node level fault isolation #862

esrishi commented Aug 14, 2015

mattrjacobs commented Aug 17, 2015

esrishi commented Aug 17, 2015

mattrjacobs commented Aug 19, 2015

esrishi commented Aug 19, 2015

mattrjacobs commented Aug 21, 2015

esrishi commented Aug 26, 2015

Node level fault isolation #862

Node level fault isolation #862

Comments

esrishi commented Aug 14, 2015

mattrjacobs commented Aug 17, 2015

esrishi commented Aug 17, 2015

mattrjacobs commented Aug 19, 2015

esrishi commented Aug 19, 2015

mattrjacobs commented Aug 21, 2015

esrishi commented Aug 26, 2015