-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node level fault isolation #862
Comments
@esrishi This is a great question. If you want fault isolation at that fine of a granularity, Hystrix currently forces you to have a HystrixCommand per JVM (otherwise faults would not be isolated). This imposes a large overhead in terms of monitoring. For thread pools, you could group many commands into the same thread pool to reduce the number of thread pools. For instance, you could have 200 thread pools, each with 10 commands, 1 per JVM. Commands would fail independently, but could, in this case, get blocked by running out of thread pool space. A related open issue is: #26. This allows for a more reasonable model of a case like this. I will say that one reason it hasn't been done yet is that, for internal Netflix usage, we try very hard to avoid cases like this. Instead, we attempt to build stateless services such that routing to any node is equivalent. Of course, cases like yours are completely valid and worth solving, we just haven't put time into them as we don't see them in daily usage. |
Hi @mattrjacobs , thanks for your response, the open issue #26 is exactly the same situation I am working to solve. We have multiple clusters with sharded data. As much as we want it to be stateless, eventually we are bound to an endpoint where the end user data lives (ex: Solr or MySql etc). Since I am new to hystrix and still learning the ropes, can you elaborated on what you mean by " you could have 200 thread pools, each with 10 commands, 1 per JVM". Here is a sample hystrixcommand initialization I have, this will lead to ~ 2000 PerNodeHystrixCommand's If I understand you right, is the shared thread pool approach something like this? class PerNodeHystrixCommand extends HystrixCommand { |
Yes, so the principle is to share thread pools between commands. There are 3 levels of naming:
Generally, these are hierarchical, so in your case I would have:
Note that, if left unset, group defaults to thread pool name, so you can safely omit that. If you use those settings, the 10 commands that talk to the same host will use the same threadpool (and have the same group name). |
Thanks Matt for clarifying the naming and hierarchy. |
Unfortunately, there's not really an analytic way to arrive at an optimal sizing of thread pools. It's based upon the flow of traffic through commands and their distributions of interarrival times and execution times. Moreover, the calculations change for different machine/OSes in terms of number of cores and In practice, we set our thread pool sizes by:
There's more context on the problem here: #131, if you're interested. I will also say that on the high-volume system I work on at Netflix, we've never tuned a threadpool above 28, so that might give you a sense of absolute numbers (though of course our domains/system characteristics are likely very different). |
Thanks @mattrjacobs, will try out with higher number of threadpools and see it that options pans out well for our deployment. |
Hi,
I am new to hystrix and trying to read up on the documentation and examples.
I have a back end dependency with requests routed to ~200 hosts (X10 JVM's per host).
Each JVM has unique data, using a lookup table my service has to connect to any of the 2000 JVM's to respond to client request.
I am looking for fault isolation at (backend jvm) level, and if I understand the documentation I have read so far, it would require my application to have ~2000 thread pools using hystrixcommand.
A single host has multiple instances running on it, so host level isolation is not the right solution.
Maybe I am missing something, any suggestions would be great.
Thanks,
Rishi.
The text was updated successfully, but these errors were encountered: