-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIXED] LeafNode's queue group load balancing and Sublist.NumInterest #5982
Conversation
While writing the test, I needed to make sure that each server in the hub has registered interest for 2 queue subscribers from the same group. I noticed that `Sublist.NumInterest()` (that I was invoking from `Account.Interest()` was returning 1, even after I knew that the propagation should have happened. It turns out that `NumInterest()` was returning the number of queue groups, not the number of queue subs in all those queue groups. For the leafnode queue balancing issue, the code was favoring local/routed queue subscriptions, so in the described issue, the message would always go from HUB1->HUB2->LEAF2->QSub instead of HUB1->LEAF1->QSub. Since we had another test that was a bit reversed where we had a HUB and LEAF1<->LEAF2 connecting to HUB and a qsub on both HUB and LEAF1 and requests originated from LEAF2, and we were expecting all responses to come from LEAF1 (instead of the responder on HUB), I went with the following approach: If the message originates from a client that connects to a server that has a connection from a remote LEAF, then we pick that LEAF the same as if it was a local client or routed server. However, if the client connects to a server that has a leaf connection to another server, then we keep track of the sub but do not sent to that one if we have local or routed qsubs. This makes the 2 tests pass, solving the new test and maintaining the behavior for the old test. Signed-off-by: Ivan Kozlovic <[email protected]>
Signed-off-by: Ivan Kozlovic <[email protected]>
@neilalexander I believe there was an issue with Sublist.NumInterest for queue subs since it looked like it was simply counting the number of groups, not the total number of queue subscriptions. Let me know if I misunderstood the intent. @derekcollison Please review the PR description and see if the choice I made is ok. |
You can review the first commit for the leafnode/sublist issues. The second is simply a bunch of missing "defer nc.Close()" and the likes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I notice looking back at #5918 that even before NumInterest()
was added, the Account.Interest()
function was still returning len(res.psubs) + len(res.qsubs)
, so I think the bug is not new and I've just ported it over to the new code as-is.
That said, I think what you're proposing here makes sense, particularly if we're relying on the number of subscriptions to balance in this way.
Something else that's just occurred to me is that @derekcollison Don't know whether we want to cherry-pick in |
@neilalexander let's pull those into 2.10.22 from main once this lands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks @kozlovic
Includes the following: - #5918 - #5982 - #5983 (although only the 1.22.8 upgrade, since 1.21.x is no longer receiving updates) Signed-off-by: Neil Twigg <[email protected]>
There were multiple issues, but basically the fact that we would not store the routed subscriptions with the origin of the LEAF they came from made the server unable to differentiate those compare to "local" routed subscriptions, which in some cases (like a server restart and the resend of subscriptions) could lead to servers sending incorrectly subscription interest to leaf connections. We are now storing the subscriptions with an origin with that origin as part of the key. This allows to differentiate "regular" routed subs versus the ones on behalf of a leafnode. An INFO boolean is added `LNOCU` to indicate support for origin in the `LS-` protocol, which is required to properly handle the removal. Therefore, if a route does not have `LNOCU`, the server will behave like an old server, and store with the key that does not contain the origin, so that it can be removed when getting an LS- without the origin. Note that in the case of a mix of servers in the same cluster, some of the issues this PR is trying to fix will be present (since the server will basically behave like a server without the fix). Having a different routed subs for leaf connections allow to revisit the fix #5982 that was done for issue #5972, which was about a more fair queue distribution to a cluster of leaf connections. That fix actually introduced a change in that we always wanted to favor queue subscriptions of the cluster where the message is produced, which that fix possibly changed. With this current PR, the server can now know if a remote queue sub is for a "local" queue sub there or on behalf of a leaf and therefore will not favor that route compared to a leaf subscription that it may have directly attached. Resolves #5972 Resolves #6148 Signed-off-by: Ivan Kozlovic <[email protected]>
There were multiple issues, but basically the fact that we would not store the routed subscriptions with the origin of the LEAF they came from made the server unable to differentiate those compare to "local" routed subscriptions, which in some cases (like a server restart and the resend of subscriptions) could lead to servers sending incorrectly subscription interest to leaf connections. We are now storing the subscriptions with an origin with that origin as part of the key. This allows to differentiate "regular" routed subs versus the ones on behalf of a leafnode. An INFO boolean is added `LNOCU` to indicate support for origin in the `LS-` protocol, which is required to properly handle the removal. Therefore, if a route does not have `LNOCU`, the server will behave like an old server, and store with the key that does not contain the origin, so that it can be removed when getting an LS- without the origin. Note that in the case of a mix of servers in the same cluster, some of the issues this PR is trying to fix will be present (since the server will basically behave like a server without the fix). Having a different routed subs for leaf connections allow to revisit the fix #5982 that was done for issue #5972, which was about a more fair queue distribution to a cluster of leaf connections. That fix actually introduced a change in that we always wanted to favor queue subscriptions of the cluster where the message is produced, which that fix possibly changed. With this current PR, the server can now know if a remote queue sub is for a "local" queue sub there or on behalf of a leaf and therefore will not favor that route compared to a leaf subscription that it may have directly attached. Resolves #5972 Resolves #6148 Signed-off-by: Ivan Kozlovic <[email protected]>
There were multiple issues, but basically the fact that we would not store the routed subscriptions with the origin of the LEAF they came from made the server unable to differentiate those compared to "local" routed subscriptions, which in some cases (like a server restart and the resend of subscriptions) could lead to servers sending incorrectly subscription interest to leaf connections. We are now storing the subscriptions with a sub type indicator and the origin (for leaf subscriptions) as part of the key. This allows to differentiate "regular" routed subs versus the ones on behalf of a leafnode. An INFO boolean is added `LNOCU` to indicate support for origin in the `LS-` protocol, which is required to properly handle the removal. Therefore, if a route does not have `LNOCU`, the server will behave like an old server, and store with the key that does not contain the origin, so that it can be removed when getting an LS- without the origin. Note that in the case of a mix of servers in the same cluster, some of the issues this PR is trying to fix will be present (since the server will basically behave like a server without the fix). Having a different routed subs for leaf connections allow to revisit the fix #5982 that was done for issue #5972, which was about a more fair queue distribution to a cluster of leaf connections. That fix actually introduced a change in that we always wanted to favor queue subscriptions of the cluster where the message is produced, which that fix possibly changed. With this current PR, the server can now know if a remote queue sub is for a "local" queue sub there or on behalf of a leaf and therefore will not favor that route compared to a leaf subscription that it may have directly attached. Resolves #5972 Resolves #6148 Signed-off-by: Ivan Kozlovic <[email protected]>
There were multiple issues, but basically the fact that we would not store the routed subscriptions with the origin of the LEAF they came from made the server unable to differentiate those compared to "local" routed subscriptions, which in some cases (like a server restart and the resend of subscriptions) could lead to servers sending incorrectly subscription interest to leaf connections. We are now storing the subscriptions with a sub type indicator and the origin (for leaf subscriptions) as part of the key. This allows to differentiate "regular" routed subs versus the ones on behalf of a leafnode. An INFO boolean is added `LNOCU` to indicate support for origin in the `LS-` protocol, which is required to properly handle the removal. Therefore, if a route does not have `LNOCU`, the server will behave like an old server, and store with the key that does not contain the origin, so that it can be removed when getting an LS- without the origin. Note that in the case of a mix of servers in the same cluster, some of the issues this PR is trying to fix will be present (since the server will basically behave like a server without the fix). Having a different routed subs for leaf connections allow to revisit the fix #5982 that was done for issue #5972, which was about a more fair queue distribution to a cluster of leaf connections. That fix actually introduced a change in that we always wanted to favor queue subscriptions of the cluster where the message is produced, which that fix possibly changed. With this current PR, the server can now know if a remote queue sub is for a "local" queue sub there or on behalf of a leaf and therefore will not favor that route compared to a leaf subscription that it may have directly attached. Resolves #5972 Resolves #6148 Signed-off-by: Ivan Kozlovic <[email protected]>
There were multiple issues, but basically the fact that we would not store the routed subscriptions with the origin of the LEAF they came from made the server unable to differentiate those compared to "local" routed subscriptions, which in some cases (like a server restart and the resend of subscriptions) could lead to servers sending incorrectly subscription interest to leaf connections. We are now storing the subscriptions with a sub type indicator and the origin (for leaf subscriptions) as part of the key. This allows to differentiate "regular" routed subs versus the ones on behalf of a leafnode. An INFO boolean is added `LNOCU` to indicate support for origin in the `LS-` protocol, which is required to properly handle the removal. Therefore, if a route does not have `LNOCU`, the server will behave like an old server, and store with the key that does not contain the origin, so that it can be removed when getting an LS- without the origin. Note that in the case of a mix of servers in the same cluster, some of the issues this PR is trying to fix will be present (since the server will basically behave like a server without the fix). Having a different routed subs for leaf connections allow to revisit the fix #5982 that was done for issue #5972, which was about a more fair queue distribution to a cluster of leaf connections. That fix actually introduced a change in that we always wanted to favor queue subscriptions of the cluster where the message is produced, which that fix possibly changed. With this current PR, the server can now know if a remote queue sub is for a "local" queue sub there or on behalf of a leaf and therefore will not favor that route compared to a leaf subscription that it may have directly attached. Resolves #5972 Resolves #6148 Signed-off-by: Ivan Kozlovic <[email protected]>
There were multiple issues, but basically the fact that we would not store the routed subscriptions with the origin of the LEAF they came from made the server unable to differentiate those compared to "local" routed subscriptions, which in some cases (like a server restart and the resend of subscriptions) could lead to servers sending incorrectly subscription interest to leaf connections. We are now storing the subscriptions with a sub type indicator and the origin (for leaf subscriptions) as part of the key. This allows to differentiate "regular" routed subs versus the ones on behalf of a leafnode. An INFO boolean is added `LNOCU` to indicate support for origin in the `LS-` protocol, which is required to properly handle the removal. Therefore, if a route does not have `LNOCU`, the server will behave like an old server, and store with the key that does not contain the origin, so that it can be removed when getting an LS- without the origin. Note that in the case of a mix of servers in the same cluster, some of the issues this PR is trying to fix will be present (since the server will basically behave like a server without the fix). Having a different routed subs for leaf connections allow to revisit the fix #5982 that was done for issue #5972, which was about a more fair queue distribution to a cluster of leaf connections. That fix actually introduced a change in that we always wanted to favor queue subscriptions of the cluster where the message is produced, which that fix possibly changed. With this current PR, the server can now know if a remote queue sub is for a "local" queue sub there or on behalf of a leaf and therefore will not favor that route compared to a leaf subscription that it may have directly attached. Resolves #5972 Resolves #6148 Signed-off-by: Ivan Kozlovic <[email protected]>
While writing the test, I needed to make sure that each server in
the hub has registered interest for 2 queue subscribers from the
same group. I noticed that
Sublist.NumInterest()
(that I wasinvoking from
Account.Interest()
was returning 1, even afterI knew that the propagation should have happened. It turns out
that
NumInterest()
was returning the number of queue groups, notthe number of queue subs in all those queue groups.
For the leafnode queue balancing issue, the code was favoring
local/routed queue subscriptions, so in the described issue,
the message would always go from HUB1->HUB2->LEAF2->QSub instead
of HUB1->LEAF1->QSub.
Since we had another test that was a bit reversed where we had
a HUB and LEAF1<->LEAF2 connecting to HUB and a qsub on both
HUB and LEAF1 and requests originated from LEAF2, and we were
expecting all responses to come from LEAF1 (instead of the
responder on HUB), I went with the following approach:
If the message originates from a client that connects to a server
that has a connection from a remote LEAF, then we pick that LEAF the
same as if it was a local client or routed server.
However, if the client connects to a server that has a leaf
connection to another server, then we keep track of the sub
but do not sent to that one if we have local or routed qsubs.
This makes the 2 tests pass, solving the new test and maintaining
the behavior for the old test.
Resolves #5972
Signed-off-by: Ivan Kozlovic [email protected]