Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] LeafNode's queue group load balancing and Sublist.NumInterest #5982

Merged
merged 2 commits into from
Oct 10, 2024

Conversation

kozlovic
Copy link
Member

While writing the test, I needed to make sure that each server in
the hub has registered interest for 2 queue subscribers from the
same group. I noticed that Sublist.NumInterest() (that I was
invoking from Account.Interest() was returning 1, even after
I knew that the propagation should have happened. It turns out
that NumInterest() was returning the number of queue groups, not
the number of queue subs in all those queue groups.

For the leafnode queue balancing issue, the code was favoring
local/routed queue subscriptions, so in the described issue,
the message would always go from HUB1->HUB2->LEAF2->QSub instead
of HUB1->LEAF1->QSub.

Since we had another test that was a bit reversed where we had
a HUB and LEAF1<->LEAF2 connecting to HUB and a qsub on both
HUB and LEAF1 and requests originated from LEAF2, and we were
expecting all responses to come from LEAF1 (instead of the
responder on HUB), I went with the following approach:

If the message originates from a client that connects to a server
that has a connection from a remote LEAF, then we pick that LEAF the
same as if it was a local client or routed server.
However, if the client connects to a server that has a leaf
connection to another server, then we keep track of the sub
but do not sent to that one if we have local or routed qsubs.

This makes the 2 tests pass, solving the new test and maintaining
the behavior for the old test.

Resolves #5972

Signed-off-by: Ivan Kozlovic [email protected]

While writing the test, I needed to make sure that each server in
the hub has registered interest for 2 queue subscribers from the
same group. I noticed that `Sublist.NumInterest()` (that I was
invoking from `Account.Interest()` was returning 1, even after
I knew that the propagation should have happened. It turns out
that `NumInterest()` was returning the number of queue groups, not
the number of queue subs in all those queue groups.

For the leafnode queue balancing issue, the code was favoring
local/routed queue subscriptions, so in the described issue,
the message would always go from HUB1->HUB2->LEAF2->QSub instead
of HUB1->LEAF1->QSub.

Since we had another test that was a bit reversed where we had
a HUB and LEAF1<->LEAF2 connecting to HUB and a qsub on both
HUB and LEAF1 and requests originated from LEAF2, and we were
expecting all responses to come from LEAF1 (instead of the
responder on HUB), I went with the following approach:

If the message originates from a client that connects to a server
that has a connection from a remote LEAF, then we pick that LEAF the
same as if it was a local client or routed server.
However, if the client connects to a server that has a leaf
connection to another server, then we keep track of the sub
but do not sent to that one if we have local or routed qsubs.

This makes the 2 tests pass, solving the new test and maintaining
the behavior for the old test.

Signed-off-by: Ivan Kozlovic <[email protected]>
@kozlovic kozlovic requested a review from a team as a code owner October 10, 2024 04:33
@kozlovic
Copy link
Member Author

@neilalexander I believe there was an issue with Sublist.NumInterest for queue subs since it looked like it was simply counting the number of groups, not the total number of queue subscriptions. Let me know if I misunderstood the intent.

@derekcollison Please review the PR description and see if the choice I made is ok.

@kozlovic
Copy link
Member Author

You can review the first commit for the leafnode/sublist issues. The second is simply a bunch of missing "defer nc.Close()" and the likes.

Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I notice looking back at #5918 that even before NumInterest() was added, the Account.Interest() function was still returning len(res.psubs) + len(res.qsubs), so I think the bug is not new and I've just ported it over to the new code as-is.

That said, I think what you're proposing here makes sense, particularly if we're relying on the number of subscriptions to balance in this way.

@neilalexander
Copy link
Member

Something else that's just occurred to me is that NumInterest() was never back ported into 2.10.x, so if there's a problem on those versions too (as opposed to just on main), it's probably because of the Account.Interest() doing len(res.psubs) + len(res.qsubs).

@derekcollison Don't know whether we want to cherry-pick in NumInterest() into 2.10.x and apply this on top, or if we want to raise a separate PR against the release/v2.10.22 branch to just fix Account.Interest()?

@derekcollison
Copy link
Member

@neilalexander let's pull those into 2.10.22 from main once this lands.

Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks @kozlovic

@derekcollison derekcollison merged commit 7e9c93f into main Oct 10, 2024
5 checks passed
@derekcollison derekcollison deleted the fix_5972 branch October 10, 2024 14:23
neilalexander added a commit that referenced this pull request Oct 10, 2024
Includes the following:

- #5918
- #5982
- #5983 (although only the 1.22.8 upgrade, since 1.21.x is no longer
receiving updates)

Signed-off-by: Neil Twigg <[email protected]>
kozlovic added a commit that referenced this pull request Nov 19, 2024
There were multiple issues, but basically the fact that we would
not store the routed subscriptions with the origin of the LEAF they
came from made the server unable to differentiate those compare to
"local" routed subscriptions, which in some cases (like a server
restart and the resend of subscriptions) could lead to servers
sending incorrectly subscription interest to leaf connections.

We are now storing the subscriptions with an origin with that origin
as part of the key. This allows to differentiate "regular" routed
subs versus the ones on behalf of a leafnode.
An INFO boolean is added `LNOCU` to indicate support for origin
in the `LS-` protocol, which is required to properly handle the
removal. Therefore, if a route does not have `LNOCU`, the server
will behave like an old server, and store with the key that does
not contain the origin, so that it can be removed when getting
an LS- without the origin. Note that in the case of a mix of servers
in the same cluster, some of the issues this PR is trying to fix
will be present (since the server will basically behave like a
server without the fix).

Having a different routed subs for leaf connections allow to revisit
the fix #5982 that was done for issue #5972, which was about
a more fair queue distribution to a cluster of leaf connections.
That fix actually introduced a change in that we always wanted to
favor queue subscriptions of the cluster where the message is produced,
which that fix possibly changed. With this current PR, the server
can now know if a remote queue sub is for a "local" queue sub there
or on behalf of a leaf and therefore will not favor that route compared
to a leaf subscription that it may have directly attached.

Resolves #5972
Resolves #6148

Signed-off-by: Ivan Kozlovic <[email protected]>
kozlovic added a commit that referenced this pull request Nov 20, 2024
There were multiple issues, but basically the fact that we would
not store the routed subscriptions with the origin of the LEAF they
came from made the server unable to differentiate those compare to
"local" routed subscriptions, which in some cases (like a server
restart and the resend of subscriptions) could lead to servers
sending incorrectly subscription interest to leaf connections.

We are now storing the subscriptions with an origin with that origin
as part of the key. This allows to differentiate "regular" routed
subs versus the ones on behalf of a leafnode.
An INFO boolean is added `LNOCU` to indicate support for origin
in the `LS-` protocol, which is required to properly handle the
removal. Therefore, if a route does not have `LNOCU`, the server
will behave like an old server, and store with the key that does
not contain the origin, so that it can be removed when getting
an LS- without the origin. Note that in the case of a mix of servers
in the same cluster, some of the issues this PR is trying to fix
will be present (since the server will basically behave like a
server without the fix).

Having a different routed subs for leaf connections allow to revisit
the fix #5982 that was done for issue #5972, which was about
a more fair queue distribution to a cluster of leaf connections.
That fix actually introduced a change in that we always wanted to
favor queue subscriptions of the cluster where the message is produced,
which that fix possibly changed. With this current PR, the server
can now know if a remote queue sub is for a "local" queue sub there
or on behalf of a leaf and therefore will not favor that route compared
to a leaf subscription that it may have directly attached.

Resolves #5972
Resolves #6148

Signed-off-by: Ivan Kozlovic <[email protected]>
kozlovic added a commit that referenced this pull request Nov 22, 2024
There were multiple issues, but basically the fact that we would
not store the routed subscriptions with the origin of the LEAF they
came from made the server unable to differentiate those compared to
"local" routed subscriptions, which in some cases (like a server
restart and the resend of subscriptions) could lead to servers
sending incorrectly subscription interest to leaf connections.

We are now storing the subscriptions with a sub type indicator and
the origin (for leaf subscriptions) as part of the key. This allows
to differentiate "regular" routed subs versus the ones on behalf
of a leafnode.
An INFO boolean is added `LNOCU` to indicate support for origin
in the `LS-` protocol, which is required to properly handle the
removal. Therefore, if a route does not have `LNOCU`, the server
will behave like an old server, and store with the key that does
not contain the origin, so that it can be removed when getting
an LS- without the origin. Note that in the case of a mix of servers
in the same cluster, some of the issues this PR is trying to fix
will be present (since the server will basically behave like a
server without the fix).

Having a different routed subs for leaf connections allow to revisit
the fix #5982 that was done for issue #5972, which was about
a more fair queue distribution to a cluster of leaf connections.
That fix actually introduced a change in that we always wanted to
favor queue subscriptions of the cluster where the message is produced,
which that fix possibly changed. With this current PR, the server
can now know if a remote queue sub is for a "local" queue sub there
or on behalf of a leaf and therefore will not favor that route compared
to a leaf subscription that it may have directly attached.

Resolves #5972
Resolves #6148

Signed-off-by: Ivan Kozlovic <[email protected]>
kozlovic added a commit that referenced this pull request Nov 22, 2024
There were multiple issues, but basically the fact that we would
not store the routed subscriptions with the origin of the LEAF they
came from made the server unable to differentiate those compared to
"local" routed subscriptions, which in some cases (like a server
restart and the resend of subscriptions) could lead to servers
sending incorrectly subscription interest to leaf connections.

We are now storing the subscriptions with a sub type indicator and
the origin (for leaf subscriptions) as part of the key. This allows
to differentiate "regular" routed subs versus the ones on behalf
of a leafnode.
An INFO boolean is added `LNOCU` to indicate support for origin
in the `LS-` protocol, which is required to properly handle the
removal. Therefore, if a route does not have `LNOCU`, the server
will behave like an old server, and store with the key that does
not contain the origin, so that it can be removed when getting
an LS- without the origin. Note that in the case of a mix of servers
in the same cluster, some of the issues this PR is trying to fix
will be present (since the server will basically behave like a
server without the fix).

Having a different routed subs for leaf connections allow to revisit
the fix #5982 that was done for issue #5972, which was about
a more fair queue distribution to a cluster of leaf connections.
That fix actually introduced a change in that we always wanted to
favor queue subscriptions of the cluster where the message is produced,
which that fix possibly changed. With this current PR, the server
can now know if a remote queue sub is for a "local" queue sub there
or on behalf of a leaf and therefore will not favor that route compared
to a leaf subscription that it may have directly attached.

Resolves #5972
Resolves #6148

Signed-off-by: Ivan Kozlovic <[email protected]>
derekcollison added a commit that referenced this pull request Nov 22, 2024
There were multiple issues, but basically the fact that we would not
store the routed subscriptions with the origin of the LEAF they came
from made the server unable to differentiate those compared to "local"
routed subscriptions, which in some cases (like a server restart and the
resend of subscriptions) could lead to servers sending incorrectly
subscription interest to leaf connections.

We are now storing the subscriptions with a sub type indicator and the
origin (for leaf subscriptions) as part of the key. This allows to
differentiate "regular" routed subs versus the ones on behalf of a
leafnode.
An INFO boolean is added `LNOCU` to indicate support for origin in the
`LS-` protocol, which is required to properly handle the removal.
Therefore, if a route does not have `LNOCU`, the server will behave like
an old server, and store with the key that does not contain the origin,
so that it can be removed when getting an LS- without the origin. Note
that in the case of a mix of servers in the same cluster, some of the
issues this PR is trying to fix will be present (since the server will
basically behave like a server without the fix).

Having a different routed subs for leaf connections allow to revisit the
fix #5982 that was done for issue #5972, which was about a more fair
queue distribution to a cluster of leaf connections. That fix actually
introduced a change in that we always wanted to favor queue
subscriptions of the cluster where the message is produced, which that
fix possibly changed. With this current PR, the server can now know if a
remote queue sub is for a "local" queue sub there or on behalf of a leaf
and therefore will not favor that route compared to a leaf subscription
that it may have directly attached.

Resolves #5972
Resolves #6148

Signed-off-by: Ivan Kozlovic <[email protected]>
neilalexander pushed a commit that referenced this pull request Nov 22, 2024
There were multiple issues, but basically the fact that we would
not store the routed subscriptions with the origin of the LEAF they
came from made the server unable to differentiate those compared to
"local" routed subscriptions, which in some cases (like a server
restart and the resend of subscriptions) could lead to servers
sending incorrectly subscription interest to leaf connections.

We are now storing the subscriptions with a sub type indicator and
the origin (for leaf subscriptions) as part of the key. This allows
to differentiate "regular" routed subs versus the ones on behalf
of a leafnode.
An INFO boolean is added `LNOCU` to indicate support for origin
in the `LS-` protocol, which is required to properly handle the
removal. Therefore, if a route does not have `LNOCU`, the server
will behave like an old server, and store with the key that does
not contain the origin, so that it can be removed when getting
an LS- without the origin. Note that in the case of a mix of servers
in the same cluster, some of the issues this PR is trying to fix
will be present (since the server will basically behave like a
server without the fix).

Having a different routed subs for leaf connections allow to revisit
the fix #5982 that was done for issue #5972, which was about
a more fair queue distribution to a cluster of leaf connections.
That fix actually introduced a change in that we always wanted to
favor queue subscriptions of the cluster where the message is produced,
which that fix possibly changed. With this current PR, the server
can now know if a remote queue sub is for a "local" queue sub there
or on behalf of a leaf and therefore will not favor that route compared
to a leaf subscription that it may have directly attached.

Resolves #5972
Resolves #6148

Signed-off-by: Ivan Kozlovic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Queue Groups on leaf clusters not balancing correctly when messages are routed in from hub cluster [v2.10.21]
3 participants