Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.13.x: fix direct-reply-to crash and lost reply in 3.12 / 3.13 mixed version clusters #11401

Merged
merged 3 commits into from
Jun 7, 2024

Conversation

ansd
Copy link
Member

@ansd ansd commented Jun 6, 2024

Fixes #11380

This PR should only go into v3.13.x.
#11397 should only go into main.

@ansd ansd added this to the 3.13.4 milestone Jun 6, 2024
@mergify mergify bot added the bazel label Jun 6, 2024
@michaelklishin michaelklishin changed the title Fix direct-reply-to crash and lost reply in 3.12 / 3.13 mixed version clusters 3.13.x: fix direct-reply-to crash and lost reply in 3.12 / 3.13 mixed version clusters Jun 6, 2024
Fix dialyzer error:
```
rabbit_channel.erl:309:1: The attempt to match a term of type
          mc:state() against the pattern
          {'basic_message', _, _, _, _, _} breaks the opacity of the term
```
@michaelklishin
Copy link
Member

Tested this PR with the following setup:

  1. A two node cluster (good enough for this test), one node is a v3.13.x one and the other is a 3.12.8 one
  2. Client connects to the 3.12 node
  3. Server connects to the 3.13.x node

Here is the client code, using Bunny:

#!/usr/bin/env ruby

require "bundler"
require "bunny"

c = Bunny.new(port: 5673); c.start
ch = c.create_channel
ch.basic_consume("amq.rabbitmq.reply-to", "", true) do |delivery_info, metadata, payload|
  puts "Response: #{payload}"
end

loop do
  ch.basic_publish(Time.now.to_i.to_s, "", "qq.1", reply_to: "amq.rabbitmq.reply-to")
  sleep 1
end

Here is the server code:

#!/usr/bin/env ruby

require "bundler"
require "bunny"

c = Bunny.new(port: 5672); c.start
ch = c.create_channel
ch.quorum_queue("qq.1").subscribe(manual_ack: false) do |delivery_info, metadata, payload|
  print "."
  ch.basic_publish("#{payload}.response.#{rand}", "", metadata.reply_to)
end

loop do
  sleep 1
end

All common feature flags between the nodes were enabled after cluster formation.

The clients communicate just fine using Direct Reply to.

@michaelklishin
Copy link
Member

Apparently in #11380 the request-reply client is connected to the 3.13.x node.

After swapping client and server connection ports I still observe no exceptions:

#!/usr/bin/env ruby

require "bundler"
require "bunny"

c = Bunny.new(port: 5672); c.start
ch = c.create_channel
ch.basic_consume("amq.rabbitmq.reply-to", "", true) do |delivery_info, metadata, payload|
  puts "Response: #{payload}"
end

loop do
  ch.basic_publish(Time.now.to_i.to_s, "", "qq.1", reply_to: "amq.rabbitmq.reply-to")
  sleep 1
end

And the server code:

#!/usr/bin/env ruby

require "bundler"
require "bunny"

c = Bunny.new(port: 5673); c.start
ch = c.create_channel
ch.quorum_queue("qq.1").subscribe(manual_ack: false) do |delivery_info, metadata, payload|
  print "."
  ch.basic_publish("#{payload}.response.#{rand}", "", metadata.reply_to)
end

loop do
  sleep 1
end

@michaelklishin michaelklishin merged commit bb95794 into v3.13.x Jun 7, 2024
18 checks passed
@michaelklishin michaelklishin deleted the direct-reply-to-3.13.x branch June 7, 2024 06:26
@ansd
Copy link
Member Author

ansd commented Jun 7, 2024

For the record, this PR contains a test which reproduces the issue:

git checkout 990eb1b1b8aceb6994bce1cbd983ac7c3ad246d4
bazel test //deps/rabbit:amqpl_direct_reply_to_SUITE-mixed -t- --test_sharding_strategy=disabled --test_env FOCUS="-group [cluster_size_3] -case rpc_new_to_old_node"

makes the test fail with the following crash:

2024-06-07 07:33:46.814420+00:00 [error] <0.1117.0>   crasher:
2024-06-07 07:33:46.814420+00:00 [error] <0.1117.0>     initial call: rabbit_channel:init/1
2024-06-07 07:33:46.814420+00:00 [error] <0.1117.0>     pid: <0.1117.0>
2024-06-07 07:33:46.814420+00:00 [error] <0.1117.0>     registered_name: []
2024-06-07 07:33:46.814420+00:00 [error] <0.1117.0>     exception exit: {function_clause,
2024-06-07 07:33:46.814420+00:00 [error] <0.1117.0>                      [{mc_compat,convert_to,
2024-06-07 07:33:46.814420+00:00 [error] <0.1117.0>                        [mc_amqpl,
2024-06-07 07:33:46.814420+00:00 [error] <0.1117.0>                         {delivery,false,false,<32789.1572.0>,
2024-06-07 07:33:46.814420+00:00 [error] <0.1117.0>                          {basic_message,

The other way around makes the test fail as well due to lost relpy (but without a crash in the logs):

bazel test //deps/rabbit:amqpl_direct_reply_to_SUITE-mixed -t- --test_sharding_strategy=disabled --test_env FOCUS="-group [cluster_size_3] -case rpc_old_to_new_node"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants