Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer connection recovery does not stop on success when RabbitMQ node is restarted #1061

Closed
Inlustris opened this issue Jul 31, 2021 · 22 comments
Assignees
Milestone

Comments

@Inlustris
Copy link

For easy understanding of the problem.
With the connectionfactory i create a Consumer and Publisher connection. No changes in the default settings of the connection or connectionfactory. Which should result in auto recovery of the connection.

ConsumerConnection = AutoRecoverable

  • 3 active consumers (channels)

PublisherConnection = AutoRecoverable

  • none active channels

For testing purposes i was restarting rabbitMQ.

While logging for the ShutdownConnection event resulted in "CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'". (this is off course a logical outcome)

Within rabbitMQ my connection looks like the following table:

Network Name User name State SSL / TLS Protocol Channels From client To client
172.17.0.1:40406 Consumers user running AMQP 0-9-1 3 2 B/s 0 B/s
172.17.0.1:40412 Publishers user running AMQP 0-9-1 0 2 B/s 0 B/s

When i restart rabbitMQ it start trying to recover like in the document:

  • Reconnect
  • Restore connection listeners
  • Re-open channels
  • Restore channel listeners
  • Restore channel basic.qos setting, publisher confirms and transaction settings

The result is that it recreates my connections for this application with the channels.
But the result is that my publishing connection is reacreated with 0 channels which is right. but my Consumer connection gets recreated with 3 channels. Which is at this point right too.

Now my issue it doesn't recover this consumer connection correctly. It is being recreated like every 5 to 10 seconds. Which results in x amount of connections. All with the correct 3 channels but none of them function like fully recovered.

Example of the results after restart: (1 Publish Connection, X amount of consumer Connections)

Network Name User name State SSL / TLS Protocol Channels From client To client
172.17.0.1:40586 Publishers user running AMQP 0-9-1 0 2 iB/s 0 B/s
172.17.0.1:40592 Consumers user running AMQP 0-9-1 3 2 iB/s 0 B/s
172.17.0.1:40604 Consumers user running AMQP 0-9-1 3 0 B/s 0 B/s
172.17.0.1:40610 Consumers user running AMQP 0-9-1 3 0 B/s 0 B/s
172.17.0.1:40616 Consumers user running AMQP 0-9-1 3 2 iB/s 0 B/s
172.17.0.1:40628 Consumers user running AMQP 0-9-1 3 0 B/s 2 iB/s
172.17.0.1:40634 Consumers user running AMQP 0-9-1 3 0 B/s 0 B/s
172.17.0.1:40640 Consumers user running AMQP 0-9-1 3 2 iB/s 0 B/s
172.17.0.1:40646 Consumers user running AMQP 0-9-1 3 0 B/s 0 B/s

RabbitMQ logging repeating itself:

2021-07-29 20:02:58.084 [info] <0.2125.0> accepting AMQP connection <0.2125.0> (172.17.0.1:38276 -> 172.17.0.4:5672)
2021-07-29 20:02:58.086 [info] <0.2125.0> Connection <0.2125.0> (172.17.0.1:38276 -> 172.17.0.4:5672) has a client-provided name: Consumers
2021-07-29 20:02:58.088 [info] <0.2125.0> connection <0.2125.0> (172.17.0.1:38276 -> 172.17.0.4:5672 - Consumers): user 'user' authenticated and granted access to vhost '/'

2021-07-29 20:03:03.142 [info] <0.2150.0> accepting AMQP connection <0.2150.0> (172.17.0.1:38282 -> 172.17.0.4:5672)
2021-07-29 20:03:03.961 [info] <0.2150.0> Connection <0.2150.0> (172.17.0.1:38282 -> 172.17.0.4:5672) has a client-provided name: Consumers
2021-07-29 20:03:03.963 [info] <0.2150.0> connection <0.2150.0> (172.17.0.1:38282 -> 172.17.0.4:5672 - Consumers): user 'user' authenticated and granted access to vhost '/'

@michaelklishin
Copy link
Member

What version of the client is used? Can you share an example repository that we can use to reproduce?

@michaelklishin michaelklishin changed the title Recover issue when stopping rabbitMQ and restarting Consumer connection recovery does not stop on success when RabbitMQ node is restarted Jul 31, 2021
@Inlustris
Copy link
Author

It is not completely a succes recovery. The consumers are not yet consuming when the loop is restarted. The connection and channels are recreated but the consumers are not. On which i think it is going wrong on restoring the channel listeners.

Client version used:
RabbitMQ.Client = 6.2.2

private readonly ILogger<ConnectionManager> logger;
private readonly IExchangeManager ExchangeManager;
private readonly IQueueManager QueueManager;
private readonly IConnectionFactory connectionFactory;
private readonly IConnection ConsumerConnection;
private readonly IConnection PublishingConnection;

public ConnectionManager(ILoggerFactory loggerFactory, IConnectionFactory connectionFactory, IEnumerable<IRabbitMQSetup> rabbitMQSetups)
{
	this.logger = loggerFactory.CreateLogger<ConnectionManager>();
	this.connectionFactory = connectionFactory;

	ConsumerConnection = GetConnection(System.AppDomain.CurrentDomain.FriendlyName + " Consumers");
	PublishingConnection = GetConnection(System.AppDomain.CurrentDomain.FriendlyName + " Publishers");

	ExchangeManager = new ExchangeManager(loggerFactory.CreateLogger<ExchangeManager>(), ConsumerConnection);
	QueueManager = new QueueManager(loggerFactory.CreateLogger<QueueManager>(), ConsumerConnection);
	if (!SetupRabbitMQ(rabbitMQSetups))
	{
		throw new Exception("Setup of RabbitMQ failed for current configuration.");
	}
	logger.LogInformation("Setup Rabbit MQ succeeded");
}

public IModel GetNewConsumerChannel()
{
	return ConsumerConnection.CreateModel();
}

public IModel GetNewPublishingChannel()
{
	return PublishingConnection.CreateModel();
}

private IConnection GetConnection(string clientName)
{
	var policy = Policy.Handle<RabbitMQ.Client.Exceptions.BrokerUnreachableException>().WaitAndRetry(8, retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)), (exception, timeSpan, retryCount, context) =>
	{
		logger.LogCritical("Not able to establish a connection. {TimeSpan}, {RetryCount}", timeSpan, retryCount);
	});

	var connection = policy.Execute(() => connectionFactory.CreateConnection(clientName));
	connection.ConnectionShutdown += Connection_ConnectionShutdown;

	return connection;
}
		
private void Connection_ConnectionShutdown(object sender, ShutdownEventArgs e)
{
	logger.LogCritical("Connection broke! {ReplyText} {Cause}", e.ReplyText, e.Cause);
}

The code above is the basic part of the implementation i'm using. Where as the ConsumerConnection is used for declaring Exchanges and Queues.

Until now i was focused on the consuming side of the application. And have not yet tested the Publishing side. (If this connection is restored correctly) Probably this will work because there is no channel / listener to be restored. I will do this now and come back with the results.

@Inlustris
Copy link
Author

Inlustris commented Aug 1, 2021

  1. The connection for Publishing message is restored correctly and is able to send messages even after RabbitMQ node Restart.

  2. Bringing the application down to just one connection doesn't fix the issue.

Two possible issues could be i think.

  1. I'm using Dependency Injection
  2. I'm storing my Channels(Consumers) in a concurrent dictionary

I'm trying to rule out several options by re-building my solution directly in one application. But until now the use of DI did not change anything from a working application in a bad way.

private readonly ConcurrentDictionary<string, List<IModel>> channelsForClass = new ConcurrentDictionary<string, List<IModel>>();

I think the ConcurrentDictionary is preventing the recovery from completing. In combination with DI.

@Inlustris
Copy link
Author

I came to the conclusion i was not knowing how to debug some stuff of the rabbitMQ client. (Which i found out just yet)

For better knowledge on what is happening, the following exception occurs.

Error: Exception: RabbitMQ.Client.Exceptions.TopologyRecoveryException
Caught an exception while recovering binding between HTMLPDFDirect and HTMLPDFQueue: Already closed: The AMQP operation was interrupted: AMQP close-reason, initiated by Application, code=200, text='Goodbye', classId=0, methodId=0
RabbitMQ.Client.Exceptions.AlreadyClosedException: Already closed: The AMQP operation was interrupted: AMQP close-reason, initiated by Application, code=200, text='Goodbye', classId=0, methodId=0
   at RabbitMQ.Client.Impl.SessionBase.Transmit(OutgoingCommand& cmd)
   at RabbitMQ.Client.Impl.ModelBase.ModelRpc(MethodBase method, ContentHeaderBase header, Byte[] body)
   at RabbitMQ.Client.Framing.Impl.Model._Private_QueueBind(String queue, String exchange, String routingKey, Boolean nowait, IDictionary`2 arguments)
   at RabbitMQ.Client.Impl.ModelBase.QueueBind(String queue, String exchange, String routingKey, IDictionary`2 arguments)
   at RabbitMQ.Client.Impl.RecordedQueueBinding.Recover()
   at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.RecoverBindings()
Error: Exception when recovering connection. Will try again after retry interval.
Error: Exception: RabbitMQ.Client.Exceptions.TopologyRecoveryException
Caught an exception while recovering binding between HTMLPDFDirect and HTMLPDFQueue: Already closed: The AMQP operation was interrupted: AMQP close-reason, initiated by Application, code=200, text='Goodbye', classId=0, methodId=0

   at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.HandleTopologyRecoveryException(TopologyRecoveryException e)
   at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.RecoverBindings()
   at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.RecoverEntities()
   at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.TryPerformAutomaticRecovery()
InnerException:
RabbitMQ.Client.Exceptions.AlreadyClosedException: Already closed: The AMQP operation was interrupted: AMQP close-reason, initiated by Application, code=200, text='Goodbye', classId=0, methodId=0
   at RabbitMQ.Client.Impl.SessionBase.Transmit(OutgoingCommand& cmd)
   at RabbitMQ.Client.Impl.ModelBase.ModelRpc(MethodBase method, ContentHeaderBase header, Byte[] body)
   at RabbitMQ.Client.Framing.Impl.Model._Private_QueueBind(String queue, String exchange, String routingKey, Boolean nowait, IDictionary`2 arguments)
   at RabbitMQ.Client.Impl.ModelBase.QueueBind(String queue, String exchange, String routingKey, IDictionary`2 arguments)
   at RabbitMQ.Client.Impl.RecordedQueueBinding.Recover()
   at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.RecoverBindings()

I can't get a clue where i missed something. It looks like it is trying to recover from the connection which was shutdown. And not the recovered one.

@Inlustris
Copy link
Author

Inlustris commented Aug 2, 2021

Got a bit further. But now i'm stuck at the following log:

Informational: Performing automatic recovery
Informational: Connection recovery completed
Error: Topology recovery exception
Error: Exception: RabbitMQ.Client.Exceptions.TopologyRecoveryException

This is weird because of the sequence i got them in. If i look at the code in AutorecoveringConnection. I should not get this log line "Informational: Connection recovery completed" before everything is recovered am i right?

private bool TryPerformAutomaticRecovery()
        {
            ESLog.Info("Performing automatic recovery");

            try
            {
                ThrowIfDisposed();
                if (TryRecoverConnectionDelegate())
                {
                    ThrowIfDisposed();
                    RecoverModels();
                    if (_factory.TopologyRecoveryEnabled)
                    {
                        // The recovery sequence is the following:
                        //
                        // 1. Recover exchanges
                        // 2. Recover queues
                        // 3. Recover bindings
                        // 4. Recover consumers
                        RecoverExchanges();
                        RecoverQueues();
                        RecoverBindings();
                        RecoverConsumers();
                    }

                    ESLog.Info("Connection recovery completed");
                    ThrowIfDisposed();
                    _recoverySucceededWrapper.Invoke(this, EventArgs.Empty);

                    return true;
                }

                ESLog.Warn("Connection delegate was manually closed. Aborted recovery.");
            }
            catch (Exception e)
            {
                ESLog.Error("Exception when recovering connection. Will try again after retry interval.", e);
            }

            return false;
        }

@michaelklishin
Copy link
Member

If any of the Recover… methods fail and handle/log their exceptions, or operate asynchronously, the result would be logged.

See server log for clues as most topology recovery [protocol-level] exceptions produce specific error messages.

@Inlustris
Copy link
Author

It's like in the opening message of this issue. I do not get errors or connection issues. Is it possible that i miss logging? (Docker running RabbitMQ looking at Logs in the container)

It looks like because there is no channel or connection the software is not able to connect to RabbitMQ to declare the binding.

What connection or channel is the client trying to use to redeclare the topology?

  • One specifically create for the recovery
  • Or one recreated for the application?

@Inlustris
Copy link
Author

@michaelklishin I think i found my issue.

In my package I separated several dependencies. Where as the declaration and binding of the application is done in one go on one channel. This channel is done at the end of my start-up so I close it.

To check my case i closed this channel with my own "ReplyCode" and "ReplyText". For Example:

  • ReplyCode : 200
  • ReplyText : "RabbitMQ setup completed."

This result in the following error on recovery.

Informational: Performing automatic recovery
Error: Topology recovery exception
Error: Exception: RabbitMQ.Client.Exceptions.TopologyRecoveryException
Caught an exception while recovering binding between HTMLPDFDirect and HTMLPDFQueue: Already closed: The AMQP operation was interrupted: AMQP close-reason, initiated by Application, code=200, text='RabbitMQ setup completed.', classId=0, methodId=0

Can i conclude this is a bug? Or am I using it wrong? Because of the Channel-per-Thread limitation and the use of DI to declare all of the Exchanges, Queues and Bindings on start-up. I'm not able to keep it thread safe if i need to do this with the channels I use for consuming. Let alone my publishing channels will need some sort of declaration too which channels don't exist until i need them.

@Inlustris
Copy link
Author

I took some time refining the story. (Sorry for all messages. I'm learning by playing with it.)

I figured out that my channel is closed at the point where it is tying to recover. I closed it because i was done with configuring. The issue occurring "Not able to recover binding" is blocking the completion of the full recovery. Stopping before getting to recover the Consumers. All Connections and Channels are recovered at this point. (this repeats every 5 seconds)

While looking into it i thought how can it be this class(old channel) still exists? I used it within a constructor. It should be disposed. Trying to resolve this i created an using statement around this code for the channel. Which results in a correctly disposed channel. This results in the continuation of the recovery in case of the consumers. (And not repeating the recover process every 5 seconds)

Which brings me to the following exception within the recovery process. Where it is trying to recover bindings from the IModel that now is disposed. Because this is an unknown error for the recovery process it ignores this state and continues. This makes the recovery complete.

it's not a known problem with connectivty, ignoring it

This still results in an uncomplete recovery because the binding was not able to recover. Reasons:

  • Channel = Closed
  • Channel = Disposed

Both of them result in the effect that the binding is not recovered.

  • Where as Closed results in failing of recovery at all with 5 secondsrepeat
  • Where as Disposed results in full recovery except the bindings
Informational: Performing automatic recovery
Informational: Connection recovery completed
Error: Topology recovery exception
Error: Exception: RabbitMQ.Client.Exceptions.TopologyRecoveryException
Caught an exception while recovering binding between HTMLPDFDirect and HTMLPDFQueue: Cannot access a disposed object.
Object name: 'RabbitMQ.Client.Impl.AutorecoveringModel'.

InnerException:
System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'RabbitMQ.Client.Impl.AutorecoveringModel'.
   at RabbitMQ.Client.Impl.AutorecoveringModel.get_Delegate()
   at RabbitMQ.Client.Impl.RecordedEntity.get_ModelDelegate()
   at RabbitMQ.Client.Impl.RecordedQueueBinding.Recover()
   at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.RecoverBindings()
Informational: Will not retry recovery because of System.ObjectDisposedException: it's not a known problem with connectivty, ignoring it
Informational: Connection recovery completed

@Inlustris
Copy link
Author

I would like to help make a solution to this but i need to know if you are in agreement this is a bug.
Because of the reason there need to be made a significant amount of change to the code for restoring configurations.

Creating a new IModel for configuring all settings (RecoveryChannel). So the user of the package is not required to keep such channels open for the lifetime of their application.

But I'm not totally clear where it needs to be added.

@michaelklishin Can you give me some information on this issue.

@bollhals
Copy link
Contributor

Hmmmmmmm What I'm wondering is this:

  • Channel_1 is creating some queues and bindings, then gets closed / disposed.
  • Connection drops, auto recovery is started.

Now what happens?

  • Recovery will open a new connection
  • But no new channel since it was already closed. (<- this is still fine)
  • It will try to recover the queues / bindings, but since they reference the original model that is now closed / disposed -> Boom

Question is what should be the correct behavior?

In general it doesn't matter which channel creates exchanges, queues, bindings, it does matter for consumers (as the channel is then the one receiving things). But somehow we're remembering which channel did create them.
(The only thing that does matter is that it is done on the same connection for things like autodelete or exclusive, right?)

Couldn't we switch to a model where:
If there is at least one channel open, we use that one to recover everything. If there isn't one open, we create a new temporary one.

Sidenote: This behavior is like that since a long time, I'm somewhat surprised we got this far without someone running into this issue.

@michaelklishin @stebet Tagging you to verify my assumption and verify the proposal for fixing it.

@bollhals
Copy link
Contributor

Also after scanning through the PR of @Inlustris, it's basically already implemented what I suggested. (Except the use an existing channel if there is one instead of creating a temporary one)

I think this is a significant bug in the library, but since it hasn't come up earlier, I guess it's not that common that one defines things on a channel that gets closed later. So I guess we'd only fix it in 7.0 and not 6.x?

@Inlustris
Copy link
Author

@bollhals I get why it would be ideal to use an existing one thinking about recourses. And it would be possible. But by using always a new one you are in control for the recover proces. The newly created channel isn't the property of the user but of the client itself. This results in a more stable recovery. And could not be interrupted by the user by closing the channel while recovering.

Can i ask when the 7.0 release is due? I'm implementing on 6.2.2 now and would like not to program around the issue.

Can you also take a look at my second issue i created. #1067 It is a follow up issue which is related to this one. This won't fix the whole problem and is only a partial fix.

@bollhals
Copy link
Contributor

bollhals commented Aug 13, 2021

@bollhals I get why it would be ideal to use an existing one thinking about recourses. And it would be possible. But by using always a new one you are in control for the recover proces. The newly created channel isn't the property of the user but of the client itself. This results in a more stable recovery. And could not be interrupted by the user by closing the channel while recovering.

Well if that happens it will just retry again. So no harm done, but surely opening up a new one works as well. I guess it's up to the maintainer what they prefer.

Can i ask when the 7.0 release is due? I'm implementing on 6.2.2 now and would like not to program around the issue.

I do not yet know, last thing I've read was that prior to it releasing, there should be some betas first. but @michaelklishin should be able to answer that.

Can you also take a look at my second issue i created. #1067 It is a follow up issue which is related to this one. This won't fix the whole problem and is only a partial fix.

I will.

@michaelklishin
Copy link
Member

Once we are done with #1067 and this one (that does not mean we find the perfect solution, just improve things meaningfully) we can cut a 7.0 preview release.

@michaelklishin
Copy link
Member

By the way, I'd be up for producing a 6.3 with a backport if we consider the solution to be reasonably safe.

@goldenbull
Copy link

I come with a similar issue. A simple way to reproduce the bug:

  1. Create a AutoRecover Connection.
  2. Enter a function.
  3. Create a model from the connection.
  4. Close the model, but leave it not disposed.
  5. Return from the function.
  6. Force close the connection from RabbitMQ ManagementUI
  7. Then unexpected infinite connections will be created.

Just FYI

@lukebakken
Copy link
Contributor

lukebakken commented Feb 25, 2022

It would be great if people who commented on this issue could try to reproduce it using version 6.2.4 of this library. As suggested in this comment, PR #1145 may have fixed this issue as well.

cc @goldenbull @Inlustris @bollhals

@Inlustris
Copy link
Author

@lukebakken i see milestone 6.2.5 is added. You are asking for 6.2.4. is this fixed in 6.2.4?

@lukebakken
Copy link
Contributor

is this fixed in 6.2.4?

Yes, please test. I put this in 6.2.5 so I wouldn't forget about it.

@Inlustris
Copy link
Author

@lukebakken
Sorry, for the time it took but i can confirm both issues are solved.
I have replicated both issues and have confirmed both cases as solved.

  • repeating recovery loop creating multiple connections.
    • The loop does not occur anymore.
  • not finishing all recoveries successfully but telling it has.
    • Bindings now do successfully get restored

@lukebakken
Copy link
Contributor

lukebakken commented Mar 2, 2022

Great thank you! There's never a rush on testing things out 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants