-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer connection recovery does not stop on success when RabbitMQ node is restarted #1061
Comments
What version of the client is used? Can you share an example repository that we can use to reproduce? |
It is not completely a succes recovery. The consumers are not yet consuming when the loop is restarted. The connection and channels are recreated but the consumers are not. On which i think it is going wrong on restoring the channel listeners. Client version used:
The code above is the basic part of the implementation i'm using. Where as the ConsumerConnection is used for declaring Exchanges and Queues. Until now i was focused on the consuming side of the application. And have not yet tested the Publishing side. (If this connection is restored correctly) Probably this will work because there is no channel / listener to be restored. I will do this now and come back with the results. |
Two possible issues could be i think.
I'm trying to rule out several options by re-building my solution directly in one application. But until now the use of DI did not change anything from a working application in a bad way.
I think the ConcurrentDictionary is preventing the recovery from completing. In combination with DI. |
I came to the conclusion i was not knowing how to debug some stuff of the rabbitMQ client. (Which i found out just yet) For better knowledge on what is happening, the following exception occurs.
I can't get a clue where i missed something. It looks like it is trying to recover from the connection which was shutdown. And not the recovered one. |
Got a bit further. But now i'm stuck at the following log:
This is weird because of the sequence i got them in. If i look at the code in AutorecoveringConnection. I should not get this log line "Informational: Connection recovery completed" before everything is recovered am i right?
|
If any of the See server log for clues as most topology recovery [protocol-level] exceptions produce specific error messages. |
It's like in the opening message of this issue. I do not get errors or connection issues. Is it possible that i miss logging? (Docker running RabbitMQ looking at Logs in the container) It looks like because there is no channel or connection the software is not able to connect to RabbitMQ to declare the binding. What connection or channel is the client trying to use to redeclare the topology?
|
@michaelklishin I think i found my issue. In my package I separated several dependencies. Where as the declaration and binding of the application is done in one go on one channel. This channel is done at the end of my start-up so I close it. To check my case i closed this channel with my own "ReplyCode" and "ReplyText". For Example:
This result in the following error on recovery.
Can i conclude this is a bug? Or am I using it wrong? Because of the Channel-per-Thread limitation and the use of DI to declare all of the Exchanges, Queues and Bindings on start-up. I'm not able to keep it thread safe if i need to do this with the channels I use for consuming. Let alone my publishing channels will need some sort of declaration too which channels don't exist until i need them. |
I took some time refining the story. (Sorry for all messages. I'm learning by playing with it.) I figured out that my channel is closed at the point where it is tying to recover. I closed it because i was done with configuring. The issue occurring "Not able to recover binding" is blocking the completion of the full recovery. Stopping before getting to recover the Consumers. All Connections and Channels are recovered at this point. (this repeats every 5 seconds) While looking into it i thought how can it be this class(old channel) still exists? I used it within a constructor. It should be disposed. Trying to resolve this i created an using statement around this code for the channel. Which results in a correctly disposed channel. This results in the continuation of the recovery in case of the consumers. (And not repeating the recover process every 5 seconds) Which brings me to the following exception within the recovery process. Where it is trying to recover bindings from the IModel that now is disposed. Because this is an unknown error for the recovery process it ignores this state and continues. This makes the recovery complete.
This still results in an uncomplete recovery because the binding was not able to recover. Reasons:
Both of them result in the effect that the binding is not recovered.
|
I would like to help make a solution to this but i need to know if you are in agreement this is a bug. Creating a new IModel for configuring all settings (RecoveryChannel). So the user of the package is not required to keep such channels open for the lifetime of their application. But I'm not totally clear where it needs to be added. @michaelklishin Can you give me some information on this issue. |
Hmmmmmmm What I'm wondering is this:
Now what happens?
Question is what should be the correct behavior? In general it doesn't matter which channel creates exchanges, queues, bindings, it does matter for consumers (as the channel is then the one receiving things). But somehow we're remembering which channel did create them. Couldn't we switch to a model where: Sidenote: This behavior is like that since a long time, I'm somewhat surprised we got this far without someone running into this issue. @michaelklishin @stebet Tagging you to verify my assumption and verify the proposal for fixing it. |
Also after scanning through the PR of @Inlustris, it's basically already implemented what I suggested. (Except the use an existing channel if there is one instead of creating a temporary one) I think this is a significant bug in the library, but since it hasn't come up earlier, I guess it's not that common that one defines things on a channel that gets closed later. So I guess we'd only fix it in 7.0 and not 6.x? |
@bollhals I get why it would be ideal to use an existing one thinking about recourses. And it would be possible. But by using always a new one you are in control for the recover proces. The newly created channel isn't the property of the user but of the client itself. This results in a more stable recovery. And could not be interrupted by the user by closing the channel while recovering. Can i ask when the 7.0 release is due? I'm implementing on 6.2.2 now and would like not to program around the issue. Can you also take a look at my second issue i created. #1067 It is a follow up issue which is related to this one. This won't fix the whole problem and is only a partial fix. |
Well if that happens it will just retry again. So no harm done, but surely opening up a new one works as well. I guess it's up to the maintainer what they prefer.
I do not yet know, last thing I've read was that prior to it releasing, there should be some betas first. but @michaelklishin should be able to answer that.
I will. |
Once we are done with #1067 and this one (that does not mean we find the perfect solution, just improve things meaningfully) we can cut a 7.0 preview release. |
By the way, I'd be up for producing a |
I come with a similar issue. A simple way to reproduce the bug:
Just FYI |
It would be great if people who commented on this issue could try to reproduce it using version 6.2.4 of this library. As suggested in this comment, PR #1145 may have fixed this issue as well. |
@lukebakken i see milestone 6.2.5 is added. You are asking for 6.2.4. is this fixed in 6.2.4? |
Yes, please test. I put this in 6.2.5 so I wouldn't forget about it. |
@lukebakken
|
Great thank you! There's never a rush on testing things out 😄 |
For easy understanding of the problem.
With the connectionfactory i create a Consumer and Publisher connection. No changes in the default settings of the connection or connectionfactory. Which should result in auto recovery of the connection.
ConsumerConnection = AutoRecoverable
PublisherConnection = AutoRecoverable
For testing purposes i was restarting rabbitMQ.
While logging for the ShutdownConnection event resulted in "CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'". (this is off course a logical outcome)
Within rabbitMQ my connection looks like the following table:
When i restart rabbitMQ it start trying to recover like in the document:
The result is that it recreates my connections for this application with the channels.
But the result is that my publishing connection is reacreated with 0 channels which is right. but my Consumer connection gets recreated with 3 channels. Which is at this point right too.
Now my issue it doesn't recover this consumer connection correctly. It is being recreated like every 5 to 10 seconds. Which results in x amount of connections. All with the correct 3 channels but none of them function like fully recovered.
Example of the results after restart: (1 Publish Connection, X amount of consumer Connections)
RabbitMQ logging repeating itself:
The text was updated successfully, but these errors were encountered: