-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs stuck in delayed state indefinitely #1656
Comments
I have tracked down a possible reason for this issue to this other issue: redis/ioredis#1718 |
Ok, I checked now and it seems that a new release has been performed, version 3.6.2 should solve this issue, but only if you do not use a custom connection retry strategy that retries very quickly, within 100-50ms or so. |
I'll upgrade to the version 3.6.2 and post an update here if this fixes the issue, which most probably is the case here! Thank you! |
I am experiencing an issue that seems related. In my case jobs don't get processed in time after temporary disconnection to Redis happens. I have made a demo repo to demonstrate the bug. I have updated bullmq to 3.6.2 but the problem is still there: https://github.com/butavicius/bullmq-test |
@butavicius In your repo it seems like jobs are processed after reconnection, If this is the case then it is not the same issue. |
@butavicius I have tried your repo and I cannot reproduce the issue, but I have not used 5 minutes repetition rate as that takes too long time, but 5 seconds works as it should. Is it only reproducible with 5 minutes repetitions? Btw, can you create an issue for this? |
@manast Unfortunately 3.6.2 did not solve the issue, this must be something else then, I'll try to reproduce it locally and share a repo. |
@paweltatarczuk strange, because what you write in the issue matches 100% the symptoms of this issue, i.e. BRPOPLPUSH hanging forever and the worker correctly reconnected. |
I'll investigate this in more depth within upcoming days and get back with, I hope, new information. The worst part here is that it only happens on production, randomly, after couple of hours, so reproducing this properly is difficult. |
You are right, my issue seems to be separate. Leaving a link to it in case someone follows: #1658 |
This happened again but this time I double checked if there is an active connection for the queue and there is not. I can see multiple Does it mean that worker is not running? I have a heath check in place that checks if |
@paweltatarczuk what settings are you using for your Worker connections? |
There is a new chapter in the documentation that have some information regarding reconnections that could be useful to know in these cases: https://docs.bullmq.io/guide/going-to-production#automatic-reconnections |
I'm using
|
Here's a repository that reproduces the issue: It freezes every time I run it:
In my case worker freezes right after the first job and never picks the second one, it should after 10 seconds. Now that I'm able to replicate this issue I'll try to investigate this deeper over the weekend. |
Thanks for taking the time to provide an easy to reproduce test code. What I could see is that if you change to this slower retry strategy, it works properly:
so it seems like the default retryStrategy provided by IORedis already suffers from this issue, which is not good at all: redis/ioredis#1718 |
I think I can create a workaround for these kind of issues, like making sure the retry time is always larger than 2 seconds, but it may not work at all times since we do not know yet the underlying cause of the issue, it may be a hazard that manifests itself easily with fast retries but it could also be the case that it manifest itself in other situations even with a larger retry time. |
I can confirm that adjusting the retry strategy "resolves" the issue. I'll keep an eye on redis/ioredis#1718 then. |
Turns out the |
Closing this now since the issue has been resolved in ioredis in version 5.3.1 |
I have a queue with mostly delayed jobs which get stuck in delayed state indefinitely. Restarting the worker solves the issue temporarily but it re-occurs on a regular basis.
I was trying to replicate the issue but the cause is unclear to me and I'm not sure how to proceed with investing what is happening here.
From what I understand everything stalls when worker is waiting for a job and calls
BRPOPLPUSH
:The client list shows there's a connection for this command:
There are no errors being emitted by the worker too. It looks like
BRPOPLPUSH
is not timing out properly and unless a new job in active / wait state is added or worker is restarted it never ends.Is there anything I can do to prevent this from happening or to find the root cause of this issue? I'm happy to work on the PR once it clear what is causing this behavior.
Redis version: 7.0.5 (DigitalOcean Managed Redis)
Bull MQ version: 3.5.11
The text was updated successfully, but these errors were encountered: