Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed jobs not processed on time after temporary loss of connection with Redis #1658

Closed
simas-b opened this issue Feb 6, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@simas-b
Copy link

simas-b commented Feb 6, 2023

I have a repeatable job with repeat: { pattern "0 0 4 * * *" } (every day at 4 AM). The problem is that if a connection to the Redis database is lost during the day at least once, the job would not be processed at 4 AM. It would be processed at a later time, possibly many hours later, depending on when the connection was lost (see below).

I have made a demo Github repository to demonstrate the problem without having to wait a such long time. Instead of repeat: { pattern "0 0 4 * * *" } I have used repeat: { every: 5 * 60 * 1000 } (every 5 minutes) and the problem still manifests itself.

Expected behavior

A repeatable job should be executed ("Job done" logged to console) every 5 minutes. If you simulate the temporary loss of connection to Redis, the job should still be executed on the next schedule.

For example:

2023-02-03T12:00:00.000Z: Job done
2023-02-03T12:05:00.000Z: Job done
2023-02-03T12:10:00.000Z: Job done
2023-02-03T12:13:00.000Z: (Temporary loss of connection to Redis lasting 10 seconds)
2023-02-03T12:15:00.000Z: Job done
2023-02-03T12:20:00.000Z: Job done

Actual behavior:

If the connection to Redis is lost, the repeatable job is eventually executed, but not on the next schedule. Instead, it is executed 5 minutes after the connection to Redis problem happens.

For example:

2023-02-03T12:00:00.000Z: Job done
2023-02-03T12:05:00.000Z: Job done
2023-02-03T12:10:00.000Z: Job done
2023-02-03T12:13:00.000Z: (Temporary loss of connection to Redis lasting 10 seconds)
2023-02-03T12:18:00.000Z: Job done <--- This job is 3 minutes late. 
2023-02-03T12:20:00.000Z: Job done
2023-02-03T12:25:00.000Z: Job done

While the delay of 3 minutes does not seem like a big deal, it can turn into hours if the interval is 24 hours instead of 5 minutes.

More details and instructions on how to reproduce the problem are in the repo’s readme: https://github.com/butavicius/bullmq-test

@manast
Copy link
Contributor

manast commented Feb 8, 2023

I have nailed down the reason for this issue, and it comes from the fact that when you issue a BRPOPLPUSH command to Redis (using IORedis), this command blocks until either some item is available in the source list or it times out. The time out value is used by BullMQ to wait until the next delayed job is supposed to be processed. However, when there is a disconnection, and a future reconnection, IORedis just issues the same command with the same timeout, it does not take into account the time the command waited before the disconnection or the time it was disconnected. This kind of makes sense from IORedis perspective so I do not think this can be classed as a bug. I need to find a workaround though. Easiest would be to not allow a timeout larger than a small number of seconds, something like 10 seconds order of magnitude. However, a proper solution would be to detect disconnection, cancel the BRPOPLPUSH command, wait for reconnection, and issue a new BRPOPLPUSH command with an updated timeout value.

@manast
Copy link
Contributor

manast commented May 3, 2024

There is a fix for this issue now #2543 that will force the blocking command if it is blocking 1 second more than expected. So for the cases with reconnections during a call to a blocking command, there will be at most 1 second delay.

@manast manast closed this as completed May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants