You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a repeatable job with repeat: { pattern "0 0 4 * * *" } (every day at 4 AM). The problem is that if a connection to the Redis database is lost during the day at least once, the job would not be processed at 4 AM. It would be processed at a later time, possibly many hours later, depending on when the connection was lost (see below).
I have made a demo Github repository to demonstrate the problem without having to wait a such long time. Instead of repeat: { pattern "0 0 4 * * *" } I have used repeat: { every: 5 * 60 * 1000 } (every 5 minutes) and the problem still manifests itself.
Expected behavior
A repeatable job should be executed ("Job done" logged to console) every 5 minutes. If you simulate the temporary loss of connection to Redis, the job should still be executed on the next schedule.
For example:
2023-02-03T12:00:00.000Z: Job done
2023-02-03T12:05:00.000Z: Job done
2023-02-03T12:10:00.000Z: Job done
2023-02-03T12:13:00.000Z: (Temporary loss of connection to Redis lasting 10 seconds)
2023-02-03T12:15:00.000Z: Job done
2023-02-03T12:20:00.000Z: Job done
Actual behavior:
If the connection to Redis is lost, the repeatable job is eventually executed, but not on the next schedule. Instead, it is executed 5 minutes after the connection to Redis problem happens.
For example:
2023-02-03T12:00:00.000Z: Job done
2023-02-03T12:05:00.000Z: Job done
2023-02-03T12:10:00.000Z: Job done
2023-02-03T12:13:00.000Z: (Temporary loss of connection to Redis lasting 10 seconds)
2023-02-03T12:18:00.000Z: Job done <--- This job is 3 minutes late.
2023-02-03T12:20:00.000Z: Job done
2023-02-03T12:25:00.000Z: Job done
While the delay of 3 minutes does not seem like a big deal, it can turn into hours if the interval is 24 hours instead of 5 minutes.
I have nailed down the reason for this issue, and it comes from the fact that when you issue a BRPOPLPUSH command to Redis (using IORedis), this command blocks until either some item is available in the source list or it times out. The time out value is used by BullMQ to wait until the next delayed job is supposed to be processed. However, when there is a disconnection, and a future reconnection, IORedis just issues the same command with the same timeout, it does not take into account the time the command waited before the disconnection or the time it was disconnected. This kind of makes sense from IORedis perspective so I do not think this can be classed as a bug. I need to find a workaround though. Easiest would be to not allow a timeout larger than a small number of seconds, something like 10 seconds order of magnitude. However, a proper solution would be to detect disconnection, cancel the BRPOPLPUSH command, wait for reconnection, and issue a new BRPOPLPUSH command with an updated timeout value.
There is a fix for this issue now #2543 that will force the blocking command if it is blocking 1 second more than expected. So for the cases with reconnections during a call to a blocking command, there will be at most 1 second delay.
I have a repeatable job with repeat:
{ pattern "0 0 4 * * *" }
(every day at 4 AM). The problem is that if a connection to the Redis database is lost during the day at least once, the job would not be processed at 4 AM. It would be processed at a later time, possibly many hours later, depending on when the connection was lost (see below).I have made a demo Github repository to demonstrate the problem without having to wait a such long time. Instead of
repeat: { pattern "0 0 4 * * *" }
I have usedrepeat: { every: 5 * 60 * 1000 }
(every 5 minutes) and the problem still manifests itself.Expected behavior
A repeatable job should be executed ("Job done" logged to console) every 5 minutes. If you simulate the temporary loss of connection to Redis, the job should still be executed on the next schedule.
For example:
Actual behavior:
If the connection to Redis is lost, the repeatable job is eventually executed, but not on the next schedule. Instead, it is executed 5 minutes after the connection to Redis problem happens.
For example:
While the delay of 3 minutes does not seem like a big deal, it can turn into hours if the interval is 24 hours instead of 5 minutes.
More details and instructions on how to reproduce the problem are in the repo’s readme: https://github.com/butavicius/bullmq-test
The text was updated successfully, but these errors were encountered: