Delayed jobs not processed on time after temporary loss of connection with Redis #1658

simas-b · 2023-02-06T10:54:10Z

I have a repeatable job with repeat: { pattern "0 0 4 * * *" } (every day at 4 AM). The problem is that if a connection to the Redis database is lost during the day at least once, the job would not be processed at 4 AM. It would be processed at a later time, possibly many hours later, depending on when the connection was lost (see below).

I have made a demo Github repository to demonstrate the problem without having to wait a such long time. Instead of repeat: { pattern "0 0 4 * * *" } I have used repeat: { every: 5 * 60 * 1000 } (every 5 minutes) and the problem still manifests itself.

Expected behavior

A repeatable job should be executed ("Job done" logged to console) every 5 minutes. If you simulate the temporary loss of connection to Redis, the job should still be executed on the next schedule.

For example:

2023-02-03T12:00:00.000Z: Job done
2023-02-03T12:05:00.000Z: Job done
2023-02-03T12:10:00.000Z: Job done
2023-02-03T12:13:00.000Z: (Temporary loss of connection to Redis lasting 10 seconds)
2023-02-03T12:15:00.000Z: Job done
2023-02-03T12:20:00.000Z: Job done

Actual behavior:

If the connection to Redis is lost, the repeatable job is eventually executed, but not on the next schedule. Instead, it is executed 5 minutes after the connection to Redis problem happens.

For example:

2023-02-03T12:00:00.000Z: Job done
2023-02-03T12:05:00.000Z: Job done
2023-02-03T12:10:00.000Z: Job done
2023-02-03T12:13:00.000Z: (Temporary loss of connection to Redis lasting 10 seconds)
2023-02-03T12:18:00.000Z: Job done <--- This job is 3 minutes late. 
2023-02-03T12:20:00.000Z: Job done
2023-02-03T12:25:00.000Z: Job done

While the delay of 3 minutes does not seem like a big deal, it can turn into hours if the interval is 24 hours instead of 5 minutes.

More details and instructions on how to reproduce the problem are in the repo’s readme: https://github.com/butavicius/bullmq-test

The text was updated successfully, but these errors were encountered:

manast · 2023-02-08T22:03:45Z

I have nailed down the reason for this issue, and it comes from the fact that when you issue a BRPOPLPUSH command to Redis (using IORedis), this command blocks until either some item is available in the source list or it times out. The time out value is used by BullMQ to wait until the next delayed job is supposed to be processed. However, when there is a disconnection, and a future reconnection, IORedis just issues the same command with the same timeout, it does not take into account the time the command waited before the disconnection or the time it was disconnected. This kind of makes sense from IORedis perspective so I do not think this can be classed as a bug. I need to find a workaround though. Easiest would be to not allow a timeout larger than a small number of seconds, something like 10 seconds order of magnitude. However, a proper solution would be to detect disconnection, cancel the BRPOPLPUSH command, wait for reconnection, and issue a new BRPOPLPUSH command with an updated timeout value.

manast · 2024-05-03T07:40:12Z

There is a fix for this issue now #2543 that will force the blocking command if it is blocking 1 second more than expected. So for the cases with reconnections during a call to a blocking command, there will be at most 1 second delay.

manast self-assigned this Feb 6, 2023

simas-b mentioned this issue Feb 6, 2023

Jobs stuck in delayed state indefinitely #1656

Closed

manast added the bug Something isn't working label Feb 8, 2023

manast mentioned this issue Feb 9, 2023

fix: add a maximum block time #1671

Merged

manast mentioned this issue Aug 25, 2023

[Bug]: Repeatable job is not picked up from the queue if Redis is reconnected #2151

Open

manast closed this as completed May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delayed jobs not processed on time after temporary loss of connection with Redis #1658

Delayed jobs not processed on time after temporary loss of connection with Redis #1658

simas-b commented Feb 6, 2023 •

edited

Loading

manast commented Feb 8, 2023

manast commented May 3, 2024

Delayed jobs not processed on time after temporary loss of connection with Redis #1658

Delayed jobs not processed on time after temporary loss of connection with Redis #1658

Comments

simas-b commented Feb 6, 2023 • edited Loading

Expected behavior

Actual behavior:

manast commented Feb 8, 2023

manast commented May 3, 2024

simas-b commented Feb 6, 2023 •

edited

Loading