Jobs stuck in delayed state indefinitely #1656

paweltatarczuk · 2023-02-05T08:53:21Z

I have a queue with mostly delayed jobs which get stuck in delayed state indefinitely. Restarting the worker solves the issue temporarily but it re-occurs on a regular basis.

I was trying to replicate the issue but the cause is unclear to me and I'm not sure how to proceed with investing what is happening here.

From what I understand everything stalls when worker is waiting for a job and calls BRPOPLPUSH:

const jobId = await client.brpoplpush(
  this.keys.wait,
  this.keys.active,
  blockTimeout,
);

The client list shows there's a connection for this command:

id=387154 addr=[<redacted>]:44672 laddr=[<redacted>]:25060 fd=15 name= age=19775 idle=19775 flags=b db=0 sub=0 psub=0 ssub=0 multi=-1 qbuf=85 qbuf-free=0 argv-mem=105 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=2025 events=r cmd=brpoplpush user=default redir=-1 resp=2

There are no errors being emitted by the worker too. It looks like BRPOPLPUSH is not timing out properly and unless a new job in active / wait state is added or worker is restarted it never ends.

Is there anything I can do to prevent this from happening or to find the root cause of this issue? I'm happy to work on the PR once it clear what is causing this behavior.

Redis version: 7.0.5 (DigitalOcean Managed Redis)
Bull MQ version: 3.5.11

The text was updated successfully, but these errors were encountered:

manast · 2023-02-05T09:06:11Z

I have tracked down a possible reason for this issue to this other issue: redis/ioredis#1718
A fix to workaround this issue (by default at least): #1654
Something failed on the release script, so it has not yet been released...

manast · 2023-02-05T09:10:38Z

Ok, I checked now and it seems that a new release has been performed, version 3.6.2 should solve this issue, but only if you do not use a custom connection retry strategy that retries very quickly, within 100-50ms or so.

paweltatarczuk · 2023-02-05T09:25:31Z

I'll upgrade to the version 3.6.2 and post an update here if this fixes the issue, which most probably is the case here! Thank you!

simas-b · 2023-02-06T06:17:04Z

Ok, I checked now and it seems that a new release has been performed, version 3.6.2 should solve this issue, but only if you do not use a custom connection retry strategy that retries very quickly, within 100-50ms or so.

I am experiencing an issue that seems related. In my case jobs don't get processed in time after temporary disconnection to Redis happens. I have made a demo repo to demonstrate the bug. I have updated bullmq to 3.6.2 but the problem is still there: https://github.com/butavicius/bullmq-test

manast · 2023-02-06T07:19:34Z

@butavicius In your repo it seems like jobs are processed after reconnection, If this is the case then it is not the same issue.

manast · 2023-02-06T09:27:21Z

@butavicius I have tried your repo and I cannot reproduce the issue, but I have not used 5 minutes repetition rate as that takes too long time, but 5 seconds works as it should. Is it only reproducible with 5 minutes repetitions? Btw, can you create an issue for this?

paweltatarczuk · 2023-02-06T12:30:17Z

@manast Unfortunately 3.6.2 did not solve the issue, this must be something else then, I'll try to reproduce it locally and share a repo.

manast · 2023-02-06T12:33:07Z

@paweltatarczuk strange, because what you write in the issue matches 100% the symptoms of this issue, i.e. BRPOPLPUSH hanging forever and the worker correctly reconnected.

paweltatarczuk · 2023-02-06T12:45:18Z

I'll investigate this in more depth within upcoming days and get back with, I hope, new information. The worst part here is that it only happens on production, randomly, after couple of hours, so reproducing this properly is difficult.

simas-b · 2023-02-06T13:30:15Z

@butavicius In your repo it seems like jobs are processed after reconnection, If this is the case then it is not the same issue.

You are right, my issue seems to be separate. Leaving a link to it in case someone follows: #1658

paweltatarczuk · 2023-02-08T08:17:44Z

This happened again but this time I double checked if there is an active connection for the queue and there is not. I can see multiple brpoplpush when I check client list but there is none for the queue that is stalled. I'm looking at clients' names here, assuming the name should be bull:<base64 encoded name>. I can see clients for other queues but not the stalled one.

Does it mean that worker is not running? I have a heath check in place that checks if worker.isRunning() but it's not failing. If that's the case could it be the reconnection issue?

manast · 2023-02-08T08:47:27Z

@paweltatarczuk what settings are you using for your Worker connections?

manast · 2023-02-08T10:04:46Z

There is a new chapter in the documentation that have some information regarding reconnections that could be useful to know in these cases: https://docs.bullmq.io/guide/going-to-production#automatic-reconnections

paweltatarczuk · 2023-02-08T13:43:48Z

I'm using IORedis directly to set up a "shared" connection with maxRetriesPerRequest set to null as stated in the docs here. Here's a sample code based on my application:

// Job
const job = {
  name: "Some job name",
  async run() {
    // ... 
  },
};

// Connection
const connection = new IORedis(
  process.env.REDIS_CONNECTION ?? "redis://127.0.0.1:6379",
  {
    maxRetriesPerRequest: null,
  }
);

// Queue
const queue = new Queue(job.name, {
  connection,
  defaultJobOptions: {
    attempts: Number.MAX_SAFE_INTEGER,
    backoff: {
      type: "exponential",
      delay: 1000,
    },
  },
});

// Worker
const worker = new Worker(
  job.name,
  async () => {
    return await job.run();
  },
  {
    connection,
    autorun: false,
  }
);

worker.on("error", () => {
  // Capture errors with Sentry
});

worker.on("failed", () => {
  // Capture failures with Sentry
});

// Bootstrap
void worker.run();
await worker.waitUntilReady();

// Health-check
setInterval(() => {
  if (!worker.isRunning()) {
    // Terminate application
  }
}, 1000);

paweltatarczuk · 2023-02-10T07:50:05Z

Here's a repository that reproduces the issue:
https://github.com/paweltatarczuk/bullmq-1656

It freezes every time I run it:

Running job: Test 1
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379
connect ECONNREFUSED 127.0.0.1:56379

In my case worker freezes right after the first job and never picks the second one, it should after 10 seconds. Now that I'm able to replicate this issue I'll try to investigate this deeper over the weekend.

manast · 2023-02-10T09:40:42Z

Thanks for taking the time to provide an easy to reproduce test code. What I could see is that if you change to this slower retry strategy, it works properly:

const connection = new IORedis(
  process.env.REDIS_CONNECTION ?? "redis://127.0.0.1:6379",
  {
    maxRetriesPerRequest: null,
    retryStrategy: (times) => {
      return 2000;
    }
  }
);

so it seems like the default retryStrategy provided by IORedis already suffers from this issue, which is not good at all: redis/ioredis#1718

manast · 2023-02-10T09:47:22Z

I think I can create a workaround for these kind of issues, like making sure the retry time is always larger than 2 seconds, but it may not work at all times since we do not know yet the underlying cause of the issue, it may be a hazard that manifests itself easily with fast retries but it could also be the case that it manifest itself in other situations even with a larger retry time.

paweltatarczuk · 2023-02-10T10:19:34Z

I can confirm that adjusting the retry strategy "resolves" the issue. I'll keep an eye on redis/ioredis#1718 then.

paweltatarczuk · 2023-02-13T09:28:20Z

Turns out the retryStrategy workaround wasn't enough. I've experienced the same issue over the weekend. Yesterday I updated ioredis to 5.3.1, no issues since then. This looks promising!

manast · 2023-02-19T20:37:48Z

Closing this now since the issue has been resolved in ioredis in version 5.3.1

manast added the cannot reproduce label Feb 8, 2023

manast closed this as completed Feb 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs stuck in delayed state indefinitely #1656

Jobs stuck in delayed state indefinitely #1656

paweltatarczuk commented Feb 5, 2023

manast commented Feb 5, 2023

manast commented Feb 5, 2023

paweltatarczuk commented Feb 5, 2023

simas-b commented Feb 6, 2023 •

edited

Loading

manast commented Feb 6, 2023

manast commented Feb 6, 2023

paweltatarczuk commented Feb 6, 2023

manast commented Feb 6, 2023

paweltatarczuk commented Feb 6, 2023

simas-b commented Feb 6, 2023

paweltatarczuk commented Feb 8, 2023 •

edited

Loading

manast commented Feb 8, 2023

manast commented Feb 8, 2023

paweltatarczuk commented Feb 8, 2023

paweltatarczuk commented Feb 10, 2023 •

edited

Loading

manast commented Feb 10, 2023

manast commented Feb 10, 2023 •

edited

Loading

paweltatarczuk commented Feb 10, 2023

paweltatarczuk commented Feb 13, 2023

manast commented Feb 19, 2023

Jobs stuck in delayed state indefinitely #1656

Jobs stuck in delayed state indefinitely #1656

Comments

paweltatarczuk commented Feb 5, 2023

manast commented Feb 5, 2023

manast commented Feb 5, 2023

paweltatarczuk commented Feb 5, 2023

simas-b commented Feb 6, 2023 • edited Loading

manast commented Feb 6, 2023

manast commented Feb 6, 2023

paweltatarczuk commented Feb 6, 2023

manast commented Feb 6, 2023

paweltatarczuk commented Feb 6, 2023

simas-b commented Feb 6, 2023

paweltatarczuk commented Feb 8, 2023 • edited Loading

manast commented Feb 8, 2023

manast commented Feb 8, 2023

paweltatarczuk commented Feb 8, 2023

paweltatarczuk commented Feb 10, 2023 • edited Loading

manast commented Feb 10, 2023

manast commented Feb 10, 2023 • edited Loading

paweltatarczuk commented Feb 10, 2023

paweltatarczuk commented Feb 13, 2023

manast commented Feb 19, 2023

simas-b commented Feb 6, 2023 •

edited

Loading

paweltatarczuk commented Feb 8, 2023 •

edited

Loading

paweltatarczuk commented Feb 10, 2023 •

edited

Loading

manast commented Feb 10, 2023 •

edited

Loading