Long-running tasks are duplicated multiple times in multi-cluster environment when no timeout is set #307

jonathan-golorry · 2018-07-21T16:28:00Z

Running version 0.9.4 in a 2-cluster environment.

I wrote a task to email myself and set it to run once/day at 10am. The task had a very poorly-scaling query in it, so it would take a few hours to run. I didn't have a timeout set, but I'd get somewhere from 5 to 20 emails a day. Here's a typical set of emails:
5:40am
5:57am
9:34am
9:51am
7:38pm
7:40pm
12:05am (next day)
12:05am (next day)

The emails from between 12am and 10am are presumably from tasks for the previous day, but they are date-stamped by when the task started (as part of the task).

The admin page does not show duplicate tasks, only one task per day. This day's task claims to have stopped at 7:38pm.

I found it interesting that I was running 2 clusters and emails seem to come in pairs. I saw some old posts about tasks being duplicated in multi-cluster environments, but those seemed to identify bugs that were fixed sometime before 0.9.4.

I've since fixed the slow query and so far it seems to have fixed it.

edmenendez · 2018-11-13T19:27:36Z

I'm also seeing long running tasks being picked up more than once. It's one cluster of 6 workers using the orm and it seems to pickup the same job multiple times.

I've worked around it by adding a flag when the job is picked up. Not ideal obviously.

Is anyone else seeing this?

@jonathan-golorry what backend are you using?

jonathan-golorry · 2018-11-14T04:49:50Z

I'm using the django ORM.

kbuilds · 2019-01-02T21:02:13Z

I am also seeing this issue. Using the Django ORM.

django_q version = 1.0.1

settings.py:

137 Q_CLUSTER = {
138     'name': 'import',
139     'workers': int(os.getenv("WORKER_COUNT", 10)),
140     'timeout': 1200,
141     'retry': 120,
142     'queue_limit': 100,
143     'bulk': 10,
144     'orm': 'default'
145     }
146 
147 CACHES = {
148     'default': {
149         'BACKEND': 'django.core.cache.backends.db.DatabaseCache',
150         'LOCATION': 'cache_table',
151         }
152     }

kbuilds · 2019-01-02T21:15:02Z

I did a bit of poking on this one. Looks like the second worker starts the duplicate task at exactly 120 sec for me. This happens to be my retry setting.

Is it possible that the retry mechanism is duplicating the running task?

kbuilds · 2019-01-02T21:55:20Z

I reduced the retry setting to 60, and the duplicate tasks are now being created 60 sec after the first one starts, so I think that the retry system is to blame here.

I tried fixing the issue by increasing the retry to be the same as my timeout setting (1200), but now tasks are no longer being picked up by the workers...

kbuilds · 2019-01-02T21:59:20Z

This behavior is documented https://django-q.readthedocs.io/en/latest/brokers.html

kbuilds · 2019-01-04T16:33:50Z

Tried this again, and it seems to magically be working.

@edmenendez @jonathan-golorry Have you guys checked to see if the retry setting is related to your duplicate tasks?

jonathan-golorry · 2019-01-04T18:21:38Z

Thanks for checking this. Looks like the default timeout setting is to never timeout and the default retry setting is to retry after 60 seconds. The docs suggest never to set retry less than the timeout, but I was mostly using default settings so I missed it.

Can retry be set to None? The docs don't mention it.

edmenendez · 2019-01-05T00:28:31Z

I'm setting timeout to 900, but not setting retry. And I think retry defaults to 60 seconds. So that's probably the issue. A warning to the console about that would be nice :-)

This issue has been reported many times in Django-Q's issue tracker: Koed00#183 Koed00#180 Koed00#307 All these issue have been closed and responses have noted that retry should be set bigger than timeout or duration of any task.

jonathan-golorry closed this as completed Jan 4, 2019

janneronkko mentioned this issue Feb 6, 2019

Document the behaviour of retry value properly #340

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long-running tasks are duplicated multiple times in multi-cluster environment when no timeout is set #307

Long-running tasks are duplicated multiple times in multi-cluster environment when no timeout is set #307

jonathan-golorry commented Jul 21, 2018

edmenendez commented Nov 13, 2018

jonathan-golorry commented Nov 14, 2018

kbuilds commented Jan 2, 2019 •

edited

Loading

kbuilds commented Jan 2, 2019

kbuilds commented Jan 2, 2019

kbuilds commented Jan 2, 2019

kbuilds commented Jan 4, 2019

jonathan-golorry commented Jan 4, 2019

edmenendez commented Jan 5, 2019

Long-running tasks are duplicated multiple times in multi-cluster environment when no timeout is set #307

Long-running tasks are duplicated multiple times in multi-cluster environment when no timeout is set #307

Comments

jonathan-golorry commented Jul 21, 2018

edmenendez commented Nov 13, 2018

jonathan-golorry commented Nov 14, 2018

kbuilds commented Jan 2, 2019 • edited Loading

kbuilds commented Jan 2, 2019

kbuilds commented Jan 2, 2019

kbuilds commented Jan 2, 2019

kbuilds commented Jan 4, 2019

jonathan-golorry commented Jan 4, 2019

edmenendez commented Jan 5, 2019

kbuilds commented Jan 2, 2019 •

edited

Loading