Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long-running tasks are duplicated multiple times in multi-cluster environment when no timeout is set #307

Closed
jonathan-golorry opened this issue Jul 21, 2018 · 9 comments

Comments

@jonathan-golorry
Copy link

Running version 0.9.4 in a 2-cluster environment.

I wrote a task to email myself and set it to run once/day at 10am. The task had a very poorly-scaling query in it, so it would take a few hours to run. I didn't have a timeout set, but I'd get somewhere from 5 to 20 emails a day. Here's a typical set of emails:
5:40am
5:57am
9:34am
9:51am
7:38pm
7:40pm
12:05am (next day)
12:05am (next day)

The emails from between 12am and 10am are presumably from tasks for the previous day, but they are date-stamped by when the task started (as part of the task).

The admin page does not show duplicate tasks, only one task per day. This day's task claims to have stopped at 7:38pm.

I found it interesting that I was running 2 clusters and emails seem to come in pairs. I saw some old posts about tasks being duplicated in multi-cluster environments, but those seemed to identify bugs that were fixed sometime before 0.9.4.

I've since fixed the slow query and so far it seems to have fixed it.

@edmenendez
Copy link

I'm also seeing long running tasks being picked up more than once. It's one cluster of 6 workers using the orm and it seems to pickup the same job multiple times.

I've worked around it by adding a flag when the job is picked up. Not ideal obviously.

Is anyone else seeing this?

@jonathan-golorry what backend are you using?

@jonathan-golorry
Copy link
Author

I'm using the django ORM.

@kbuilds
Copy link

kbuilds commented Jan 2, 2019

I am also seeing this issue. Using the Django ORM.

django_q version = 1.0.1

settings.py:

137 Q_CLUSTER = {
138     'name': 'import',
139     'workers': int(os.getenv("WORKER_COUNT", 10)),
140     'timeout': 1200,
141     'retry': 120,
142     'queue_limit': 100,
143     'bulk': 10,
144     'orm': 'default'
145     }
146 
147 CACHES = {
148     'default': {
149         'BACKEND': 'django.core.cache.backends.db.DatabaseCache',
150         'LOCATION': 'cache_table',
151         }
152     }

@kbuilds
Copy link

kbuilds commented Jan 2, 2019

I did a bit of poking on this one. Looks like the second worker starts the duplicate task at exactly 120 sec for me. This happens to be my retry setting.

Is it possible that the retry mechanism is duplicating the running task?

@kbuilds
Copy link

kbuilds commented Jan 2, 2019

I reduced the retry setting to 60, and the duplicate tasks are now being created 60 sec after the first one starts, so I think that the retry system is to blame here.

I tried fixing the issue by increasing the retry to be the same as my timeout setting (1200), but now tasks are no longer being picked up by the workers...

@kbuilds
Copy link

kbuilds commented Jan 2, 2019

This behavior is documented https://django-q.readthedocs.io/en/latest/brokers.html

@kbuilds
Copy link

kbuilds commented Jan 4, 2019

Tried this again, and it seems to magically be working.

@edmenendez @jonathan-golorry Have you guys checked to see if the retry setting is related to your duplicate tasks?

@jonathan-golorry
Copy link
Author

Thanks for checking this. Looks like the default timeout setting is to never timeout and the default retry setting is to retry after 60 seconds. The docs suggest never to set retry less than the timeout, but I was mostly using default settings so I missed it.

Can retry be set to None? The docs don't mention it.

@edmenendez
Copy link

I'm setting timeout to 900, but not setting retry. And I think retry defaults to 60 seconds. So that's probably the issue. A warning to the console about that would be nice :-)

janneronkko added a commit to janneronkko/django-q that referenced this issue Feb 6, 2019
This issue has been reported many times in Django-Q's issue tracker:
Koed00#183
Koed00#180
Koed00#307

All these issue have been closed and responses have noted that retry should
be set bigger than timeout or duration of any task.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants