You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For certain tables in which the make() function can take a long period of time, it is often desirable to force that only a small number of keys are handled before moving on to other tables.
Datajoint provides two ways to implement this, that can behave unexpectedly differently for reasons that can only be found by examing the datajoint source code in detail:
table.populate(limit=x)
table.populate(max_calls=x)
I believe that the way the limit keyword currently operates does not align well either with its implied meaning (based on what the word "limit" means in english language", nor with how it is described in the docstring (:param limit: if not None, check at most this many keys)
Determine the set of incomplete keys for table (L212)
Fetch up to limit number of keys from this set
If reserve_jobs, then remove any keys from that group that exist in the jobs table (L218)
if max_calls, reduce the length of the list of remaining keys to max_calls
For each remaining key, call table.make(key) (L248)
In the case where limit is small (equal to or smaller than the number of workers that might be executing table.populate(reserve_jobs=True, limit=x) simultaneously), this can result in a situation in which there are many potential keys to be processed, but each additional worker is handed a set of keys to process that are all in the schema.jobs table simultaneously, and thus is reduced to an empty set of keys before getting to the table.make(key) part of the .populate() call.
The "correct" answer to this case is to use max_calls instead, as the limit will then apply after checking whether keys are already in progress (or error, or any other form of unavailability due to the jobs table).
However, today marks the 4th time I have spent >1h chasing down weird behaviour where multiple workers were idling /ignoring available work due to misunderstanding / misremembering the meaning of this keyword.
I have several suggestions for fixing this behaviour, mostly depending on the degree to which it is intentional/desired:
Assuming it is unintentional, make limit and max_calls equivalent, by moving the limit handling within Autopopulate.populate after the job_table checking instead of before
Change the logic for determining the set of available keys to include keys already in flight (i.e. listed in the jobs table with whatever status). This would combine well with issue Provide a way to list specific jobs #873 , although in practice it has the same outcome at (1)
If it is intended, change the keyword to something that better reflects its actual role. To me the literal meaning of the word "limit" is much more closely aligned with the logical behaviour of the keyword max_calls, and vice versa. It is not mentally logical to me that limit applies before ignoring keys that are already in progress - I anticipate (every single time!) that keys already in error/reserved are already excluded from the set of keys that are available to be processed, at least is including the reserve_jobs flag.
Change the docstring to better reflect the actual behaviour. This obviously involves the smallest changes to compatibility or behaviour for everyone who already uses this logic, but still falls victim to the point I make above in (3). I would suggest something like if not None, attempt to process this many keys, which may include keys already reserved
Reproducibility
Include:
Any OS
Any Python version
Any MySQL version
Datajoint <=0.14.3 (not tested with later versions, but the problematic logic still exists in code in master as of posting date)
Create two tables: one listing keys, the other representing some long-running job
# my_test_schema.pyimporttimeimportdatajointasdjschema=dj.schema("test_schema")
@schemaclassTableOne(dj.Manual):
definition=""" id: int --- """contents=list(range(10))
@schemaclassTableTwo(dj.Computed):
definition=""" -> TableOne --- some_value: int """defmake(self, key):
print(f"working on {key}\n")
time.sleep(30)
self.insert1({**key, "some_value": key["id"]})
Run at least two copies of my_test_script.py simultaneously. The output will look something like this:
1 : available keys: 10
working on {"id":0}
2 : available keys 9
2 : available keys 9
2 : available keys 9
2 : available keys 9
....
i.e. the second script will rapidly loop through the while loop ignoring the 9 other jobs available while the first actually does something (in this test example, sleeping). There is no way to predict which process will pick up the second job once the first is complete, but the same pattern will play out - the other script idling.
Expected Behavior
I expect the limit keyword to behave the way that the max_calls keyword actually does.
Alternately, I expect either a choice of keyword that better reflects the actual logical behaviour, or a more descriptive docstring that better explains the actual logical behaviour.
It should not be necessary to examine the source code of the autopopulate.py component of the Datajoint library to understand what the difference between the keywords limit and max_calls is and why one should choose one or the other.
The text was updated successfully, but these errors were encountered:
Bug Report
Description
For certain tables in which the
make()
function can take a long period of time, it is often desirable to force that only a small number of keys are handled before moving on to other tables.Datajoint provides two ways to implement this, that can behave unexpectedly differently for reasons that can only be found by examing the datajoint source code in detail:
table.populate(limit=x)
table.populate(max_calls=x)
I believe that the way the
limit
keyword currently operates does not align well either with its implied meaning (based on what the word "limit" means in english language", nor with how it is described in the docstring (:param limit: if not None, check at most this many keys
)Within
Autopopulate.populate
(https://github.com/datajoint/datajoint-python/blob/master/datajoint/autopopulate.py#L153), the handling goes as follows:table
(L212)limit
number of keys from this setreserve_jobs
, then remove any keys from that group that exist in thejobs
table (L218)max_calls
, reduce the length of the list of remaining keys tomax_calls
table.make(key)
(L248)In the case where
limit
is small (equal to or smaller than the number of workers that might be executingtable.populate(reserve_jobs=True, limit=x)
simultaneously), this can result in a situation in which there are many potential keys to be processed, but each additional worker is handed a set of keys to process that are all in theschema.jobs
table simultaneously, and thus is reduced to an empty set of keys before getting to thetable.make(key)
part of the.populate()
call.The "correct" answer to this case is to use
max_calls
instead, as the limit will then apply after checking whether keys are already in progress (or error, or any other form of unavailability due to the jobs table).However, today marks the 4th time I have spent >1h chasing down weird behaviour where multiple workers were idling /ignoring available work due to misunderstanding / misremembering the meaning of this keyword.
I have several suggestions for fixing this behaviour, mostly depending on the degree to which it is intentional/desired:
Assuming it is unintentional, make
limit
andmax_calls
equivalent, by moving thelimit
handling withinAutopopulate.populate
after the job_table checking instead of beforeChange the logic for determining the set of available keys to include keys already in flight (i.e. listed in the jobs table with whatever status). This would combine well with issue Provide a way to list specific jobs #873 , although in practice it has the same outcome at (1)
If it is intended, change the keyword to something that better reflects its actual role. To me the literal meaning of the word "limit" is much more closely aligned with the logical behaviour of the keyword
max_calls
, and vice versa. It is not mentally logical to me thatlimit
applies before ignoring keys that are already in progress - I anticipate (every single time!) that keys already in error/reserved are already excluded from the set of keys that are available to be processed, at least is including thereserve_jobs
flag.Change the docstring to better reflect the actual behaviour. This obviously involves the smallest changes to compatibility or behaviour for everyone who already uses this logic, but still falls victim to the point I make above in (3). I would suggest something like
if not None, attempt to process this many keys, which may include keys already reserved
Reproducibility
Include:
master
as of posting date)Create two tables: one listing keys, the other representing some long-running job
Create a script to populate
TableTwo
:Run at least two copies of
my_test_script.py
simultaneously. The output will look something like this:i.e. the second script will rapidly loop through the
while
loop ignoring the 9 other jobs available while the first actually does something (in this test example, sleeping). There is no way to predict which process will pick up the second job once the first is complete, but the same pattern will play out - the other script idling.Expected Behavior
I expect the
limit
keyword to behave the way that themax_calls
keyword actually does.Alternately, I expect either a choice of keyword that better reflects the actual logical behaviour, or a more descriptive docstring that better explains the actual logical behaviour.
It should not be necessary to examine the source code of the
autopopulate.py
component of the Datajoint library to understand what the difference between the keywordslimit
andmax_calls
is and why one should choose one or the other.The text was updated successfully, but these errors were encountered: