Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected / poorly documented behaviour of keyword limit in table.populate() #1203

Open
simon-ball opened this issue Feb 10, 2025 · 1 comment
Assignees
Labels

Comments

@simon-ball
Copy link

simon-ball commented Feb 10, 2025

Bug Report

Description

For certain tables in which the make() function can take a long period of time, it is often desirable to force that only a small number of keys are handled before moving on to other tables.

Datajoint provides two ways to implement this, that can behave unexpectedly differently for reasons that can only be found by examing the datajoint source code in detail:

  • table.populate(limit=x)
  • table.populate(max_calls=x)

I believe that the way the limit keyword currently operates does not align well either with its implied meaning (based on what the word "limit" means in english language", nor with how it is described in the docstring (:param limit: if not None, check at most this many keys)

Within Autopopulate.populate (https://github.com/datajoint/datajoint-python/blob/master/datajoint/autopopulate.py#L153), the handling goes as follows:

  • Determine the set of incomplete keys for table (L212)
  • Fetch up to limit number of keys from this set
  • If reserve_jobs, then remove any keys from that group that exist in the jobs table (L218)
  • if max_calls, reduce the length of the list of remaining keys to max_calls
  • For each remaining key, call table.make(key) (L248)

In the case where limit is small (equal to or smaller than the number of workers that might be executing table.populate(reserve_jobs=True, limit=x) simultaneously), this can result in a situation in which there are many potential keys to be processed, but each additional worker is handed a set of keys to process that are all in the schema.jobs table simultaneously, and thus is reduced to an empty set of keys before getting to the table.make(key) part of the .populate() call.

The "correct" answer to this case is to use max_calls instead, as the limit will then apply after checking whether keys are already in progress (or error, or any other form of unavailability due to the jobs table).

However, today marks the 4th time I have spent >1h chasing down weird behaviour where multiple workers were idling /ignoring available work due to misunderstanding / misremembering the meaning of this keyword.

I have several suggestions for fixing this behaviour, mostly depending on the degree to which it is intentional/desired:

  1. Assuming it is unintentional, make limit and max_calls equivalent, by moving the limit handling within Autopopulate.populate after the job_table checking instead of before

  2. Change the logic for determining the set of available keys to include keys already in flight (i.e. listed in the jobs table with whatever status). This would combine well with issue Provide a way to list specific jobs  #873 , although in practice it has the same outcome at (1)

  3. If it is intended, change the keyword to something that better reflects its actual role. To me the literal meaning of the word "limit" is much more closely aligned with the logical behaviour of the keyword max_calls, and vice versa. It is not mentally logical to me that limit applies before ignoring keys that are already in progress - I anticipate (every single time!) that keys already in error/reserved are already excluded from the set of keys that are available to be processed, at least is including the reserve_jobs flag.

  4. Change the docstring to better reflect the actual behaviour. This obviously involves the smallest changes to compatibility or behaviour for everyone who already uses this logic, but still falls victim to the point I make above in (3). I would suggest something like if not None, attempt to process this many keys, which may include keys already reserved

Reproducibility

Include:

  • Any OS
  • Any Python version
  • Any MySQL version
  • Datajoint <=0.14.3 (not tested with later versions, but the problematic logic still exists in code in master as of posting date)

Create two tables: one listing keys, the other representing some long-running job

# my_test_schema.py

import time
import datajoint as dj

schema = dj.schema("test_schema")

@schema
class TableOne(dj.Manual):
    definition = """
    id: int
    ---
    """

    contents = list(range(10))

@schema
class TableTwo(dj.Computed):
    definition = """
    -> TableOne
    ---
    some_value: int
    """

    def make(self, key):
        print(f"working on {key}\n")
        time.sleep(30)
        self.insert1({**key, "some_value": key["id"]})

Create a script to populate TableTwo:

# my_test_script.py
import os
import time
from my_test_schema import *

def test():
    pid = os.getpid()
    while True:
        in_flight_keys = (schema.jobs & {"table_name": TableTwo.table_name}).fetch("key")
        available_keys = TableTwo.key_source - TableTwo - in_flight_keys
        print(f"{pid} : available keys: {len(available_keys)}")
        TableTwo.populate(reserve_jobs=True, limit=1)
        time.sleep(1)   # Reduce screen spam

if __name__ == "__main__":
    test()

Run at least two copies of my_test_script.py simultaneously. The output will look something like this:

1 : available keys: 10
working on {"id":0}

2 : available keys 9
2 : available keys 9
2 : available keys 9
2 : available keys 9
....

i.e. the second script will rapidly loop through the while loop ignoring the 9 other jobs available while the first actually does something (in this test example, sleeping). There is no way to predict which process will pick up the second job once the first is complete, but the same pattern will play out - the other script idling.

Expected Behavior

I expect the limit keyword to behave the way that the max_calls keyword actually does.

Alternately, I expect either a choice of keyword that better reflects the actual logical behaviour, or a more descriptive docstring that better explains the actual logical behaviour.

It should not be necessary to examine the source code of the autopopulate.py component of the Datajoint library to understand what the difference between the keywords limit and max_calls is and why one should choose one or the other.

@simon-ball simon-ball added the bug label Feb 10, 2025
@dimitri-yatsenko dimitri-yatsenko self-assigned this Feb 10, 2025
@dimitri-yatsenko
Copy link
Member

Agreed with you assessment @simon-ball

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants