Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an inferface to Python API for a set of files and jobspecs #158

Merged
merged 26 commits into from
Oct 23, 2024

Conversation

edknv
Copy link
Collaborator

@edknv edknv commented Oct 14, 2024

Description

Closes #78.

  • Introduces BatchJobSpec for generating Jobs from a set of files and JobSpecs.
  • Enables creation of file list via path globbing or dataset specification.
  • Configurable in-flight batch size
  • Ensures we have builtin fetch attempt / retry logic and the ability for an end user to specify fetch timeouts
  • Add in the ability to check and handle failed jobs 'metadata.status' == failed
  • Adding native async calls in addition to Futures based async functionality is left for future work.

In the following example from README, BatchJobSpec is interchangeable with JobSpec in the main API.

from nv_ingest_client.client import NvIngestClient
from nv_ingest_client.primitives import BatchJobSpec
from nv_ingest_client.primitives.tasks import ExtractTask

batch_job_spec = BatchJobSpec(
    [
        "data/bo_20/*.pdf",
    ]
)

extract_task = ExtractTask(document_type="pdf", extract_text=True, extract_images=False, extract_tables=False)

batch_job_spec.add_task(extract_task)

client = NvIngestClient(
    message_client_hostname="nv-ingest-ms-runtime",  # Host where nv-ingest-ms-runtime is running
    message_client_port=7670,  # REST port, defaults to 7670
)
job_ids = client.add_job(batch_job_spec)

client.submit_job(job_ids, "morpheus_task_queue", batch_size=10)
result = client.fetch_job_result(job_ids, timeout=60, verbose=True)
print(f"Got {len(result)} results")
(nv_ingest) root@06dc675b3b16:/work/git/nv-ingest# python3 test_python_client.py 
Job 14 is not ready. Retrying 1/∞ after 1 seconds.
Job 14 is not ready. Retrying 2/∞ after 1 seconds.
Got 20 results

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@edknv edknv marked this pull request as ready for review October 17, 2024 07:42
@randerzander
Copy link
Collaborator

Thanks for doing this work, @edknv

I was attempting to try it out, but:

>>> from nv_ingest_client.primitives import BatchJobSpec
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'BatchJobSpec' from 'nv_ingest_client.primitives' (/home/nfs/rgelhausen/projects/nv-ingest/venv/lib/python3.11/site-packages/nv_ingest_client/primitives/__init__.py)

Does our init logic need an update to make BatchJobSpec importable?

@edknv
Copy link
Collaborator Author

edknv commented Oct 17, 2024

@randerzander There has an update in the client liibrary, so I think it needs a reinstall.

(nv-ingest-dev) $ python3
Python 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nv_ingest_client.primitives import BatchJobSpec
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'BatchJobSpec' from 'nv_ingest_client.primitives' (/home/edwardk/.local/lib/python3.10/site-packages/nv_ingest_client/primitives/__init__.py)
>>> from nv_ingest_client.primitives import JobSpec
>>> exit()
(nv-ingest-dev) $ pip install client/ --quiet

[notice] A new release of pip is available: 23.0.1 -> 24.2
[notice] To update, run: /usr/bin/python3 -m pip install --upgrade pip
(nv-ingest-dev) $ python3
Python 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nv_ingest_client.primitives import BatchJobSpec
>>> 

But it raises another question. Every time we have an update in the client, customers will have the same issue. How do we make sure they are using the latest version of the client library?

@randerzander
Copy link
Collaborator

I believe I'm installing the client library from your commit:

rgelhausen@a4u8g-0132:~/projects/nv-ingest$ git fetch origin pull/158/head:batch_api
rgelhausen@a4u8g-0132:~/projects/nv-ingest$ git checkout batch_api
M       docker-compose.yaml
Switched to branch 'batch_api'
rgelhausen@a4u8g-0132:~/projects/nv-ingest$ git rev-parse HEAD
22899fd7d145f71130aa8fcdfaeb7a0b3d5d6cb6

Creating the env and installing:

rgelhausen@a4u8g-0132:~/projects/nv-ingest$ uv venv --python 3.11 venv
Using CPython 3.11.10
Creating virtual environment at: venv
Activate with: source venv/bin/activate
rgelhausen@a4u8g-0132:~/projects/nv-ingest$ source venv/bin/activate
(venv) rgelhausen@a4u8g-0132:~/projects/nv-ingest$ cd client
(venv) rgelhausen@a4u8g-0132:~/projects/nv-ingest/client$ uv pip install -r requirements.txt 
Resolved 25 packages in 31ms
Installed 25 packages in 1.18s
 + annotated-types==0.7.0
 + anyio==4.6.2.post1
 + certifi==2024.8.30
 + charset-normalizer==3.4.0
 + click==8.1.7
 + h11==0.14.0
 + httpcore==1.0.6
 + httpx==0.27.2
 + idna==3.10
 + lxml==5.3.0
 + pillow==11.0.0
 + pydantic==2.9.2
 + pydantic-core==2.23.4
 + pypdfium2==4.30.0
 + python-docx==1.1.2
 + python-magic==0.4.27
 + python-pptx==0.6.23
 + redis==5.0.8
 + requests==2.32.3
 + setuptools==75.2.0
 + sniffio==1.3.1
 + tqdm==4.66.5
 + typing-extensions==4.12.2
 + urllib3==2.2.3
 + xlsxwriter==3.2.0
(venv) rgelhausen@a4u8g-0132:~/projects/nv-ingest/client$ uv pip install .
Resolved 26 packages in 95ms
Installed 1 package in 136ms
 + nv-ingest-client==2024.10.4.dev0 (from file:///home/nfs/rgelhausen/projects/nv-ingest/client)
(venv) rgelhausen@a4u8g-0132:~/projects/nv-ingest/client$ python
Python 3.11.10 (main, Sep  9 2024, 22:11:19) [Clang 18.1.8 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from nv_ingest_client.primitives import BatchJobSpec
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'BatchJobSpec' from 'nv_ingest_client.primitives' (/home/nfs/rgelhausen/projects/nv-ingest/venv/lib/python3.11/site-packages/nv_ingest_client/primitives/__init__.py)

@edknv
Copy link
Collaborator Author

edknv commented Oct 17, 2024

Hmm, I'm trying to repro but I can't seem to, even with using all the commands verbatim starting from git fetch origin pull/158/head:batch_api all the way down to uv pip install ..

@randerzander
Copy link
Collaborator

ok, you can ignore my feedback. I must have messed up my git index, it's working for me now :)

batch = job_indices[batch_start:batch_end]

# Submit each batch of jobs
batch_results = [self._submit_job(job_id, job_queue_id) for job_id in batch]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just noticing this now, but we probably need to handle exceptions from _submit_job better. Currently, if we submit a batch of exceptions and one fails, the rest of the items in the batch are failed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in f2cc046 and 631d80a.

job_specs = create_job_specs_for_batch(files_batch)

job_ids = []
for job_spec in job_specs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're in process of turning the CLI code into a library, I'd like to be a bit more precise here and handle corner cases where more than one of a single task type is requested... or if multiple tasks of the same type are selected with different configuration parameters.

Its probably ok if we just reject duplicate tasks out of hand for now and raise an error, but its also worth thinking through when/if we might want them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I thought about this, but I opted to reject duplicate task immediately and raise an error if there are any duplicate tasks in e1fceb6, mostly to avoid complexity.

Given the serial nature of our current pipeline, I didn't think duplicate tasks made much sense, but maybe there are some use cases where users might want to apply, for example, split tasks to pdf documents but not on pptx documents, or something like that? Or maybe in that case, two separate pipelines makes more sense.

@edknv edknv requested a review from drobison00 October 22, 2024 04:54
@edknv edknv merged commit cdf1b64 into NVIDIA:main Oct 23, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA]: Add python multi file API for job submission
3 participants