-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement generic processing steps (#650)
### Generic implementation of a processing graph Remove explicit mentions to /splits or /first-rows from code, and move them to the "processing graph": ```json { "/splits": {"input_type": "dataset", "required_by_dataset_viewer": true}, "/first-rows": {"input_type": "split", "requires": "/splits", "required_by_dataset_viewer": true}, } ``` This JSON (see libcommon.config) defines the *processing steps* (here /splits and /first-rows) and their dependency relationship (here /first-rows depends on /splits). It also defines if a processing step is required by the Hub dataset viewer (used to fill /valid and /is-valid). A processing step is defined by the endpoint (/splits, /first-rows), where the result of the processing step can be downloaded. The endpoint value is also used as the cache key and the job type. After this change, adding a new processing step should consist in: - creating a new worker in the `workers/` directory - update the processing graph - update the CI, tests, docs and deployment (docker-compose files, e2e tests, docs, openapi, helm chart) This also means that the services (API, admin) don't contain any code mentioning directly splits or first-rows. And the splits worker does not contain direct reference to first-rows. ### Other changes - code: the libcache and libqueue libraries have been merged into libcommon - the code to check if a dataset is supported (exists, is not private, access can be programmatically obtained if gated) has been factorized and is now used before every processing step and before even accepting to create a new job (through the webhook or through the /admin/force-refresh endpoint). - add a new endpoint: /admin/cancel-jobs, which replaces the last admin scripts. It's easier to send a POST request than to call a remote script. - simplify the code of the workers by factorizing some code into libcommon: - the code to test if a job should be skipped, based on the versions of the git repository and the worker - the logic to catch errors and to write to the cache This way, the code for every worker now only contains what is specific to that worker. ### Breaking changes - env vars `QUEUE_MAX_LOAD_PCT`, `QUEUE_MAX_MEMORY_PCT` and `QUEUE_SLEEP_SECONDS` are renamed as `WORKER_MAX_LOAD_PCT`, `WORKER_MAX_MEMORY_PCT` and `WORKER_SLEEP_SECONDS`. --- * feat: 🎸 add /cache-reports/parquet endpoint and parquet reports * feat: 🎸 add the /parquet endpoint * feat: 🎸 add parquet worker Note that it will not pass the CI because - the CI token is not allowed to push to refs/convert/parquet (should be in the "datasets-maintainers" org) - the refs/convert/parquet does not exist and cannot be created for now * ci: 🎡 add CI for the worker * feat: 🎸 remove the hffs dependency we don't use it, and it's private for now * feat: 🎸 change the response format associate each parquet file with a split and a config (based on path parsing) * fix: 🐛 handle the fact that "SSSSS-of-NNNNN" is "optional" thanks @lhoestq * fix: 🐛 fill two fields to known versions in case of error * feat: 🎸 upgrade datasets to 2.7.0 * ci: 🎡 fix action * feat: 🎸 create ref/convert/parquet if it does not exist * feat: 🎸 update pytest See pytest-dev/py#287 (comment) * feat: 🎸 unlock access to the gated datasets Gated datasets with extra fields are not supported. Note also that only one token is used now. * feat: 🎸 check if the dataset is supported only for existing one * fix: 🐛 fix config * fix: 🐛 fix the branch argument + fix case where ref is created * fix: 🐛 fix logic of the worker, to ensure we get the git sha Also fix the tests, and disable gated+private for now * fix: 🐛 fix gated datasets and update the tests * test: 💍 assert that gated with extra fields are not supported * fix: 🐛 add controls on the dataset_git_revision * feat: 🎸 upgrade datasets * feat: 🎸 add script to refresh parquet response * fix: 🐛 fix the condition to test if the split exists in list also: rename functions to be more accurate * refactor: 💡 use exceptions to make the flow clearer * feat: 🎸 add processing_steps * fix: 🐛 fix signature * chore: 🤖 adapt to poetry 1.2, use pip-audit * feat: 🎸 use ProcessingStep in api service * feat: 🎸 use ProcessingStep in admin service and replace the last scripts with the /cancel-jobs/xxx endpoints. * style: 💄 fix style * feat: 🎸 update libcommon (use processing_step) * refactor: 💡 merge libcache and libqueue into libcommon * feat: 🎸 upgrade to libcommon 0.4 * feat: 🎸 upgrade to libcommon 0.4 * fix: 🐛 upgrade poetry * feat: 🎸 use processing_step in workers * feat: 🎸 implement should_skip_job and process in generic Worker this will make the code of workers simpler * feat: 🎸 handle CustomError from the workers, with specific code * feat: 🎸 simplify compute method * refactor: 💡 fix typing * fix: 🐛 remove erroneous control * feat: 🎸 update libcommon to 0.4.2 * feat: 🎸 update to libcommon 0.4.2 * ci: 🎡 fix ci * docs: ✏️ fix docstring * feat: 🎸 update to libcommon 0.4.2 * refactor: 💡 use Mapping instead of Dict * feat: 🎸 update to libcommon 0.4.2 also: replace Dict with Mapping * fix: 🐛 use Dict because it must be mutable * fix: 🐛 missing import * feat: 🎸 remplace dependency with previous_step and next_steps * feat: 🎸 define the processing graph in the configuration * feat: 🎸 upgrade to libcommon 0.5 * feat: 🎸 upgrade to libcommon 0.5 * feat: 🎸 upgrade to libcommon 0.5 * feat: 🎸 upgrade to libcommon 0.5.0 * feat: 🎸 upgrade to libcommon 0.5 * feat: 🎸 upgrade to libcommon 0.5 * refactor: 💡 add logic methods to simplify services and workers * feat: 🎸 upgrade to libcommon 0.5.1 some tests have been moved (commented yet) to e2e, since it becomes hard to simulate all the Hub endpoints -> better to test the scenari against the real Hub instead * feat: 🎸 upgrade to libcommon 0.5.1 * feat: 🎸 remove parquet processing step since it's not the scope of this PR * style: 💄 fix stykle * ci: 🎡 remove parquet ci * feat: 🎸 upgrade docker images * test: 💍 add some tests for the webhook * test: 💍 update e2e tests (and error messages in openapi) * style: 💄 fix style * feat: 🎸 remove parquet code
- Loading branch information
1 parent
a252e94
commit 8c47e92
Showing
303 changed files
with
3,301 additions
and
6,180 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,14 @@ | ||
# SPDX-License-Identifier: Apache-2.0 | ||
# Copyright 2022 The HuggingFace Authors. | ||
|
||
import pytest | ||
|
||
from .utils import poll | ||
|
||
|
||
def test_healthcheck(): | ||
@pytest.mark.parametrize("endpoint", ["/", "/healthcheck", "/metrics"]) | ||
def test_healthcheck(endpoint: str) -> None: | ||
# this tests ensures the /healthcheck and the /metrics endpoints are hidden | ||
response = poll("/healthcheck", expected_code=404) | ||
assert response.status_code == 404, f"{response.status_code} - {response.text}" | ||
assert "Not Found" in response.text, response.text | ||
|
||
response = poll("/metrics", expected_code=404) | ||
response = poll(endpoint, expected_code=404) | ||
assert response.status_code == 404, f"{response.status_code} - {response.text}" | ||
assert "Not Found" in response.text, response.text |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.