Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset-compatible-libraries gives an UnexpectedError for some datasets #2607

Closed
severo opened this issue Mar 18, 2024 · 4 comments
Closed
Assignees
Labels
blocked-by-upstream The issue must be fixed in a dependency bug Something isn't working P1 Not as needed as P0, but still important/wanted

Comments

@severo
Copy link
Collaborator

severo commented Mar 18, 2024

On https://huggingface.co/datasets/HackerNoon/tech-company-news-data-dump, the step dataset-compatible-libraries gives:

{
  "error": "Dataset at 'hf://datasets/HackerNoon/tech-company-news-data-dump' doesn't contain data files matching the patterns for config 'default', check `data_files` and `data_fir` parameters in the `configs` YAML field in README.md. ",
  "cause_exception": "EmptyDatasetError",
  "cause_message": "Dataset at 'hf://datasets/HackerNoon/tech-company-news-data-dump' doesn't contain data files matching the patterns for config 'default', check `data_files` and `data_fir` parameters in the `configs` YAML field in README.md. ",
  "cause_traceback": [
    "Traceback (most recent call last):\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py\", line 622, in create_builder_configs_from_metadata_configs\n else get_data_patterns(config_base_path)\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/data_files.py\", line 485, in get_data_patterns\n raise EmptyDatasetError(f\"The directory at {base_path} doesn't contain any data files\") from None\n",
    "datasets.data_files.EmptyDatasetError: The directory at hf://datasets/HackerNoon/tech-company-news-data-dump doesn't contain any data files\n",
    "\nThe above exception was the direct cause of the following exception:\n\n",
    "Traceback (most recent call last):\n",
    " File \"/src/services/worker/src/worker/job_manager.py\", line 125, in process\n job_result = self.job_runner.compute()\n",
    " File \"/src/services/worker/src/worker/job_runners/dataset/compatible_libraries.py\", line 632, in compute\n response_content = compute_compatible_libraries_response(\n",
    " File \"/src/services/worker/src/worker/job_runners/dataset/compatible_libraries.py\", line 619, in compute_compatible_libraries_response\n compatible_library = get_compatible_library_for_builder[builder_name](dataset, hf_token)\n",
    " File \"/src/services/worker/src/worker/job_runners/dataset/compatible_libraries.py\", line 416, in get_compatible_libraries_for_csv\n builder_configs = get_builder_configs_with_simplified_data_files(dataset, module_name=\"csv\", hf_token=hf_token)\n",
    " File \"/src/services/worker/src/worker/job_runners/dataset/compatible_libraries.py\", line 107, in get_builder_configs_with_simplified_data_files\n builder_configs, _ = create_builder_configs_from_metadata_configs(\n",
    " File \"/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py\", line 629, in create_builder_configs_from_metadata_configs\n raise EmptyDatasetError(\n",
    "datasets.data_files.EmptyDatasetError: Dataset at 'hf://datasets/HackerNoon/tech-company-news-data-dump' doesn't contain data files matching the patterns for config 'default', check `data_files` and `data_fir` parameters in the `configs` YAML field in README.md. \n"
  ]
}

Some ideas to explore: it's a gated dataset, and also it's a partial parquet export.

@severo severo added bug Something isn't working P1 Not as needed as P0, but still important/wanted labels Mar 18, 2024
@severo
Copy link
Collaborator Author

severo commented Mar 18, 2024

cc @lhoestq for viz

@lhoestq lhoestq self-assigned this Mar 19, 2024
@lhoestq lhoestq added the blocked-by-upstream The issue must be fixed in a dependency label Mar 19, 2024
@lhoestq
Copy link
Member

lhoestq commented Mar 19, 2024

Opened huggingface/datasets#6742 with a fix, we'll have to update datasets once it's released

@AndreaFrancis
Copy link
Contributor

Currently, there are 1959 records with this issue, will refresh now that huggingface/datasets#6742 and #2739 were merged.

@AndreaFrancis
Copy link
Contributor

Done. All the entries have been fixed, and closing.

datasets_server_cache> db.cachedResponsesBlue.countDocuments({kind:"dataset-compatible-libraries", http_status:{$ne:200}, "details.cause_exception": "EmptyDatasetError","details.copied_from_artifact":{$exists:false}})
0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked-by-upstream The issue must be fixed in a dependency bug Something isn't working P1 Not as needed as P0, but still important/wanted
Projects
None yet
Development

No branches or pull requests

3 participants