Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet uses row group row count if missing from header #13712

Merged

Conversation

hyperbolic2346
Copy link
Contributor

Description

When investigating this issue I noticed that the file provided has 0 rows in the header. This caused cudf's parquet reader to fail at reading the file, but other tools such as parq and parquet-tools had no issues reading the file. This change counts up the number of rows in the row groups of the file and will complain loudly if the number differ, but not if the main header is 0. This allows us to properly read the data inside this file. Note that it will not properly parse it as a list of structs yet, that will be fixed in another PR. I didn't add a test since this is the only file I have seen with this issue and we can't read it yet in cudf. A test will be added for reading this file, which will test this change as well, with the PR for that issue.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@hyperbolic2346 hyperbolic2346 added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue tech debt non-breaking Non-breaking change labels Jul 17, 2023
@hyperbolic2346 hyperbolic2346 self-assigned this Jul 17, 2023
@hyperbolic2346 hyperbolic2346 requested a review from a team as a code owner July 17, 2023 22:31
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@hyperbolic2346
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 9fe1270 into rapidsai:branch-23.08 Jul 18, 2023
@hyperbolic2346 hyperbolic2346 deleted the mwilson/parquet_num_rows_fix branch July 18, 2023 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants