Parquet uses row group row count if missing from header #13712
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
When investigating this issue I noticed that the file provided has 0 rows in the header. This caused cudf's parquet reader to fail at reading the file, but other tools such as
parq
andparquet-tools
had no issues reading the file. This change counts up the number of rows in the row groups of the file and will complain loudly if the number differ, but not if the main header is 0. This allows us to properly read the data inside this file. Note that it will not properly parse it as a list of structs yet, that will be fixed in another PR. I didn't add a test since this is the only file I have seen with this issue and we can't read it yet in cudf. A test will be added for reading this file, which will test this change as well, with the PR for that issue.Checklist