-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] delta_lake_delete_test.py failed assertion [DATAGEN_SEED=1701225104, IGNORE_ORDER... #9884
Comments
I cannot reproduce this, and AFAIK it has not been seen since it occurred. My guess is that somehow the data was spread across tasks differently between the two runs, as the data was verified to be correct (i.e.: all rows were accounted for, ignoring ordering) before the metadata equality check afterwards failed. |
I found a way to reliably reproduce this:
As suspected, the table contents as a whole are correct, but somehow some of the rows have been swizzled between the two output files between the CPU and GPU runs. AFAICT nothing is actually wrong semantically with the output produced by the GPU relative to the CPU, but it would be good to understand how we're reliably getting the rows crossed here. I suspect for some odd reason the GPU run is getting a different set of input files for the tasks than the CPU does. |
Debugged why this is failing for certain datagen seeds. The problem can occur when a particular datagen seed causes two or more Parquet files within a table to be generated with the same file size. ext4 and most other Linux filesystems will return a directory listing in an order influenced by the order the files are written to the directory, and it's not deterministic which files will be written before other files as tasks execute in parallel. The input files are sorted in descending order by file size, but when two or more files are the same size, the sorted ordering can be different between two directories that contain the same files but were created in a different order. I do not see a way to really fix this other than use a datagen seed which is known to produce files that can be deterministically sorted, as was done in #10009, or change the test to not try to compare metadata. The latter would allow a lot of subtle bugs to slip through, so I think using a fixed datagen seed is the better route. Closing this bug as setting a fixed seed as done by #10009 is the long-term solution. |
Describe the bug
first seen in rapids_integration-dev-github, build ID 863 (jdk11 runtime + spark 330)
mismatched CPU and GPU output:
others
Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.
Expected behavior
A clear and concise description of what you expected to happen.
Environment details (please complete the following information)
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: