-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k #3040
Labels
Comments
pan3793
added a commit
to pan3793/parquet-java
that referenced
this issue
Nov 3, 2024
…ult when dict size exceeds 8k
RussellSpitzer
added a commit
to RussellSpitzer/iceberg
that referenced
this issue
Nov 4, 2024
This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).
RussellSpitzer
added a commit
to apache/iceberg
that referenced
this issue
Nov 4, 2024
This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).
RussellSpitzer
added a commit
to RussellSpitzer/iceberg
that referenced
this issue
Nov 4, 2024
…ache#11462) This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).
RussellSpitzer
added a commit
to apache/iceberg
that referenced
this issue
Nov 4, 2024
This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).
zachdisc
pushed a commit
to zachdisc/iceberg
that referenced
this issue
Dec 23, 2024
…ache#11462) This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug, including details regarding any error messages, version, and platform.
Background
I get some data loss reports after upgrading the internal Spark's Parquet from 1.13.1 to 1.14.3, after some experiments, I believe this should be a bug on the Parquet side, and it could be worked around by disabling
spark.sql.parquet.filterPushdown
.Analysis
With some debugging, I think the issue was introduced by PARQUET-2432(#1278).
The issue is, during the evaluation of
DictionaryFilter.canDrop
(this happens when reading a column that hasPLAIN_DICTIONARY
with pushed predications), when dict size exceeds 8k, only the head 8k was copiedparquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/DictionaryPageReader.java
Line 113 in 274dc51
the correct data
the copied data
the root cause is
parquet-java/parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java
Line 379 in 274dc51
may not read fully if the underlying
InputStream
'savailable
method always returns 0Component(s)
Core
The text was updated successfully, but these errors were encountered: