DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k #3040

pan3793 · 2024-11-03T18:08:33Z

Describe the bug, including details regarding any error messages, version, and platform.

Background

I get some data loss reports after upgrading the internal Spark's Parquet from 1.13.1 to 1.14.3, after some experiments, I believe this should be a bug on the Parquet side, and it could be worked around by disabling spark.sql.parquet.filterPushdown.

Analysis

With some debugging, I think the issue was introduced by PARQUET-2432(#1278).

The issue is, during the evaluation of DictionaryFilter.canDrop(this happens when reading a column that has PLAIN_DICTIONARY with pushed predications), when dict size exceeds 8k, only the head 8k was copied

parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/DictionaryPageReader.java

Line 113 in 274dc51

    
           return new DictionaryPage(dict.getBytes().copy(releaser), dict.getDictionarySize(), dict.getEncoding());

the correct data

...
|00001fd0| 34 32 35 34 30 39 0a 00 00 00 37 32 39 33 39 38 |425409....729398|
|00001fe0| 39 35 33 30 0a 00 00 00 39 38 32 30 31 36 39 36 |9530....98201696|
|00001ff0| 34 34 0a 00 00 00 37 33 39 31 32 34 37 38 30 36 |44....7391247806|
|00002000| 0b 00 00 00 31 32 35 32 35 31 36 31 34 31 36 0a |....12525161416.|
|00002010| 00 00 00 38 37 38 35 35 34 34 36 34 35 0b 00 00 |...8785544645...|
|00002020| 00 31 32 32 38 30 38 37 35 39 30 32 0b 00 00 00 |.12280875902....|
...

the copied data

...
|00001fd0| 34 32 35 34 30 39 0a 00 00 00 37 32 39 33 39 38 |425409....729398|
|00001fe0| 39 35 33 30 0a 00 00 00 39 38 32 30 31 36 39 36 |9530....98201696|
|00001ff0| 34 34 0a 00 00 00 37 33 39 31 32 34 37 38 30 36 |44....7391247806|
|00002000| 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
|00002010| 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
|00002020| 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
...

the root cause is

parquet-java/parquet-common/src/main/java/org/apache/parquet/bytes/BytesInput.java

Line 379 in 274dc51

Channels.newChannel(in).read(workBuf);

may not read fully if the underlying InputStream's available method always returns 0

    private static class ReadableByteChannelImpl
        extends AbstractInterruptibleChannel    // Not really interruptible
        implements ReadableByteChannel
    {
     ...
        private static final int TRANSFER_SIZE = 8192;

        public int read(ByteBuffer dst) throws IOException {
            if (!isOpen()) {
                throw new ClosedChannelException();
            }

            int len = dst.remaining();
            int totalRead = 0;
            int bytesRead = 0;
            synchronized (readLock) {
                while (totalRead < len) {
                    int bytesToRead = Math.min((len - totalRead),
                                               TRANSFER_SIZE);
                    if (buf.length < bytesToRead)
                        buf = new byte[bytesToRead];
                    if ((totalRead > 0) && !(in.available() > 0))
                        break; // block at most once
                    try {
                        begin();
                        bytesRead = in.read(buf, 0, bytesToRead);
                    } finally {
                        end(bytesRead > 0);
                    }
                    if (bytesRead < 0)
                        break;
                    else
                        totalRead += bytesRead;
                    dst.put(buf, 0, bytesRead);
                }
                if ((bytesRead < 0) && (totalRead == 0))
                    return -1;

                return totalRead;
            }
        }
        ...
    }

Component(s)

Core

The text was updated successfully, but these errors were encountered:

…ult when dict size exceeds 8k

Noticed this while reviewing #3040

@pan3793

This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).

@pan3793

This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).

@pan3793

…ache#11462) This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).

@pan3793

This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).

…en dict size exceeds 8k (#3041) * GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k * style * check bytesRead * import

Noticed this while reviewing #3040

@pan3793

…ache#11462) This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).

pan3793 added the Type: bug label Nov 3, 2024

pan3793 added a commit to pan3793/parquet-java that referenced this issue Nov 3, 2024

apacheGH-3040: DictionaryFilter.canDrop may return false positive res…

4564c58

…ult when dict size exceeds 8k

pan3793 mentioned this issue Nov 3, 2024

GH-3040: DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k #3041

Merged

Fokko added a commit that referenced this issue Nov 4, 2024

MINOR: Throw exception in BytesInput

4b9c4de

Noticed this while reviewing #3040

Fokko mentioned this issue Nov 4, 2024

MINOR: Throw exception in BytesInput #3042

Merged

RussellSpitzer mentioned this issue Nov 4, 2024

Build: Revert Parquet to 1.13.1 apache/iceberg#11462

Merged

RussellSpitzer added a commit to RussellSpitzer/iceberg that referenced this issue Nov 4, 2024

Revert "Build: Bump parquet from 1.13.1 to 1.14.3 (apache#11264)"

296a4b5

This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).

RussellSpitzer added a commit to apache/iceberg that referenced this issue Nov 4, 2024

Revert "Build: Bump parquet from 1.13.1 to 1.14.3 (#11264)" (#11462)

7cc16fa

This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).

RussellSpitzer added a commit to apache/iceberg that referenced this issue Nov 4, 2024

Revert "Build: Bump parquet from 1.13.1 to 1.14.3 (#11264)" (#11462)

36aeda4

This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).

Fokko closed this as completed in #3041 Nov 6, 2024

Fokko added a commit that referenced this issue Nov 7, 2024

MINOR: Throw exception in BytesInput (#3042)

985c34d

Noticed this while reviewing #3040

Fokko added a commit that referenced this issue Nov 7, 2024

MINOR: Throw exception in BytesInput (#3042)

742ad87

Noticed this while reviewing #3040

zachdisc pushed a commit to zachdisc/iceberg that referenced this issue Dec 23, 2024

Revert "Build: Bump parquet from 1.13.1 to 1.14.3 (apache#11264)" (ap…

65c18f5

…ache#11462) This reverts commit b8c2b20. apache/parquet-java#3040 Was discovered by @pan3793 in Parquet 1.14.(0,1,2,3).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k #3040

DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k #3040

pan3793 commented Nov 3, 2024

DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k #3040

DictionaryFilter.canDrop may return false positive result when dict size exceeds 8k #3040

Comments

pan3793 commented Nov 3, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Background

Analysis

Component(s)