Prevent writing a file with a footer size exceeding the Int max value #2986

ConeyLiu · 2024-08-13T02:32:08Z

Describe the bug, including details regarding any error messages, version, and platform.

The footer size is assumed as an int:

BytesUtils.writeIntLittleEndian(out, (int) (out.getPos() - footerIndex));

This force casting is not safe. For example, we could write out a file with a size exceeding the Int max value and get a corrupted file:

java.lang.RuntimeException: corrupted file: the footer index is not within the file: 10200584257
        |       at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:571)
        |       at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
        |       at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
        |       at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:240)
        |       at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:81)
        |       at org.apache.iceberg.parquet.VectorizedParquetReader.init(VectorizedParquetReader.java:90)
        |       at org.apache.iceberg.parquet.VectorizedParquetReader.iterator(VectorizedParquetReader.java:99)
        |       at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:195)
        |       at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:49)
        |       at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:150)
        |       at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119)
        |       at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
        |       at scala.Option.exists(Option.scala:376)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
        |       at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)

Component(s)

No response

The text was updated successfully, but these errors were encountered:

dylanburati · 2024-08-26T00:33:01Z

I have the same issue with a corrupted file due to overflow in this field; it was created using the Rust parquet crate, which uses unsigned ints for this field (link). Also, the file is usable with pyarrow. I'm wondering if this specific field could be treated as unsigned in Java as well, since it doesn't seem to be referenced as i32 in the format specification.

using parquet-cli 1.14.1:

$ tail -c 64 ~/Downloads/enwiki/20240620/enwiki_20240620.parquet | xxd -g 4                                                                                                                        
00000000: 41414141 41414141 41454141 41414141  AAAAAAAAAEAAAAAA                                                                                                                                    
00000010: 67414141 476c6b41 41413d00 18197061  gAAAGlkAAA=...pa                                                                                                                                    
00000020: 72717565 742d7273 20766572 73696f6e  rquet-rs version                                                                                                                                    
00000030: 2033342e 302e3000 e755eb8a 50415231   34.0.0..U..PAR1                                                                                                                                    

$ parquet pages ~/Downloads/enwiki/20240620/enwiki_20240620.parquet                                                                                                                                
Unknown error                                   
java.lang.RuntimeException: corrupted file: the footer index is not within the file: 39975304334                                                                                                   
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:608)                                                                                                      
        at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:902)                                                                                                          
        at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:659)                                                                                                            
        at org.apache.parquet.cli.commands.ShowPagesCommand.run(ShowPagesCommand.java:93)                                                                                                          
        at org.apache.parquet.cli.Main.run(Main.java:163)                                        
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)                                                                                                                               
        at org.apache.parquet.cli.Main.main(Main.java:191)                                                                                                                                         

$ python -c "print($(stat -c %s ~/Downloads/enwiki/20240620/enwiki_20240620.parquet) - 8 - (-0x10000_0000 + 0x8aeb_55e7))"                                                                         
39975304334                                     

$ python -c 'import pyarrow.parquet as pq; f = pq.ParquetFile("~/Downloads/enwiki/20240620/enwiki_20240620.parquet"); print(f.metadata)'
<pyarrow._parquet.FileMetaData object at 0x729a06892a70>
  created_by: parquet-rs version 34.0.0
  num_columns: 6
  num_rows: 23802888
  num_row_groups: 238062
  format_version: 1.0
  serialized_size: 2330678759

ConeyLiu · 2024-08-27T02:17:57Z

Sounds reasonable, let me investigate it.

#2987)

ConeyLiu added the Type: bug label Aug 13, 2024

ConeyLiu mentioned this issue Aug 13, 2024

GH-2986: Fails the file writing when footer size exceeds int max value #2987

Merged

wgtmac pushed a commit that referenced this issue Aug 29, 2024

GH-2986: Fails the file writing when footer size exceeds int max value (

fafd9b0

#2987)

wgtmac closed this as completed in #2987 Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent writing a file with a footer size exceeding the Int max value #2986

Prevent writing a file with a footer size exceeding the Int max value #2986

ConeyLiu commented Aug 13, 2024

dylanburati commented Aug 26, 2024 •

edited

Loading

ConeyLiu commented Aug 27, 2024

Prevent writing a file with a footer size exceeding the Int max value #2986

Prevent writing a file with a footer size exceeding the Int max value #2986

Comments

ConeyLiu commented Aug 13, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

dylanburati commented Aug 26, 2024 • edited Loading

ConeyLiu commented Aug 27, 2024

dylanburati commented Aug 26, 2024 •

edited

Loading