Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent writing a file with a footer size exceeding the Int max value #2986

Closed
ConeyLiu opened this issue Aug 13, 2024 · 2 comments · Fixed by #2987
Closed

Prevent writing a file with a footer size exceeding the Int max value #2986

ConeyLiu opened this issue Aug 13, 2024 · 2 comments · Fixed by #2987

Comments

@ConeyLiu
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

The footer size is assumed as an int:

BytesUtils.writeIntLittleEndian(out, (int) (out.getPos() - footerIndex));

This force casting is not safe. For example, we could write out a file with a size exceeding the Int max value and get a corrupted file:

java.lang.RuntimeException: corrupted file: the footer index is not within the file: 10200584257
        |       at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:571)
        |       at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
        |       at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
        |       at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:240)
        |       at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:81)
        |       at org.apache.iceberg.parquet.VectorizedParquetReader.init(VectorizedParquetReader.java:90)
        |       at org.apache.iceberg.parquet.VectorizedParquetReader.iterator(VectorizedParquetReader.java:99)
        |       at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:195)
        |       at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:49)
        |       at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:150)
        |       at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119)
        |       at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
        |       at scala.Option.exists(Option.scala:376)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
        |       at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)

Component(s)

No response

@dylanburati
Copy link

dylanburati commented Aug 26, 2024

I have the same issue with a corrupted file due to overflow in this field; it was created using the Rust parquet crate, which uses unsigned ints for this field (link). Also, the file is usable with pyarrow. I'm wondering if this specific field could be treated as unsigned in Java as well, since it doesn't seem to be referenced as i32 in the format specification.

using parquet-cli 1.14.1:

$ tail -c 64 ~/Downloads/enwiki/20240620/enwiki_20240620.parquet | xxd -g 4                                                                                                                        
00000000: 41414141 41414141 41454141 41414141  AAAAAAAAAEAAAAAA                                                                                                                                    
00000010: 67414141 476c6b41 41413d00 18197061  gAAAGlkAAA=...pa                                                                                                                                    
00000020: 72717565 742d7273 20766572 73696f6e  rquet-rs version                                                                                                                                    
00000030: 2033342e 302e3000 e755eb8a 50415231   34.0.0..U..PAR1                                                                                                                                    

$ parquet pages ~/Downloads/enwiki/20240620/enwiki_20240620.parquet                                                                                                                                
Unknown error                                   
java.lang.RuntimeException: corrupted file: the footer index is not within the file: 39975304334                                                                                                   
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:608)                                                                                                      
        at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:902)                                                                                                          
        at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:659)                                                                                                            
        at org.apache.parquet.cli.commands.ShowPagesCommand.run(ShowPagesCommand.java:93)                                                                                                          
        at org.apache.parquet.cli.Main.run(Main.java:163)                                        
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)                                                                                                                               
        at org.apache.parquet.cli.Main.main(Main.java:191)                                                                                                                                         

$ python -c "print($(stat -c %s ~/Downloads/enwiki/20240620/enwiki_20240620.parquet) - 8 - (-0x10000_0000 + 0x8aeb_55e7))"                                                                         
39975304334                                     

$ python -c 'import pyarrow.parquet as pq; f = pq.ParquetFile("~/Downloads/enwiki/20240620/enwiki_20240620.parquet"); print(f.metadata)'
<pyarrow._parquet.FileMetaData object at 0x729a06892a70>
  created_by: parquet-rs version 34.0.0
  num_columns: 6
  num_rows: 23802888
  num_row_groups: 238062
  format_version: 1.0
  serialized_size: 2330678759

@ConeyLiu
Copy link
Contributor Author

Sounds reasonable, let me investigate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants