setting `--max-read-records 0` reads zero records #99

cardi · 2022-10-11T20:53:38Z

Thank you for this great tool!

The documentation notes that setting --max-read-records to 0 will stop schema inference and set the type for all columns to strings:

json2parquet/Readme.md

Lines 32 to 33 in 27dfa6a

    
                 --max-read-records <MAX_READ_RECORDS> 
        
                     The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed

The behavior I'm seeing is that json2parquet will only read the number of records set by --max-read-records.

Is there another way to stop schema inference?

Some examples demonstrating the behavior:

$ cat foo.json
{"key1":"value1"}
{"key2":"value2"}
{"key3":"value3"}
{"key4":"value4"}
{"key5":"value5"}

$ ./json2parquet -V
json2parquet 0.6.0

$ ./json2parquet --max-read-records 0 foo.json foo.parquet -p
Schema:
{
  "fields": []
}

$ du -h --apparent-size foo.parquet
183     foo.parquet

$ ./json2parquet --max-read-records 2 foo.json foo.parquet -p
Schema:
{
  "fields": [
    {
      "name": "key1",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key2",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    }
  ]
}

$ du -h --apparent-size foo.parquet
751     foo.parquet

$ ./json2parquet foo.json foo.parquet -p
Schema:
{
  "fields": [
    {
      "name": "key1",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key2",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key3",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key4",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key5",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    }
  ]
}

$ du -h --apparent-size foo.parquet
1.6K    foo.parquet

The text was updated successfully, but these errors were encountered:

domoritz · 2022-10-11T21:53:49Z

Huh, strange. max_read_records should only be used for schema inference. See

json2parquet/src/main.rs

Line 141 in 27dfa6a

    
           match arrow::json::reader::infer_json_schema(&mut buf_reader, opts.max_read_records) {

. Do you see a bug there? Could you try stepping through the code and also actually check what is in the parquet file?

cardi · 2022-10-11T22:10:20Z

Sure, let me take a look.

I don't have any experience with Rust, so it will be a bit while I try to get familiar.

cardi · 2022-10-11T22:25:14Z

[...] check what is in the parquet file?

$ cat generate_parquets.sh
#!/usr/bin/env bash

./json2parquet --max-read-records 0 foo.json foo-0.parquet
./json2parquet --max-read-records 1 foo.json foo-1.parquet
./json2parquet --max-read-records 2 foo.json foo-2.parquet
./json2parquet --max-read-records 3 foo.json foo-3.parquet
./json2parquet --max-read-records 4 foo.json foo-4.parquet
./json2parquet                      foo.json foo-5.parquet

read_parquet.py:

#!/usr/bin/env python3

import pandas as pd

for i in range(6):
  fn = "foo-" + str(i) + ".parquet"
  df = pd.read_parquet(fn)
  print(fn)
  print(df)
  print()

output:

$ ./read_parquet.py
foo-0.parquet
Empty DataFrame
Columns: []
Index: []

foo-1.parquet
     key1
0  value1
1    None
2    None
3    None
4    None

foo-2.parquet
     key1    key2
0  value1    None
1    None  value2
2    None    None
3    None    None
4    None    None

foo-3.parquet
     key1    key2    key3
0  value1    None    None
1    None  value2    None
2    None    None  value3
3    None    None    None
4    None    None    None

foo-4.parquet
     key1    key2    key3    key4
0  value1    None    None    None
1    None  value2    None    None
2    None    None  value3    None
3    None    None    None  value4
4    None    None    None    None

foo-5.parquet
     key1    key2    key3    key4    key5
0  value1    None    None    None    None
1    None  value2    None    None    None
2    None    None  value3    None    None
3    None    None    None  value4    None
4    None    None    None    None  value5

cardi · 2022-10-12T01:56:14Z

I did some digging into arrow::json::reader::infer_json_schema, and it seems like it needs to read some records to build a schema and does not build a generic schema of string types.

(It's possible that arrow has some way of building a generic schema of all Strings, but I'd expect that it would still have to read through the entire JSON file.)

definition for arrow::json::reader::infer_json_schema which calls arrow::json::reader::infer_json_schema_from_iterator—this might be where columns are coerced into Strings if they don't match certain types?

I've tested this with the following JSON file:

{"key1":"value1"}
{"key2":"value2"}
{"key3":"value3"}
{"key4":"value4"}
{"key5":"value5"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}

and with --max-read-records 1, the resulting Parquet is:

      key1
0   value1
1     None
2     None
3     None
4     None
5   value1
6   value1
7   value1
8   value1
9   value1
10  value1
11  value1
12  value1

For context, my use case for json2parquet is part of a workflow to convert network packet capture (.pcap) files --> JSON --> Parquet.

The converted JSON doesn't seem to have a consistent schema that can be inferred, e.g.,

Error: General("Error inferring schema: Json error: Expected scalar or scalar array JSON type, found: Object({ [...]

(It's very possible that I'm asking json2parquet to do something outside of a reasonable scope—I might have better success with selecting a few fields to extract from pcaps --> csv --> Parquet, which I'll try next.)

cardi · 2022-10-12T02:07:26Z

Oops—I might not have answered your earlier question.

My read of the code is that the following snippet puts a schema into schema:

json2parquet/src/main.rs

Lines 141 to 150 in 27dfa6a

    
           match arrow::json::reader::infer_json_schema(&mut buf_reader, opts.max_read_records) { 
        
               Ok(schema) => { 
        
                   input.seek(SeekFrom::Start(0))?; 
        
                   Ok(schema) 
        
               } 
        
               Err(error) => Err(ParquetError::General(format!( 
        
                   "Error inferring schema: {}", 
        
                   error 
        
               ))), 
        
           }

schema is then used later to assist with the JSON reading:

json2parquet/src/main.rs

Lines 165 to 167 in 27dfa6a

    
           let schema_ref = Arc::new(schema); 
        
           let builder = ReaderBuilder::new().with_schema(schema_ref); 
        
           let reader = builder.build(input)?;

I think the issue is arrow::json::reader::infer_json_schema does not return a "generic" schema of String types if max_read_records is set to 0, but instead returns an empty schema (and the resultant Parquet file would be empty):

{
  "fields": []
}

This seems like the expected (if not documented) behavior for arrow::json::reader::infer_json_schema, so perhaps it's just the usage statement that needs to be updated. (But it would be great to be able to generate "generic" schema of sorts!)

domoritz mentioned this issue Feb 2, 2023

json2parquet: setting --max-read-records 0 reads zero records domoritz/arrow-tools#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setting `--max-read-records 0` reads zero records #99

setting `--max-read-records 0` reads zero records #99

cardi commented Oct 11, 2022

domoritz commented Oct 11, 2022

cardi commented Oct 11, 2022

cardi commented Oct 11, 2022

cardi commented Oct 12, 2022

cardi commented Oct 12, 2022

setting --max-read-records 0 reads zero records #99

setting --max-read-records 0 reads zero records #99

Comments

cardi commented Oct 11, 2022

domoritz commented Oct 11, 2022

cardi commented Oct 11, 2022

cardi commented Oct 11, 2022

cardi commented Oct 12, 2022

cardi commented Oct 12, 2022

setting `--max-read-records 0` reads zero records #99

setting `--max-read-records 0` reads zero records #99