Skip to content
This repository has been archived by the owner on Feb 2, 2023. It is now read-only.

setting --max-read-records 0 reads zero records #99

Open
cardi opened this issue Oct 11, 2022 · 5 comments
Open

setting --max-read-records 0 reads zero records #99

cardi opened this issue Oct 11, 2022 · 5 comments

Comments

@cardi
Copy link

cardi commented Oct 11, 2022

Thank you for this great tool!

The documentation notes that setting --max-read-records to 0 will stop schema inference and set the type for all columns to strings:

json2parquet/Readme.md

Lines 32 to 33 in 27dfa6a

--max-read-records <MAX_READ_RECORDS>
The number of records to infer the schema from. All rows if not present. Setting max-read-records to zero will stop schema inference and all columns will be string typed

The behavior I'm seeing is that json2parquet will only read the number of records set by --max-read-records.

Is there another way to stop schema inference?

Some examples demonstrating the behavior:

$ cat foo.json
{"key1":"value1"}
{"key2":"value2"}
{"key3":"value3"}
{"key4":"value4"}
{"key5":"value5"}

$ ./json2parquet -V
json2parquet 0.6.0

$ ./json2parquet --max-read-records 0 foo.json foo.parquet -p
Schema:
{
  "fields": []
}

$ du -h --apparent-size foo.parquet
183     foo.parquet

$ ./json2parquet --max-read-records 2 foo.json foo.parquet -p
Schema:
{
  "fields": [
    {
      "name": "key1",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key2",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    }
  ]
}

$ du -h --apparent-size foo.parquet
751     foo.parquet

$ ./json2parquet foo.json foo.parquet -p
Schema:
{
  "fields": [
    {
      "name": "key1",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key2",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key3",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key4",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    },
    {
      "name": "key5",
      "data_type": "Utf8",
      "nullable": true,
      "dict_id": 0,
      "dict_is_ordered": false
    }
  ]
}

$ du -h --apparent-size foo.parquet
1.6K    foo.parquet
@domoritz
Copy link
Owner

Huh, strange. max_read_records should only be used for schema inference. See

match arrow::json::reader::infer_json_schema(&mut buf_reader, opts.max_read_records) {
. Do you see a bug there? Could you try stepping through the code and also actually check what is in the parquet file?

@cardi
Copy link
Author

cardi commented Oct 11, 2022

Sure, let me take a look.

I don't have any experience with Rust, so it will be a bit while I try to get familiar.

@cardi
Copy link
Author

cardi commented Oct 11, 2022

[...] check what is in the parquet file?

$ cat generate_parquets.sh
#!/usr/bin/env bash

./json2parquet --max-read-records 0 foo.json foo-0.parquet
./json2parquet --max-read-records 1 foo.json foo-1.parquet
./json2parquet --max-read-records 2 foo.json foo-2.parquet
./json2parquet --max-read-records 3 foo.json foo-3.parquet
./json2parquet --max-read-records 4 foo.json foo-4.parquet
./json2parquet                      foo.json foo-5.parquet

read_parquet.py:

#!/usr/bin/env python3

import pandas as pd

for i in range(6):
  fn = "foo-" + str(i) + ".parquet"
  df = pd.read_parquet(fn)
  print(fn)
  print(df)
  print()

output:

$ ./read_parquet.py
foo-0.parquet
Empty DataFrame
Columns: []
Index: []

foo-1.parquet
     key1
0  value1
1    None
2    None
3    None
4    None

foo-2.parquet
     key1    key2
0  value1    None
1    None  value2
2    None    None
3    None    None
4    None    None

foo-3.parquet
     key1    key2    key3
0  value1    None    None
1    None  value2    None
2    None    None  value3
3    None    None    None
4    None    None    None

foo-4.parquet
     key1    key2    key3    key4
0  value1    None    None    None
1    None  value2    None    None
2    None    None  value3    None
3    None    None    None  value4
4    None    None    None    None

foo-5.parquet
     key1    key2    key3    key4    key5
0  value1    None    None    None    None
1    None  value2    None    None    None
2    None    None  value3    None    None
3    None    None    None  value4    None
4    None    None    None    None  value5

@cardi
Copy link
Author

cardi commented Oct 12, 2022

I did some digging into arrow::json::reader::infer_json_schema, and it seems like it needs to read some records to build a schema and does not build a generic schema of string types.

(It's possible that arrow has some way of building a generic schema of all Strings, but I'd expect that it would still have to read through the entire JSON file.)

definition for arrow::json::reader::infer_json_schema which calls arrow::json::reader::infer_json_schema_from_iterator—this might be where columns are coerced into Strings if they don't match certain types?

I've tested this with the following JSON file:

{"key1":"value1"}
{"key2":"value2"}
{"key3":"value3"}
{"key4":"value4"}
{"key5":"value5"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}

and with --max-read-records 1, the resulting Parquet is:

      key1
0   value1
1     None
2     None
3     None
4     None
5   value1
6   value1
7   value1
8   value1
9   value1
10  value1
11  value1
12  value1

For context, my use case for json2parquet is part of a workflow to convert network packet capture (.pcap) files --> JSON --> Parquet.

The converted JSON doesn't seem to have a consistent schema that can be inferred, e.g.,

Error: General("Error inferring schema: Json error: Expected scalar or scalar array JSON type, found: Object({ [...]

(It's very possible that I'm asking json2parquet to do something outside of a reasonable scope—I might have better success with selecting a few fields to extract from pcaps --> csv --> Parquet, which I'll try next.)

@cardi
Copy link
Author

cardi commented Oct 12, 2022

Oops—I might not have answered your earlier question.

My read of the code is that the following snippet puts a schema into schema:

json2parquet/src/main.rs

Lines 141 to 150 in 27dfa6a

match arrow::json::reader::infer_json_schema(&mut buf_reader, opts.max_read_records) {
Ok(schema) => {
input.seek(SeekFrom::Start(0))?;
Ok(schema)
}
Err(error) => Err(ParquetError::General(format!(
"Error inferring schema: {}",
error
))),
}

schema is then used later to assist with the JSON reading:

json2parquet/src/main.rs

Lines 165 to 167 in 27dfa6a

let schema_ref = Arc::new(schema);
let builder = ReaderBuilder::new().with_schema(schema_ref);
let reader = builder.build(input)?;

I think the issue is arrow::json::reader::infer_json_schema does not return a "generic" schema of String types if max_read_records is set to 0, but instead returns an empty schema (and the resultant Parquet file would be empty):

{
  "fields": []
}

This seems like the expected (if not documented) behavior for arrow::json::reader::infer_json_schema, so perhaps it's just the usage statement that needs to be updated. (But it would be great to be able to generate "generic" schema of sorts!)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants