-
Notifications
You must be signed in to change notification settings - Fork 13
setting --max-read-records 0
reads zero records
#99
Comments
Huh, strange. max_read_records should only be used for schema inference. See Line 141 in 27dfa6a
|
Sure, let me take a look. I don't have any experience with Rust, so it will be a bit while I try to get familiar. |
$ cat generate_parquets.sh
#!/usr/bin/env bash
./json2parquet --max-read-records 0 foo.json foo-0.parquet
./json2parquet --max-read-records 1 foo.json foo-1.parquet
./json2parquet --max-read-records 2 foo.json foo-2.parquet
./json2parquet --max-read-records 3 foo.json foo-3.parquet
./json2parquet --max-read-records 4 foo.json foo-4.parquet
./json2parquet foo.json foo-5.parquet
#!/usr/bin/env python3
import pandas as pd
for i in range(6):
fn = "foo-" + str(i) + ".parquet"
df = pd.read_parquet(fn)
print(fn)
print(df)
print() output: $ ./read_parquet.py
foo-0.parquet
Empty DataFrame
Columns: []
Index: []
foo-1.parquet
key1
0 value1
1 None
2 None
3 None
4 None
foo-2.parquet
key1 key2
0 value1 None
1 None value2
2 None None
3 None None
4 None None
foo-3.parquet
key1 key2 key3
0 value1 None None
1 None value2 None
2 None None value3
3 None None None
4 None None None
foo-4.parquet
key1 key2 key3 key4
0 value1 None None None
1 None value2 None None
2 None None value3 None
3 None None None value4
4 None None None None
foo-5.parquet
key1 key2 key3 key4 key5
0 value1 None None None None
1 None value2 None None None
2 None None value3 None None
3 None None None value4 None
4 None None None None value5 |
I did some digging into (It's possible that definition for arrow::json::reader::infer_json_schema which calls arrow::json::reader::infer_json_schema_from_iterator—this might be where columns are coerced into Strings if they don't match certain types? I've tested this with the following JSON file: {"key1":"value1"}
{"key2":"value2"}
{"key3":"value3"}
{"key4":"value4"}
{"key5":"value5"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"}
{"key1":"value1"} and with
For context, my use case for The converted JSON doesn't seem to have a consistent schema that can be inferred, e.g.,
(It's very possible that I'm asking |
Oops—I might not have answered your earlier question. My read of the code is that the following snippet puts a schema into Lines 141 to 150 in 27dfa6a
Lines 165 to 167 in 27dfa6a
I think the issue is {
"fields": []
} This seems like the expected (if not documented) behavior for |
Thank you for this great tool!
The documentation notes that setting
--max-read-records
to0
will stop schema inference and set the type for all columns to strings:json2parquet/Readme.md
Lines 32 to 33 in 27dfa6a
The behavior I'm seeing is that
json2parquet
will only read the number of records set by--max-read-records
.Is there another way to stop schema inference?
Some examples demonstrating the behavior:
The text was updated successfully, but these errors were encountered: