Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet files generated by DataFusion cannot be read by Apache Spark #4782

Closed
andygrove opened this issue Dec 31, 2022 · 1 comment
Closed
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Member

Describe the bug
I generated TPC-H data and converted to Parquet using DataFusion. Here is the nation table.

$ ls -l /tmp/tpch-parquet/nation.parquet/
total 4
drwxrwxr-x 2 andy andy 4096 Dec 31 09:25 part-0.parquet

I can read the schema fine from bdt (which uses DataFusion)

$ bdt schema /tmp/tpch-parquet/nation.parquet
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| n_nationkey | Int64     | NO          |
| n_name      | Utf8      | NO          |
| n_regionkey | Int64     | NO          |
| n_comment   | Utf8      | NO          |
+-------------+-----------+-------------+

Spark fails with:

val df = spark.read.parquet("/tmp/tpch-parquet/nation.parquet")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.

However, if I ask Spark to read the one partition file directly, and not the directory, then it works, which confuses me,

scala> val df = spark.read.parquet("/tmp/tpch-parquet/nation.parquet/part-0.parquet")
df: org.apache.spark.sql.DataFrame = [n_nationkey: bigint, n_name: string ... 2 more fields]

scala> df.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(n_nationkey,LongType,true), StructField(n_name,StringType,true), StructField(n_regionkey,LongType,true), StructField(n_comment,StringType,true))

To Reproduce

Expected behavior

Additional context

@andygrove andygrove added the bug Something isn't working label Dec 31, 2022
@andygrove
Copy link
Member Author

nm, user error ... there was a nested directory containing the partitions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant