Make AvroArrowArrayReader possible to scan Avro backed table which contains nested records #7525

sarutak · 2023-09-11T19:27:57Z

Which issue does this PR close?

Closes #7524

Rationale for this change

This PR fixes an issue that I explained #7524.

What changes are included in this PR?

The causes are:

schema_lookup considers the lookup table only for root record. Child records have their own lookup table so they should be considered too.
The logic for reading arrays of records are wrong.

So, this change includes fixes for them.

Are these changes tested?

I prepared this Avro format file for test.
The schema of this file is as follows.

{
    "name": "record1",
    "namespace": "ns1",
    "type": "record",
    "fields": [
        {
            "name": "f1",
            "type": {
                "name": "record2",
                "namespace": "ns2",
                "type": "record",
                "fields": [
                    {
                        "name": "f1_1",
                        "type": "string"
                    },  {
                        "name": "f1_2",
                        "type": "int"
                    },  {
                        "name": "f1_3",
                        "type": {
                            "name": "record3",
                            "namespace": "ns3",
                            "type": "record",
                            "fields": [
                                {
                                    "name": "f1_3_1",
                                    "type": "double"
                                }
                            ]
                        }
                    }
                ]
            }
        },  {
            "name": "f2",
            "type": "array",
            "items": {
                "name": "record4",
                "namespace": "ns4",
                "type": "record",
                "fields": [
                    {
                        "name": "f2_1",
                        "type": "boolean"
                    },  {
                        "name": "f2_2",
                        "type": "float"
                    }
                ]
            }
        }
    ]
}

And the JSON representation of the Avro format file is as follows.

{"f1":{"f1_1":"aaa","f1_2":10,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":true,"f2_2":1.2},{"f2_1":true,"f2_2":2.2}]}
{"f1":{"f1_1":"bbb","f1_2":20,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":false,"f2_2":10.2}]}

Using this data, I create a table and scan it.

CREATE EXTERNAL TABLE mytbl STORED AS AVRO LOCATION '/path/to/nested_records.avro';
SELECT * FROM mytbl;

+---------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
| f1                                                                                          | f2                                                                                                 |
+---------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
| {ns2.record2.f1_1: aaa, ns2.record2.f1_2: 10, ns2.record2.f1_3: {ns3.record3.f1_3_1: 3.14}} | [{ns4.record4.f2_1: true, ns4.record4.f2_2: 1.2}, {ns4.record4.f2_1: true, ns4.record4.f2_2: 2.2}] |
| {ns2.record2.f1_1: bbb, ns2.record2.f1_2: 20, ns2.record2.f1_3: {ns3.record3.f1_3_1: 3.14}} | [{ns4.record4.f2_1: false, ns4.record4.f2_2: 10.2}]                                                |
+---------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.006 seconds.

The result seems as expected.

After this change is merged, I'll open a PR to add the the test data to arrow-testing. Then, I'll open a followup PR to add tests to avro.slt

Are there any user-facing changes?

No.

alamb

The code looks good to me, though I am not an expert. I think this PR needs a test so that we don't accidentally break the feature in the future during a refactor

alamb · 2023-09-11T20:13:09Z

Thank you for the contribution @sarutak

sarutak · 2023-09-11T20:23:11Z

@alamb
Test data suitable for this change is not present in testing. So I'm planning to add a test data to arrow-testing if this change seems good, and merged. And then, will open a followup PR to add test to avro.slt.

Or, is it better to add the test data to arrow-testing first?

alamb · 2023-09-12T13:56:19Z

Thanks @sarutak -- that makes sense.

Or, is it better to add the test data to arrow-testing first?

I suggest we add the data to arrow-testing first. I would feel much more comfortable merging code into datafusion that is tested, not only to prevent regressions, but also as part of reviewing unfamiliar code, having at test demonstrating it working is a major part of evaluating its suitability.

sarutak · 2023-09-13T05:52:24Z

@alamb All right. I've opend a PR in arrow-testing.
apache/arrow-testing#91

After the test data is added, I'll modify this PR to add tests.

This PR proposes to add an Avro format test data which contains nested records. This data is necessary for testing the change proposed in [this PR](apache/datafusion#7525). The schema of this test data is as follows. ``` { "name": "record1", "namespace": "ns1", "type": "record", "fields": [ { "name": "f1", "type": { "name": "record2", "namespace": "ns2", "type": "record", "fields": [ { "name": "f1_1", "type": "string" }, { "name": "f1_2", "type": "int" }, { "name": "f1_3", "type": { "name": "record3", "namespace": "ns3", "type": "record", "fields": [ { "name": "f1_3_1", "type": "double" } ] } } ] } }, { "name": "f2", "type": "array", "items": { "name": "record4", "namespace": "ns4", "type": "record", "fields": [ { "name": "f2_1", "type": "boolean" }, { "name": "f2_2", "type": "float" } ] } } ] } ``` And the JSON representation of the Avro format file is as follows. ``` {"f1":{"f1_1":"aaa","f1_2":10,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":true,"f2_2":1.2},{"f2_1":true,"f2_2":2.2}]} {"f1":{"f1_1":"bbb","f1_2":20,"f1_3":{"f1_3_1":3.14}},"f2":[{"f2_1":false,"f2_2":10.2}]} ```

alamb · 2023-09-13T19:18:40Z

apache/arrow-testing#91

Has been merged

sarutak · 2023-09-13T21:35:54Z

@alamb Thank you!
I've added test for this change.

alamb

Looks good to me -- thank you @sarutak

Fix for nested Avro records

f8c073c

github-actions bot added the core Core DataFusion crate label Sep 11, 2023

alamb reviewed Sep 11, 2023

View reviewed changes

sarutak mentioned this pull request Sep 13, 2023

Add an Avro test data containing nested records apache/arrow-testing#91

Merged

Add test for nested records

6180341

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 13, 2023

alamb approved these changes Sep 14, 2023

View reviewed changes

alamb merged commit 58ddcee into apache:main Sep 14, 2023

sarutak mentioned this pull request Sep 29, 2023

fix: avro_to_arrow: Handle avro nested nullable struct (union) #7663

Merged

andygrove added the enhancement New feature or request label Oct 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make AvroArrowArrayReader possible to scan Avro backed table which contains nested records #7525

Make AvroArrowArrayReader possible to scan Avro backed table which contains nested records #7525

sarutak commented Sep 11, 2023 •

edited

Loading

alamb left a comment

alamb commented Sep 11, 2023

sarutak commented Sep 11, 2023

alamb commented Sep 12, 2023

sarutak commented Sep 13, 2023 •

edited

Loading

alamb commented Sep 13, 2023

sarutak commented Sep 13, 2023

alamb left a comment

Make AvroArrowArrayReader possible to scan Avro backed table which contains nested records #7525

Make AvroArrowArrayReader possible to scan Avro backed table which contains nested records #7525

Conversation

sarutak commented Sep 11, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb commented Sep 11, 2023

sarutak commented Sep 11, 2023

alamb commented Sep 12, 2023

sarutak commented Sep 13, 2023 • edited Loading

alamb commented Sep 13, 2023

sarutak commented Sep 13, 2023

alamb left a comment

Choose a reason for hiding this comment

sarutak commented Sep 11, 2023 •

edited

Loading

sarutak commented Sep 13, 2023 •

edited

Loading