Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema error creating logical/physical plan for TPC-DS q10, q35 #4794

Closed
andygrove opened this issue Jan 3, 2023 · 7 comments · Fixed by #9829
Closed

Schema error creating logical/physical plan for TPC-DS q10, q35 #4794

andygrove opened this issue Jan 3, 2023 · 7 comments · Fixed by #9829
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Member

andygrove commented Jan 3, 2023

Describe the bug

Creating a physical plan for q10 fails with:

Error: SchemaError(FieldNotFound { field: Column { relation: Some("web_sales"), name: "ws_bill_customer_sk" }, valid_fields: Some([Column { relation: Some("customer"), name: "c_customer_sk" }, Column { relation: Some("customer"), name: "c_current_cdemo_sk" }, Column { relation: Some("customer"), name: "c_current_addr_sk" }]) })

Here is the part of the logical plan that failed in create_intial_plan:

FAILED ON SubqueryAlias: c [c_customer_sk:Int32, c_current_cdemo_sk:Int32, c_current_addr_sk:Int32]
  LeftSemi Join: web_sales.ws_bill_customer_sk = c.c_customer_sk [c_customer_sk:Int32, c_current_cdemo_sk:Int32, c_current_addr_sk:Int32]
    TableScan: customer projection=[c_customer_sk, c_current_cdemo_sk, c_current_addr_sk] [c_customer_sk:Int32, c_current_cdemo_sk:Int32, c_current_addr_sk:Int32]
    Projection: web_sales.ws_bill_customer_sk [ws_bill_customer_sk:Int32]
      Inner Join: web_sales.ws_sold_date_sk = date_dim.d_date_sk [ws_sold_date_sk:Int32, ws_bill_customer_sk:Int32, d_date_sk:Int32]
        TableScan: web_sales projection=[ws_sold_date_sk, ws_bill_customer_sk] [ws_sold_date_sk:Int32, ws_bill_customer_sk:Int32]
        Projection: date_dim.d_date_sk [d_date_sk:Int32]
          Filter: CAST(date_dim.d_year AS Int64) = Int64(2002) AND CAST(date_dim.d_moy AS Int64)date_dim.d_moy >= Int64(4) AND CAST(date_dim.d_moy AS Int64)date_dim.d_moy <= Int64(7) [CAST(date_dim.d_moy AS Int64)date_dim.d_moy:Int64, d_date_sk:Int32, d_year:Int32]
            Projection: CAST(date_dim.d_moy AS Int64) AS CAST(date_dim.d_moy AS Int64)date_dim.d_moy, date_dim.d_date_sk, date_dim.d_year [CAST(date_dim.d_moy AS Int64)date_dim.d_moy:Int64, d_date_sk:Int32, d_year:Int32]
              TableScan: date_dim projection=[d_date_sk, d_year, d_moy] [d_date_sk:Int32, d_year:Int32, d_moy:Int32]

To Reproduce
Unigore test tpcds_physical_q10.

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

@andygrove andygrove added the bug Something isn't working label Jan 3, 2023
@andygrove
Copy link
Member Author

andygrove commented Jan 3, 2023

There are multiple bugs, but the first is that the join keys are assumed to be in the same order as the left and right relations but in this case the left key is in the right schema.

left schema = ["customer.c_customer_sk", "customer.c_current_cdemo_sk", "customer.c_current_addr_sk"]
right schema = ["web_sales.ws_bill_customer_sk"]
l = web_sales.ws_bill_customer_sk, r = c.c_customer_sk

Filed as #4795

@andygrove andygrove self-assigned this Jan 3, 2023
@andygrove
Copy link
Member Author

The second bug is that the schema has customer.c_customer_sk but the join condition has c.c_customer_sk

@andygrove andygrove removed their assignment Jan 3, 2023
@askoa
Copy link
Contributor

askoa commented Jan 14, 2023

The second bug is that the schema has customer.c_customer_sk but the join condition has c.c_customer_sk

The expression in question was originally part of EXISTS(<subquery>). Few iterations before the failure, a push_down_filter pushes the EXISTS(<subquery>) down. The push_down_filter uses ExprRewriter to rewrite the expressions. The ExprRewriter does not rewrite subquery in any expression that has a subquery. In this case, the expression with subquery is EXISTS. Thus the column c.c_customer_sk does not get modified to customer.c_customer_sk.

push_down_filter rewriting expressions:
https://github.com/apache/arrow-datafusion/blob/a9ddcd3a7558437361835120659b946b903468e1/datafusion/optimizer/src/push_down_filter.rs#L785

ExprRewriter does not rewrite subqueries:
https://github.com/apache/arrow-datafusion/blob/a9ddcd3a7558437361835120659b946b903468e1/datafusion/expr/src/expr_rewriter.rs#L116

Relevant snippet of logical plan before `push_down_filter`
          Filter: c.c_current_addr_sk = ca.ca_address_sk AND ca.ca_county IN ([Utf8("Walker County"), Utf8("Richland County"), Utf8("Gaines County"), Utf8("Douglas County"), Utf8("Dona Ana County")]) AND customer_demographics.cd_demo_sk = c.c_current_cdemo_sk AND (EXISTS (<subquery>) AS EXISTS OR EXISTS (<subquery>) AS EXISTS)
            Projection: EXISTS (<subquery>) AS EXISTS (<subquery>), c.c_customer_sk, c.c_customer_id, c.c_current_cdemo_sk, c.c_current_hdemo_sk, c.c_current_addr_sk, c.c_first_shipto_date_sk, c.c_first_sales_date_sk, c.c_salutation, c.c_first_name, c.c_last_name, c.c_preferred_cust_flag, c.c_birth_day, c.c_birth_month, c.c_birth_year, c.c_birth_country, c.c_login, c.c_email_address, c.c_last_review_date_sk, ca.ca_address_sk, ca.ca_address_id, ca.ca_street_number, ca.ca_street_name, ca.ca_street_type, ca.ca_suite_number, ca.ca_city, ca.ca_county, ca.ca_state, ca.ca_zip, ca.ca_country, ca.ca_gmt_offset, ca.ca_location_type, customer_demographics.cd_demo_sk, customer_demographics.cd_gender, customer_demographics.cd_marital_status, customer_demographics.cd_education_status, customer_demographics.cd_purchase_estimate, customer_demographics.cd_credit_rating, customer_demographics.cd_dep_count, customer_demographics.cd_dep_employed_count, customer_demographics.cd_dep_college_count
              Subquery:
                Projection: web_sales.ws_sold_date_sk, web_sales.ws_sold_time_sk, web_sales.ws_ship_date_sk, web_sales.ws_item_sk, web_sales.ws_bill_customer_sk, web_sales.ws_bill_cdemo_sk, web_sales.ws_bill_hdemo_sk, web_sales.ws_bill_addr_sk, web_sales.ws_ship_customer_sk, web_sales.ws_ship_cdemo_sk, web_sales.ws_ship_hdemo_sk, web_sales.ws_ship_addr_sk, web_sales.ws_web_page_sk, web_sales.ws_web_site_sk, web_sales.ws_ship_mode_sk, web_sales.ws_warehouse_sk, web_sales.ws_promo_sk, web_sales.ws_order_number, web_sales.ws_quantity, web_sales.ws_wholesale_cost, web_sales.ws_list_price, web_sales.ws_sales_price, web_sales.ws_ext_discount_amt, web_sales.ws_ext_sales_price, web_sales.ws_ext_wholesale_cost, web_sales.ws_ext_list_price, web_sales.ws_ext_tax, web_sales.ws_coupon_amt, web_sales.ws_ext_ship_cost, web_sales.ws_net_paid, web_sales.ws_net_paid_inc_tax, web_sales.ws_net_paid_inc_ship, web_sales.ws_net_paid_inc_ship_tax, web_sales.ws_net_profit, date_dim.d_date_sk, date_dim.d_date_id, date_dim.d_date, date_dim.d_month_seq, date_dim.d_week_seq, date_dim.d_quarter_seq, date_dim.d_year, date_dim.d_dow, date_dim.d_moy, date_dim.d_dom, date_dim.d_qoy, date_dim.d_fy_year, date_dim.d_fy_quarter_seq, date_dim.d_fy_week_seq, date_dim.d_day_name, date_dim.d_quarter_name, date_dim.d_holiday, date_dim.d_weekend, date_dim.d_following_holiday, date_dim.d_first_dom, date_dim.d_last_dom, date_dim.d_same_day_ly, date_dim.d_same_day_lq, date_dim.d_current_day, date_dim.d_current_week, date_dim.d_current_month, date_dim.d_current_quarter, date_dim.d_current_year
                  Filter: c.c_customer_sk = web_sales.ws_bill_customer_sk AND web_sales.ws_sold_date_sk = date_dim.d_date_sk AND CAST(date_dim.d_year AS Int64) = Int64(2002) AND CAST(date_dim.d_moy AS Int64) BETWEEN Int64(4) AND Int64(4) + Int64(3)
                    CrossJoin:
                      TableScan: web_sales
                      TableScan: date_dim
              LeftSemi Join: c.c_customer_sk = store_sales.ss_customer_sk
                CrossJoin:
                  CrossJoin:
                    SubqueryAlias: c
                      TableScan: customer
Relevant snippet of logical plan after `push_down_filter`
        Projection: c.c_customer_sk, c.c_customer_id, c.c_current_cdemo_sk, c.c_current_hdemo_sk, c.c_current_addr_sk, c.c_first_shipto_date_sk, c.c_first_sales_date_sk, c.c_salutation, c.c_first_name, c.c_last_name, c.c_preferred_cust_flag, c.c_birth_day, c.c_birth_month, c.c_birth_year, c.c_birth_country, c.c_login, c.c_email_address, c.c_last_review_date_sk, ca.ca_address_sk, ca.ca_address_id, ca.ca_street_number, ca.ca_street_name, ca.ca_street_type, ca.ca_suite_number, ca.ca_city, ca.ca_county, ca.ca_state, ca.ca_zip, ca.ca_country, ca.ca_gmt_offset, ca.ca_location_type, customer_demographics.cd_demo_sk, customer_demographics.cd_gender, customer_demographics.cd_marital_status, customer_demographics.cd_education_status, customer_demographics.cd_purchase_estimate, customer_demographics.cd_credit_rating, customer_demographics.cd_dep_count, customer_demographics.cd_dep_employed_count, customer_demographics.cd_dep_college_count
          Projection: EXISTS (<subquery>) AS EXISTS (<subquery>), c.c_customer_sk, c.c_customer_id, c.c_current_cdemo_sk, c.c_current_hdemo_sk, c.c_current_addr_sk, c.c_first_shipto_date_sk, c.c_first_sales_date_sk, c.c_salutation, c.c_first_name, c.c_last_name, c.c_preferred_cust_flag, c.c_birth_day, c.c_birth_month, c.c_birth_year, c.c_birth_country, c.c_login, c.c_email_address, c.c_last_review_date_sk, ca.ca_address_sk, ca.ca_address_id, ca.ca_street_number, ca.ca_street_name, ca.ca_street_type, ca.ca_suite_number, ca.ca_city, ca.ca_county, ca.ca_state, ca.ca_zip, ca.ca_country, ca.ca_gmt_offset, ca.ca_location_type, customer_demographics.cd_demo_sk, customer_demographics.cd_gender, customer_demographics.cd_marital_status, customer_demographics.cd_education_status, customer_demographics.cd_purchase_estimate, customer_demographics.cd_credit_rating, customer_demographics.cd_dep_count, customer_demographics.cd_dep_employed_count, customer_demographics.cd_dep_college_count
            Subquery:
              Projection: web_sales.ws_sold_date_sk, web_sales.ws_sold_time_sk, web_sales.ws_ship_date_sk, web_sales.ws_item_sk, web_sales.ws_bill_customer_sk, web_sales.ws_bill_cdemo_sk, web_sales.ws_bill_hdemo_sk, web_sales.ws_bill_addr_sk, web_sales.ws_ship_customer_sk, web_sales.ws_ship_cdemo_sk, web_sales.ws_ship_hdemo_sk, web_sales.ws_ship_addr_sk, web_sales.ws_web_page_sk, web_sales.ws_web_site_sk, web_sales.ws_ship_mode_sk, web_sales.ws_warehouse_sk, web_sales.ws_promo_sk, web_sales.ws_order_number, web_sales.ws_quantity, web_sales.ws_wholesale_cost, web_sales.ws_list_price, web_sales.ws_sales_price, web_sales.ws_ext_discount_amt, web_sales.ws_ext_sales_price, web_sales.ws_ext_wholesale_cost, web_sales.ws_ext_list_price, web_sales.ws_ext_tax, web_sales.ws_coupon_amt, web_sales.ws_ext_ship_cost, web_sales.ws_net_paid, web_sales.ws_net_paid_inc_tax, web_sales.ws_net_paid_inc_ship, web_sales.ws_net_paid_inc_ship_tax, web_sales.ws_net_profit, date_dim.d_date_sk, date_dim.d_date_id, date_dim.d_date, date_dim.d_month_seq, date_dim.d_week_seq, date_dim.d_quarter_seq, date_dim.d_year, date_dim.d_dow, date_dim.d_moy, date_dim.d_dom, date_dim.d_qoy, date_dim.d_fy_year, date_dim.d_fy_quarter_seq, date_dim.d_fy_week_seq, date_dim.d_day_name, date_dim.d_quarter_name, date_dim.d_holiday, date_dim.d_weekend, date_dim.d_following_holiday, date_dim.d_first_dom, date_dim.d_last_dom, date_dim.d_same_day_ly, date_dim.d_same_day_lq, date_dim.d_current_day, date_dim.d_current_week, date_dim.d_current_month, date_dim.d_current_quarter, date_dim.d_current_year
                Filter: c.c_customer_sk = web_sales.ws_bill_customer_sk AND web_sales.ws_sold_date_sk = date_dim.d_date_sk AND CAST(date_dim.d_year AS Int64) = Int64(2002) AND CAST(date_dim.d_moy AS Int64) BETWEEN Int64(4) AND Int64(4) + Int64(3)
                  CrossJoin:
                    TableScan: web_sales
                    TableScan: date_dim
            LeftSemi Join: c.c_customer_sk = store_sales.ss_customer_sk
              Filter: customer_demographics.cd_demo_sk = c.c_current_cdemo_sk
                CrossJoin:
                  Filter: c.c_current_addr_sk = ca.ca_address_sk
                    CrossJoin:
                      SubqueryAlias: c
                        Filter: EXISTS (<subquery>) OR EXISTS (<subquery>)
                          Subquery:
                            Projection: web_sales.ws_sold_date_sk, web_sales.ws_sold_time_sk, web_sales.ws_ship_date_sk, web_sales.ws_item_sk, web_sales.ws_bill_customer_sk, web_sales.ws_bill_cdemo_sk, web_sales.ws_bill_hdemo_sk, web_sales.ws_bill_addr_sk, web_sales.ws_ship_customer_sk, web_sales.ws_ship_cdemo_sk, web_sales.ws_ship_hdemo_sk, web_sales.ws_ship_addr_sk, web_sales.ws_web_page_sk, web_sales.ws_web_site_sk, web_sales.ws_ship_mode_sk, web_sales.ws_warehouse_sk, web_sales.ws_promo_sk, web_sales.ws_order_number, web_sales.ws_quantity, web_sales.ws_wholesale_cost, web_sales.ws_list_price, web_sales.ws_sales_price, web_sales.ws_ext_discount_amt, web_sales.ws_ext_sales_price, web_sales.ws_ext_wholesale_cost, web_sales.ws_ext_list_price, web_sales.ws_ext_tax, web_sales.ws_coupon_amt, web_sales.ws_ext_ship_cost, web_sales.ws_net_paid, web_sales.ws_net_paid_inc_tax, web_sales.ws_net_paid_inc_ship, web_sales.ws_net_paid_inc_ship_tax, web_sales.ws_net_profit, date_dim.d_date_sk, date_dim.d_date_id, date_dim.d_date, date_dim.d_month_seq, date_dim.d_week_seq, date_dim.d_quarter_seq, date_dim.d_year, date_dim.d_dow, date_dim.d_moy, date_dim.d_dom, date_dim.d_qoy, date_dim.d_fy_year, date_dim.d_fy_quarter_seq, date_dim.d_fy_week_seq, date_dim.d_day_name, date_dim.d_quarter_name, date_dim.d_holiday, date_dim.d_weekend, date_dim.d_following_holiday, date_dim.d_first_dom, date_dim.d_last_dom, date_dim.d_same_day_ly, date_dim.d_same_day_lq, date_dim.d_current_day, date_dim.d_current_week, date_dim.d_current_month, date_dim.d_current_quarter, date_dim.d_current_year
                              Filter: c.c_customer_sk = web_sales.ws_bill_customer_sk AND web_sales.ws_sold_date_sk = date_dim.d_date_sk AND CAST(date_dim.d_year AS Int64) = Int64(2002) AND CAST(date_dim.d_moy AS Int64) BETWEEN Int64(4) AND Int64(4) + Int64(3)
                                CrossJoin:
                                  TableScan: web_sales
                                  TableScan: date_dim
                          Subquery:
                            Projection: web_sales.ws_sold_date_sk, web_sales.ws_sold_time_sk, web_sales.ws_ship_date_sk, web_sales.ws_item_sk, web_sales.ws_bill_customer_sk, web_sales.ws_bill_cdemo_sk, web_sales.ws_bill_hdemo_sk, web_sales.ws_bill_addr_sk, web_sales.ws_ship_customer_sk, web_sales.ws_ship_cdemo_sk, web_sales.ws_ship_hdemo_sk, web_sales.ws_ship_addr_sk, web_sales.ws_web_page_sk, web_sales.ws_web_site_sk, web_sales.ws_ship_mode_sk, web_sales.ws_warehouse_sk, web_sales.ws_promo_sk, web_sales.ws_order_number, web_sales.ws_quantity, web_sales.ws_wholesale_cost, web_sales.ws_list_price, web_sales.ws_sales_price, web_sales.ws_ext_discount_amt, web_sales.ws_ext_sales_price, web_sales.ws_ext_wholesale_cost, web_sales.ws_ext_list_price, web_sales.ws_ext_tax, web_sales.ws_coupon_amt, web_sales.ws_ext_ship_cost, web_sales.ws_net_paid, web_sales.ws_net_paid_inc_tax, web_sales.ws_net_paid_inc_ship, web_sales.ws_net_paid_inc_ship_tax, web_sales.ws_net_profit, date_dim.d_date_sk, date_dim.d_date_id, date_dim.d_date, date_dim.d_month_seq, date_dim.d_week_seq, date_dim.d_quarter_seq, date_dim.d_year, date_dim.d_dow, date_dim.d_moy, date_dim.d_dom, date_dim.d_qoy, date_dim.d_fy_year, date_dim.d_fy_quarter_seq, date_dim.d_fy_week_seq, date_dim.d_day_name, date_dim.d_quarter_name, date_dim.d_holiday, date_dim.d_weekend, date_dim.d_following_holiday, date_dim.d_first_dom, date_dim.d_last_dom, date_dim.d_same_day_ly, date_dim.d_same_day_lq, date_dim.d_current_day, date_dim.d_current_week, date_dim.d_current_month, date_dim.d_current_quarter, date_dim.d_current_year
                              Filter: c.c_customer_sk = web_sales.ws_bill_customer_sk AND web_sales.ws_sold_date_sk = date_dim.d_date_sk AND CAST(date_dim.d_year AS Int64) = Int64(2002) AND CAST(date_dim.d_moy AS Int64) BETWEEN Int64(4) AND Int64(4) + Int64(3)
                                CrossJoin:
                                  TableScan: web_sales
                                  TableScan: date_dim
                          TableScan: customer

We could see from the second snippet that the filter EXISTS (<subquery>) OR EXISTS (<subquery>) pushed below Subqueryalias: c. The column name in the subquery of EXISTS is not changed from c to customer.

The fix?

I would like to discuss the fix here. I don't think we can change impl ExprRewriter for Expr as it's used in many places. May be write an exclusive rewriter for push_down_filter which rewrites subqueries?

@jackwener
Copy link
Member

jackwener commented Feb 19, 2023

tpcds-35 also have this problem.

After set pub skip_failed_rules: bool, default = false, we can find it.

@jackwener
Copy link
Member

@askoa , I also investigated this problem.
Your comments are very useful to solve this problem, thank you.

@askoa
Copy link
Contributor

askoa commented Feb 19, 2023

Your comments are very useful to solve this problem, thank you.

No problem. I am glad it was helpful. BTW, q10 has one more issue #4887

@Dandandan
Copy link
Contributor

Seems the schema error when creating the logical / physical plan has been resolved in the meantime.

@Dandandan Dandandan changed the title Schema error creating physical plan for TPC-DS q10 Schema error creating physical plan for TPC-DS q10, q35 Mar 27, 2024
@Dandandan Dandandan changed the title Schema error creating physical plan for TPC-DS q10, q35 Schema error creating logical/physical plan for TPC-DS q10, q35 Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants