Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashJoin order fixing #7155

Merged
merged 5 commits into from
Aug 2, 2023
Merged

HashJoin order fixing #7155

merged 5 commits into from
Aug 2, 2023

Conversation

metesynnada
Copy link
Contributor

Which issue does this PR close?

Closes #7113.

Rationale for this change

Fixing the HashJoin order.

What changes are included in this PR?

Reversing indices.

Are these changes tested?

Yes, with existing tests.

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jul 31, 2023
Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -757,7 +757,7 @@ pub fn build_equal_condition_join_indices(
let mut build_indices = UInt64BufferBuilder::new(0);
let mut probe_indices = UInt32BufferBuilder::new(0);
// Visit all of the probe rows
for (row, hash_value) in hash_values.iter().enumerate() {
for (row, hash_value) in hash_values.iter().enumerate().rev() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be reverted as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah because otherwise the order of is wrong - I get it. Could we add some comments for both changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I am on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for this issue #7113
I don't know why this issue is a bug?
Does it get the error result for the hash join?

@Dandandan @metesynnada

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there seems there was no test for checking the ordering of the hash join results (only for the matching rows to be correct) @metesynnada changed one to test this. This is important if the execution plan depends on the output ordering (and doesn't add extra sort).

The matching rows themselves are correct even before this PR.

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't merge this PR, I have some confused issue about this fix.

@liukun4515 liukun4515 requested a review from Dandandan August 1, 2023 08:41
Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌 excellent

@metesynnada
Copy link
Contributor Author

@liukun4515 Is there anything I can clarify for you about some points?

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@metesynnada and I discussed this in detail and this PR LGTM as well. @liukun4515, do you still have any questions or worries?

@Dandandan Dandandan merged commit 1e9816b into apache:main Aug 2, 2023
@Dandandan
Copy link
Contributor

Thanks @metesynnada and @ozankabak for the extra review.

I felt certain that this is the correct change so went ahead with the merge. @liukun4515 let us know if you have any worries.

@liukun4515
Copy link
Contributor

Thanks @metesynnada and @ozankabak for the extra review.

I felt certain that this is the correct change so went ahead with the merge. @liukun4515 let us know if you have any worries.

In this issue #7113

@metesynnada describe the issue

In the current implementation of calculate_hash_join_output_order, we operate under the assumption that the order of the build side is also preserved lexicographically.

But I review the code of calculate_join_output_ordering, and find the build side are not preserved the order
Does we need to preserve the order for the build side?

From the implementation of calculate_join_output_ordering, I think we should just maintain the order of the probe side.
Does I miss something?
@Dandandan
@ozankabak

@metesynnada
Copy link
Contributor Author

metesynnada commented Aug 2, 2023

In the JoinType::Inner match arm, we also preserve the left ordering.

JoinType::Inner => {
    // We modify the indices of the right order columns because their
    // columns are appended to the right side of the left schema.
    let mut adjusted_right_order =
        adjust_right_order(right_order, left_len)?;
    if let Some(left_order) = maybe_left_order {
        adjusted_right_order.extend_from_slice(left_order);
    }
    Some(adjusted_right_order)
}

Since the function is changed into calculate_join_output_ordering into calculate_join_output_order in a previous PR, this function is not currently in main. However, the new function is also adding left order in case of the inner join.

(false, true) => {
      // Special case, we can prefix ordering of left side with the ordering of right side.
      if join_type == JoinType::Inner && probe_side == Some(JoinSide::Right) {
          merge_vectors(&right_ordering, left_ordering)
      } else {
          right_ordering
      }
  }

@liukun4515
Copy link
Contributor

In the JoinType::Inner match arm, we also preserve the left ordering.

JoinType::Inner => {
    // We modify the indices of the right order columns because their
    // columns are appended to the right side of the left schema.
    let mut adjusted_right_order =
        adjust_right_order(right_order, left_len)?;
    if let Some(left_order) = maybe_left_order {
        adjusted_right_order.extend_from_slice(left_order);
    }
    Some(adjusted_right_order)
}

Since the function is changed into calculate_join_output_ordering into calculate_join_output_order in a previous PR, this function is not currently in main. However, the new function is also adding left order in case of the inner join.

(false, true) => {
      // Special case, we can prefix ordering of left side with the ordering of right side.
      if join_type == JoinType::Inner && probe_side == Some(JoinSide::Right) {
          merge_vectors(&right_ordering, left_ordering)
      } else {
          right_ordering
      }
  }

I know this changes about calculate_join_output_ordering.

If we don't consider the special case

      if join_type == JoinType::Inner && probe_side == Some(JoinSide::Right) {
          merge_vectors(&right_ordering, left_ordering)

Does we can get the right result?
@metesynnada

@mustafasrepo
Copy link
Contributor

mustafasrepo commented Aug 2, 2023

If we don't consider the special case ...
Does we can get the right result?

Yes, the result will be correct but it would be sub-optimal. Consider the hash join where left table ordering is left ASC and right table ordering is right ASC. If the join type is right output ordering of the hash join would be right ASC. If the join type were, inner join output ordering would be right ASC, left ASC. (lexicographical). This special case handles this. If we do not use it, we would produce right ASC for inner join. This is correct also.

@liukun4515
Copy link
Contributor

If we don't consider the special case ...
Does we can get the right result?

Yes, the result will be correct but it would be sub-optimal. Consider the hash join where left table ordering is left ASC and right table ordering is right ASC. If the join type is right output ordering of the hash join would be right ASC. If the join type were, inner join output ordering would be right ASC, left ASC. (lexicographical). This special case handles this. If we do not use it, we would produce right ASC for inner join. This is correct also.

Thanks for your explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HashJoinExec output order is not correct
5 participants