Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cached hash clash in Fixpoint #31140

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ggevay
Copy link
Contributor

@ggevay ggevay commented Jan 22, 2025

Fixes https://github.com/MaterializeInc/database-issues/issues/8906

TransformCtx gives Transforms the hash of the plan that is the input to the transform. The hash that TransformCtx gives out is the one stored in last_hash (the "cached hash"), which has to be updated at various moments. We forgot about one such moment: Fixpoint sometimes jumps back to the beginning as part of its cycle detection mechanism (which, somewhat confusingly, also involves hashing, but is not currently connected to the last_hash mechanism). Therefore, in this case, Transform::transform's soft_assert_eq_no_log! detects the wrong hash and reports a "cached hash clash". (And if we were to go through this soft_assert_eq_no_log! in prod, then the wrong hash would lead to wrong values for the transform_hits metric. In the future, we'd like to use these hashes for more mission-critical things than just metrics, so we'd like them to be correct.)

The first commit just fixes the bug, by updating last_hash to the hash of the plan that was at the beginning of the fixpoint loop.

The second commit factors out the updating of last_hash into a function, because it now happens in 3 places.

The third commit downgrades a soft_panic_or_log! to a tracing::error! to stop https://github.com/MaterializeInc/database-issues/issues/8197 from flaking nightly, because we have accumulated enough examples to debug this when we get to this issue. The tracing:error! would still show up in Sentry if this happens in prod, which would affect the prioritization of the issue. (So far, it hasn't happened in production at all in the 7 months since I added this soft_panic_or_log.)

The fourth commit adds a regression test. (This would fail with the above soft_panic_or_log!, so this is one more reason I wanted to downgrade this now to an error log.)

cc @mgree

Motivation

Tips for reviewer

I suggest reviewing commit by commit.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

I'd like to stop MaterializeInc#8197 from flaking Nightly, because we have
accumulated enough examples to debug this when we get to this issue.
But at the same time, I would still be interested in if this happens
in production, because that would affect the prioritization of the
issue. (So far, it hasn't happened in production at all in the
7 months since I added this soft_panic_or_log.)
@ggevay ggevay requested a review from a team as a code owner January 22, 2025 13:53
@ggevay ggevay added A-optimization Area: query optimization and transformation A-CLUSTER Topics related to the CLUSTER layer labels Jan 22, 2025
@ggevay ggevay requested a review from mgree January 24, 2025 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-CLUSTER Topics related to the CLUSTER layer A-optimization Area: query optimization and transformation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant