Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IAE on cross_fields query introduced in 7.0.1 #41938

Merged
merged 1 commit into from
May 9, 2019

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented May 8, 2019

If the max doc in the index is greater than the minimum total term frequency
among the requested fields we need to adjust max doc to be equal to the min ttf.
This was removed by mistake when fixing #41125.

Closes #41934

If the max doc in the index is greater than the minimum total term frequency
among the requested fields we need to adjust max doc to be equal to the min ttf.
This was removed by mistake when fixing elastic#41125.

Closes elastic#41934
@jimczi jimczi added >bug :Search/Search Search-related issues that do not fall into other categories v8.0.0 v7.2.0 labels May 8, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@jimczi
Copy link
Contributor Author

jimczi commented May 8, 2019

@elasticmachine run elasticsearch-ci/1

@TheRealChrisS
Copy link

@jimczi, thanks for fixing that. This pull request is labeled with v7.2.0. Does it mean this fix will come with v7.2.0 only?

Copy link
Contributor

@markharwood markharwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can verify that this fixes the TTF>=DF assertion failure but I'm not certain whether lowering DF or raising TTF would be the right fix scoring-wise.

The main motivation for BlendedTermQuery was to fix the problem of multi-field-search favouring the wrong context for a term (default Lucene prefers lastName:mark over firstName:mark). To fix that we had to pick the most common field for a term and use that as the choice of DF.
So we want to take max(DF) of the fields containing a term to make it more boring.

Perhaps the fix should be to raise TTF rather than lower DF given our general motivation is to raise DF.

I confess I don't know how TTF is generally used in scoring - I know higher TFs are preferred because it indicates a single doc is repeatedly using a term but not sure what value is attached to TTF.

My assumption is fixing the assertion break by raising TTF is preferable to lowering DF given we want to make terms seem more boring, not rarer and therefore more interesting.

@jimczi
Copy link
Contributor Author

jimczi commented May 9, 2019

Perhaps the fix should be to raise TTF rather than lower DF given our general motivation is to raise DF.

I'd like to merge this fix first since this is a bug introduced in a patch version and we can discuss better ways to handle this in a follow up ? We also want to switch to the new BM25FQuery in Lucene which doesn't alter the ttf or df artificially.

@jimczi
Copy link
Contributor Author

jimczi commented May 9, 2019

This pull request is labeled with v7.2.0. Does it mean this fix will come with v7.2.0 only?

That's the plan, yes. It's too late for 7.1.0 but the fix should be in the next release after that.

@jimczi
Copy link
Contributor Author

jimczi commented May 9, 2019

@elasticmachine run elasticsearch-ci/1

@jimczi jimczi merged commit 166a921 into elastic:master May 9, 2019
@jimczi jimczi deleted the bug/blended_min_ttf branch May 9, 2019 12:25
jimczi added a commit that referenced this pull request May 9, 2019
If the max doc in the index is greater than the minimum total term frequency
among the requested fields we need to adjust max doc to be equal to the min ttf.
This was removed by mistake when fixing #41125.

Closes #41934
jimczi added a commit that referenced this pull request May 9, 2019
If the max doc in the index is greater than the minimum total term frequency
among the requested fields we need to adjust max doc to be equal to the min ttf.
This was removed by mistake when fixing #41125.

Closes #41934
@rightaway
Copy link

@jimczi Why is this not being put into the very next patch release? The very point of a patch release is to fix bugs.

Including this severe bug that has made search completely unusable for us. This bug was introduced in a patch release and has no good reason not to be fixed in a patch release.

What was supposed to be a painless upgrade to 7.0.1 has literally broken search for us and we are supposed to wait for 2 more minor releases?

Forget 7.2.0 and even 7.1.0, this needs to be a priority to make it into 7.0.2!

gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
If the max doc in the index is greater than the minimum total term frequency
among the requested fields we need to adjust max doc to be equal to the min ttf.
This was removed by mistake when fixing elastic#41125.

Closes elastic#41934
jtibshirani added a commit that referenced this pull request Sep 23, 2022
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
#41938. Specifically, we need to make sure to take the minimum between the
document frequency (`actualDf`) and the minimum total term frequency
(`minTTF`). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

Fixes #90275
jtibshirani added a commit that referenced this pull request Sep 23, 2022
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
#41938. Specifically, we need to make sure to take the minimum between the
document frequency (`actualDf`) and the minimum total term frequency
(`minTTF`). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

Fixes #90275
jtibshirani added a commit to jtibshirani/elasticsearch that referenced this pull request Sep 23, 2022
In elastic#89016 we adjusted the `cross_fields` scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
elastic#41938. Specifically, we need to make sure to take the minimum between the
document frequency (`actualDf`) and the minimum total term frequency
(`minTTF`). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

Fixes elastic#90275
jtibshirani added a commit that referenced this pull request Sep 23, 2022
In #89016 we adjusted the `cross_fields` scoring formula to prevent negative
scores. This fix accidentally dropped another important fix that was added in
#41938. Specifically, we need to make sure to take the minimum between the
document frequency (`actualDf`) and the minimum total term frequency
(`minTTF`). Otherwise, we can produce invalid term statistics where the total
term frequency is less than the document frequency.

Fixes #90275
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories v7.1.0 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

“totalTermFreq must be at least docFreq” error after upgrading to 7.0.1
7 participants