Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security for _field_names field should not override field statistics #33261

Merged
merged 5 commits into from
Sep 3, 2018

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Aug 30, 2018

In Lucene 8 the statistics for a field (doc_count, sum_doc_count, ...) are
checked and invalid values (v < 0) are rejected. Though for the _field_names
field we hide the statistics of the field if security is enabled since
some terms (field names) may be filtered. However this statistics are never
used, this field is not used for ranking and cannot be used to generate
term vectors. For these reasons this commit restores the original statistics
for the field in order to be compliant with Lucene 8.

In Lucene 8 the statistics for a field (doc_count, sum_doc_count, ...) are
checked and invalid values (v < 0) are rejected. Though for the _field_names
field we hide the statistics of the field if security is enabled since
some terms (field names) may be filtered. However this statistics are never
used, this field is not used for ranking and cannot be used to generate
term vectors. For these reasons this commit restores the original statistics
for the field in order to be compliant with Lucene 8.
@jimczi jimczi added >non-issue :Search/Search Search-related issues that do not fall into other categories v7.0.0 team-discuss labels Aug 30, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jimczi jimczi mentioned this pull request Aug 30, 2018
18 tasks
@@ -109,11 +114,13 @@ public CacheHelper getReaderCacheHelper() {
private final FieldInfos fieldInfos;
/** An automaton that only accepts authorized fields. */
private final CharacterRunAutomaton filter;
/** {@link Terms} cache with filtered stats for the {@link FieldNamesFieldMapper} field. */
private Terms fieldNamesFilterTerms;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make it final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@@ -371,37 +375,47 @@ private Terms wrapTerms(Terms terms, String field) {
* representing fields that should not be visible in this reader.
*/
class FieldNamesTerms extends FilterTerms {
long size = 0;
long sumDocFreq;
int docCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make them final somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

while (e.next() != null) {
size ++;
sumDocFreq += e.docFreq();
docCount = Math.max(e.docFreq(), docCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct... Maybe we should assume docCount = maxDoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oups thanks, I changed it to return maxDoc instead

@jimczi
Copy link
Contributor Author

jimczi commented Aug 31, 2018

We discussed with @jpountz and he proposed that we recompute the stats for this field rather than keeping the wrong stats. This should be fast since the number of terms is bounded by the number of fields in the index and the result can be cached per leaf reader. This idea implemented in c3c8ca3.

@jimczi
Copy link
Contributor Author

jimczi commented Aug 31, 2018

run gradle build tests

@jimczi jimczi merged commit f0a61b6 into elastic:master Sep 3, 2018
@jimczi jimczi deleted the field_names_field_subset branch September 3, 2018 07:36
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Sep 3, 2018
* master: (197 commits)
  Prevent NPE parsing the stop datafeed request. (elastic#33347)
  HLRC: Add ML get overall buckets API (elastic#33297)
  Core: Fix epoch millis java time formatter (elastic#33302)
  [Docs] Improve tuning for speed advice (elastic#33315)
  [Rollup] Fix Caps Comparator to handle calendar/fixed time (elastic#33336)
  [CI] Mute  IndexShardTests#testIndexCheckOnStartup fails elastic#33345
  [CI] Mute LuceneChangesSnapshotTests#testUpdateAndReadChangesConcurrently
  Security for _field_names field should not override field statistics (elastic#33261)
  Add early termination support to BucketCollector (elastic#33279)
  Fix extractjar task  ci  (elastic#33272)
  Mute testFollowIndexAndCloseNode
  Logging: Drop Settings from some logging ctors (elastic#33332)
  HLREST: add update by query API (elastic#32760)
  TEST: Increase timeout testFollowIndexAndCloseNode (elastic#33333)
  HLRC: ML Flush job (elastic#33187)
  HLRC: Adding ML Job stats (elastic#33183)
  LLREST: Drop deprecated methods (elastic#33223)
  Mute testSyncerOnClosingShard
  [DOCS] Moves machine learning APIs to docs folder (elastic#31118)
  Mute test watcher usage stats output
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :Search/Search Search-related issues that do not fall into other categories v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants