FilterPolicy API changes for 7.0 #9501

pdillinger · 2022-02-04T17:18:18Z

Summary:

Inefficient block-based filter is no longer customizable in the public
API, though (for now) can still be enabled.
- Removed deprecated FilterPolicy::CreateFilter() and
  FilterPolicy::KeyMayMatch()
- Removed rocksdb_filterpolicy_create() from C API
Change meaning of nullptr return from GetBuilderWithContext() from "use
block-based filter" to "generate no filter in this case." This is a
cleaner solution to the proposal in Support skipping filters in custom FilterPolicy, near-zero bpk #8250.
- Also, when user specifies bits_per_key < 0.5, we now round this down
  to "no filter" because we expect a filter with >= 80% FP rate is
  unlikely to be worth the CPU cost of accessing it (esp with
  cache_index_and_filter_blocks=1 or partition_filters=1).
- bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP
  rate)
- This also gives us some support for configuring filters from OPTIONS
  file as currently saved: filter_policy=rocksdb.BuiltinBloomFilter.
  Opening from such an options file will enable reading filters (an
  improvement) but not writing new ones. (See Customizable follow-up
  below.)
Also removed deprecated functions
- FilterBitsBuilder::CalculateNumEntry()
- FilterPolicy::GetFilterBitsBuilder()
- NewExperimentalRibbonFilterPolicy()
Remove default implementations of
- FilterBitsBuilder::EstimateEntriesAdded()
- FilterBitsBuilder::ApproximateNumEntries()
- FilterPolicy::GetBuilderWithContext()
Remove support for "filter_policy=experimental_ribbon" configuration
string.
Allow "filter_policy=bloomfilter:n" without bool to discourage use of
block-based filter.

Some pieces for #9389

Likely follow-up (later PRs):

Refactoring toward FilterPolicy Customizable, so that we can generate
filters with same configuration as before when configuring from options
file.
Remove support for user enabling block-based filter (ignore bool use_block_based_builder)
- Some months after this change, we could even remove read support for
  block-based filter, because it is not critical to DB data
  preservation.
Make FilterBitsBuilder::FinishV2 to avoid using FilterBitsBuilder::Finish mess and add support for specifying a
MemoryAllocator (for cache warming)

Test Plan: A number of obsolete tests deleted and new tests or test
cases added or updated.

Summary: * Inefficient block-based filter is no longer customizable in the public API, though (for now) can still be enabled. * Removed deprecated FilterPolicy::CreateFilter() and FilterPolicy::KeyMayMatch() * Removed `rocksdb_filterpolicy_create()` from C API * Change meaning of nullptr return from GetBuilderWithContext() from "use block-based filter" to "generate no filter in this case." This is a cleaner solution to the proposal in facebook#8250. * Also, when user specifies bits_per_key < 0.5, we now round this down to "no filter" because we expect a filter with >= 80% FP rate is unlikely to be worth the CPU cost of accessing it (esp with cache_index_and_filter_blocks=1 or partition_filters=1). * bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP rate) * This also gives us some support for configuring filters from OPTIONS file as currently saved: `filter_policy=rocksdb.BuiltinBloomFilter`. Opening from such an options file will enable reading filters (an improvement) but not writing new ones. (See Customizable follow-up below.) * Also removed deprecated functions * FilterBitsBuilder::CalculateNumEntry() * FilterPolicy::GetFilterBitsBuilder() * NewExperimentalRibbonFilterPolicy() * Remove default implementations of * FilterBitsBuilder::EstimateEntriesAdded() * FilterBitsBuilder::ApproximateNumEntries() * FilterPolicy::GetBuilderWithContext() * Remove support for "filter_policy=experimental_ribbon" configuration string. * Allow "filter_policy=bloomfilter:n" without bool to discourage use of block-based filter. Likely follow-up (later PRs): * Refactoring toward FilterPolicy Customizable, so that we can generate filters with same configuration as before when configuring from options file. * Remove support for user enabling block-based filter (ignore `bool use_block_based_builder`) * Some months after this change, we could even remove read support for block-based filter, because it is not critical to DB data preservation. * Make FilterBitsBuilder::FinishV2 to avoid `using FilterBitsBuilder::Finish` mess and add support for specifying a MemoryAllocator (for cache warming) Test Plan: A number of obsolete tests deleted and new tests or test cases added or updated.

facebook-github-bot · 2022-02-04T17:55:03Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2022-02-08T01:26:00Z

util/bloom_test.cc

@@ -399,21 +401,22 @@ TEST_P(FullBloomTest, FilterSize) {
  bool some_computed_less_than_denoted = false;
  // Note: enforced minimum is 1 bit per key (1000 millibits), and enforced


Seem like we also need to fix this comment as we now drop "the minimum is 1 bit per key (1000 millibits)"

hx235 · 2022-02-08T01:26:08Z

table/block_based/filter_policy_internal.h

-// Abstract base class for RocksDB built-in filter policies.
+// Base class for RocksDB built-in filter policies. This can read all
+// kinds of built-in filters (for backward compatibility with old
+// OPTIONS files) but does not write filters.


Is "write filter" the same as "build filter"? I am more used to using the word "build" led by "FilterBitsBuilder". Is there a technical reason to use "write" instead of "build"?

Good point. I'll use consistent wording.

hx235

First round review: left 1 minor comment and 2 questions.

hx235 · 2022-02-08T01:37:12Z

table/block_based/mock_block_based_table.h

      : ioptions_(options_),
        env_options_(options_),
        icomp_(options_.comparator) {
-    table_options_.filter_policy.reset(filter_policy);
+    table_options_.filter_policy = std::move(filter_policy);


Any subtle reason why we are doing this trick?

table_options_.filter_policy.reset(filter_policy.get()); would be wrong for shared_ptr, because it would be managed twice (double free). table_options_.filter_policy = filter_policy; would be OK but might unnecessarily increment+decrement the refcount on the shared_ptr (due to two potentially live local copies). std::move essentially guarantees the optimization by ensuring only one live local copy.

I know @mrambacher likes to use const std::shared_ptr& parameters, but in cases like this where the callee will surely (or almost surely) save a copy, I prefer std::shared_ptr because you can pass in a temporary like std::make_shared<X> and likely avoid unnecessary extra increment+decrement the refcount.

Discussed offline:
For clarification: my original question was intended to ask why we do we add an "indirection" from raw pointer "const FilterPolicy* filter_policy" to the share pointer "table_options_.filter_policy" using a trick of adding another constructor (I realized I quoted the wrong line for my question so probably that's why it missed its point)

Update:
It's not needed in this code but might be needed for future refactoring. If that's the case, I suggest adding the "indirection" in that PR for better context.

OK, I'll take it out of this one. I just noticed I need to fix some FB lints also.

hx235 · 2022-02-08T03:46:07Z

options/options_test.cc

@@ -930,16 +930,13 @@ TEST_F(OptionsTest, GetBlockBasedTableOptionsFromString) {
            new_opt.cache_index_and_filter_blocks);
  ASSERT_EQ(table_opt.filter_policy, new_opt.filter_policy);

-  // unrecognized filter policy config
-  s = GetBlockBasedTableOptionsFromString(config_options, table_opt,
-                                          "cache_index_and_filter_blocks=1;"


Why are we removing "cache_index_and_filter_blocks=1;" in the test?

It seems irrelevant. The cases of

ASSERT_EQ(table_opt.cache_index_and_filter_blocks, new_opt.cache_index_and_filter_blocks);

after

ASSERT_NOK(GetBlockBasedTableOptionsFromString(...

seem especially useless because they only test that a bad assignment wasn't made. It does not distinguish no assignment from proper assignment before failure. I don't know what's intended.

hx235 · 2022-02-08T04:30:06Z

table/block_based/filter_policy.cc

@@ -1790,8 +1785,8 @@ Status FilterPolicy::CreateFromString(
    } else {
      bloom_before_level = ParseInt(trim(value.substr(pos + 1)));
    }
-    double bloom_equivalent_bits_per_key =
-        ParseDouble(trim(value.substr(kRibbonName.size(), pos)));


Is the previous parsing algorithm a bug which should have been ParseDouble( trim(value.substr(kRibbonName.size(), pos - kRibbonName.size())));?

Yes but luckily it's just a code hygiene bug. If you go past the end with substr, it just gives you a string to the end. And ParseDouble seems to stop parsing and return what it has if it hits ':'.

hx235 · 2022-02-08T04:43:40Z

Second round: left 3 questions

facebook-github-bot · 2022-02-08T18:16:29Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-02-08T18:17:15Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235

LGTM!

facebook-github-bot · 2022-02-08T19:52:03Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-02-08T19:56:31Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: This change removes the ability to configure the deprecated, inefficient block-based filter in the public API. Options that would have enabled it now use "full" (and optionally partitioned) filters. Existing block-based filters can still be read and used, and a "back door" way to build them still exists, for testing and in case of trouble. About the only way this removal would cause an issue for users is if temporary memory for filter construction greatly increases. In HISTORY.md we suggest a few possible mitigations: partitioned filters, smaller SST files, or setting reserve_table_builder_memory=true. Or users who have customized a FilterPolicy using the CreateFilter/KeyMayMatch mechanism removed in facebook#9501 will have to upgrade their code. (It's long past time for people to move to the new builder/reader customization interface.) This change also introduces some internal-use-only configuration strings for testing specific filter implementations while bypassing some compatibility / intelligence logic. This is intended to hint at a path toward making FilterPolicy Customizable, but it also gives us a "back door" way to configure block-based filter. Test Plan: Unit tests updated. Specifically, * BlockBasedTableTest.BlockReadCountTest is tweaked to validate the back door configuration interface and ignoring of `use_block_based_builder`. * BlockBasedTableTest.TracingGetTest is migrated from testing block-based filter access pattern to full filter access patter, by re-ordering some things. * Options test (pretty self-explanatory)

Summary: This change removes the ability to configure the deprecated, inefficient block-based filter in the public API. Options that would have enabled it now use "full" (and optionally partitioned) filters. Existing block-based filters can still be read and used, and a "back door" way to build them still exists, for testing and in case of trouble. About the only way this removal would cause an issue for users is if temporary memory for filter construction greatly increases. In HISTORY.md we suggest a few possible mitigations: partitioned filters, smaller SST files, or setting reserve_table_builder_memory=true. Or users who have customized a FilterPolicy using the CreateFilter/KeyMayMatch mechanism removed in #9501 will have to upgrade their code. (It's long past time for people to move to the new builder/reader customization interface.) This change also introduces some internal-use-only configuration strings for testing specific filter implementations while bypassing some compatibility / intelligence logic. This is intended to hint at a path toward making FilterPolicy Customizable, but it also gives us a "back door" way to configure block-based filter. Aside: updated db_bench so that -readonly implies -use_existing_db Pull Request resolved: #9535 Test Plan: Unit tests updated. Specifically, * BlockBasedTableTest.BlockReadCountTest is tweaked to validate the back door configuration interface and ignoring of `use_block_based_builder`. * BlockBasedTableTest.TracingGetTest is migrated from testing block-based filter access pattern to full filter access patter, by re-ordering some things. * Options test (pretty self-explanatory) Performance test - create with `./db_bench -db=/dev/shm/rocksdb1 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0` with and without `-use_block_based_filter`, which creates a DB with 21 SST files in L0. Read with `./db_bench -db=/dev/shm/rocksdb1 -readonly -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=30` Without -use_block_based_filter: readrandom 464 ops/sec, 689280 KB DB With -use_block_based_filter: readrandom 169 ops/sec, 690996 KB DB No consistent difference with fillrandom Reviewed By: jay-zhuang Differential Revision: D34153871 Pulled By: pdillinger fbshipit-source-id: 31f4a933c542f8f09aca47fa64aec67832a69738

facebook-github-bot added the CLA Signed label Feb 4, 2022

hx235 self-requested a review February 4, 2022 17:20

pdillinger added 2 commits February 4, 2022 09:53

Add to HISTORY.md

c8e999a

Merge remote-tracking branch 'origin/main' into filter_policy_api7

35d300c

hx235 reviewed Feb 8, 2022

View reviewed changes

pdillinger added 2 commits February 8, 2022 10:03

Merge remote-tracking branch 'origin/main' into filter_policy_api7

1bbbfb4

Improve some comments (from feedback)

f8da662

hx235 approved these changes Feb 8, 2022

View reviewed changes

Remove unused new ctor; fix some linter errors

f8a5800

facebook-github-bot closed this in 68a9c18 Feb 8, 2022

pdillinger mentioned this pull request Feb 9, 2022

Hide deprecated, inefficient block-based filter from public API #9535

Closed

jamesbibby mentioned this pull request Jun 22, 2023

Rocksdb 8 Support DataDog/gorocksdb#28

Open

jamesbibby mentioned this pull request Oct 27, 2023

Add support for Rocksdb 8 DataDog/gorocksdb#33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilterPolicy API changes for 7.0 #9501

FilterPolicy API changes for 7.0 #9501

pdillinger commented Feb 4, 2022 •

edited

Loading

facebook-github-bot commented Feb 4, 2022

hx235 Feb 8, 2022

hx235 Feb 8, 2022

pdillinger Feb 8, 2022

hx235 left a comment

hx235 Feb 8, 2022

pdillinger Feb 8, 2022

hx235 Feb 8, 2022 •

edited

Loading

pdillinger Feb 8, 2022

hx235 Feb 8, 2022

pdillinger Feb 8, 2022

hx235 Feb 8, 2022

pdillinger Feb 8, 2022

hx235 commented Feb 8, 2022

facebook-github-bot commented Feb 8, 2022

facebook-github-bot commented Feb 8, 2022

hx235 left a comment

facebook-github-bot commented Feb 8, 2022

facebook-github-bot commented Feb 8, 2022

		@@ -399,21 +401,22 @@ TEST_P(FullBloomTest, FilterSize) {
		bool some_computed_less_than_denoted = false;
		// Note: enforced minimum is 1 bit per key (1000 millibits), and enforced

FilterPolicy API changes for 7.0 #9501

FilterPolicy API changes for 7.0 #9501

Conversation

pdillinger commented Feb 4, 2022 • edited Loading

facebook-github-bot commented Feb 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hx235 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hx235 Feb 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hx235 commented Feb 8, 2022

facebook-github-bot commented Feb 8, 2022

facebook-github-bot commented Feb 8, 2022

hx235 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 8, 2022

facebook-github-bot commented Feb 8, 2022

pdillinger commented Feb 4, 2022 •

edited

Loading

hx235 Feb 8, 2022 •

edited

Loading