perf: Adds a caching to shard management counts to alleviate shard scanning O(n) #530

astubbs · 2022-12-21T20:45:35Z

Description...

Checklist

Documentation (if applicable)
Changelog
Final speed comparison to origin

Fixes confluentinc#60 - upgrade AK to 2.7.0. Adds 2.7.0 as a stable build.

This messes with the shutdown process of the Producer, i.e. committing final transactions, offsets etc.

…ncy issues under high pressure

refactor: Extract common Reactor and Vert.x parts

Prevents the extension modules from incorrectly inheriting core methods that would be broken to use. Step 1: new parent: rename

Remove deprecated test class

Base class refactor - removes core api from extension modules

Don't know how this made it through CI. Missing copyright and updated readme.

… see #maxCurrency Vert.x concurrency control previously relied on Vert.x WebClient controlling concurrency setting per host. This breaks things when you use multiple hosts - no the max concurrency can go beyond the setting. This change migrates to the new ExternalEngine system - which controls concurrency properly. Turn off performance comparison unit test - too brittle for CI.

… test-jar dependency handling bug

…out of order in the mock "partition"

Under unrealistically high load with no-op processing, broker poller unblocking a partition could cause ProcessingShard to skip forward in its entries and take work out of order. Was discovered when fixing a synthetic high performance benchmark, after an O(n) algo was fixed to O(1), creating the state for the race condition to appear. Probably could not happen without the fix, as it's related to the performance of certain parts of the system.

what-the-diff · 2023-01-13T13:00:29Z

Added a new run configuration for performance tests
Updated the pom file to include ProgressBar class in bytecode enhancement process
Removed log statement from poll method of ParallelEoSStreamProcessor class as it was causing too much logging and slowing down test execution time by ~10% (Encoding back pressure system so that offset payloads are prevented from being too large #47)
Fixed bug where interrupting control thread would cause an exception if work mailbox is not being polled currently, which could happen when there are no records available on broker or all partitions have been revoked due to rebalance event (https://github.com/confluentinc/parallel-consumer#bugfix--interrupting-controlthread---exception)
& 5b: Changed offset map encoding logic so that partition state can be set into blocked mode once payload size exceeds pressure threshold value but still lower than max allowed metadata size limit; this will prevent further messages from being processed until record success reduces encoded payload size below pressure threshold value again - https://github.com//confluentinc//parallel-consumer/#feature--offsetmapencodingbackpressurethresholds
The function workIsWaitingToBeProcessed() was renamed to isWorkWaitingToBeProcessed().
A new field totalShardEntriesNotInFlight was added, which keeps track of the number of entries in all shards that are not currently being processed by a consumer thread. This value is updated when an entry is removed from a shard (onSuccess or onFailure) and when it's added back into the retry queue after failing processing for some reason. It's also decremented whenever we get more records out for processing - this way we can keep track of how many unprocessed records there are at any given time without having to iterate over every single record in each shard as before (which could be very expensive).
In order to avoid starvation, if partition blocking occurs while getting work from shards, then stop scanning through them immediately instead of continuing until reaching resume point - because even though other partitions may have available messages ready for consumption right now, they will never be able to process those messages due to blocked partitions preventing progress with offset commits etc., so no need wasting CPU cycles checking these non-blocked but still unavailable message repeatedly just because they're next up according to our iteration logic based on last consumed offsets per topic/partition combination..
Added logging around why certain WorkContainers aren't taken as work yet: delay hasn't passed; already succeeded; already failed and waiting retry period; or simply isn't allowed since its partition has been revoked temporarily due too much lag behind committed offsets etc.. Also log slow queues separately using rate limiting mechanism so only one warning gets logged every 5 seconds max about same issue(s), otherwise logs would fill up quickly with repeated warnings about same thing happening again and again...
The testLargeVolumeInMemory() method was changed to use a different number of records.
A new function called getTotalSizeOfAllShards() was added in the WorkManager class, which returns the total size of all shards not currently being processed by any consumer instance (i.e., it is waiting for selection). This value is used as part of an if statement that determines whether or not more work should be downloaded from Kafka and put into memory before processing begins on existing data already in memory.
Added a new method to KafkaTestUtils, which allows the user to add records directly into the MockConsumer
Updated all tests that use generateRecords() in order for them not to fail due to ordering issues with offsets and partitions (see note on deprecated methods)
Fixed some bugs where we were using incorrect types of collections when adding/removing elements from lists or maps - this was causing test failures as well as potential race conditions if used in production code
Removed unused imports and variables throughout project files
Added @SneakyThrows to JStreamParallelEoSStreamProcessorTest.java
Removed @beforeeach from ParallelEoSStreamProcessorTest.java
Changed ConcurrentLinkedQueue<>() to new ConcurrentLinkedDeque<>() in OffsetEncodingBackPressureUnitTest and OffsetEncodingBackPressureUnitTest
Replaced assertThat with ManagedTruth assertions for better error messages when the test fails (eos-test)
Added a new test to check the final rate of progress bar is at least some value.
Changed an assertion in WorkManagerTest class from numberOfWorkQueuedInShardsAwaitingSelection() to totalShardEntriesNotInFlight().
Enabled debug logging for DynamicLoadFactor and ParallelEoSStreamProcessorTest classes by removing comments before logger statements in logback-test file.

astubbs

notes

parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ProcessingShard.java

parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java

parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/WorkManager.java

parallel-consumer-core/src/test/java/io/confluent/csid/utils/LatchTestUtils.java

astubbs · 2022-12-29T09:09:34Z

.../src/test/java/io/confluent/parallelconsumer/offsets/OffsetEncodingBackPressureUnitTest.java

@@ -72,12 +72,31 @@ void backPressureShouldPreventTooManyMessagesBeingQueuedForProcessing() throws O

        var completes = LongStreamEx.of(numberOfRecords).filter(x -> !blockedOffsets.contains(x)).boxed().toList();

+
+        {
+            var totalSizeOfAllShards = wm.getTotalSizeOfAllShards();


astubbs

...

astubbs

lgtm

…e out of order processing (confluentinc#534) Under unrealistically high load with no-op processing, broker poller unblocking a partition could cause ProcessingShard to skip forward in its entries and take work out of order. This was discovered when fixing a synthetic high performance benchmark, after PR#530 (O(n) algo was fixed to O(1)), creating the state for the race condition to appear. Probably could not happen without the fix, as it's related to the performance of certain parts of the system.

…counts # Conflicts: # parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionState.java # parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ProcessingShard.java

cla-assistant · 2023-08-08T07:41:44Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 3 committers have signed the CLA.

✅ acktsap
❌ astubbs
❌ nachomdo
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

astubbs and others added 30 commits July 23, 2021 16:01

feature: confluentinc#65 Custom retry delay provider

5462162

major: Upgrade to AK 2.7.0

8199bd5

Fixes confluentinc#60 - upgrade AK to 2.7.0. Adds 2.7.0 as a stable build.

fix: AK 2.7 blocks concurrent access to group metadata, so cache it

a3378ed

fix: Don't interrupted control thread when closing

4289f19

This messes with the shutdown process of the Producer, i.e. committing final transactions, offsets etc.

major: Upgrade to AK 2.8.0

1b4fac4

minor: Changelog

aea09a8

[maven-release-plugin] prepare release 0.3.2.0

09c5890

[maven-release-plugin] prepare for next development iteration

de1759f

docs: Remove link to GH actions as their disabled since CodeCov

7458405

log: Show record topic in slow-work warning message

522dcef

fix: Use ConcurrentSkipListMap instead of TreeMap to prevent concurre…

ea9738f

…ncy issues under high pressure

[maven-versions-plugin] prepare for next development iteration

2817a21

feature: Project Reactor adapter module

b0eacbf

refactor: Extract common Reactor and Vert.x parts

docs: Add blog link

371beca

refactor: Extract common base PC class - prevent module pollution

d8cda93

Prevents the extension modules from incorrectly inheriting core methods that would be broken to use. Step 1: new parent: rename

Step: new parent: new class

785803d

Step: new parent: move poll methods, extend

ec2c7f6

Step: new parent: compiles

9f07183

Step: new parent: move

d0b6a9a

Remove deprecated test class

fix: Wake up if close, don't sleep if there's work

01096c7

Merge pull request confluentinc#133 from astubbs/base-class-refactor

2e3da34

Base class refactor - removes core api from extension modules

fix: Out of bounds for empty collections

160712a

Don't know how this made it through CI. Missing copyright and updated readme.

feature: Generic Vert.x future support ie FileSystem, db etc..

ac66193

docs: Changelog for 0.4

07cccca

ci: Make sure release builds also use Jenkins and CI profile

3ad49c2

Updated enforcer plugin configuration to exclude java 14

9dc83b8

Updated Readme - added note about maven compile target failure due to…

f548a35

… test-jar dependency handling bug

ide: Add code style settings

75d0822

fix: Method missing with Google truth - conflict with Guava dependency

332b2ed

astubbs added 3 commits December 29, 2022 14:58

Merge branch 'fixes/unblock-race' into improvements/cache-counts

5c1b494

tweak, send records as soon as offset is assigned, otherwise they're …

3fcf23d

…out of order in the mock "partition"

log improvements

61e1431

astubbs mentioned this pull request Dec 29, 2022

fix: After new performance fix PR#530 merges - corner case could cause out of order processing #534

Merged

1 task

astubbs added 5 commits December 29, 2022 15:09

log improvements

6f6d7fb

Merge branch 'fixes/unblock-race' into improvements/cache-counts

32f92bf

review

0fdd2c1

fix test - beware of putting record into ConsumerSpy out of offset order

9304b08

astubbs added 2 commits January 13, 2023 13:08

license

e9ed922

step

eecca57

astubbs commented Jan 13, 2023

View reviewed changes

astubbs added 5 commits January 13, 2023 15:34

review

d5a1a6d

review

e995b16

make sure cache is thread safe

44191a8

review

4b486b0

review

572fe79

astubbs commented Jan 25, 2023

View reviewed changes

astubbs added 4 commits January 25, 2023 14:01

Merge remote-tracking branch 'origin/master' into improvements/cache-…

d46a7af

…counts # Conflicts: # parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/PartitionState.java # parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ProcessingShard.java

review

24ce41b

review

f99e6b6

astubbs changed the title ~~perf: Adds a caching layer to work management to alleviate O(n) counting~~ perf: Adds a caching to shard management counts to alleviate O(n) counting Jan 25, 2023

astubbs changed the title ~~perf: Adds a caching to shard management counts to alleviate O(n) counting~~ perf: Adds a caching to shard management counts to alleviate shard scanning O(n) Jan 25, 2023

sangreal mentioned this pull request Sep 16, 2023

improvement: add RetryHandler for accelerating available container count calculation #644

Closed

2 tasks

cprovencher closed this Nov 3, 2023

cprovencher force-pushed the master branch from 075b4f9 to 558d599 Compare November 3, 2023 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Adds a caching to shard management counts to alleviate shard scanning O(n) #530

perf: Adds a caching to shard management counts to alleviate shard scanning O(n) #530

astubbs commented Dec 21, 2022 •

edited

Loading

what-the-diff bot commented Jan 13, 2023

astubbs left a comment

astubbs Dec 29, 2022

astubbs left a comment

astubbs left a comment

cla-assistant bot commented Aug 8, 2023 •

edited

Loading

perf: Adds a caching to shard management counts to alleviate shard scanning O(n) #530

perf: Adds a caching to shard management counts to alleviate shard scanning O(n) #530

Conversation

astubbs commented Dec 21, 2022 • edited Loading

Checklist

what-the-diff bot commented Jan 13, 2023

astubbs left a comment

Choose a reason for hiding this comment

astubbs Dec 29, 2022

Choose a reason for hiding this comment

astubbs left a comment

Choose a reason for hiding this comment

astubbs left a comment

Choose a reason for hiding this comment

cla-assistant bot commented Aug 8, 2023 • edited Loading

astubbs commented Dec 21, 2022 •

edited

Loading

cla-assistant bot commented Aug 8, 2023 •

edited

Loading