[multistage] implement naive round robin operator chain scheduling #9753

agavra · 2022-11-07T23:11:54Z

This is a follow-up to #9711 and follows the design outlined in this design doc.

This PR implements a round robin operator chain scheduling algorithm and sets up the interface for future PRs that will implement more advanced scheduling. As of this PR, we can be guaranteed that all queries will make progress (see the change in pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/SSBQueryIntegrationTest.java, you can now run it under situations with only 2 cores available) but the algorithm is still very hungry for CPU (queries with nothing in their mailbox will still be scheduled).

Review Guide:

look at OpChainSchedulerService and RoundRobinScheduler, which contains the logic of yielding threads when there's no more work to be done for a given operator chain
look at WorkerQueryExecutor to see where this new scheduler is now wired in as opposed to running the work directly on the old worker pool
I added some logic to PhysicalPlanVisitor to collect information on mailboxes so that later we can hook up the mailboxes with the scheduling logic. Probably should have been done in a follow-up PR but 🤷 I was already at it. Let me know if you want me to split it up
Look at the corresponding tests

cc @walterddr @61yao

agavra · 2022-11-07T23:15:00Z

...y-runtime/src/main/java/org/apache/pinot/query/runtime/executor/OpChainSchedulerService.java

+    }
+  }
+
+  // TODO: remove this method after we pipe down the proper executor pool to the v1 engine


this is a bit unfortunate, but it's how the existing code works and refactoring it would be out of scope for this PR. While it's not particularly efficient, it also isn't dangerous - V1 queries are non-blocking, so using the same worker pool for executing V1 queries (that are issued as part of V2) and V2 intermediate queries does not threaten liveness.

lol should've read this first. good call out.

another way is to simple decoupled the executor service used by v1 from v2. i am not sure which is better.

yeah, I think we should split v1/v2 executor pools - that's probably the safest option. Alternatively we may also want three pools: a v2-intermediate pool, a v1-via-v2 pool and a v1-vanilla pool. That will allow us to make sure that clusters that run an existing v1 vanilla workload have their exposure to the v2 engine entirely limited

+1, but for another day

agavra · 2022-11-07T23:15:57Z

...y-runtime/src/main/java/org/apache/pinot/query/runtime/executor/OpChainSchedulerService.java

+                // not complete, needs to re-register for scheduling
+                register(operatorChain);
+              } else {
+                LOGGER.info("Execution time: " + timer.getThreadTimeNs());


for more complex scheduling algorithms, we will add a callback here to complete or unregister an operator chain. that requires a unique way to identify operator chains which adds a bit more code so I avoided it for this PR

+1. good splitting point

codecov-commenter · 2022-11-08T00:15:32Z

Codecov Report

Merging #9753 (77c080a) into master (2f640ff) will decrease coverage by 0.02%.
The diff coverage is 94.80%.

@@             Coverage Diff              @@
##             master    #9753      +/-   ##
============================================
- Coverage     70.08%   70.05%   -0.03%     
- Complexity     4980     5396     +416     
============================================
  Files          1951     1957       +6     
  Lines        104561   104878     +317     
  Branches      15836    15874      +38     
============================================
+ Hits          73279    73477     +198     
- Misses        26155    26245      +90     
- Partials       5127     5156      +29

Flag	Coverage Δ
integration1	`25.36% <0.00%> (-0.08%)`	⬇️
integration2	`24.56% <0.00%> (-0.08%)`	⬇️
unittests1	`67.58% <94.80%> (+0.03%)`	⬆️
unittests2	`15.68% <94.80%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ot/query/runtime/executor/RoundRobinScheduler.java	`87.50% <87.50%> (ø)`
...g/apache/pinot/query/runtime/operator/OpChain.java	`87.50% <87.50%> (ø)`
...uery/runtime/executor/OpChainSchedulerService.java	`95.12% <95.12%> (ø)`
...va/org/apache/pinot/query/runtime/QueryRunner.java	`81.55% <100.00%> (ø)`
...ot/query/runtime/executor/WorkerQueryExecutor.java	`100.00% <100.00%> (+7.69%)`	⬆️
...query/runtime/operator/MailboxReceiveOperator.java	`79.66% <100.00%> (+0.71%)`	⬆️
.../pinot/query/runtime/plan/PhysicalPlanVisitor.java	`97.22% <100.00%> (+0.44%)`	⬆️
...va/org/apache/pinot/query/service/QueryServer.java	`73.80% <100.00%> (+1.30%)`	⬆️
...data/manager/realtime/DefaultSegmentCommitter.java	`0.00% <0.00%> (-80.00%)`	⬇️
...er/api/resources/LLCSegmentCompletionHandlers.java	`43.56% <0.00%> (-18.82%)`	⬇️
... and 57 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

walterddr · 2022-11-08T02:29:58Z

...egration-tests/src/test/java/org/apache/pinot/integration/tests/SSBQueryIntegrationTest.java

-    if (Runtime.getRuntime().availableProcessors() < MIN_AVAILABLE_CORE_REQUIREMENT) {
-      throw new SkipException("Skip SSB query testing. Insufficient core count: "
-          + Runtime.getRuntime().availableProcessors());
-    }


freaking awesome!!!

walterddr · 2022-11-10T19:23:57Z

...query-runtime/src/main/java/org/apache/pinot/query/runtime/executor/WorkerQueryExecutor.java

  public void processQuery(DistributedStagePlan queryRequest, Map<String, String> requestMetadataMap,
-      ExecutorService executorService) {
+      OpChainSchedulerService scheduler) {


not necessarily needed in this PR: let's change this API to directly take operator chain as input

Suggested change

public void processQuery(DistributedStagePlan queryRequest, Map<String, String> requestMetadataMap,

ExecutorService executorService) {

OpChainSchedulerService scheduler) {

public void processQuery(OpChain opChain, OpChainSchedulerService scheduler) {

and make the opChain constructor outside --> this way we can early return if there's any error during opChain construct

I just deleted WorkerQueryExecutor altogether, it really doesn't make sense to have that and the scheduler.

The constructing OpChain in the request thread should be a different PR

walterddr · 2022-11-10T19:27:24Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/QueryRunner.java

@@ -128,7 +129,7 @@ public void processQuery(DistributedStagePlan distributedStagePlan, ExecutorServ
      for (ServerPlanRequestContext requestContext : serverQueryRequests) {
        ServerQueryRequest request = new ServerQueryRequest(requestContext.getInstanceRequest(),
            new ServerMetrics(PinotMetricUtils.getPinotMetricsRegistry()), System.currentTimeMillis());
-        serverQueryResults.add(processServerQuery(request, executorService));
+        serverQueryResults.add(processServerQuery(request, scheduler.getWorkerPool()));


this means leaf stage are directly scheduled on top of the worker pool?

yes, IIRC this is the same as existing behavior if you follow where executorService is created and passed down

walterddr · 2022-11-10T19:28:50Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/executor/OpChainScheduler.java

+   *
+   * @param mailbox the mailbox ID
+   */
+  void onDataAvailable(MailboxIdentifier mailbox);


is this the only trigger? (other than register)

there's three possible triggers:

register

onDataAvailable

next().getRoot().nextBlock() completes

Triggers are defined in the implementation instead of the interface.

walterddr · 2022-11-10T19:32:40Z

...y-runtime/src/main/java/org/apache/pinot/query/runtime/executor/OpChainSchedulerService.java

+              // so long as there's work to be done, keep getting the next block
+              // when the operator chain returns a NOOP block, then yield the execution
+              // of this to another worker
+              TransferableBlock result = operatorChain.getRoot().nextBlock();


IIUC, for the current mechanism, this will return NO-OP everytime it reaches all the way down to mailbox receive and the buffer is empty. yes?

👍 yup that's correct

walterddr · 2022-11-10T19:33:29Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/service/QueryServer.java

    LOGGER.info("Initialized QueryWorker on port: {} with numWorkerThreads: {}", port,
        ResourceManager.DEFAULT_QUERY_WORKER_THREADS);
  }

  public void start() {
    LOGGER.info("Starting QueryWorker");
    try {
+      _scheduler.startAsync().awaitRunning();


oops! good catch

walterddr · 2022-11-10T19:33:48Z

pinot-query-runtime/src/test/java/org/apache/pinot/query/QueryServerEnclosure.java

@@ -218,6 +221,7 @@ public void start()
    _queryRunner = new QueryRunner();
    _queryRunner.init(configuration, _instanceDataManager, _helixManager, mockServiceMetrics());
    _queryRunner.start();
+    _scheduler.startAsync().awaitRunning();


walterddr

lgtm overall. minor comments please take a look

walterddr · 2022-11-15T17:34:53Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/plan/PhysicalPlanVisitor.java


-  public static Operator<TransferableBlock> build(StageNode node, PlanRequestContext context) {
-    return node.visit(INSTANCE, context);
+  private List<MailboxIdentifier> _inputMailboxIds = new ArrayList<>();


can we move this into the PlanRequestContext? otherwise, we need to adjust the comment for the static usage of this visitor class

walterddr · 2022-11-15T17:37:43Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/operator/OpChain.java

+  public List<MailboxIdentifier> getInputMailboxes() {
+    return _inputMailboxes;
+  }


doesn't seem like this is being used. from what I understand on the triggering mechanism:

register

onDataAvailable

next().getRoot().nextBlock() completes

all should be able to identify which opChain to call based on jobID alone. do we need the input mailbox id list?

I think I see what you're saying. Since this isn't used in this PR I'll just clean it up for now and pipe it back in when I add the PR which triggers the scheduler via mailbox data available.

thank you. or adding a unit-test to explain is also good. i am assuming it will be related to #9753 (comment) but I am not sure exactly. so i pop the question.

walterddr · 2022-11-15T17:41:35Z

...y-runtime/src/main/java/org/apache/pinot/query/runtime/executor/OpChainSchedulerService.java

+  private final Monitor _monitor = new Monitor();
+  protected final Monitor.Guard _hasNextOrClosing = new Monitor.Guard(_monitor) {


this is more of a question --> why one of these is private and the other is protected?

that is a very good question... I think it was an autocomplete typo

walterddr · 2022-11-15T17:48:18Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/operator/OpChain.java

+    _timer = Suppliers.memoize(ThreadTimer::new)::get;
+  }
+
+  public Operator<TransferableBlock> getRoot() {


question for follow up PRs:
assuming the opChain is going to be invoked via this root operator API. I was wondering how we can inform the opChain being scheduled from scheduler.onDataAvailable(mailboxId) API, which mailboxId has new data so we can do better than round-robin checking each mailbox on the list

that's exactly the plan for the next PR - in fact it'll be even better than that, it won't schedule anything unless there's data to be scheduled at all (and it'll sleep until it's notified of available data)

61yao · 2022-11-08T04:29:07Z

...y-runtime/src/main/java/org/apache/pinot/query/runtime/executor/OpChainSchedulerService.java

+
+  private static final Logger LOGGER = LoggerFactory.getLogger(OpChainSchedulerService.class);
+
+  private final OpChainScheduler _scheduler;


Add a comment saying this is guarded by monitor below?

61yao · 2022-11-15T19:08:55Z

...y-runtime/src/main/java/org/apache/pinot/query/runtime/executor/OpChainSchedulerService.java

+                // not complete, needs to re-register for scheduling
+                register(operatorChain);
+              } else {
+                LOGGER.info("Execution time: " + timer.getThreadTimeNs());


Is this logging expensive? I feel there would be a lot of logs if we log the pause every time. Can we have some class such as OpChainStats to hold the data and we decide later where to report them?

oops good callout. this could be a debug log

61yao · 2022-11-15T19:14:21Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/operator/OpChain.java

+ * An {@code OpChain} represents a chain of operators that are separated
+ * by send/receive stages.
+ */
+public class OpChain {


Ideally this OpChain should capture the timeout info rather than we rely on MailboxSendOperator timeout. It is clear when we should timeout instead of saying if the root operator times out correctly, this will work.

i am not sure how to interpret this comment. eventually the timeout needs to be returned by error block in the current architecture. maybe we can clarify more in a concrete example on what other routes opChain can bubble up the timeout

This means we don't schedule the chain anymore if we check the opchain times out.
Say we have deadline in Opchain. we can do

while scheduling, we can do
if (now > deadline){
discard the scheduling.
}

I like that suggestion a lot, definitely a good improvement

61yao · 2022-11-15T19:15:35Z

pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/operator/OpChain.java

+ * An {@code OpChain} represents a chain of operators that are separated
+ * by send/receive stages.
+ */
+public class OpChain {


I feel we should also have OpChain ID somewhere. maybe in future PRs :)

currently OpChain is equivalent to stage. we can add later once we have the split logic

I meant even for stage one, we should have an ID for easier debugging.

big +1! I actually started adding that but decided that should be done in future PRs (see #9753 (comment))

walterddr · 2022-11-15T19:26:14Z

...y-runtime/src/main/java/org/apache/pinot/query/runtime/executor/OpChainSchedulerService.java

+            } catch (Exception e) {
+              LOGGER.error("Failed to execute query!", e);
+            }


i dont think this is returnable via data blocks. echo back to @61yao's comment on timeout. maybe we need other ways to indicate out-of-norm failures. let's follow up on other PR (i don't think we handle these correctly right now either)

agavra commented Nov 7, 2022

View reviewed changes

agavra force-pushed the scheduler branch from 57aa285 to 2f952ef Compare November 7, 2022 23:17

walterddr reviewed Nov 8, 2022

View reviewed changes

walterddr reviewed Nov 10, 2022

View reviewed changes

agavra added 3 commits November 14, 2022 16:07

[multistage] implement naive round robin operator chain scheduling

261182f

fix flaky test

1059ace

remove WorkerQueryExecutor

02c8abc

agavra force-pushed the scheduler branch from c1b6ed9 to 02c8abc Compare November 15, 2022 00:07

walterddr approved these changes Nov 15, 2022

View reviewed changes

walterddr reviewed Nov 15, 2022

View reviewed changes

remove snedingmailboxes

1033d67

walterddr added the multi-stage Related to the multi-stage query engine label Nov 15, 2022

61yao reviewed Nov 15, 2022

View reviewed changes

walterddr merged commit 342b6a5 into apache:master Nov 15, 2022

walterddr reviewed Nov 15, 2022

View reviewed changes

agavra deleted the scheduler branch November 15, 2022 19:35

agavra mentioned this pull request Dec 1, 2022

[multistage] improved task scheduling #9887

Merged

walterddr mentioned this pull request Jan 18, 2023

[multistage] make QueryServer async process query #10135

Merged

		private final Monitor _monitor = new Monitor();
		protected final Monitor.Guard _hasNextOrClosing = new Monitor.Guard(_monitor) {


		private static final Logger LOGGER = LoggerFactory.getLogger(OpChainSchedulerService.class);

		private final OpChainScheduler _scheduler;

[multistage] implement naive round robin operator chain scheduling #9753

[multistage] implement naive round robin operator chain scheduling #9753

Conversation

agavra commented Nov 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Nov 8, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agavra Nov 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

walterddr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

61yao Nov 15, 2022 • edited Loading

Choose a reason for hiding this comment

walterddr Nov 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agavra commented Nov 7, 2022 •

edited

Loading

codecov-commenter commented Nov 8, 2022 •

edited

Loading

agavra Nov 14, 2022 •

edited

Loading

61yao Nov 15, 2022 •

edited

Loading

walterddr Nov 15, 2022 •

edited

Loading