Stabilize stream consumption before routing queries to servers #3451

mcvsubbu · 2018-11-09T01:08:47Z

We noticed that in some cases when a pinot-server hosting segments in CONSUMING state is restarted it showed high query latencies. The high latency correlated with high consumption rate from streams (a restart causes a consuming segment to consume from the starting offset until latest event).

A few different options to address this problem:

Save state before shutdown. In this case it will mean saving all the events consumed so far (and the last offset consumed) before shutdown. Completing the segment is one way of doing that, but then it not only takes a while to complete the segment, but the next replica to restart (in a rolling restart) will go through the same motion, producing a very small segment in the process. So, saving state locally is better, but is a complex matter. Could be easier with mmaped consuming segments, but we still have inverted indices on heap.
Re-construct state after startup, which is what we do now (in terms of consuming from the starting offset).
Stop routing query requests while consuming segments are busy catching up.

The last option seems to be simple.

There are a few different ways of implementing that. We currently have a way to stop routing queries by not re-setting the isShuttingdown flag until segments are loaded. We can extend the criteria to include that "loading" a consuming segment effectively means that the cosnuming segment has "caught up".

It seems best to include this as a part of ServiceStatus.

It is a little hard to determine if the consuming segments have caught up. The best way is to query the stream to get the current offset or timestamp and decide if it has. Other ways could be to add up the rates of consumption of different consuming segments and as long as the total consumption rate is below a certain threshold, announce that we are ready. Some streams may also throttle consumption.

For now, we should start by integrating ServiceStatus to also query consuming segments (perhaps through the tabledatamanager). Perhaps we can start off with a (configured) time out, and then figure out better ways to determine if consuming segments have caught up.

sajjad-moradi · 2021-11-16T00:16:40Z

#7267 adds a mechanism to get the latest offset for each consuming segment and uses that as the basis for ending the consumption catch-up period.

mcvsubbu · 2022-01-28T21:30:45Z

@sajjad-moradi can this issue be closed now?

mcvsubbu assigned mcvsubbu and unassigned mcvsubbu Nov 15, 2021

mcvsubbu assigned sajjad-moradi Jan 28, 2022

sajjad-moradi closed this as completed Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize stream consumption before routing queries to servers #3451

Stabilize stream consumption before routing queries to servers #3451

mcvsubbu commented Nov 9, 2018

sajjad-moradi commented Nov 16, 2021

mcvsubbu commented Jan 28, 2022

Stabilize stream consumption before routing queries to servers #3451

Stabilize stream consumption before routing queries to servers #3451

Comments

mcvsubbu commented Nov 9, 2018

sajjad-moradi commented Nov 16, 2021

mcvsubbu commented Jan 28, 2022