You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We noticed that in some cases when a pinot-server hosting segments in CONSUMING state is restarted it showed high query latencies. The high latency correlated with high consumption rate from streams (a restart causes a consuming segment to consume from the starting offset until latest event).
A few different options to address this problem:
Save state before shutdown. In this case it will mean saving all the events consumed so far (and the last offset consumed) before shutdown. Completing the segment is one way of doing that, but then it not only takes a while to complete the segment, but the next replica to restart (in a rolling restart) will go through the same motion, producing a very small segment in the process. So, saving state locally is better, but is a complex matter. Could be easier with mmaped consuming segments, but we still have inverted indices on heap.
Re-construct state after startup, which is what we do now (in terms of consuming from the starting offset).
Stop routing query requests while consuming segments are busy catching up.
The last option seems to be simple.
There are a few different ways of implementing that. We currently have a way to stop routing queries by not re-setting the isShuttingdown flag until segments are loaded. We can extend the criteria to include that "loading" a consuming segment effectively means that the cosnuming segment has "caught up".
It seems best to include this as a part of ServiceStatus.
It is a little hard to determine if the consuming segments have caught up. The best way is to query the stream to get the current offset or timestamp and decide if it has. Other ways could be to add up the rates of consumption of different consuming segments and as long as the total consumption rate is below a certain threshold, announce that we are ready. Some streams may also throttle consumption.
For now, we should start by integrating ServiceStatus to also query consuming segments (perhaps through the tabledatamanager). Perhaps we can start off with a (configured) time out, and then figure out better ways to determine if consuming segments have caught up.
The text was updated successfully, but these errors were encountered:
We noticed that in some cases when a pinot-server hosting segments in CONSUMING state is restarted it showed high query latencies. The high latency correlated with high consumption rate from streams (a restart causes a consuming segment to consume from the starting offset until latest event).
A few different options to address this problem:
Save state before shutdown. In this case it will mean saving all the events consumed so far (and the last offset consumed) before shutdown. Completing the segment is one way of doing that, but then it not only takes a while to complete the segment, but the next replica to restart (in a rolling restart) will go through the same motion, producing a very small segment in the process. So, saving state locally is better, but is a complex matter. Could be easier with mmaped consuming segments, but we still have inverted indices on heap.
Re-construct state after startup, which is what we do now (in terms of consuming from the starting offset).
Stop routing query requests while consuming segments are busy catching up.
The last option seems to be simple.
There are a few different ways of implementing that. We currently have a way to stop routing queries by not re-setting the isShuttingdown flag until segments are loaded. We can extend the criteria to include that "loading" a consuming segment effectively means that the cosnuming segment has "caught up".
It seems best to include this as a part of ServiceStatus.
It is a little hard to determine if the consuming segments have caught up. The best way is to query the stream to get the current offset or timestamp and decide if it has. Other ways could be to add up the rates of consumption of different consuming segments and as long as the total consumption rate is below a certain threshold, announce that we are ready. Some streams may also throttle consumption.
For now, we should start by integrating ServiceStatus to also query consuming segments (perhaps through the tabledatamanager). Perhaps we can start off with a (configured) time out, and then figure out better ways to determine if consuming segments have caught up.
The text was updated successfully, but these errors were encountered: