Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Send correct batch stats when SendBatchMaxSize is set #5385

Merged
merged 1 commit into from
Jun 2, 2022
Merged

fix: Send correct batch stats when SendBatchMaxSize is set #5385

merged 1 commit into from
Jun 2, 2022

Conversation

njvrzm
Copy link
Contributor

@njvrzm njvrzm commented May 18, 2022

Description:
This fixes a bug with the batch processor's batch_send_size and batch_send_size_bytes metrics. Their values were being calculated before SendBatchMaxSize was applied.

We observed this issue during performance testing. With SendBatchMaxSize set to a small enough value that it almost always took effect, graphs of batch_send_size showed an odd sawtooth pattern and the total values were far in excess of the number of items actually sent. Looking at some individual metrics the issue was clear - with a SendBatchMaxSize of 100, for instance, we'd see batch_send_size metrics looking like 1000, 900, 800, 700... as the full size of the batch queue was recorded but not sent each time.

With this change the various export methods report the actual count of items sent, and byte size sent if requested, and the sendItems method records those values.

Testing:
I added a test called TestBatchProcessorSentBySize_withMaxSize, based on TestBatchProcessorSentBySize but with SendBatchMaxSize set and with all spans delivered in a single request so that the batch size is predictable. It does not attempt to validate the batch_send_size_bytes metric - the splitting of batches makes the method used in the original test fail due to different amounts of overhead.

(Incidentally, TestBatchProcessorSentBySize itself is rather brittle. Changing either sendBatchSize or spansPerRequest so that the former is not a multiple of the latter makes the test fail in several ways.)

@njvrzm njvrzm requested review from a team and bogdandrutu May 18, 2022 03:58
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented May 18, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: njvrzm / name: Nathan Vērzemnieks (7ef957c)

@@ -244,17 +245,24 @@ func (bt *batchTraces) add(item interface{}) {
td.ResourceSpans().MoveAndAppendTo(bt.traceData.ResourceSpans())
}

func (bt *batchTraces) export(ctx context.Context, sendBatchMaxSize int) error {
func (bt *batchTraces) export(ctx context.Context, sendBatchMaxSize int, returnBytes bool) (int, int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: should we do the stats.Record call within export rather than returning it? That way we don't need to make any function signature changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can see why that would be annoying because then instead of a single stats record call, you have to make one in each batch processor. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd rather change the signature, especially since it's only used in one place, than duplicate the stats recording code.

@@ -244,17 +245,24 @@ func (bt *batchTraces) add(item interface{}) {
td.ResourceSpans().MoveAndAppendTo(bt.traceData.ResourceSpans())
}

func (bt *batchTraces) export(ctx context.Context, sendBatchMaxSize int) error {
func (bt *batchTraces) export(ctx context.Context, sendBatchMaxSize int, returnBytes bool) (int, int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can see why that would be annoying because then instead of a single stats record call, you have to make one in each batch processor. What do you think?

}

if err := bp.batch.export(bp.exportCtx, bp.sendBatchMaxSize); err != nil {
detailed := bp.telemetryLevel == configtelemetry.LevelDetailed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: should we call out that we may want to remove this given it's barely used and almost always desired in a future PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the default for this setting is LevelBasic, so eliminating the check here would be a behavior change.

@codecov
Copy link

codecov bot commented May 19, 2022

Codecov Report

Merging #5385 (3a85f7a) into main (528fd56) will decrease coverage by 0.00%.
The diff coverage is 91.66%.

❗ Current head 3a85f7a differs from pull request most recent head 18aebcb. Consider uploading reports for the commit 18aebcb to get more accurate results

@@            Coverage Diff             @@
##             main    #5385      +/-   ##
==========================================
- Coverage   90.89%   90.88%   -0.01%     
==========================================
  Files         191      190       -1     
  Lines       11421    11446      +25     
==========================================
+ Hits        10381    10403      +22     
- Misses        819      822       +3     
  Partials      221      221              
Impacted Files Coverage Δ
processor/batchprocessor/batch_processor.go 88.94% <91.66%> (-2.59%) ⬇️
service/service.go 41.79% <0.00%> (-4.64%) ⬇️
service/zpages.go 70.08% <0.00%> (-1.69%) ⬇️
pdata/internal/common.go 94.61% <0.00%> (-0.77%) ⬇️
service/host.go 100.00% <0.00%> (ø)
config/common.go 100.00% <0.00%> (ø)
config/exporter.go 90.90% <0.00%> (ø)
config/receiver.go 90.90% <0.00%> (ø)
config/extension.go 90.90% <0.00%> (ø)
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 528fd56...18aebcb. Read the comment docs.

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching and fixing this. Please add a changelog entry. Also can you confirm whether or not this addresses #3262

@njvrzm
Copy link
Contributor Author

njvrzm commented May 19, 2022

Thanks for having a look, @codeboten!

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njvrzm please rebase and push again, there was a change that broke the build for the collector, we can then get this merged

The stat was getting sent before the max batch size was
taken into account.
@codeboten codeboten merged commit 65b7b1b into open-telemetry:main Jun 2, 2022
@njvrzm njvrzm deleted the njvrzm/fix_batch_processor_stats_with_max_size branch June 3, 2022 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants