Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2][adjuster] Implement adjuster for deduplicating spans #6391

Merged
merged 18 commits into from
Dec 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
83 changes: 83 additions & 0 deletions cmd/query/app/querysvc/adjuster/hash.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
// Copyright (c) 2024 The Jaeger Authors.
// SPDX-License-Identifier: Apache-2.0

package adjuster

import (
"fmt"
"hash/fnv"

"go.opentelemetry.io/collector/pdata/ptrace"

"github.com/jaegertracing/jaeger/internal/jptrace"
)

var _ Adjuster = (*SpanHashDeduper)(nil)

// SpanHash creates an adjuster that deduplicates spans by removing all but one span
// with the same hash code. This is particularly useful for scenarios where spans
// may be duplicated during archival, such as with ElasticSearch archival.
//
// The hash code is generated by serializing the span into protobuf bytes and applying
// the FNV hashing algorithm to the serialized data.
//
// To ensure consistent hash codes, this adjuster should be executed after
// SortAttributesAndEvents, which normalizes the order of collections within the span.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts on this:

  1. Some storage backends (Cassandra, in particular), perform similar deduping by computing a hash before the span is saved and using it as part of the partition key (it creates tombstones if identical span is saved 2 times or more but no dups on read). So we could make this hashing process to be a part of the ingestion pipeline (e.g. in sanitizers) and simply store it as an attribute on the span. Then this adjuster would be "lazy", it will only recompute the hash if it doesn't already exist in the storage.

  2. If we do this on the write path, we would want this to be as efficient as possible, so we would need to implement manual hashing by iterating through the attributes (and pre-sorting them to avoid dependencies) and but manually going through all fields of the Span / SpanEvent / SpanLink. The reason I was reluctant to do that in the past was to avoid unintended bugs if the data model was changed, like a new field added that we'd forget to add to the hash function. To protect against that we probably could use some fuzzing tests, by setting / unsetting each field individually and making sure the hash code changes as a result.

We don't have to do it now, but let's open a ticket for future improvement (I think it could be a good-first-issue)

func SpanHash() SpanHashDeduper {
return SpanHashDeduper{
marshaler: &ptrace.ProtoMarshaler{},
}
}

type SpanHashDeduper struct {
marshaler ptrace.Marshaler
yurishkuro marked this conversation as resolved.
Show resolved Hide resolved
}

func (s *SpanHashDeduper) Adjust(traces ptrace.Traces) {
spansByHash := make(map[uint64]ptrace.Span)
resourceSpans := traces.ResourceSpans()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend going forward to use terms resources and scopes. Makes the code more readable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good - I can open a cleanup PR

for i := 0; i < resourceSpans.Len(); i++ {
rs := resourceSpans.At(i)
scopeSpans := rs.ScopeSpans()
hashTrace := ptrace.NewTraces()
hashResourceSpan := hashTrace.ResourceSpans().AppendEmpty()
hashScopeSpan := hashResourceSpan.ScopeSpans().AppendEmpty()
hashSpan := hashScopeSpan.Spans().AppendEmpty()
rs.Resource().Attributes().CopyTo(hashResourceSpan.Resource().Attributes())
for j := 0; j < scopeSpans.Len(); j++ {
ss := scopeSpans.At(j)
spans := ss.Spans()
ss.Scope().Attributes().CopyTo(hashScopeSpan.Scope().Attributes())
dedupedSpans := ptrace.NewSpanSlice()
for k := 0; k < spans.Len(); k++ {
span := spans.At(k)
span.CopyTo(hashSpan)
h, err := s.computeHashCode(
hashTrace,
)
if err != nil {
jptrace.AddWarning(span, fmt.Sprintf("failed to compute hash code: %v", err))
span.CopyTo(dedupedSpans.AppendEmpty())
continue
}
if _, ok := spansByHash[h]; !ok {
spansByHash[h] = span
span.CopyTo(dedupedSpans.AppendEmpty())
}
}
dedupedSpans.CopyTo(spans)
}
}
}

func (s *SpanHashDeduper) computeHashCode(
hashTrace ptrace.Traces,
) (uint64, error) {
b, err := s.marshaler.MarshalTraces(hashTrace)
if err != nil {
return 0, err
}
hasher := fnv.New64a()
hasher.Write(b) // the writer in the Hash interface never returns an error
return hasher.Sum64(), nil
}
Loading
Loading