Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddtrace/tracer: add integration tag to spans_started/finished #3023

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

hannahkm
Copy link
Contributor

@hannahkm hannahkm commented Dec 10, 2024

What does this PR do?

Add a integration tag to the existing datadog.tracer.spans_started and datadog.tracer.spans_finished metrics. The value of the tag will be the name of the component from which the span was started. For example, for a contrib, it will be the name of the contrib package (chi, net/http, etc). For spans that were created manually, the tag will say manual.

Motivation

We want to know, in addition to when a span is started, where the span originated from. This could be a contrib or a manual implementation.

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild.
  • For internal contributors, a matching PR should be created to the v2-dev branch and reviewed by @DataDog/apm-go.

Unsure? Have a question? Request a review!

@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Dec 10, 2024

Datadog Report

Branch report: apm-rd/span-source-health-metric
Commit report: f769f19
Test service: dd-trace-go

✅ 0 Failed, 5132 Passed, 70 Skipped, 2m 34.62s Total Time

@pr-commenter
Copy link

pr-commenter bot commented Dec 10, 2024

Benchmarks

Benchmark execution time: 2025-01-10 22:39:38

Comparing candidate commit ce1b7eb in PR branch apm-rd/span-source-health-metric with baseline commit 27624f6 in branch main.

Found 0 performance improvements and 8 performance regressions! Performance is the same for 50 metrics, 1 unstable metrics.

scenario:BenchmarkOTelApiWithCustomTags/otel_api-24

  • 🟥 execution_time [+256.708ns; +318.292ns] or [+3.625%; +4.494%]

scenario:BenchmarkPartialFlushing/Disabled-24

  • 🟥 execution_time [+10.226ms; +13.079ms] or [+3.823%; +4.889%]

scenario:BenchmarkPartialFlushing/Enabled-24

  • 🟥 execution_time [+11.316ms; +13.791ms] or [+4.153%; +5.061%]

scenario:BenchmarkSetTagMetric-24

  • 🟥 execution_time [+7.336ns; +10.444ns] or [+6.069%; +8.641%]

scenario:BenchmarkSingleSpanRetention/no-rules-24

  • 🟥 execution_time [+9.100µs; +10.826µs] or [+3.867%; +4.601%]

scenario:BenchmarkSingleSpanRetention/with-rules/match-all-24

  • 🟥 execution_time [+8.779µs; +10.800µs] or [+3.708%; +4.562%]

scenario:BenchmarkSingleSpanRetention/with-rules/match-half-24

  • 🟥 execution_time [+8.647µs; +10.617µs] or [+3.643%; +4.473%]

scenario:BenchmarkTracerAddSpans-24

  • 🟥 execution_time [+146.280ns; +200.520ns] or [+3.796%; +5.204%]

@hannahkm hannahkm changed the title ddtrace/tracer: add source tag to spans_started health metric ddtrace/tracer: add integration tag to spans_started/finished Dec 12, 2024
@github-actions github-actions bot added the apm:ecosystem contrib/* related feature requests or bugs label Dec 12, 2024
Copy link
Contributor

@mtoffl01 mtoffl01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so you're reporting spansStarted/spansFinished on span.Start/span.Finished if the integration is not empty, and leaving the chunk reporting to any spans that are manual... I understand why you did this but not totally sure about the approach.

span.Start and span.Stop are typically called quite frequently, so if a majority of the spans are from automatic integrations, this will be very noisy (and defeats the purpose of reporting the metrics at a specified interval, to reduce noise)

One alternative idea:
Change the way we track spansStarted and spansFinished to be some kind of counter map that includes the integration name, e.g. map[string]uint32 where the key is the integration name and the value is the count of spans started/finished that integration name. Then, in this goroutine, we'll have to iterate over the map and report the spans started/finished per integration
(or some other idea I haven't thought of?)

ddtrace/mocktracer/mockspan.go Outdated Show resolved Hide resolved
ddtrace/mocktracer/mockspan.go Outdated Show resolved Hide resolved
ddtrace/mocktracer/mockspan_test.go Outdated Show resolved Hide resolved
@hannahkm
Copy link
Contributor Author

@mtoffl01 Good points! A map would probably work better; I was hesitant at first since I didn't want to change too much of what already exists, but knowing that these metrics are pretty old... I'm more down to change it up now.

@hannahkm hannahkm marked this pull request as ready for review December 19, 2024 21:27
@hannahkm hannahkm requested review from a team as code owners December 19, 2024 21:27
@darccio
Copy link
Member

darccio commented Dec 20, 2024

@hannahkm I'm approving this but we should investigate why the benchmarks report the increased allocations.

Copy link
Contributor

@mtoffl01 mtoffl01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I definitely have some concerns 🤔 Maybe you can write some additional tests to provide peace of mind....

  1. Tests designed to try and make the system fail -- what happens when you have multiple goroutines access a start span / finish span method, can we prove that we've protected against a race condition?
  2. Maybe you want to write dedicated benchmarks to show how much performance is impacted

ddtrace/tracer/tracer.go Outdated Show resolved Hide resolved
ddtrace/tracer/tracer.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mtoffl01 mtoffl01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have some of the same concerns as before regarding lock contention 🤔 . Maybe try some of Felix's ideas!

ddtrace/mocktracer/mockspan_test.go Outdated Show resolved Hide resolved
ddtrace/mocktracer/mockspan_test.go Outdated Show resolved Hide resolved
t.spansStarted.mu.Lock()
for name, v := range t.spansStarted.spans {
t.statsd.Count("datadog.tracer.spans_started", int64(v), []string{"integration:" + name}, 1)
t.spansStarted.spans[name] = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. At first I was going to suggest that you use delete instead, since delete is generally more performant (constant time complexity and removes the key from the map, freeing up memory), but in this case, it's possible and even likely that the same key will be reused in the map in the future. In this case, it's possible that setting the value to 0 will be more performant, since you won't need to re-allocate space for the key in the map. I'm curious if you chose to set the value to 0, rather than use delete, for this purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I assumed that it would be more likely for a key to be reused than to be a one off and done. But I do wonder if we have data about which situation is more common (i.e. many spans being created for a certain integration vs just a few, all at once).

t.statsd.Count("datadog.tracer.spans_finished", int64(atomic.SwapUint32(&t.spansFinished, 0)), nil, 1)
t.spansStarted.mu.Lock()
for name, v := range t.spansStarted.spans {
t.statsd.Count("datadog.tracer.spans_started", int64(v), []string{"integration:" + name}, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may want to use Incr here instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why use Incr? If there has been more than one span started from a certain integration, the value of v should be greater than 1, so we would want to use Count.

t.spansStarted.mu.Unlock()
t.spansFinished.mu.Lock()
for name, v := range t.spansFinished.spans {
t.statsd.Count("datadog.tracer.spans_finished", int64(v), []string{"integration:" + name}, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as line 98: Use Incr.

ddtrace/tracer/metrics_test.go Outdated Show resolved Hide resolved
ddtrace/tracer/metrics_test.go Show resolved Hide resolved
ddtrace/tracer/metrics_test.go Outdated Show resolved Hide resolved
ddtrace/tracer/metrics_test.go Show resolved Hide resolved
ddtrace/tracer/spancontext.go Outdated Show resolved Hide resolved
spansStarted, spansFinished, tracesDropped uint32
// These maps keep count of the number of spans started and finished from
// each component, including contribs and "manual" spans.
spansStarted, spansFinished *xsync.MapOf[string, int64]
Copy link
Member

@felixge felixge Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix the perf issue you're seeing, try using a *atomic.Int64 type instead of int64 for the map value. Also use the Load+LoadOrStore method on the map to do the update like shown in the example I posted earlier.

After the initial phase of initializing the map keys, the code will end up taking only the Load path which is much faster than Compute. The latter always needs to take a lock whereas the former uses optimistic concurrency control that requires no locks or retries during steady state.

Let me know if that works.

(I know you mentioned you tried Load+Store in your slack DM, but I suspect you didn't try with an *atomic.Int64 pointer like I'm suggesting?)

ddtrace/tracer/metrics.go Outdated Show resolved Hide resolved
ddtrace/tracer/tracer.go Outdated Show resolved Hide resolved
if !ok {
v, _ = t.spansStarted.LoadOrStore(span.integration, new(atomic.Int64))
}
v.Add(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming that at this point, v is of type *atomic.Int64, and not interface{}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct!

ddtrace/tracer/spancontext.go Outdated Show resolved Hide resolved
ddtrace/tracer/metrics.go Outdated Show resolved Hide resolved
// reset the count
t.spansStarted.Range(func(key string, value *atomic.Int64) bool {
err := t.statsd.Count("datadog.tracer.spans_started", value.Load(), []string{"integration:" + key}, 1)
t.spansStarted.Delete(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a comment about this on a previous commit. I wonder if you want to Delete here which [if syncx.MapOf follows golang std map functionality] will deallocate the data at this key, or if you want to just set the value at key to 0.

As I understand it: Delete is "greedy" and good if we want to get rid of memory allocation as soon as possible, but could be bad if we frequently allocate data at the same key over, and over again.

Setting the value to 0 is bad if we won't re-use this key in the future (because we're letting unused memory sit there), but good if we frequently re-use it.

Given that our tracer has a set number of integrations it supports, I imagine we will re-use the same key quite often, and would therefore lean toward the second option (keep key, set to 0).

Curious what others (@darccio , @felixge ?) think.

ddtrace/tracer/metrics.go Outdated Show resolved Hide resolved
ddtrace/tracer/metrics.go Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:ecosystem contrib/* related feature requests or bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants