forked from djluck/prometheus-net.DotNetRuntime
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathTODO.txt
247 lines (202 loc) · 9.28 KB
/
TODO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# Summary for V4
Main goals of next release:
- As this library is now in greater use, improve the currently poor baseline performance
- Fix longstanding issues (e.g. ThreadPoolStatsCollector doesn't work)
- Make use of all events currently being collected by current level of verbosity
- Keep compatability with .NET core v3
- Remove support for v2 of prometheus-net
- Fix long-running perf bug
### v4.0.0 release
- Figure out release process
- Fix build pipeline (no need for v2/ v3 anymore)
- Perf comparison when running default and all between v2 + v3
- Long running perf test vs v3
- Review PR
- Update README (note on performance, choosing levels, metrics exposed?, example docker-compose stack, recycling, compatibility with prior configuration)
- Perform metric diff vs v3
- Use all vs default (to note what metrics are not present by default)
- Will be useful to note what metrics are missing now
- Write release notes (dropped support for prom v2)
- Publish prerelease version
### v4.1+ release
- DNS events
- HTTP events
- So many events in .net 5!
- Counter-based metrics for contention + JIT
- Review all existing events and look for improvements/ updates made in .NET 5
- Rethink sampling implmentation (consumption is actually single-threaded, what about circular buffer idea and time is either actual time or estimated max time)
- Review if V3 JIT/ Contention was as wrong as V4. If not, fix later. Otherwise Fix JIT + Contention CPU bug (measurements for time are way off, even with sampling off)
## Splitting up work
Need to focus on getting event counters in and reducing overhead by default.
1. Reduce overhead + sane defaults
2. Improve documentation (generate documentation automatically?)
3. Improve data collected
4. Dynamic switching
5. Advanced collection? e.g. GCSampledObjectAllocation perhaps?
## IObservable
- Separate out event producers and metric producers
- EventProducers vs EventConsumers
- EventListeners
- MetricProducers
- IEventProducers
- Metric producerss
- Different observables for different
## Example GC metric producer
- Focusing on allocation rate
- will consume IObservable<RuntimeMetrics>, GcEvents.Verbose
- Example impl. for GcEvents.Info:
```
public class Info{
public IObservable<AllocTick> AlocTick { get; set; }
}
```
- Subscribe to both observables
- Enabled/ Disabled is good- allows us to set up metrics correctly
## Improving performance
Main cost involves the .NET runtime producing events, not processing them so we need to be smarter about when we enable the more verbose event sources. Ideas include:
- Using event counters
- Dynamic switching of metric sources
### Using event counters
Event counters should place a much lower stress on the runtime- using these could definitely help.
See https://docs.microsoft.com/en-us/dotnet/core/diagnostics/event-counters#sample-code.
Counter implementation concerns:
- Rate is fixed ahead of time (min frequency = 1 sec)
- Counter values are collected separately from other events (need to provide mechanism for other profilers to consume counter values)
- Perhaps event listeners consume counters?
### Dynamic switching of metric sources
Overall could be three levels of detail for most sources, ordered in terms of perf impact (low -> high):
1. Counters
2. Events (Warning)
2. Events (info)
3. Events (verbose)
Levels are hierarchial- enabling a more detailed level implies the others are enabled.
#### Ordering
Could start at a more verbose level and have rules to selectively enable less verbose levels. Or start at the least verbose and move towards more verbose.
#### Changing verbosity
Reasons to change include:
- A period of time has elapsed
- A counter value has changed
- Number of events being processed has changed
Depending on the information a user wishes to obtain, switching between these three verbosity levels could be useful. Scenarios include:
- Disabling more detailed collectors by default
- Enabling/ disabling collectors as conditions change (e.g. a lot of exceptions are thrown then disable high-impact collection, enabling detailed thread stats when thread pool queue times increase)
```
DotNetRuntimeStatsBuilder.LowestImpact.StartCollecting();
DotNetRuntimeStatsBuilder.AllCollectors.StartCollecting();
// This should be good enough to start, right?
Gc.DefaultLevel(Info)
// Should we allow multiple conditions? This could help solve the problem of different evaluation timeframes
.Use.Verbose.While(x => x.allocRate > x, evalPeriod: 5 second)
.Use.Verbose.While(x => x.startTime < DateTime.Now.AddMinutes(2))
.Use.Info.While(x => x.EventsSec > 100)
.Use.Counters.While(x => true)
#### Goals of builder
- Good documentation of what metrics can be collected at each level
- Well-typed- should not be allowed to enable a level that offers no benefit
- Easy to use!
// Why are we doing this?
// To control performance impact of collectors
// To enable more detail when needed
// TraceInfo, TraceVerbose
// Scenarios:
// JIT: on startup, when a lot of JIT is happening (e.g. num methods > x)
// Contention: when number of locks contended > 5, when number of locks isn't greater than 5
// GC: when LOH > blah size (enable LOH allocs)
// ThreadPool: When queue length > x, num threads > Environment.ProcessorCount
// Exceptions: When num exceptions> blah
// Profiles:
// Perf concious- I don't want performance to be destroyed by monitoring
// Detail oriented- I want to have more insight into why my application is degrading
// Balanced- I want more detail as long as it's not impacting the perf of my application
// Levels: Info, Verbose
DotNetRuntimeStatsBuilder.Customize()
.With.Gc
.With.Jit
.With.ThreadPoolStats
.Use.Detailed.When(x => x.Blah)
.Use.Detailed.Always()
.Use.Level(Level.Info)
.Use.Normal.When(x => x.NumEventsSec > 100)
.With.ThreadPoolLatencyStats
Use.Gc
.And.Jit
.Use.Level.Info.Always // same effect as And
.And.ThreadPoolStats
.EnableLevel.Detailed.When(x => x).DisableWhen(x => x)
.EnableLevel.Info.When(x => x).DisableWhen(x => x)
.And.
```
We are considering moving from low -> high, what about the other way around?
Start profiling with a lot of detail
Context:
- Time since app start
- Time since level last enabled
- Rate of events (disable level only)
- Relevant counters
-
How often will these be evaluated?
- Counter values are evaluated, enabling higher levels of detail
- Profilers are disabled when rate is too high or no interesting events are happening
- Need a control mechanism that says "after enabling/ disabling, do not disable/ enable for x period of time"
- Enable.When(<condition>).While(<condition>)For.AtLeast()/AtMost()
- .Default(level)
.Use(level).When(x => x.).Until(x => x.EventsSec > 100)
.Use(level).When(x => x.)
.Use(Counters).When();
- What do conditions do?
- Check rate/ sec of events (this can only be known while the thingo is active)
- Check event counter values
- How often are they evaluated?
What is the long-term impact of starting and stopping?
IDEAS:
- State change based on repetitive duration of time
- State change based on value of event counters (e.g. bytes jitted > x for y seconds)
- State change based on rate of events received (e.g. > 100/sec, then disable events for x seconds)
- Premade profiles (e.g. perf vs investigation)
- Need to track what collectors are enabled and at what level of verbosity
- Will need to completely redesign the construction of event listeners
- Evaluate every collection (perhaps this will be too long?)
- Counters have to have their refresh frequency specified up front (default to 1 sec?)
- Counters will be updated at a fixed frequency, we can use this to inform judgments
- We need to take samples of the queue length via histogram
- E.g. thread pool, enable detail after we see a queue build up of y
- Collectors should be ignorant of verbosity changes (managed externally)
- Collectors need to expose additional information (e.g. counter values to base judgments on)
## Collector improvements
Overall:
- Make full use of all events captured by current verbosity levels
- Upgrade to the latest version of events
### GC
- Track finalizer processing times
- Track mark events (GcMarkWithType) to track the types of roots that hold memory
- Track pinned object heap size (heapstats v3)
- Track allocations more effeciently (don't use Verbose keyword). Can we support this in V3 of .NET core?
- GCHeapSurvivalAndMovementKeyword to track reserved sizes of heaps and positions
- Look into GCGlobalHeapHistory_V3?
- Track compactions?
### Execeptions
- Track times spent throwing + time spent handling events
- Offer fallback to the count event counter
- No need to use Information by default- can track with Error Level
### JIT
- _ilBytesJittedCounter to track bytes spent
- Offer to track greater verbosity?
### RuntimeInformation?
### [EventSource()] in coreclr
Possible ideas:
- HttpClient (time queued, connection count, etc)
- Dependency injection
- DNS lookups
Ideas to reduce CPU consumpton:
- don't track JIT on startup
- don't track TP stats unless unhealthy (e.g. too many queued tasks)
- don't track contention stats unless lots of contention
- don't track exceptions unless count is
For each source of info, offer options to:
- increase verbosity (more detailed log events, e.g. alloc by heap)
- upgrade from event counters -> event traces
- downgrade from event traces -> counters
- disable collectors entirely
# Collector improvements
## GC
- Collect heap info