@@ -61,13 +61,13 @@ monorepo, but that would have risked a bug being pushed up.
61
61
62
62
Jacob says, "This is app data which we use for alerting to understand how our
63
63
workloads are functioning in all of our environments, so it's important to not
64
- take that down since it’ d be disastrous. Same story for users, they want to know
65
- if they move to OTel they won’ t lose their alerting capabilities. You want a
64
+ take that down since it' d be disastrous. Same story for users, they want to know
65
+ if they move to OTel they won' t lose their alerting capabilities. You want a
66
66
safe and easy migration."
67
67
68
68
His team did the feature flag-based part of the configuration in Kubernetes. He
69
69
says, "It would disable the sidecar and enable some code that would then swap
70
- the OTel for metrics and forward it to where it’ s supposed to go. So that was
70
+ the OTel for metrics and forward it to where it' s supposed to go. So that was
71
71
the path there."
72
72
73
73
However, along the way, he noticed some "pretty large performance issues" as he
@@ -76,15 +76,15 @@ worked with the OTel team to alleviate some of these concerns, and found that
76
76
one of the big blockers was their heavy use of attributes on metrics.
77
77
78
78
"It was tedious to go in and figure out which metrics are using them and getting
79
- rid of them. I had a theory that one codepath was the problem, where we’ re doing
79
+ rid of them. I had a theory that one codepath was the problem, where we' re doing
80
80
the conversion from our internal tagging implementation to OTel tags, which came
81
81
with a lot of other logic and [ is] expensive to do, and it was on almost every
82
82
call," he says. "No better time than now to begin another migration from
83
83
OpenCensus to OTel."
84
84
85
85
He saw this as another opportunity: "While we wait for the OTel folks on the
86
86
metrics side to push out more performant code and implementations, we could also
87
- test out the theory of, if we migrate to OTel entirely, we’ re going to see more
87
+ test out the theory of, if we migrate to OTel entirely, we' re going to see more
88
88
performance benefits." Thus, they paused the metrics work and began on migrating
89
89
their tracing.
90
90
@@ -111,7 +111,7 @@ code and some dangerous hacks, so that was a really good thing."
111
111
enough for it to be constant," Jacob says. "The reason you want to pick a
112
112
service like this is that if it's too low traffic, like one request every 10
113
113
minutes, you have to worry about sample rates, [ and] you may not have a lot of
114
- data to compare against – that’ s the big thing: you need to have some data to
114
+ data to compare against – that' s the big thing: you need to have some data to
115
115
compare against."
116
116
117
117
He had written a script early on for their metrics migration that queried
@@ -131,23 +131,23 @@ different types of instrumentation, so from Envoy to OTel to OpenTracing."
131
131
He explains, "What you want to see is that the trace before has the same
132
132
structure as the trace after. So I made another script that checked that those
133
133
structures were relatively the same and that they all had the same attributes as
134
- well... That’ s the point of the tracing migration – what matters is that all the
134
+ well... That' s the point of the tracing migration – what matters is that all the
135
135
attributes stayed the same."
136
136
137
137
### When data goes missing
138
138
139
- "The ‘ why it’ s missing’ stories are the really complicated ones," says Jacob.
139
+ "The ' why it' s missing' stories are the really complicated ones," says Jacob.
140
140
Sometimes, it's as simple as forgetting "to add something somewhere," but other
141
141
times, there could be an upstream library that doesn't emit what you expected
142
142
for OTel.
143
143
144
144
He tells a story about the time he migrated their gRPC util package (which is
145
145
now in Go contrib) and found an issue with propagation.
146
146
147
- "I was trying to understand what’ s going wrong here. When I looked at the code –
147
+ "I was trying to understand what' s going wrong here. When I looked at the code –
148
148
this tells you how early I was doing this migration – where there was supposed
149
149
to be a propagator, there was just a 'TODO'," he shares. "It just took down our
150
- entire services’ traces in staging."
150
+ entire services' traces in staging."
151
151
152
152
He spent some time working on it, but they in turn were waiting on something
153
153
else, and so on and so forth -- Jacob says there are "endless cycles of that
@@ -163,9 +163,9 @@ Views and the use of Views is something we used heavily early in the migration."
163
163
164
164
"A Metrics View is something that is run inside of your Meter Provider in OTel,"
165
165
Jacob explains. There are many configuration options, such as dropping
166
- attributes, which is one of the most common use cases. "For example, you’ re a
166
+ attributes, which is one of the most common use cases. "For example, you' re a
167
167
centralized SRE and you don't want anyone to instrument code with any user ID
168
- attribute, because that’ s a high cardinality thing and it’ s going to explode
168
+ attribute, because that' s a high cardinality thing and it' s going to explode
169
169
your metrics cost. You can make a View that gets added to your instrumentation
170
170
and tell it to not record it, to deny it."
171
171
@@ -174,19 +174,19 @@ temporality or aggregation of your metrics. Temporality refers to whether a
174
174
metric incorporates the previous measurement or not (cumulative and delta), and
175
175
aggregation refers to how you send off the metrics.
176
176
177
- "It’ s most useful for [ our] histograms," says Jacob. "When you record
177
+ "It' s most useful for [ our] histograms," says Jacob. "When you record
178
178
histograms, there are a few different kinds – DataDog and Statsd histograms are
179
- not true histograms because what they’ re recording is like aggregation samples.
179
+ not true histograms because what they' re recording is like aggregation samples.
180
180
They give you a min, max, count, average, and P95 or something. The problem with
181
181
that is, in distributed computing, if you have multiple applications that are
182
- reporting a P95, there’ s no way you can get a true P95 from that observation
182
+ reporting a P95, there' s no way you can get a true P95 from that observation
183
183
with that aggregation," he continues.
184
184
185
- "The reason for that is, if you have five P95 observations, there’ s not an
185
+ "The reason for that is, if you have five P95 observations, there' s not an
186
186
aggregation to say, give me the overall P95 from that. You need to have
187
187
something about the original data to recalculate it. You can get the average of
188
- the P95s but it’ s not a great metric, it doesn't really tell you much. It's not
189
- really accurate. If you’ re going to alert on something and page someone at
188
+ the P95s but it' s not a great metric, it doesn't really tell you much. It's not
189
+ really accurate. If you' re going to alert on something and page someone at
190
190
night, you should be paging on accurate measurements."
191
191
192
192
Initially, they did have a few people who relied on the min, max, sum, count
@@ -207,13 +207,13 @@ other components, which was really neat."
207
207
208
208
When Jacob started the OTel migration, it was still too early for logs. "The
209
209
thing we would change," he says, "is how we collect those logs, potentially; we
210
- previously did it using Google’ s log agent, basically running
210
+ previously did it using Google' s log agent, basically running
211
211
[ fluentbit] ( https://fluentbit.io ) on every node in a GKE cluster and then they
212
212
send it off to GCP and we tail it there." He notes that there may have been
213
213
recent changes to this that he's not aware of at this time.
214
214
215
- "For a long time, we’ ve used span events and logs for a lot of things
216
- internally," he says. "I’ m a big fan of them." He is not as big a fan of
215
+ "For a long time, we' ve used span events and logs for a lot of things
216
+ internally," he says. "I' m a big fan of them." He is not as big a fan of
217
217
logging, sharing that he thinks they are "cumbersome and expensive." He suggests
218
218
that users opt for tracing and trace logs whenever possible, although he does
219
219
like logging for local development, and tracing for distributed development."
@@ -276,11 +276,11 @@ components that you'll need to monitor.
276
276
"In OTel, we tack on this
277
277
[ Prometheus receiver] ( https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/prometheusreceiver/README.md )
278
278
to get all this data, but because we want to be more efficient than Prometheus,
279
- because we don’ t need to store the data, we have this component called the
279
+ because we don' t need to store the data, we have this component called the
280
280
Target Allocator, which goes to do the service discovery from Prometheus," says
281
281
Jacob. "It says give me all the targets I need to scrape. Then the Target
282
282
Allocator says: with these targets, distribute them evenly among the set of
283
- collectors that’ s running."
283
+ collectors that' s running."
284
284
285
285
That's the main thing this component does, and it also helps with job discovery.
286
286
If you're using Prometheus service monitors, which is part of the
@@ -303,31 +303,29 @@ to run this."
303
303
### The Collector setup
304
304
305
305
Jacob's team runs a lot of different types of Collectors over at Lightstep. "We
306
- run metrics things, tracing things, internal ones, external ones – there’ s a lot
306
+ run metrics things, tracing things, internal ones, external ones – there' s a lot
307
307
of different collectors that are running at all times", he shares.
308
308
309
- "It’ s all very in-flux." They're changing things around a lot to run
309
+ "It' s all very in-flux." They're changing things around a lot to run
310
310
experiments, since the best way for them to create features for customers and
311
311
end users is to make sure they work internally first.
312
312
313
313
"We're running in a single path where there could be two collectors in two
314
314
environments that could be running two different images and two different
315
315
versions. It gets really meta and really confusing to talk about," he says. "And
316
- then, if you’ re sending Collector A across an environment to Collector B,
316
+ then, if you' re sending Collector A across an environment to Collector B,
317
317
Collector B also emits telemetry about itself, which is then collected by
318
318
Collector C, so it just chains."
319
319
320
320
In a nutshell, you need to make sure that the collector is actually working.
321
- "That’ s like the problem when we’ re debugging this stuff. When there’ s a problem
321
+ "That' s like the problem when we' re debugging this stuff. When there' s a problem
322
322
you have to think up where the problem actually is -- is it in how we collect
323
323
the data, is it in how we emit the data, is it in the source of how the data was
324
324
generated? One of a bunch of things."
325
325
326
326
### Kubernetes modes on OTel
327
327
328
- The OTel Operator supports four
329
- [ deployment modes] ( https://github.com/open-telemetry/opentelemetry-operator/blob/main/docs/api.md#opentelemetrycollectorspec )
330
- for the OTel Collector in Kubernetes:
328
+ The OTel Operator supports [ four deployment modes] ( https://github.com/open-telemetry/opentelemetry-operator/blob/main/docs/api/opentelemetrycollectors.md ) in Kubernetes.
331
329
332
330
- [ Deployment] ( https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ ) -
333
331
see example
@@ -346,19 +344,18 @@ Which ones you should use depends on what you need to do, such as how you like
346
344
to run applications for reliability.
347
345
348
346
"Sidecar is the one we use the least and is probably used the least across the
349
- industry if I had to make a bet," Jacob says. "They’ re expensive. If you don’ t
350
- really need them, then you shouldn’ t use them." An example of something run as a
347
+ industry if I had to make a bet," Jacob says. "They' re expensive. If you don' t
348
+ really need them, then you shouldn' t use them." An example of something run as a
351
349
sidecar is Istio, "which makes a lot of sense to run as a sidecar because it
352
350
does proxy traffic and it hooks into your container network to change how it all
353
351
does its thing."
354
352
355
353
You will get a cost hit if you sidecar your Collectors for all your services,
356
- and you also have limited capabilities. He says, "If you’re making Kubernetes
357
- APIs calls or attribute enrichment, that’s the thing that would get
358
- exponentially expensive if you’re running as a sidecar." He shares an example:
359
- "...if you have sidecar [ Collector using the
354
+ and you also have limited capabilities. He says, "If you're making Kubernetes
355
+ APIs calls or attribute enrichment, that's the thing that would get
356
+ exponentially expensive if you're running as a sidecar." He shares an example: "...if you have sidecar [ Collector using the
360
357
[ k8sattributesprocessor] ( https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/k8sattributesprocessor )]
361
- on 10k pods, then that’ s 10k API calls made to the K8s API. That's expensive."
358
+ on 10k pods, then that' s 10k API calls made to the K8s API. That's expensive."
362
359
363
360
On the other hand, if you have five pods deployed on StatefulSets, "that's not
364
361
that expensive." When you run in StatefulSet mode, you get an exact number of
@@ -370,7 +367,7 @@ which is why it's required. Another thing that StatefulSets guarantee is
370
367
something called in-place deployment, which is also available with DaemonSets;
371
368
this is where you take the pod down before you create a new one.
372
369
373
- "In a deployment you usually do a 1-up, 1-down, or what’ s called a
370
+ "In a deployment you usually do a 1-up, 1-down, or what' s called a
374
371
[ rolling deployment] ( https://www.techtarget.com/searchitoperations/definition/rolling-deployment ) ,
375
372
or rolling update," Jacob says. If you were doing this with the Target
376
373
Allocator, you are likely to get much more unreliable scrapes. This is because
@@ -380,30 +377,30 @@ the hashes you've assigned.
380
377
381
378
Whereas with StatefulSets, this isn't necessary, since you get a consistent ID
382
379
range. "So when you do 1-down 1 up, it keeps the same targets each time. So like
383
- a placeholder for it – you don’ t have to recalculate the ring," he explains.
380
+ a placeholder for it – you don' t have to recalculate the ring," he explains.
384
381
385
382
He notes that this is really only useful as a metrics use case, where you're
386
383
scraping Prometheus. He notes that they'd probably run it as a Deployment for
387
384
anything else, since that mode gives you most everything you would need.
388
385
Collectors are usually stateless, so there is no need for them to hold on to
389
386
anything, and Deployments are leaner as a result. "You can just run and roll out
390
- and everyone’ s happy," he says. "That’ s how we run most of our collectors, is
387
+ and everyone' s happy," he says. "That' s how we run most of our collectors, is
391
388
just as a Deployment."
392
389
393
390
For per-node scraping, DaemonSets come in handy. "This allows you to scrape the
394
- kubelet that’ s run on every node, it allows you to scrape the node exporter
395
- that’ s also run on every node, which is another Prometheus daemonset that most
391
+ kubelet that' s run on every node, it allows you to scrape the node exporter
392
+ that' s also run on every node, which is another Prometheus daemonset that most
396
393
people run," he explains.
397
394
398
395
DaemonSets are useful for scaling out, since they guarantee that you've got pods
399
396
running on every node that matches its selector. "If you have a cluster of 800+
400
- nodes, it’ s more reliable to run a bunch of little collectors that get those
397
+ nodes, it' s more reliable to run a bunch of little collectors that get those
401
398
tiny metrics, rather than a few bigger stateful set pods because your blast
402
399
radius is much lower," he says.
403
400
404
401
"If one pod goes down, you lose just a tiny bit of data, but remember, with all
405
- this cardinality stuff, that’ s a lot of memory. So if you’ re doing a
406
- StatefulSet, scraping all these nodes, that’ s a lot of targets, that’ s a lot of
402
+ this cardinality stuff, that' s a lot of memory. So if you' re doing a
403
+ StatefulSet, scraping all these nodes, that' s a lot of targets, that' s a lot of
407
404
memory, it can go down much more easily and you can lose more data."
408
405
409
406
If a Collector goes down, it comes back up quickly, since it is usually
@@ -418,10 +415,10 @@ scale on, and you can distribute targets and load-balance.
418
415
419
416
"Pull-based is like the reason that Prometheus is so ubiquitous... because it
420
417
makes local development really easy, where you can just scrape your local
421
- endpoint, that’ s what most backend development is anyway," he says. "You can hit
418
+ endpoint, that' s what most backend development is anyway," he says. "You can hit
422
419
endpoint A and then hit your metrics endpoint. Then hit endpoint A again and
423
- then metrics endpoint, and check that, so it’ s an easy developer loop. It also
424
- means you don’ t have to reach outside of the network so if you have really
420
+ then metrics endpoint, and check that, so it' s an easy developer loop. It also
421
+ means you don' t have to reach outside of the network so if you have really
425
422
strict proxy requirements to send data, local dev is much easier for that.
426
423
That's why OTel now has a really good Prometheus exporter, so it can do both."
427
424
@@ -451,8 +448,8 @@ He recommends using Dependabot, which they use in OTel. OTel packages update in
451
448
lockstep, which means you have to update "a fair amount of packages at once, but
452
449
it does do it all for you, which is nice," he says. However, you should be doing
453
450
this with all your dependencies, as "CVEs happen in the industry constantly. If
454
- you're not staying up to date with vulnerability fixes then you’ re opening
455
- yourself up to security attacks, which you don’ t want. 'Do something about it'
451
+ you're not staying up to date with vulnerability fixes then you' re opening
452
+ yourself up to security attacks, which you don' t want. 'Do something about it'
456
453
is my recommendation."
457
454
458
455
## Additional Resources
@@ -469,7 +466,7 @@ is my recommendation."
469
466
470
467
## Final Thoughts
471
468
472
- OpenTelemetry is all about community, and we wouldn’ t be where we are without
469
+ OpenTelemetry is all about community, and we wouldn' t be where we are without
473
470
our contributors, maintainers, and users. We value user feedback -- please share
474
471
your experiences and help us improve OpenTelemetry.
475
472
0 commit comments