Skip to content

Commit 165dffb

Browse files
committed
fix: update links to OpenTelemetry Operator API documentation
The OpenTelemetry Operator repository refactored its API documentation structure, making the previous links to docs/api.md invalid. This commit updates all references to point to the new API documentation locations. Fixes #6237
1 parent cd90ab7 commit 165dffb

File tree

5 files changed

+97
-102
lines changed

5 files changed

+97
-102
lines changed

.htmltest.yml

-3
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,3 @@ IgnoreURLs: # list of regexs of paths or URLs to be ignored
9393
- ^https://pkg.go.dev/go.opentelemetry.io/collector/config/configauth#client-authenticators
9494
- ^https://pkg.go.dev/go.opentelemetry.io/collector/config/configauth#server-authenticators
9595

96-
# Temporary until
97-
# https://github.com/open-telemetry/opentelemetry.io/issues/6237 is resolved
98-
- ^https://github.com/open-telemetry/opentelemetry-operator/blob/main/docs/api.md#

content/en/blog/2023/end-user-q-and-a-04.md

+48-51
Original file line numberDiff line numberDiff line change
@@ -61,13 +61,13 @@ monorepo, but that would have risked a bug being pushed up.
6161

6262
Jacob says, "This is app data which we use for alerting to understand how our
6363
workloads are functioning in all of our environments, so it's important to not
64-
take that down since itd be disastrous. Same story for users, they want to know
65-
if they move to OTel they wont lose their alerting capabilities. You want a
64+
take that down since it'd be disastrous. Same story for users, they want to know
65+
if they move to OTel they won't lose their alerting capabilities. You want a
6666
safe and easy migration."
6767

6868
His team did the feature flag-based part of the configuration in Kubernetes. He
6969
says, "It would disable the sidecar and enable some code that would then swap
70-
the OTel for metrics and forward it to where its supposed to go. So that was
70+
the OTel for metrics and forward it to where it's supposed to go. So that was
7171
the path there."
7272

7373
However, along the way, he noticed some "pretty large performance issues" as he
@@ -76,15 +76,15 @@ worked with the OTel team to alleviate some of these concerns, and found that
7676
one of the big blockers was their heavy use of attributes on metrics.
7777

7878
"It was tedious to go in and figure out which metrics are using them and getting
79-
rid of them. I had a theory that one codepath was the problem, where were doing
79+
rid of them. I had a theory that one codepath was the problem, where we're doing
8080
the conversion from our internal tagging implementation to OTel tags, which came
8181
with a lot of other logic and [is] expensive to do, and it was on almost every
8282
call," he says. "No better time than now to begin another migration from
8383
OpenCensus to OTel."
8484

8585
He saw this as another opportunity: "While we wait for the OTel folks on the
8686
metrics side to push out more performant code and implementations, we could also
87-
test out the theory of, if we migrate to OTel entirely, were going to see more
87+
test out the theory of, if we migrate to OTel entirely, we're going to see more
8888
performance benefits." Thus, they paused the metrics work and began on migrating
8989
their tracing.
9090

@@ -111,7 +111,7 @@ code and some dangerous hacks, so that was a really good thing."
111111
enough for it to be constant," Jacob says. "The reason you want to pick a
112112
service like this is that if it's too low traffic, like one request every 10
113113
minutes, you have to worry about sample rates, [and] you may not have a lot of
114-
data to compare against – thats the big thing: you need to have some data to
114+
data to compare against – that's the big thing: you need to have some data to
115115
compare against."
116116

117117
He had written a script early on for their metrics migration that queried
@@ -131,23 +131,23 @@ different types of instrumentation, so from Envoy to OTel to OpenTracing."
131131
He explains, "What you want to see is that the trace before has the same
132132
structure as the trace after. So I made another script that checked that those
133133
structures were relatively the same and that they all had the same attributes as
134-
well... Thats the point of the tracing migration – what matters is that all the
134+
well... That's the point of the tracing migration – what matters is that all the
135135
attributes stayed the same."
136136

137137
### When data goes missing
138138

139-
"The why its missing stories are the really complicated ones," says Jacob.
139+
"The 'why it's missing' stories are the really complicated ones," says Jacob.
140140
Sometimes, it's as simple as forgetting "to add something somewhere," but other
141141
times, there could be an upstream library that doesn't emit what you expected
142142
for OTel.
143143

144144
He tells a story about the time he migrated their gRPC util package (which is
145145
now in Go contrib) and found an issue with propagation.
146146

147-
"I was trying to understand whats going wrong here. When I looked at the code –
147+
"I was trying to understand what's going wrong here. When I looked at the code –
148148
this tells you how early I was doing this migration – where there was supposed
149149
to be a propagator, there was just a 'TODO'," he shares. "It just took down our
150-
entire services traces in staging."
150+
entire services' traces in staging."
151151

152152
He spent some time working on it, but they in turn were waiting on something
153153
else, and so on and so forth -- Jacob says there are "endless cycles of that
@@ -163,9 +163,9 @@ Views and the use of Views is something we used heavily early in the migration."
163163

164164
"A Metrics View is something that is run inside of your Meter Provider in OTel,"
165165
Jacob explains. There are many configuration options, such as dropping
166-
attributes, which is one of the most common use cases. "For example, youre a
166+
attributes, which is one of the most common use cases. "For example, you're a
167167
centralized SRE and you don't want anyone to instrument code with any user ID
168-
attribute, because thats a high cardinality thing and its going to explode
168+
attribute, because that's a high cardinality thing and it's going to explode
169169
your metrics cost. You can make a View that gets added to your instrumentation
170170
and tell it to not record it, to deny it."
171171

@@ -174,19 +174,19 @@ temporality or aggregation of your metrics. Temporality refers to whether a
174174
metric incorporates the previous measurement or not (cumulative and delta), and
175175
aggregation refers to how you send off the metrics.
176176

177-
"Its most useful for [our] histograms," says Jacob. "When you record
177+
"It's most useful for [our] histograms," says Jacob. "When you record
178178
histograms, there are a few different kinds – DataDog and Statsd histograms are
179-
not true histograms because what theyre recording is like aggregation samples.
179+
not true histograms because what they're recording is like aggregation samples.
180180
They give you a min, max, count, average, and P95 or something. The problem with
181181
that is, in distributed computing, if you have multiple applications that are
182-
reporting a P95, theres no way you can get a true P95 from that observation
182+
reporting a P95, there's no way you can get a true P95 from that observation
183183
with that aggregation," he continues.
184184

185-
"The reason for that is, if you have five P95 observations, theres not an
185+
"The reason for that is, if you have five P95 observations, there's not an
186186
aggregation to say, give me the overall P95 from that. You need to have
187187
something about the original data to recalculate it. You can get the average of
188-
the P95s but its not a great metric, it doesn't really tell you much. It's not
189-
really accurate. If youre going to alert on something and page someone at
188+
the P95s but it's not a great metric, it doesn't really tell you much. It's not
189+
really accurate. If you're going to alert on something and page someone at
190190
night, you should be paging on accurate measurements."
191191

192192
Initially, they did have a few people who relied on the min, max, sum, count
@@ -207,13 +207,13 @@ other components, which was really neat."
207207

208208
When Jacob started the OTel migration, it was still too early for logs. "The
209209
thing we would change," he says, "is how we collect those logs, potentially; we
210-
previously did it using Googles log agent, basically running
210+
previously did it using Google's log agent, basically running
211211
[fluentbit](https://fluentbit.io) on every node in a GKE cluster and then they
212212
send it off to GCP and we tail it there." He notes that there may have been
213213
recent changes to this that he's not aware of at this time.
214214

215-
"For a long time, weve used span events and logs for a lot of things
216-
internally," he says. "Im a big fan of them." He is not as big a fan of
215+
"For a long time, we've used span events and logs for a lot of things
216+
internally," he says. "I'm a big fan of them." He is not as big a fan of
217217
logging, sharing that he thinks they are "cumbersome and expensive." He suggests
218218
that users opt for tracing and trace logs whenever possible, although he does
219219
like logging for local development, and tracing for distributed development."
@@ -276,11 +276,11 @@ components that you'll need to monitor.
276276
"In OTel, we tack on this
277277
[Prometheus receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/prometheusreceiver/README.md)
278278
to get all this data, but because we want to be more efficient than Prometheus,
279-
because we dont need to store the data, we have this component called the
279+
because we don't need to store the data, we have this component called the
280280
Target Allocator, which goes to do the service discovery from Prometheus," says
281281
Jacob. "It says give me all the targets I need to scrape. Then the Target
282282
Allocator says: with these targets, distribute them evenly among the set of
283-
collectors thats running."
283+
collectors that's running."
284284

285285
That's the main thing this component does, and it also helps with job discovery.
286286
If you're using Prometheus service monitors, which is part of the
@@ -303,31 +303,29 @@ to run this."
303303
### The Collector setup
304304

305305
Jacob's team runs a lot of different types of Collectors over at Lightstep. "We
306-
run metrics things, tracing things, internal ones, external ones – theres a lot
306+
run metrics things, tracing things, internal ones, external ones – there's a lot
307307
of different collectors that are running at all times", he shares.
308308

309-
"Its all very in-flux." They're changing things around a lot to run
309+
"It's all very in-flux." They're changing things around a lot to run
310310
experiments, since the best way for them to create features for customers and
311311
end users is to make sure they work internally first.
312312

313313
"We're running in a single path where there could be two collectors in two
314314
environments that could be running two different images and two different
315315
versions. It gets really meta and really confusing to talk about," he says. "And
316-
then, if youre sending Collector A across an environment to Collector B,
316+
then, if you're sending Collector A across an environment to Collector B,
317317
Collector B also emits telemetry about itself, which is then collected by
318318
Collector C, so it just chains."
319319

320320
In a nutshell, you need to make sure that the collector is actually working.
321-
"Thats like the problem when were debugging this stuff. When theres a problem
321+
"That's like the problem when we're debugging this stuff. When there's a problem
322322
you have to think up where the problem actually is -- is it in how we collect
323323
the data, is it in how we emit the data, is it in the source of how the data was
324324
generated? One of a bunch of things."
325325

326326
### Kubernetes modes on OTel
327327

328-
The OTel Operator supports four
329-
[deployment modes](https://github.com/open-telemetry/opentelemetry-operator/blob/main/docs/api.md#opentelemetrycollectorspec)
330-
for the OTel Collector in Kubernetes:
328+
The OTel Operator supports [four deployment modes](https://github.com/open-telemetry/opentelemetry-operator/blob/main/docs/api/opentelemetrycollectors.md) in Kubernetes.
331329

332330
- [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) -
333331
see example
@@ -346,19 +344,18 @@ Which ones you should use depends on what you need to do, such as how you like
346344
to run applications for reliability.
347345

348346
"Sidecar is the one we use the least and is probably used the least across the
349-
industry if I had to make a bet," Jacob says. "Theyre expensive. If you dont
350-
really need them, then you shouldnt use them." An example of something run as a
347+
industry if I had to make a bet," Jacob says. "They're expensive. If you don't
348+
really need them, then you shouldn't use them." An example of something run as a
351349
sidecar is Istio, "which makes a lot of sense to run as a sidecar because it
352350
does proxy traffic and it hooks into your container network to change how it all
353351
does its thing."
354352

355353
You will get a cost hit if you sidecar your Collectors for all your services,
356-
and you also have limited capabilities. He says, "If you’re making Kubernetes
357-
APIs calls or attribute enrichment, that’s the thing that would get
358-
exponentially expensive if you’re running as a sidecar." He shares an example:
359-
"...if you have sidecar [Collector using the
354+
and you also have limited capabilities. He says, "If you're making Kubernetes
355+
APIs calls or attribute enrichment, that's the thing that would get
356+
exponentially expensive if you're running as a sidecar." He shares an example: "...if you have sidecar [Collector using the
360357
[k8sattributesprocessor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/k8sattributesprocessor)]
361-
on 10k pods, then thats 10k API calls made to the K8s API. That's expensive."
358+
on 10k pods, then that's 10k API calls made to the K8s API. That's expensive."
362359

363360
On the other hand, if you have five pods deployed on StatefulSets, "that's not
364361
that expensive." When you run in StatefulSet mode, you get an exact number of
@@ -370,7 +367,7 @@ which is why it's required. Another thing that StatefulSets guarantee is
370367
something called in-place deployment, which is also available with DaemonSets;
371368
this is where you take the pod down before you create a new one.
372369

373-
"In a deployment you usually do a 1-up, 1-down, or whats called a
370+
"In a deployment you usually do a 1-up, 1-down, or what's called a
374371
[rolling deployment](https://www.techtarget.com/searchitoperations/definition/rolling-deployment),
375372
or rolling update," Jacob says. If you were doing this with the Target
376373
Allocator, you are likely to get much more unreliable scrapes. This is because
@@ -380,30 +377,30 @@ the hashes you've assigned.
380377

381378
Whereas with StatefulSets, this isn't necessary, since you get a consistent ID
382379
range. "So when you do 1-down 1 up, it keeps the same targets each time. So like
383-
a placeholder for it – you dont have to recalculate the ring," he explains.
380+
a placeholder for it – you don't have to recalculate the ring," he explains.
384381

385382
He notes that this is really only useful as a metrics use case, where you're
386383
scraping Prometheus. He notes that they'd probably run it as a Deployment for
387384
anything else, since that mode gives you most everything you would need.
388385
Collectors are usually stateless, so there is no need for them to hold on to
389386
anything, and Deployments are leaner as a result. "You can just run and roll out
390-
and everyones happy," he says. "Thats how we run most of our collectors, is
387+
and everyone's happy," he says. "That's how we run most of our collectors, is
391388
just as a Deployment."
392389

393390
For per-node scraping, DaemonSets come in handy. "This allows you to scrape the
394-
kubelet thats run on every node, it allows you to scrape the node exporter
395-
thats also run on every node, which is another Prometheus daemonset that most
391+
kubelet that's run on every node, it allows you to scrape the node exporter
392+
that's also run on every node, which is another Prometheus daemonset that most
396393
people run," he explains.
397394

398395
DaemonSets are useful for scaling out, since they guarantee that you've got pods
399396
running on every node that matches its selector. "If you have a cluster of 800+
400-
nodes, its more reliable to run a bunch of little collectors that get those
397+
nodes, it's more reliable to run a bunch of little collectors that get those
401398
tiny metrics, rather than a few bigger stateful set pods because your blast
402399
radius is much lower," he says.
403400

404401
"If one pod goes down, you lose just a tiny bit of data, but remember, with all
405-
this cardinality stuff, thats a lot of memory. So if youre doing a
406-
StatefulSet, scraping all these nodes, thats a lot of targets, thats a lot of
402+
this cardinality stuff, that's a lot of memory. So if you're doing a
403+
StatefulSet, scraping all these nodes, that's a lot of targets, that's a lot of
407404
memory, it can go down much more easily and you can lose more data."
408405

409406
If a Collector goes down, it comes back up quickly, since it is usually
@@ -418,10 +415,10 @@ scale on, and you can distribute targets and load-balance.
418415

419416
"Pull-based is like the reason that Prometheus is so ubiquitous... because it
420417
makes local development really easy, where you can just scrape your local
421-
endpoint, thats what most backend development is anyway," he says. "You can hit
418+
endpoint, that's what most backend development is anyway," he says. "You can hit
422419
endpoint A and then hit your metrics endpoint. Then hit endpoint A again and
423-
then metrics endpoint, and check that, so its an easy developer loop. It also
424-
means you dont have to reach outside of the network so if you have really
420+
then metrics endpoint, and check that, so it's an easy developer loop. It also
421+
means you don't have to reach outside of the network so if you have really
425422
strict proxy requirements to send data, local dev is much easier for that.
426423
That's why OTel now has a really good Prometheus exporter, so it can do both."
427424

@@ -451,8 +448,8 @@ He recommends using Dependabot, which they use in OTel. OTel packages update in
451448
lockstep, which means you have to update "a fair amount of packages at once, but
452449
it does do it all for you, which is nice," he says. However, you should be doing
453450
this with all your dependencies, as "CVEs happen in the industry constantly. If
454-
you're not staying up to date with vulnerability fixes then youre opening
455-
yourself up to security attacks, which you dont want. 'Do something about it'
451+
you're not staying up to date with vulnerability fixes then you're opening
452+
yourself up to security attacks, which you don't want. 'Do something about it'
456453
is my recommendation."
457454

458455
## Additional Resources
@@ -469,7 +466,7 @@ is my recommendation."
469466

470467
## Final Thoughts
471468

472-
OpenTelemetry is all about community, and we wouldnt be where we are without
469+
OpenTelemetry is all about community, and we wouldn't be where we are without
473470
our contributors, maintainers, and users. We value user feedback -- please share
474471
your experiences and help us improve OpenTelemetry.
475472

0 commit comments

Comments
 (0)