You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Remove telegraf and use only fluent-bit for telemetry (#1030)
[comment]: # (Note that your PR title should follow the conventional
commit format: https://conventionalcommits.org/en/v1.0.0/#summary)
# PR Description
- Upgrade `fluent-bit`
- Linux: `2.1.10` -> `3.2.2` (latest)
- `>=3.2` is necessary for using `metrics_selector` and `labels`
processors for filtering Prometheus metrics
- Windows: `2.1.10` -> `3.0.7` (latest)
- `3.0` is the latest Windows version
- Remove `telegraf` for Linux and Windows
- Changes to `fluent-bit`
- Built-in Plugins:
- Use `prometheus_scrape` input plugin to scrape Prometheus metrics
previously collected by telegraf
- Use `metrics_selector` and `labels` processors to filter certain
metrics and drop unnecessary labels before sending to App Insights
- Conf:
- Use new YAML format for config to be able to use `metrics_selector`
and `labels` processors
- Custom Output Plugin:
1. Collect CPU and Memory usage for otelcollector and metricsextension
that were previously collected by telegraf
- This runs as a go routine, separate from the fluent-bit pipeline
- Use the same underlying golang package as telegraf:
`github.com/shirou/gopsutil/v4/process`
- Collect at the same frequency as telegraf and aggregate to p50 and p95
- Send extra env var as customDimensions as telegraf did
3. Decode the Prometheus metrics msgpack from fluent-bit and send to App
Insights in the format we want
- Add one line to the fluent-bit `proxy_plugin` file so that the
Prometheus metrics will be allowed to flow to our golang output plugin:
- `out->event_type = FLB_OUTPUT_LOGS | FLB_OUTPUT_METRICS;`
- `fluent-bit` has the `proxy_plugin` files to allow the golang output
plugins to be built upon the C code. However, this does not specify what
type the output plugin accepts (out of `logs`, `metrics`, and `traces`
types), so it defaults to only allowing the `logs` type to be routed to
the ouput plugin.
- Build Pipeline:
- Build `fluent-bit` with the line added above in the exact same way
Mariner builds the package.
- Only build `fluent-bit` with the plugins that we actually use so that
our CVE surface area is very low.
- Main image bug fixes:
- Use daemonset config file for fluent-bit for the daemonset.
Previously, it was using the replicaset config file for both replicaset
and daemonset and the daemonset logs weren't being collected
- Fix telemetry sent for `network-observability` and `acstor` that was
missed
[comment]: # (The below checklist is for PRs adding new features. If a
box is not checked, add a reason why it's not needed.)
# New Feature Checklist
- [x] Link to the one-pager about the feature:
https://msazure.visualstudio.com/InfrastructureInsights/_wiki/wikis/InfrastructureInsights.wiki/741581/TelegrafRemoval
- [x] Attach results of scale and perf testing:
<img width="1753" alt="image"
src="https://github.com/user-attachments/assets/de8a4d9a-6f0f-40ba-b55d-491dc68e503b"
/>
# Telemetry Values Comparison
- ReplicaSet
<img width="1712" alt="image"
src="https://github.com/user-attachments/assets/4ab3d9b3-51ab-4775-a5a6-872d3d0291c9"
/>
- DaemonSet
<img width="1706" alt="image"
src="https://github.com/user-attachments/assets/cd2feee3-1bfd-411c-9050-99a3681f00c8"
/>
- All extra [env var telemetry are transferred
over](https://dataexplorer.azure.com/dashboards/94da59c1-df12-4134-96bb-82c6b32e6199?p-_cluster=v-%2Fsubscriptions%2Fb9842c7c-1a38-4385-8f39-a51314758bcf%2FresourceGroups%2Fgrace-win%2Fproviders%2FMicrosoft.ContainerService%2FmanagedClusters%2Fgrace-win&p-_startTime=7days&p-_endTime=now&p-Interval=v-5m&p-AKSClusterID=v-675b8ceeceb9d100010e6fe2#9d5aa5eb-cc6f-46c9-81f6-22c8fe5357e4)
0 commit comments