Skip to content

Commit 285eef2

Browse files
gyliu513solsu01
andcommitted
Blog for AI Agent Observability
Signed-off-by: Guangya Liu <[email protected]> Co-authored-by: Sujay Solomon <[email protected]>
1 parent 3690863 commit 285eef2

File tree

4 files changed

+355
-0
lines changed

4 files changed

+355
-0
lines changed
Loading
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,259 @@
1+
---
2+
title: AI Agent Observability - Evolving Standards and Best Practices
3+
author: >-
4+
[Guangya Liu](https://github.com/gyliu513) (IBM), [Sujay
5+
Solomon](https://github.com/solsu01) (Google)
6+
linkTitle: AI Agent Observability
7+
issue: https://github.com/open-telemetry/opentelemetry.io/issues/6389
8+
sig: SIG GenAI Observability
9+
date: 2025-03-06
10+
cSpell:ignore: genai Guangya PydanticAI Sujay
11+
---
12+
13+
## 2025: Year of AI agents
14+
15+
AI Agents are becoming the next big leap in artificial intelligence in 2025.
16+
From autonomous workflows to intelligent decision making, AI Agents will power
17+
numerous applications across industries. However, with this evolution comes the
18+
critical need for AI agent observability, especially when scaling these agents
19+
to meet enterprise needs. Without proper monitoring, tracing, and logging
20+
mechanisms, diagnosing issues, improving efficiency, and ensuring reliability in
21+
AI agent-driven applications will be challenging.
22+
23+
### What is an AI agent
24+
25+
An AI agent is an application that uses a combination of LLM capabilities, tools
26+
to connect to the external world, and high-level reasoning to achieve a desired
27+
end goal or state; Alternatively, agents can also be treated as systems where
28+
LLMs dynamically direct their own processes and tool usage, maintaining control
29+
over how they accomplish tasks.
30+
31+
![Sample RAG based application w/ReAct reasoning/planning](ai-agent.png)
32+
<small>_Image credit_:
33+
[Google AI Agent Whitepaper](https://www.kaggle.com/whitepaper-agents).</small>
34+
35+
For more information about AI agents, see:
36+
37+
- [Google: What is an AI agent?](https://cloud.google.com/discover/what-are-ai-agents)
38+
- [IBM: What are AI agents?](https://www.ibm.com/think/topics/ai-agents)
39+
- [MicroSoft: AI agents — what they are, and how they’ll change the way we work](https://news.microsoft.com/source/features/ai/ai-agents-what-they-are-and-how-theyll-change-the-way-we-work/)
40+
- [AWS: What are AI Agents?](https://aws.amazon.com/what-is/ai-agents/)
41+
- [Anthropic: Building effective agents](https://www.anthropic.com/research/building-effective-agents)
42+
43+
### Observability and beyond
44+
45+
Typically, telemetry from applications is used to monitor and troubleshoot them.
46+
In the case of an AI agent, given its non-deterministic nature, telemetry is
47+
also used as a feedback loop to continuously learn from and improve the quality
48+
of the agent by using it as input for evaluation tools.
49+
50+
Given that observability and evaluation tools for GenAI come from various
51+
vendors, it is important to establish standards around the shape of the
52+
telemetry generated by agent apps to avoid lock-in caused by vendor or framework
53+
specific formats.
54+
55+
## Current state of AI agent observability
56+
57+
As AI agent ecosystems continue to mature, the need for standardized and robust
58+
observability has become more apparent. While some frameworks offer built-in
59+
instrumentation, others rely on integration with observability tools. This
60+
fragmented landscape underscores the importance of the
61+
[GenAI observability project](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md)
62+
and OpenTelemetry’s emerging semantic conventions, which aim to unify how
63+
telemetry data is collected and reported.
64+
65+
### Understanding AI agent application vs. AI agent framework
66+
67+
It is crucial to distinguish between **AI agent application** and **AI agent
68+
frameworks**:
69+
70+
- **AI agent application** refer to individual AI-driven entities that perform
71+
specific tasks autonomously.
72+
- **AI agent framework** provide the necessary infrastructure to develop,
73+
manage, and deploy AI agents often in a more streamlined way than building an
74+
agent from scratch. Examples include the following:
75+
[IBM Bee AI](https://github.com/i-am-bee),
76+
[IBM wxFlow](https://github.com/IBM/wxflows/),
77+
[CrewAI](https://www.crewai.com/),
78+
[AutoGen](https://microsoft.github.io/autogen/dev/),
79+
[Semantic Kernel](https://github.com/microsoft/semantic-kernel),
80+
[LangGraph](https://www.langchain.com/langgraph),
81+
[PydanticAI](https://ai.pydantic.dev/) and more.
82+
83+
![AI agent application vs AI agent framework](agent-agent-framework.png)
84+
85+
### Establishing a standardized semantic convention
86+
87+
Today, the
88+
[GenAI observability project](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md)
89+
within OpenTelemetry is actively working on defining semantic conventions to
90+
standardize AI agent observability. This effort is primarily driven by:
91+
92+
- **Agent application semantic convention** – A draft AI agent application
93+
semantic convention has already been established and finalized as part of the
94+
discussions in the
95+
[OpenTelemetry semantic conventions repository](https://github.com/open-telemetry/semantic-conventions/issues/1732).
96+
The initial AI agent semantic convention is based on
97+
[Google's AI agent white paper](https://www.kaggle.com/whitepaper-agents),
98+
providing a foundational framework for defining observability standards.
99+
Moving forward, we will continue to refine and enhance this initial convention
100+
to make it more robust and comprehensive.
101+
- **Agent framework semantic convention** – Now, the focus has shifted towards
102+
defining a common semantic convention for all AI agent frameworks. This effort
103+
is being discussed in
104+
[this OpenTelemetry issue](https://github.com/open-telemetry/semantic-conventions/issues/1530)
105+
and aims to establish a standardized approach for frameworks such as IBM Bee
106+
Stack, IBM wxFlow, CrewAI, AutoGen, LangGraph, and others. Additionally,
107+
different AI Agent frameworks will be able to define their own Framework
108+
Vendor Specific Semantic Convention while adhering to the common standard.
109+
110+
By establishing these conventions, we ensure that AI agent frameworks can report
111+
standardized metrics, traces, and logs, making it easier to integrate
112+
observability solutions and compare performance across different frameworks.
113+
114+
Note: Experimental conventions already exist in OpenTelemetry for models at
115+
[GenAI semantic convention](/docs/specs/semconv/gen-ai/).
116+
117+
### Instrumentation approaches
118+
119+
In order to make a system observable, it must be instrumented: That is, code
120+
from the system’s components must
121+
[emit traces, metrics, and logs](/docs/concepts/instrumentation/).
122+
123+
Different AI agent frameworks have varying approaches to implementing
124+
observability, mainly categorized into two options:
125+
126+
#### Option 1: Baked-in instrumentation
127+
128+
The first option is to implement built-in instrumentation that emits telemetry
129+
using OpenTelemetry semantic conventions. This means observability is a native
130+
feature, allowing users to seamlessly track agent performance, task execution,
131+
and resource utilization. Some AI agent frameworks, such as CrewAI, follow this
132+
pattern.
133+
134+
As a developer of an agent framework, here are some pros and cons of this
135+
baked-in instrumentation:
136+
137+
- Pros
138+
- You can take on the maintenance overhead of keeping the instrumentation for
139+
telemetry up-to-date.
140+
- Simplifies adoption for users unfamiliar with OpenTelemetry configuration.
141+
- Keep new features secret while providing instrumentation for them on the day
142+
of release.
143+
- Cons
144+
- Adds bloat to the framework for users who do not need observability
145+
features.
146+
- Risk of version lock-in if the framework’s OpenTelemetry dependencies lag
147+
behind upstream updates.
148+
- Less flexibility for advanced users who prefer custom instrumentation.
149+
- You may not get feedback/review from OTel contributors familiar with current
150+
semantic conventions.
151+
- Your instrumentation may lag with respect to best practices/conventions (not
152+
just the version of the OTel library dependencies).
153+
- Some best practices to follow if you consider this approach:
154+
- Provide a configuration setting that lets users easily enable or disable
155+
telemetry collection from your framework's built-in instrumentation.
156+
- Plan ahead of users wanting to use other external instrumentation packages
157+
and avoid collision.
158+
- Consider listing your agent framework in the
159+
[OpenTelemetry registry](/ecosystem/registry/) if you choose this path.
160+
- As a developer of an agent application, you may want to choose an agent
161+
framework with baked-in instrumentation if…
162+
- Minimal dependencies on external packages in your agent app code.
163+
- Out-of-the-box observability without manual setup.
164+
165+
#### Option 2: Instrumentation via OpenTelemetry
166+
167+
This option is to publish OpenTelemetry instrumentation libraries to some GitHub
168+
repositories. These instrumentation libraries can be imported into agents and
169+
configured to emit telemetry per OpenTelemetry semantic conventions.
170+
171+
For publishing instrumentation with OpenTelemetry, there are two options:
172+
173+
- Option 1: External instrumentation in your own repository/package, like
174+
[Traceloop OpenTelemetry Instrumentation](https://github.com/traceloop/openllmetry/tree/main/packages),
175+
[Langtrace OpenTelemetry Instrumentation](https://github.com/Scale3-Labs/langtrace-python-sdk/tree/main/src/langtrace_python_sdk/instrumentation)
176+
etc.
177+
- Option 2: External instrumentation in OpenTelemetry owned repository, like
178+
[instrumentation-genai](https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation-genai)
179+
etc.
180+
181+
Both options work well, but the long term goal is to host the code in
182+
OpenTelemetry owned repositories, like Traceloop is trying to
183+
[donate the instrumentation code](https://github.com/open-telemetry/community/issues/2571)
184+
to OpenTelemetry now.
185+
186+
As a developer of an agent framework, here are some pros and cons of
187+
instrumentation with OpenTelemetry:
188+
189+
- Pros
190+
- Decouples observability from the core framework, reducing bloat.
191+
- Leverages OpenTelemetry’s community-driven maintenance for instrumentation
192+
updates.
193+
- Allows users to mix and match contrib libraries for their specific needs
194+
(e.g., cloud providers, LLM vendors).
195+
- More likely to leverage best practices around semantic conventions and
196+
zero-code instrumentation
197+
- Cons
198+
- Risk of fragmentation if users rely on incompatible or outdated contrib
199+
packages for both install time and runtime.
200+
- Development velocity slows down when there are too many PRs in the
201+
OpenTelemetry review queue.
202+
- Best practices for this approach:
203+
- Ensure compatibility with popular OpenTelemetry contrib libraries (e.g., LLM
204+
vendors, vector DBs).
205+
- Provide clear documentation on recommended contrib packages and
206+
configuration examples.
207+
- Avoid reinventing the wheel; align with existing OpenTelemetry standards.
208+
- As a developer of an agent application, you may want to choose an agent
209+
framework with baked-in instrumentation if…
210+
- You need fine-grained control over telemetry sources and destinations.
211+
- Your use case requires integrating observability with niche or custom tools.
212+
213+
**NOTE:** Regardless of the approach taken, it is essential that all AI agent
214+
frameworks adopt the AI agent framework semantic convention to ensure
215+
interoperability and consistency in observability data.
216+
217+
## Future of AI agent observability
218+
219+
Looking ahead, AI agent observability will continue to evolve with:
220+
221+
- **More robust semantic conventions** to cover edge cases and emerging AI agent
222+
frameworks.
223+
- **A unified AI agent framework semantic convention** to ensure
224+
interoperability across different frameworks while allowing flexibility for
225+
vendor-specific extensions.
226+
- **Continuous improvements to the AI agent semantic convention** to refine the
227+
initial standard and address new challenges as AI agents evolve.
228+
- **Improved tooling** for monitoring, debugging, and optimizing AI agents.
229+
- **Tighter integration with AI model observability** to provide end-to-end
230+
visibility into AI powered applications.
231+
232+
## Role of OpenTelemetry's GenAI SIG
233+
234+
The
235+
[GenAI Special Interest Group (SIG) in OpenTelemetry](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md)
236+
is actively defining [GenAI semantic conventions](/docs/specs/semconv/gen-ai/)
237+
that cover key areas such as:
238+
239+
- LLM or model semantic conventions
240+
- VectorDB semantic conventions
241+
- AI agent semantic conventions (a critical component within the broader GenAI
242+
semantic convention)
243+
244+
In addition to conventions, the SIG has also expanded its scope to provide
245+
instrumentation coverage for agents and models in Python and other languages. As
246+
AI Agents become increasingly sophisticated, observability will play a
247+
fundamental role in ensuring their reliability, efficiency, and trustworthiness.
248+
Establishing a standardized approach to AI Agent observability requires
249+
collaboration, and we invite contributions from the broader AI community.
250+
251+
We look forward to partnering with different AI agent framework communities to
252+
establish best practices and refine these standards together. Your insights and
253+
contributions will help shape the future of AI observability, fostering a more
254+
transparent and effective AI ecosystem.
255+
256+
Don’t miss this opportunity to help shape the future of industry standards for
257+
GenAI Observability! Join us on the [CNCF Slack](https://slack.cncf.io)
258+
`#otel-genai-instrumentation-wg` channel, or by attending a
259+
[GenAI SIG meeting](https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md#meeting-times).

0 commit comments

Comments
 (0)