[AI Evaluation] Introduce a mechanism to detect when evaluation scores drop below configured thresholds #6038

shyamnamboodiripad · 2025-03-05T03:01:58Z

Asserting individual metrics within evaluation unit tests can lead to a cumbersome workflow since scores can jump around (between iterations in a single test depending on individual sampled LLM responses, as well as between different scenarios in the overall execution). When evaluation tests are included as part of a CI pipeline, it is not desirable for such noise to periodically block code flow. Instead, we need some way to look at the big picture and overall trends and be alerted when there is significant change.

For example, it would be desirable to have a more global mechanism to detect when the scores for multiple iterations and scenarios are trending lower. One idea would be to introduce a JSON configuration containing some set of rules / thresholds (e.g., at least x% of scenarios should produce a score > 4 for metric xyz, at least y% of scenarios in this sub-area should produce a score > 3 for metric abc, the overall scores for the metrics in this sub-area should not regress by > y% etc.). This configuration can then be passed to the aieval tool which could signal (via exit code) whether any of the configured thresholds have been breached.

This configuration could also be passed to the aieval report sub command in which case, the generated report could include a high-level summary for what the results look like for each configured threshold.

Related: #5934

As always, other ideas / suggestions are welcome :)

FYI @peterwald

The text was updated successfully, but these errors were encountered:

shyamnamboodiripad added the area-ai-eval Microsoft.Extensions.AI.Evaluation and related label Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AI Evaluation] Introduce a mechanism to detect when evaluation scores drop below configured thresholds #6038

[AI Evaluation] Introduce a mechanism to detect when evaluation scores drop below configured thresholds #6038

shyamnamboodiripad commented Mar 5, 2025

[AI Evaluation] Introduce a mechanism to detect when evaluation scores drop below configured thresholds #6038

[AI Evaluation] Introduce a mechanism to detect when evaluation scores drop below configured thresholds #6038

Comments

shyamnamboodiripad commented Mar 5, 2025