Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AI Evaluation] Introduce a mechanism to detect when evaluation scores drop below configured thresholds #6038

Open
shyamnamboodiripad opened this issue Mar 5, 2025 · 0 comments
Labels
area-ai-eval Microsoft.Extensions.AI.Evaluation and related

Comments

@shyamnamboodiripad
Copy link
Contributor

Asserting individual metrics within evaluation unit tests can lead to a cumbersome workflow since scores can jump around (between iterations in a single test depending on individual sampled LLM responses, as well as between different scenarios in the overall execution). When evaluation tests are included as part of a CI pipeline, it is not desirable for such noise to periodically block code flow. Instead, we need some way to look at the big picture and overall trends and be alerted when there is significant change.

For example, it would be desirable to have a more global mechanism to detect when the scores for multiple iterations and scenarios are trending lower. One idea would be to introduce a JSON configuration containing some set of rules / thresholds (e.g., at least x% of scenarios should produce a score > 4 for metric xyz, at least y% of scenarios in this sub-area should produce a score > 3 for metric abc, the overall scores for the metrics in this sub-area should not regress by > y% etc.). This configuration can then be passed to the aieval tool which could signal (via exit code) whether any of the configured thresholds have been breached.

This configuration could also be passed to the aieval report sub command in which case, the generated report could include a high-level summary for what the results look like for each configured threshold.

Related: #5934

As always, other ideas / suggestions are welcome :)

FYI @peterwald

@shyamnamboodiripad shyamnamboodiripad added the area-ai-eval Microsoft.Extensions.AI.Evaluation and related label Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-ai-eval Microsoft.Extensions.AI.Evaluation and related
Projects
None yet
Development

No branches or pull requests

1 participant