You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Asserting individual metrics within evaluation unit tests can lead to a cumbersome workflow since scores can jump around (between iterations in a single test depending on individual sampled LLM responses, as well as between different scenarios in the overall execution). When evaluation tests are included as part of a CI pipeline, it is not desirable for such noise to periodically block code flow. Instead, we need some way to look at the big picture and overall trends and be alerted when there is significant change.
For example, it would be desirable to have a more global mechanism to detect when the scores for multiple iterations and scenarios are trending lower. One idea would be to introduce a JSON configuration containing some set of rules / thresholds (e.g., at least x% of scenarios should produce a score > 4 for metric xyz, at least y% of scenarios in this sub-area should produce a score > 3 for metric abc, the overall scores for the metrics in this sub-area should not regress by > y% etc.). This configuration can then be passed to the aieval tool which could signal (via exit code) whether any of the configured thresholds have been breached.
This configuration could also be passed to the aieval report sub command in which case, the generated report could include a high-level summary for what the results look like for each configured threshold.
Asserting individual metrics within evaluation unit tests can lead to a cumbersome workflow since scores can jump around (between iterations in a single test depending on individual sampled LLM responses, as well as between different scenarios in the overall execution). When evaluation tests are included as part of a CI pipeline, it is not desirable for such noise to periodically block code flow. Instead, we need some way to look at the big picture and overall trends and be alerted when there is significant change.
For example, it would be desirable to have a more global mechanism to detect when the scores for multiple iterations and scenarios are trending lower. One idea would be to introduce a JSON configuration containing some set of rules / thresholds (e.g., at least x% of scenarios should produce a score > 4 for metric xyz, at least y% of scenarios in this sub-area should produce a score > 3 for metric abc, the overall scores for the metrics in this sub-area should not regress by > y% etc.). This configuration can then be passed to the
aieval
tool which could signal (via exit code) whether any of the configured thresholds have been breached.This configuration could also be passed to the
aieval report
sub command in which case, the generated report could include a high-level summary for what the results look like for each configured threshold.Related: #5934
As always, other ideas / suggestions are welcome :)
FYI @peterwald
The text was updated successfully, but these errors were encountered: