OpenAI released a playbook for conducting valid third-party evaluations of frontier AI models, emphasizing that evaluation harnesses—the surrounding setup enabling tool use, state management, and multi-step actions—significantly impact measured performance and must be explicitly documented. For commerce practitioners deploying AI agents in production workflows, this framework clarifies how to interpret evaluation claims, distinguish between capability elicitation and controlled comparisons, and assess whether reported safety and performance metrics reflect real-world conditions or artifacts of the test environment.

METR

Themes

Articles

OpenAI publishes framework for trustworthy third-party AI model evaluations