OpenAI published guidance on designing trustworthy third-party evaluations for frontier AI models, moving beyond simple chatbot-style prompting to account for complex agentic systems that use tools, maintain state, and operate across multi-step workflows. The playbook identifies three types of evaluation claims (capability elicitation, safeguard performance, and system comparison), each requiring different harness choices and evidence reporting. OpenAI emphasizes that harness design—including tool setup, scaffolding, and compute budget—materially changes measured capability; for example, UK AISI's cyber range evaluation showed performance improvements of up to 59% when token budgets increased from 10M to 100M.
For AI-in-commerce practitioners, this framework is critical because it exposes how evaluation results can be misinterpreted or artificially inflated. The guidance warns against reward hacking, refusals that obscure behavior, training data contamination, broken problem definitions, and sandbagging—all of which can distort scores. When evaluating or deploying AI agents for order processing, customer service, or supply chain optimization, practitioners should demand evaluation reports that explicitly state the harness configuration, budget, tools available, and evidence that the measured capability generalizes to production conditions.
The playbook aligns with emerging industry standards for AI safety evaluation and signals OpenAI's commitment to transparency in third-party testing. Commerce teams should use this framework to critically assess vendor claims about AI agent performance and safety, particularly when agents operate in long-horizon, tool-using scenarios where harness choices have outsized impact on real-world outcomes.