Researchers published AXPO (Agent eXplorative Policy Optimization), a training method designed to address a critical weakness in agentic reasoning systems: the Thinking-Acting Gap. Vision-language models with extended reasoning often fail to reliably use external tools, with standard RL approaches like GRPO achieving tool use on only ~30% of rollouts and generating all-wrong responses on ~40% of tool-using attempts. AXPO fixes this by locking the thinking prefix and resampling tool calls paired with uncertainty-based prefix selection, yielding +1.8pp improvements in Pass@1 and Pass@4 metrics across nine multimodal benchmarks.
For commerce practitioners, AXPO represents a practical path to more capable AI agents without expensive model scaling. An 8B-parameter model trained with SFT+AXPO outperforms a 32B base model on Pass@4 metrics, meaning commerce teams can deploy faster, cheaper agents for product discovery, order fulfillment queries, and customer support automation. The method directly addresses the reliability gap that has limited agentic tool use in production e-commerce systems.
The paper demonstrates results across Qwen3-VL-Thinking at multiple scales, signaling broader applicability. Commerce organizations experimenting with vision-language agents should monitor whether AXPO or similar exploration-based policy methods become standard in commercial model fine-tuning, as they could unlock more autonomous and cost-efficient shopping experiences.