Multimodal AI models scale toward unified vision-language systemsThursday, May 28, 2026

LLMAlibabaQwenAXPOQwen3-VL-Thinking

AXPO improves vision-language agent tool use and reasoning

Researchers introduced AXPO, a policy optimization method that fixes the Thinking-Acting Gap in vision-language models by improving tool utilization rates from ~30% to higher success rates through thinking prefix optimization and tool call resampling. For commerce practitioners building AI agents, this means more reliable autonomous tool use in product search, inventory queries, and customer service workflows without scaling model size.

Researchers published AXPO (Agent eXplorative Policy Optimization), a training method designed to address a critical weakness in agentic reasoning systems: the Thinking-Acting Gap. Vision-language models with extended reasoning often fail to reliably use external tools, with standard RL approaches like GRPO achieving tool use on only ~30% of rollouts and generating all-wrong responses on ~40% of tool-using attempts. AXPO fixes this by locking the thinking prefix and resampling tool calls paired with uncertainty-based prefix selection, yielding +1.8pp improvements in Pass@1 and Pass@4 metrics across nine multimodal benchmarks.

For commerce practitioners, AXPO represents a practical path to more capable AI agents without expensive model scaling. An 8B-parameter model trained with SFT+AXPO outperforms a 32B base model on Pass@4 metrics, meaning commerce teams can deploy faster, cheaper agents for product discovery, order fulfillment queries, and customer support automation. The method directly addresses the reliability gap that has limited agentic tool use in production e-commerce systems.

The paper demonstrates results across Qwen3-VL-Thinking at multiple scales, signaling broader applicability. Commerce organizations experimenting with vision-language agents should monitor whether AXPO or similar exploration-based policy methods become standard in commercial model fine-tuning, as they could unlock more autonomous and cost-efficient shopping experiences.

Huggingface