NEO-ov is a native foundation model that eliminates the traditional modular architecture used in current vision-language models (VLMs). Instead of stitching together separate image encoders and language decoders through multi-stage alignment, NEO-ov learns cross-frame and pixel-word correspondences end-to-end without external encoders, adapters, or post-hoc fusion. The model enables unified spatiotemporal modeling for multi-image, video understanding, and fine-grained visual perception tasks, with code and models made publicly available on GitHub.
For AI-in-commerce practitioners, NEO-ov's native "one-vision" architecture addresses a critical pain point: the fragmentation and latency introduced by modular systems. E-commerce applications such as product visual search, video catalog understanding, and spatial intelligence for AR try-on features can benefit from the model's end-to-end learning and improved fine-grained perception without architectural bottlenecks. The paper's systematic architectural analyses and training recipes lower the barrier to implementing native multimodal systems in production commerce workflows.
The release of two models (NEO1.5-9B-SFT and NEO1.5-2B-SFT) on Hugging Face within hours of publication signals rapid adoption momentum. Commerce teams should monitor whether native VLM architectures become the new standard, potentially reshaping how product understanding and visual search pipelines are built.