StepFun introduced Step 3.7 Flash, a 198B-parameter Mixture-of-Experts vision-language model with approximately 11B activated parameters per forward pass, designed for enterprise-scale multimodal AI applications. The model supports native image and video input, three configurable reasoning levels, and a 256k context window. It is available through Hugging Face with NVFP4 quantization and can be deployed across open-source frameworks including NVIDIA TensorRT-LLM, SGLang, and vLLM to leverage NVIDIA-optimized kernels.
For commerce practitioners, Step 3.7 Flash enables production-grade agentic workflows combining perception, search, and multi-step reasoning—critical for document intelligence pipelines that extract structured insights from financial reports, invoices, and complex PDFs. NVIDIA NIM packages the model as containerized inference microservices with standardized OpenAI-compatible APIs, supporting on-premises, cloud, and hybrid deployments. The NVIDIA NeMo framework enables Day 0 fine-tuning with supervised fine-tuning (SFT) and LoRA techniques at 600 tokens/sec on Hopper GPUs, allowing teams to customize the model for domain-specific commerce use cases without checkpoint conversion overhead.
This release positions NVIDIA's ecosystem as a comprehensive stack for multimodal AI in commerce—from prototyping on build.nvidia.com endpoints through production deployment and customization. The combination of high-throughput inference, flexible deployment options, and native fine-tuning support lowers barriers for retailers and financial services firms to integrate vision-language reasoning into operational workflows.