Microsoft releases Lens, a 3.8B text-to-image model

Microsoft published Lens, a 3.8B-parameter text-to-image model that matches or exceeds larger 6B+ parameter models while using only 19.3% of their training compute, leveraging dense captions and multi-resolution batching. Commerce teams can deploy faster, cheaper image generation for product catalogs and visual search without the infrastructure cost of larger models.

Microsoft introduced Lens, a compact 3.8B-parameter text-to-image model that achieves competitive or superior performance to larger models (6B+ parameters) while requiring significantly less training compute. The model was trained on Lens-800M, a dataset of 800M densely captioned image-text pairs with GPT-4.1-generated captions averaging 109 words each, combined with multi-resolution batching and optimized architecture choices including a semantic VAE and strong language encoder. Lens generates 1024² images in 3.15 seconds on a single H100 GPU, with a distilled turbo variant completing 4-step generation in 0.84 seconds.

For commerce practitioners, Lens addresses a critical pain point: high-cost AI image generation infrastructure. The model's compact size and training efficiency mean lower deployment costs, faster inference for real-time product visualization, and reduced GPU requirements for scaling visual content pipelines. Support for multiple languages, arbitrary aspect ratios (1:2 to 2:1), and resolutions up to 1440² makes it practical for diverse catalog and marketplace use cases without vendor lock-in to larger, costlier models.

The release positions Microsoft competitively against larger open-source models and proprietary APIs, while the distillation-based acceleration approach suggests a path toward even faster edge deployment for commerce applications like mobile product search and dynamic catalog generation.

Huggingface