NVIDIA published details on Dynamo Snapshot, a new checkpoint/restore approach for AI inference workloads running on Kubernetes. The system addresses the cold-start problem where inference replicas can take several minutes to initialize, leaving GPUs idle during that period. Dynamo Snapshot combines NVIDIA's `cuda-checkpoint` tool (which serializes GPU state) with CRIU (Checkpoint/Restore in Userspace, which captures Linux process state), allowing full inference worker state to be checkpointed to shared storage and restored on the same or different node.
The architecture deploys a privileged DaemonSet called `snapshot-agent` that runs on every Kubernetes node and orchestrates checkpoint/restore without requiring modifications to the container runtime. A key optimization deallocates KV cache memory before checkpointing (reducing artifact size from ~190 GiB to ~6 GiB for tested models) while preserving virtual address stability. For commerce practitioners running dynamic, auto-scaling inference services—especially large language model APIs and recommendation engines—this eliminates the latency penalty and GPU waste that currently occurs during traffic spikes, directly improving SLA compliance and reducing infrastructure costs.
NVIDIA notes that further CRIU optimizations for parallel memory restoration are in development and will ship once merged into upstream CRIU. This positions Dynamo Snapshot as a foundational technology for cost-efficient, responsive inference scaling in production e-commerce and marketplace environments where demand fluctuates rapidly.