NVIDIA has introduced the latest v0.15 release of the NVIDIA TensorRT Model Optimizer, a cutting-edge quantization toolkit designed to enhance model optimization techniques such as quantization, sparsity, and pruning. This update aims to reduce model complexity and optimize the inference speed of generative AI models, according to NVIDIA Technical Blog.
Cache Diffusion
The new version includes support for cache diffusion, building on the previously established 8-bit post-training quantization (PTQ) technique. This feature accelerates diffusion models at inference time by reusing cached outputs from previous denoising steps. Methods like DeepCache and block caching optimize inference speed without additional training. This mechanism leverages the temporal consistency of high-level features between consecutive denoising steps, making it compatible with models like DiT and UNet.
Developers can enable cache diffusion by using a single ‘cachify’ instance in the Model Optimizer with the diffusion pipeline. For instance, enabling cache diffusion in a Stable Diffusion XL (SDXL) model on an NVIDIA H100 Tensor Core GPU delivers a 1.67x speedup in images per second. This speedup further increases when FP8 is also enabled.
Quantization-Aware Training with NVIDIA NeMo
Quantization-aware training (QAT) simulates the effects of quantization during neural network training to recover model accuracy post-quantization. This process involves computing scaling factors and incorporating simulated quantization loss into the fine-tuning process. The Model Optimizer uses custom CUDA kernels for simulated quantization, achieving lower precision model weights and activations for efficient hardware deployment.
Model Optimizer v0.15 expands QAT integration support to include NVIDIA NeMo, an enterprise-grade platform for developing custom generative AI models. This first-class support for NeMo models allows users to fine-tune models directly with the original training pipeline. For more details, see the QAT example in the NeMo GitHub repository.
QLoRA Workflow
Quantized Low-Rank Adaptation (QLoRA) is a fine-tuning technique that reduces memory usage and computational complexity during model training. It combines quantization with Low-Rank Adaptation (LoRA), making large language model (LLM) fine-tuning more accessible. Model Optimizer now supports the QLoRA workflow with NVIDIA NeMo using the NF4 data type. For a Llama 13B model on the Alpaca dataset, QLoRA can reduce peak memory usage by 29-51% while maintaining model accuracy.
Expanded Support for AI Models
The latest release also expands support for a wider suite of AI models, including Stability.ai’s Stable Diffusion 3, Google’s RecurrentGemma, Microsoft’s Phi-3, Snowflake’s Arctic 2, and Databricks’ DBRX. For more details, refer to the example scripts and support matrix available in the Model Optimizer GitHub repository.
Get Started
NVIDIA TensorRT Model Optimizer provides seamless integration with NVIDIA TensorRT-LLM and TensorRT for deployment. It is available for installation on PyPI as nvidia-modelopt. Visit the NVIDIA TensorRT Model Optimizer GitHub page for example scripts and recipes for inference optimization. Comprehensive documentation is also available.
Image source: Shutterstock