Getting Started with NVFP4 Inference on NVIDIA DGX™ B200 - Part 2: Quantization Using NVIDIA® TensorRT™ Model Optimizer

Narrow down by specifying conditions

現在2183件がヒットしています。check

Basic NVIDIA

Contents of this time

In the first episode, we explained quantization, including NVFP4, which has been supported since the Blackwell generation of GPUs, including the NVIDIA® DGX™ B200.

In the second episode, HuggingFaceHubetc.It has been publishedLLMor created by myselfLLMofOptimizeThis article explains how to use the tool NVIDIA® TensorRT™ Model Optimizer.

[Getting started with NVFP4 inference on NVIDIA DGX B200]

Episode 1: What is NVFP4?

Episode 2: Quantization with the NVIDIA® TensorRT™ Model Optimizer

Episode 3: Inference with Multi-LLM NIM™

Episode 4: Benchmarking NVFP4 and FP8

Episode 5: Deploying Llama-3.1-405B-Instruct

What is TensorRT Model Optimizer?

As LLMs become larger and more complex, reducing the amount of calculations and shortening the inference time are important issues. To address this issue, there are approaches to reduce the model size, such as quantization and distillation.
The TensorRT Model Optimizer is a library that addresses these challenges and provides cutting-edge model optimization techniques, including:

・Quantization

・Distillation

・Pruning

・Speculative Decoding

・Sparsity for speedup

It is possible to efficiently optimize models by combining these optimization techniques.
It also works seamlessly with TensorRT-LLM and NVIDIA NIM, and the generated optimized checkpoints can be immediately deployed to a production environment.
This article explains how to use the TensorRT Model Optimizer to perform a technique called Post-Training Quantization to generate TensorRT LLM checkpoints in NVFP4, a new format supported by Blackwell-generation GPUs.

What is Post-Training Quantization (PTQ)?

Post-Training Quantization (PTQ) is a quantization method that optimizes model size and inference speed for already trained models without significantly compromising accuracy. Because it does not require training, it is extremely lightweight and can be applied quickly. This reduces model size and enables fast inference.
There are also various other techniques, such as Quantization Aware Training (QAT), which are used to maintain higher accuracy, and the TensorRT Model Optimizer supports these various techniques.
In this test, we performed post-training quantization, which can improve throughput, latency, and memory efficiency without retraining.

Transformation with TensorRT Model Optimizer

All of this testing was conducted using the DGX B200 introduced in our AI TRY NOW PROGRAM.

Step 1 Install/Prepare Docker image

Create an image from a Dockerfile by referring to NVIDIA/TensorRT-Model-Optimizer on github.
Alternatively, the TensorRT Model Optimizer is installed in the TensorRT-LLM Develop container on NGC.
You can also use this.
TensorRT-LLM Develop | NVIDIA NGC　　

# Clone the ModelOpt repository
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
cd TensorRT-Model-Optimizer
# Build the docker (will be tagged `docker.io/library/modelopt_examples:latest`)
# You may customize `docker/Dockerfile` to include or exclude certain dependencies you may or may not need.
./docker/build.sh
# Run the docker image  
docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_examples:latest bash

Step 2: Quantizing the model

Quantization is performed using the following command:
This time we will use Meta's Llama-3.3-70B-Instruct. When you run the script, the model will automatically be downloaded to .cache and quantization will begin.
Specify NVFP4 as the quantization format and TensorRT-LLM as the model format.

cd TensorRT-Model-Optimizer/examples/llm_ptq/
export HF_TOKEN=hf_xxxx
python hf_ptq.py \
--pyt_ckpt_path "meta-llama/Llama-3.3-70B-Instruct" \
--qformat nvfp4 \
--export_fmt tensorrt_llm \
--export_path /xxxxx \
--trust_remote_code

This will create a directory with the following structure in the specified path.

trtllm_ckpt #TensorRT Model Optimizerで作成したconfig file ├── config.json #TensorRT Model Optimizerで作成したconfig file ├── rank0.safetensors #TensorRT Model Optimizerで作成したconfig file ├── special_tokens_map.json #TensorRT Model Optimizerで作成したconfig file ├── tokenizer.json #TensorRT Model Optimizerで作成したconfig file └── tokenizer_config.json #TensorRT Model Optimizerで作成したconfig file

You can also check the quantized layers, GPU memory usage, and output for input before and after ptq by looking at the log.

model.layers.79.self_attn.k_bmm_quantizer                                        TensorQuantizer((4, 3) bit fake per-tensor amax=17.0000 calibrator=MaxCalibrator quant)
model.layers.79.self_attn.v_bmm_quantizer                                        TensorQuantizer((4, 3) bit fake per-tensor amax=5.2500 calibrator=MaxCalibrator quant)
model.layers.79.mlp.gate_proj.input_quantizer                                    TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=17.0000 calibrator=MaxCalibrator quant)
model.layers.79.mlp.gate_proj.output_quantizer                                   TensorQuantizer(disabled)
model.layers.79.mlp.gate_proj.weight_quantizer                                   TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=0.4492 calibrator=MaxCalibrator quant)
model.layers.79.mlp.up_proj.input_quantizer                                      TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=17.0000 calibrator=MaxCalibrator quant)
model.layers.79.mlp.up_proj.output_quantizer                                     TensorQuantizer(disabled)
model.layers.79.mlp.up_proj.weight_quantizer                                     TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=0.6211 calibrator=MaxCalibrator quant)
model.layers.79.mlp.down_proj.input_quantizer                                    TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=380.0000 calibrator=MaxCalibrator quant)
model.layers.79.mlp.down_proj.output_quantizer                                   TensorQuantizer(disabled)
model.layers.79.mlp.down_proj.weight_quantizer                                   TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=0.8047 calibrator=MaxCalibrator quant)
lm_head.input_quantizer                                                          TensorQuantizer(disabled)
lm_head.output_quantizer                                                         TensorQuantizer(disabled)
lm_head.weight_quantizer                                                         TensorQuantizer(disabled)
1923 TensorQuantizers found in model
Loading extension modelopt_cuda_ext_fp8...
Loaded extension modelopt_cuda_ext_fp8 in 0.0 seconds
--------
example test input: ['<|begin_of_text|>LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how he\'ll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. "I\'ll definitely have some sort of party," he said in an interview. "Hopefully none of you will be reading about it." Radcliffe\'s earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. "People are always looking to say \'kid star goes off the rails,\'" he told reporters last month. "But I try very hard not to go that way because it would be too easy for them." His latest outing as the boy wizard in "Harry Potter and the Order of the Phoenix" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter\'s latest ». There is life beyond Potter, however. The Londoner has filmed a TV movie called "My Boy Jack," about author Rudyard Kipling and his son, due for release later this year. He will also appear in "December Boys," an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer\'s "Equus." Meanwhile, he is braced for even closer media scrutiny']
--------
example outputs before ptq: [' as he becomes an adult. "I think it\'s going to be a bit more intense," he said. "I\'ll have to be more careful about where I go and what I do." Source: CNN.com
#2 Re: Daniel Radcliffe turns 18, gains access to £20 million fortune He\'s been so focused on his career and has seemed to handle the fame very well. I think he\'ll be just fine with his newfound wealth. ## Comment
#3 Re: Daniel']
--------
example outputs after ptq: [' as he leaves his teenage years behind. "I think the press will start to be more interested in my personal life," he said. "But I\'m not going to start going out to every club and every party and doing things that will get me in the papers." The actor\'s parents, Marcia Gresham and Alan Radcliffe, have been praised for keeping their son grounded. "They\'ve been very good at keeping me level-headed and not letting me get too carried away with it all']
current rank: 0, tp rank: 0, pp rank: 0
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
Quantized model exported to :/data/llama-33-70B-Instruct-nvfp4. Total time used 78.48796939849854s
########
GPU 0: Peak memory usage = 162.91 GB for all processes on the GPU
########

Step 3: Organizing directories

As described in the Multi-LLM NIM manual, organize the config downloaded from Huggingface and the generated trtllm_ckpt into the following directory structure.

llama-33-70B-Instruct-nvfp4/ ├── config.json #hugging faceからダウンロードしたfile ├── generation_config.json #hugging faceからダウンロードしたfile ├── model.safetensors.index.json #hugging faceからダウンロードしたfile ├── special_tokens_map.json #hugging faceからダウンロードしたfile ├── tokenizer.json #hugging faceからダウンロードしたfile ├── tokenizer_config.json #hugging faceからダウンロードしたfile └── trtllm_ckpt #TensorRT Model Optimizerで作成したfile ├── config.json #TensorRT Model Optimizerで作成したfile ├── rank0.safetensors #TensorRT Model Optimizerで作成したfile ├── special_tokens_map.json #TensorRT Model Optimizerで作成したfile ├── tokenizer.json #TensorRT Model Optimizerで作成したfile └── tokenizer_config.json #TensorRT Model Optimizerで作成したfile

Now we are ready to perform inference on the model quantized to NVFP4 with Multi-LLM NIM.
This time we converted to NVFP4, but TensorRT Model Optimizer and Multi-LLM NIM also support quantization such as FP8 and INT4 AWQ.
This allows for flexible changes depending on the desired accuracy, GPU resources, inference speed, etc.
Although we performed quantization using the default settings, the TensorRT Model Optimizer allows for various optimizations, such as quantization of the dataset used for calibration and the KV cache.
For details, please see the examples at NVIDIA/TensorRT-Model-Optimizer.

Summary

Quantization is an effective approach to improving the inference speed of a model. Currently, many open weight models run in FP16, BF16, or FP8, but using a format like FP4 makes it possible to infer even faster. This can further improve the user experience for AI agents that require multiple inputs/outputs to LLM in response to user questions.