NVIDIA DGX™ B200で始めるNVFP4推論　～第2話:NVIDIA® TensorRT™ Model Optimizerを使用した量子化～ - 半導体事業

条件を指定して絞り込む

現在2080件がヒットしています。check

基礎 NVIDIA

今回の内容

第1話ではNVIDIA® DGX™ B200をはじめとするBlackwell世代のGPUから対応したNVFP4をはじめとする量子化について解説しました。　　

第2話では、Hugging Face Hub等に公開されているLLMや自分で作成したLLMを最適化するツールであるNVIDIA® TensorRT™ Model Optimizerの使い方について解説します。　　

[NVIDIA DGX B200で始めるNVFP4推論]

第1話　NVFP4とは

第2話　NVIDIA® TensorRT™ Model Optimizerを使用した量子化

第3話　Multi-LLM NIM™での推論

第4話　NVFP4とFP8のベンチマーク測定

第5話　Llama-3.1-405B-Instructのデプロイ

TensorRT Model Optimizerとは

LLMが大型化・複雑化する中で計算量削減や推論時間の短縮は重要な課題です。そのため量子化や、蒸留などモデルサイズを小さくするアプローチがあります。
TensorRT Model Optimizerはそれらの課題に対応するためのライブラリーで、以下のような最先端のモデルの最適化技術を提供します。

・量子化（Quantization）

・蒸留（Distillation）

・プルーニング（Pruning）

・投機的デコード（Speculative Decoding）

・スパース性（Sparsity）による高速化

これらの最適化技術を組み合わせて効率的にモデルの最適化を行うことが可能です。
また、TensorRT-LLMやNVIDIA NIMなどともシームレスに連携することができ、生成された最適化済みのチェックポイントはすぐに本番環境にデプロイすることが可能です。
本記事ではTensorRT Model Optimizerを使用して、トレーニング後の量子化(Post-Training Quantization)と呼ばれる手法を行い、Blackwell世代のGPUから新しく対応した形式であるNVFP4のTensorRT LLM check pointを生成する方法を解説します。

トレーニング後の量子化(Post-Training Quantization:PTQ)とは

Post-Training Quantization(PTQ)とは、既に学習済みのモデルに対して、精度を大きく損なうことなく、モデルサイズと推論速度を最適化する量子化手法です。トレーニングを行う必要が無い為、非常に軽量かつ高速に適用できるのが特長です。これによりモデルサイズが小さくなり、高速に推論を行うことが可能になります。
この他にもより高い精度を維持するために行われるQuantization Aware Training(QAT)など様々な手法があり、TensorRT Model Optimizerではこれらの様々な手法に対応しています。
今回の検証では再トレーニングを行うことなく、スループットやレイテンシー、メモリー効率の向上が可能であるPost-Training Quantizationを行いました。

TensorRT Model Optimizerでの変換

今回の検証はすべて弊社のAI TRY NOW PROGRAMに導入したDGX B200を用いて行いました。

Step1 Install/Docker imageの準備

githubのNVIDIA/TensorRT-Model-Optimizerを参考にDockerfileからimageを作成します。　　
もしくは、NGCにあるTensorRT-LLM DevelopコンテナにもTensorRT Model Optimizerがインストールされておりますので、
こちらを使用することも可能です。　　
TensorRT-LLM Develop | NVIDIA NGC　　

# Clone the ModelOpt repository
git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git
cd TensorRT-Model-Optimizer
# Build the docker (will be tagged `docker.io/library/modelopt_examples:latest`)
# You may customize `docker/Dockerfile` to include or exclude certain dependencies you may or may not need.
./docker/build.sh
# Run the docker image  
docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_examples:latest bash

Step2 modelの量子化

下記コマンドで、量子化を実行します。
今回はMeta社のLlama-3.3-70B-Instructを使用します。スクリプトを実行すると自動的に.cacheにモデルがダウンロードされ、量子化が始まります。
量子化のフォーマットはNVFP4を指定し、モデルのフォーマットはTensorRT-LLMを指定します。

cd TensorRT-Model-Optimizer/examples/llm_ptq/
export HF_TOKEN=hf_xxxx
python hf_ptq.py \
--pyt_ckpt_path "meta-llama/Llama-3.3-70B-Instruct" \
--qformat nvfp4 \
--export_fmt tensorrt_llm \
--export_path /xxxxx \
--trust_remote_code

すると指定したパスに下記構造のディレクトリが生成されます。

trtllm_ckpt #TensorRT Model Optimizerで作成したconfig file
    ├── config.json #TensorRT Model Optimizerで作成したconfig file
    ├── rank0.safetensors #TensorRT Model Optimizerで作成したconfig file
    ├── special_tokens_map.json #TensorRT Model Optimizerで作成したconfig file
    ├── tokenizer.json #TensorRT Model Optimizerで作成したconfig file
    └── tokenizer_config.json #TensorRT Model Optimizerで作成したconfig file

また、logを見ると量子化されたレイヤーやGPUメモリの使用率、ptq前後のinputに対するoutputを確認することが可能です。

model.layers.79.self_attn.k_bmm_quantizer                                        TensorQuantizer((4, 3) bit fake per-tensor amax=17.0000 calibrator=MaxCalibrator quant)
model.layers.79.self_attn.v_bmm_quantizer                                        TensorQuantizer((4, 3) bit fake per-tensor amax=5.2500 calibrator=MaxCalibrator quant)
model.layers.79.mlp.gate_proj.input_quantizer                                    TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=17.0000 calibrator=MaxCalibrator quant)
model.layers.79.mlp.gate_proj.output_quantizer                                   TensorQuantizer(disabled)
model.layers.79.mlp.gate_proj.weight_quantizer                                   TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=0.4492 calibrator=MaxCalibrator quant)
model.layers.79.mlp.up_proj.input_quantizer                                      TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=17.0000 calibrator=MaxCalibrator quant)
model.layers.79.mlp.up_proj.output_quantizer                                     TensorQuantizer(disabled)
model.layers.79.mlp.up_proj.weight_quantizer                                     TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=0.6211 calibrator=MaxCalibrator quant)
model.layers.79.mlp.down_proj.input_quantizer                                    TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=380.0000 calibrator=MaxCalibrator quant)
model.layers.79.mlp.down_proj.output_quantizer                                   TensorQuantizer(disabled)
model.layers.79.mlp.down_proj.weight_quantizer                                   TensorQuantizer((2, 1) bit fake block_sizes={-1: 16, 'type': 'dynamic', 'scale_bits': (4, 3)}, amax=0.8047 calibrator=MaxCalibrator quant)
lm_head.input_quantizer                                                          TensorQuantizer(disabled)
lm_head.output_quantizer                                                         TensorQuantizer(disabled)
lm_head.weight_quantizer                                                         TensorQuantizer(disabled)
1923 TensorQuantizers found in model
Loading extension modelopt_cuda_ext_fp8...
Loaded extension modelopt_cuda_ext_fp8 in 0.0 seconds
--------
example test input: ['<|begin_of_text|>LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how he\'ll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. "I\'ll definitely have some sort of party," he said in an interview. "Hopefully none of you will be reading about it." Radcliffe\'s earnings from the first five Potter films have been held in a trust fund which he has not been able to touch. Despite his growing fame and riches, the actor says he is keeping his feet firmly on the ground. "People are always looking to say \'kid star goes off the rails,\'" he told reporters last month. "But I try very hard not to go that way because it would be too easy for them." His latest outing as the boy wizard in "Harry Potter and the Order of the Phoenix" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.  Watch I-Reporter give her review of Potter\'s latest ». There is life beyond Potter, however. The Londoner has filmed a TV movie called "My Boy Jack," about author Rudyard Kipling and his son, due for release later this year. He will also appear in "December Boys," an Australian film about four boys who escape an orphanage. Earlier this year, he made his stage debut playing a tortured teenager in Peter Shaffer\'s "Equus." Meanwhile, he is braced for even closer media scrutiny']
--------
example outputs before ptq: [' as he becomes an adult. "I think it\'s going to be a bit more intense," he said. "I\'ll have to be more careful about where I go and what I do." Source: CNN.com
#2 Re: Daniel Radcliffe turns 18, gains access to £20 million fortune He\'s been so focused on his career and has seemed to handle the fame very well. I think he\'ll be just fine with his newfound wealth. ## Comment
#3 Re: Daniel']
--------
example outputs after ptq: [' as he leaves his teenage years behind. "I think the press will start to be more interested in my personal life," he said. "But I\'m not going to start going out to every club and every party and doing things that will get me in the papers." The actor\'s parents, Marcia Gresham and Alan Radcliffe, have been praised for keeping their son grounded. "They\'ve been very good at keeping me level-headed and not letting me get too carried away with it all']
current rank: 0, tp rank: 0, pp rank: 0
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
Quantized model exported to :/data/llama-33-70B-Instruct-nvfp4. Total time used 78.48796939849854s
########
GPU 0: Peak memory usage = 162.91 GB for all processes on the GPU
########

Step3 ディレクトリの整理

Multi-LLM NIMのマニュアルにあるようにHuggingfaceからダウンロードしてきたconfigおよび生成されたtrtllm_ckptを下記ディレクト構造になるように整理します。

llama-33-70B-Instruct-nvfp4/
├── config.json #hugging faceからダウンロードしたfile
├── generation_config.json #hugging faceからダウンロードしたfile
├── model.safetensors.index.json #hugging faceからダウンロードしたfile
├── special_tokens_map.json #hugging faceからダウンロードしたfile
├── tokenizer.json #hugging faceからダウンロードしたfile
├── tokenizer_config.json #hugging faceからダウンロードしたfile
└── trtllm_ckpt #TensorRT Model Optimizerで作成したfile
    ├── config.json #TensorRT Model Optimizerで作成したfile
    ├── rank0.safetensors #TensorRT Model Optimizerで作成したfile
    ├── special_tokens_map.json #TensorRT Model Optimizerで作成したfile
    ├── tokenizer.json #TensorRT Model Optimizerで作成したfile
    └── tokenizer_config.json #TensorRT Model Optimizerで作成したfile

これにより、Multi-LLM NIMでNVFP4に量子化したモデルの推論を行う準備が整いました。
今回はNVFP4に変換しましたが、TensorRT Model OptimizerやMulti-LLM NIMでは、FP8やINT4 AWQなどの量子化にも対応しています。
これにより求める精度や、GPUリソース、推論速度等によって柔軟に変更することが可能です。
また、今回はデフォルトの設定で量子化を行いましたが、TensorRT Model Optimizerではキャリブレーションに使用するデータセットや、KVキャッシュの量子化など様々な最適化を行うことが可能です。　　
詳細はNVIDIA/TensorRT-Model-Optimizerのexamplesをご確認ください。

まとめ

量子化はモデルの推論速度を向上させる効果的なアプローチの一つです。現在の多くのオープンウェイトモデルはFP16、BF16、FP8で実行されていますが、FP4のようなフォーマットを使用することで、さらに高速に推論することが可能にとなります。これによりユーザーの質問に対して、複数回のLLMへのinput/outputが必要となるAI Agentにおいてさらにユーザーエクスペリエンスを向上させることが可能です。