Contents of this time
In the first episode, we explained each indicator output by GenAI-Perf as a benchmark for generative AI applications.
In the second episode, we will explain how to actually use the GenAI-Perf tool.
[Benchmarking LLM Applications]
Episode 1: What is GenAI-Perf?
Episode 2: How to use GenAI-Perf
Episode 3: NVIDIA NIM™ and vLLM Benchmark Measurement
GenAI-Perf Options
GenAI-Perf has many command line options, so we will explain the most common options here. For details, see Performance Analyzer->GenAI-Perf->Command Line Options.
--model <list>
Specify the AI model to be measured. (If you do not use a LoRA adapter, specify a single model.)
--tokenizer <str>
Specify the name of the tokenizer on the Hugging Face Hub to calculate the number of tokens for prompts and LLM responses. Example: meta-llama/Meta-Llama-3-8B-Instruct
--service-kind {triton, openai}
Specify triton if the inference server is a Triton Inference Server, or openai if the server has an OpenAI compatible API.
--endpoint-type {chat, completions}
If you specify openai for --server-kind, you can specify the endpoint type: chat or completions.
--url <url>
Specifies the URL of the inference server.
--concurrency <int>
Specifies the number of concurrent queries to run, which corresponds to the number of end users the LLM application will serve simultaneously.
--measurement-interval <int>
Specify the measurement time in milliseconds. Queries completed within this time will be counted for each index calculation.
--output-tokens-mean <int>
Average number of output tokens
--output-tokens-stddev <int>
Standard deviation of output token count
Valid options when using the Hugging Face dataset for prompts
--input-dataset {openorca, cnn_dailymail}
If you want to use the Hugging Face dataset as a prompt, specify openorca or cnn_dailymail. If you do not specify this option, synthetic data will be used as the prompt.
--num-prompts <int>
Unique prompt count
Valid options when using synthetic data for prompts
--num-prompts <int>
Unique prompt count
--synthetic-input-tokens-mean <int>
Average number of tokens
--synthetic-input-tokens-stddev <int>
Standard deviation of token count
--random-seed <int>
Random number generator seed value
When to use user-supplied prompts
--input-file <path>
Specify the JSON file.
How to read GenAI-Perf output
This article explains how to interpret the results output when benchmarking NIM with GenAI-Perf.
The three indicators in the first column, Time to First Token, Inter Token Latency, and (End-to-End) Request Latency, are calculated for each query to LLM, and the average, minimum, maximum, 99th percentile, 95th percentile, 90th percentile, 75th percentile, 50th percentile, and 25th percentile are output. The percentile indicates where the data falls when sorted from smallest to largest. The 99th percentile is the value that is 99% from the smallest, and the 50th percentile is the median.
Num Output Token and Num Input Token are parameters that the user sets depending on the scenario expected when using the GenAI-Perf tool. In this example, we set Input/Output to 200/200 and output the results.
| Metric | avg | min | max | p99 | p95 | p90 | p75 | p50 | p25 |
| Time To First Token (ns) | 178859359 | 73838932 | 217734212 | 217338545 | 216559510 | 215989391 | 185805802 | 183952658 | 177839897 |
| Inter Token Latency (ns) | 28186524 | 26284341 | 30353671 | 30254216 | 29861613 | 29690036 | 29065006 | 27994706 | 27230244 |
| Request Latency (ns) | 5344653932 | 5165773417 | 5619168861 | 5619164284 | 5619095892 | 5618754623 | 5540800702 | 5285519846 | 5170483894 |
| Num Output Token | 184 | 174 | 191 | 190 | 189 | 188 | 186 | 185 | 183 |
| Num Input Token | 200 | 180 | 217 | 216 | 212 | 212 | 207 | 200 | 194 |
For Output Token Throughput and Request Throughput, one value is output each time a measurement is performed.
| Metric | Value |
| Output Token Throughput (per sec) | 344.74 |
| Request Throughput (per sec) | 1.87 |
Next up: NVIDIA NIM™ and vLLM benchmarking!
In this article, we introduced the options for the GenAI-Perf tool and how to interpret the output results. What did you think?
Next time we will use the GenAI-Perf tool to benchmark the NVIDIA NIM™ and vLLM.
If you are considering introducing AI, please contact us.
For the introduction of AI, we offer selection and support for hardware NVIDIA GPU cards and GPU workstations, as well as face recognition, wire analysis, skeleton detection algorithms, and learning environment construction services. If you have any problems, please feel free to contact us.