Site Search

Benchmarking LLM Applications Part 2: How to Use GenAI-Perf

Contents of this time

In the first episode, we explained each indicator output by GenAI-Perf as a benchmark for generative AI applications.

 

In the second episode, we will explain how to actually use the GenAI-Perf tool.

 

[Benchmarking LLM Applications]
Episode 1: What is GenAI-Perf?
Episode 2: How to use GenAI-Perf
Episode 3: NVIDIA NIM™ and vLLM Benchmark Measurement

GenAI-Perf Options

GenAI-Perf has many command line options, so we will explain the most common options here. For details, see Performance Analyzer->GenAI-Perf->Command Line Options.

 

--model <list>
Specify the AI model to be measured. (If you do not use a LoRA adapter, specify a single model.)

--tokenizer <str>
Specify the name of the tokenizer on the Hugging Face Hub to calculate the number of tokens for prompts and LLM responses. Example: meta-llama/Meta-Llama-3-8B-Instruct

--service-kind {triton, openai}
Specify triton if the inference server is a Triton Inference Server, or openai if the server has an OpenAI compatible API.

--endpoint-type {chat, completions}
If you specify openai for --server-kind, you can specify the endpoint type: chat or completions.

--url <url>
Specifies the URL of the inference server.

--concurrency <int>
Specifies the number of concurrent queries to run, which corresponds to the number of end users the LLM application will serve simultaneously.

--measurement-interval <int>
Specify the measurement time in milliseconds. Queries completed within this time will be counted for each index calculation.

--output-tokens-mean <int>
Average number of output tokens

--output-tokens-stddev <int>
Standard deviation of output token count

 

Valid options when using the Hugging Face dataset for prompts

 

--input-dataset {openorca, cnn_dailymail}
If you want to use the Hugging Face dataset as a prompt, specify openorca or cnn_dailymail. If you do not specify this option, synthetic data will be used as the prompt.

--num-prompts <int>
Unique prompt count

 

Valid options when using synthetic data for prompts

 

--num-prompts <int>
Unique prompt count

--synthetic-input-tokens-mean <int>
Average number of tokens

--synthetic-input-tokens-stddev <int>
Standard deviation of token count

--random-seed <int>
Random number generator seed value

 

When to use user-supplied prompts

 

--input-file <path>
Specify the JSON file.

How to read GenAI-Perf output

This article explains how to interpret the results output when benchmarking NIM with GenAI-Perf.

The three indicators in the first column, Time to First Token, Inter Token Latency, and (End-to-End) Request Latency, are calculated for each query to LLM, and the average, minimum, maximum, 99th percentile, 95th percentile, 90th percentile, 75th percentile, 50th percentile, and 25th percentile are output. The percentile indicates where the data falls when sorted from smallest to largest. The 99th percentile is the value that is 99% from the smallest, and the 50th percentile is the median.

Num Output Token and Num Input Token are parameters that the user sets depending on the scenario expected when using the GenAI-Perf tool. In this example, we set Input/Output to 200/200 and output the results.

Metric avg min max p99 p95 p90 p75 p50 p25
Time To First Token (ns) 178859359 73838932 217734212 217338545 216559510 215989391 185805802 183952658 177839897
Inter Token Latency (ns) 28186524 26284341 30353671 30254216 29861613 29690036 29065006 27994706 27230244
Request Latency (ns) 5344653932 5165773417 5619168861 5619164284 5619095892 5618754623 5540800702 5285519846 5170483894
Num Output Token 184 174 191 190 189 188 186 185 183
Num Input Token 200 180 217 216 212 212 207 200 194

For Output Token Throughput and Request Throughput, one value is output each time a measurement is performed.

Metric Value
Output Token Throughput (per sec) 344.74
Request Throughput (per sec) 1.87

Next up: NVIDIA NIM™ and vLLM benchmarking!

In this article, we introduced the options for the GenAI-Perf tool and how to interpret the output results. What did you think?

Next time we will use the GenAI-Perf tool to benchmark the NVIDIA NIM™ and vLLM.

If you are considering introducing AI, please contact us.

For the introduction of AI, we offer selection and support for hardware NVIDIA GPU cards and GPU workstations, as well as face recognition, wire analysis, skeleton detection algorithms, and learning environment construction services. If you have any problems, please feel free to contact us.