Benchmarking LLM Applications Episode 2: How to Use GenAI-Perf

Narrow down by specifying conditions

現在2168件がヒットしています。check

Basic AI/Artificial IntelligenceNVIDIA

Contents of this time

In the first episode, we explained each indicator output by GenAI-Perf as a benchmark for generative AI applications.

In the second episode, we will explain how to actually use the GenAI-Perf tool.

[Benchmarking LLM Applications]
Episode 1: What is GenAI-Perf?
Episode 2: How to use GenAI-Perf
Episode 3: NVIDIA NIM™ and vLLM Benchmark Measurement

GenAI-Perf Options

GenAI-Perf has many command line options, so we will explain the most common options here. For details, see Performance Analyzer->GenAI-Perf->Command Line Options.

--model <list>
Specify the AI model to be measured. (If you do not use a LoRA adapter, specify a single model.)

--tokenizer <str>
Specify the name of the tokenizer on the Hugging Face Hub to calculate the number of tokens for prompts and LLM responses. Example: meta-llama/Meta-Llama-3-8B-Instruct

--service-kind {triton, openai}
Specify triton if the inference server is a Triton Inference Server, or openai if the server has an OpenAI compatible API.

--endpoint-type {chat, completions}
If you specify openai for --server-kind, you can specify the endpoint type: chat or completions.

--url <url>
Specifies the URL of the inference server.

--concurrency <int>
Specifies the number of concurrent queries to run, which corresponds to the number of end users the LLM application will serve simultaneously.

--measurement-interval <int>
Specify the measurement time in milliseconds. Queries completed within this time will be counted for each index calculation.

--output-tokens-mean <int>
Average number of output tokens

--output-tokens-stddev <int>
Standard deviation of output token count

Valid options when using the Hugging Face dataset for prompts

--input-dataset {openorca, cnn_dailymail}
If you want to use the Hugging Face dataset as a prompt, specify openorca or cnn_dailymail. If you do not specify this option, synthetic data will be used as the prompt.

--num-prompts <int>
Unique prompt count

Valid options when using synthetic data for prompts

--num-prompts <int>
Unique prompt count

--synthetic-input-tokens-mean <int>
Average number of tokens

--synthetic-input-tokens-stddev <int>
Standard deviation of token count

--random-seed <int>
Random number generator seed value

When to use user-supplied prompts

--input-file <path>
Specify the JSON file.

How to read GenAI-Perf output

This article explains how to interpret the results output when benchmarking NIM with GenAI-Perf.

The three indicators in the first column, Time to First Token, Inter Token Latency, and (End-to-End) Request Latency, are calculated for each query to LLM, and the average, minimum, maximum, 99th percentile, 95th percentile, 90th percentile, 75th percentile, 50th percentile, and 25th percentile are output. The percentile indicates where the data falls when sorted from smallest to largest. The 99th percentile is the value that is 99% from the smallest, and the 50th percentile is the median.

Num Output Token and Num Input Token are parameters that the user sets depending on the scenario expected when using the GenAI-Perf tool. In this example, we set Input/Output to 200/200 and output the results.

Metric	avg	min	max	p99	p95	p90	p75	p50	p25
Time To First Token (ns)	178859359	73838932	217734212	217338545	216559510	215989391	185805802	183952658	177839897
Inter Token Latency (ns)	28186524	26284341	30353671	30254216	29861613	29690036	29065006	27994706	27230244
Request Latency (ns)	5344653932	5165773417	5619168861	5619164284	5619095892	5618754623	5540800702	5285519846	5170483894
Num Output Token	184	174	191	190	189	188	186	185	183
Num Input Token	200	180	217	216	212	212	207	200	194

For Output Token Throughput and Request Throughput, one value is output each time a measurement is performed.

Metric	Value
Output Token Throughput (per sec)	344.74
Request Throughput (per sec)	1.87

Next up: NVIDIA NIM™ and vLLM benchmarking!

In this article, we introduced the options for the GenAI-Perf tool and how to interpret the output results. What did you think?

Next time we will use the GenAI-Perf tool to benchmark the NVIDIA NIM™ and vLLM.

Episode 3: NVIDIA NIM™ and vLLM benchmark test

If you are considering introducing AI, please contact us.

For the introduction of AI, we offer selection and support for hardware NVIDIA GPU cards and GPU workstations, as well as face recognition, wire analysis, skeleton detection algorithms, and learning environment construction services. If you have any problems, please feel free to contact us.

Manufacturer TOP page

Site Search

Contents of this time

GenAI-Perf Options

How to read GenAI-Perf output

Next up: NVIDIA NIM™ and vLLM benchmarking!

If you are considering introducing AI, please contact us.