Contents of this time
In the first episode, we explained micro-based development and the NVIDIA NIM.
In episode 2, we will explain the specific software and hardware required, as well as how to deploy it.
[RAG Chatbot Development Using NVIDIA NIM]
Episode 1: RAG system development using Microservices
Episode 2: What is the required software/hardware configuration for a RAG system?
Episode 3: RAG System Sample Code
Required software/hardware setup for RAG
This time, we will explain how to build a chatbot with the minimum configuration shown in Episode 1.
In Microservices based development, the LLM, embedded model, and vector database are each prepared in a container.
NIM Container
In the sample code shown in Part 3, the LLM uses meta/llama-3.1-70b-instruct and the sentence embedding model uses nvidia/nv-embedqa-e5-v5.
The latest information on the hardware resource requirements and software requirements for these models for which containers are provided by NIM can be found here.
Software requirements (Release 1.3.0, 2024/12/12)
・Linux operating systems(Ubuntu 20.04 or later recommended)
・NVIDIA Driver >= 560
・NVIDIA Docker >= 23.0.1
Hardware Requirements (Release 1.3.0, 2024/12/12)
meta/llama-3.1-70b-instruct(Release 1.3.0、2024/12/12)
|
GPUs |
Precision |
Profile |
# of GPUs |
Disk Space |
|---|---|---|---|---|
| H200 SXM | FP8 | Throughput | 1 | 67.87 |
| H200 SXM | FP8 | Latency | 2 | 68.2 |
| H200 SXM | BF16 | Throughput | 2 | 133.72 |
| H200 SXM | BF16 | Latency | 4 | 137.99 |
| H100 SXM | FP8 | Throughput | 2 | 68.2 |
| H100 SXM | FP8 | Throughput | 4 | 68.72 |
| H100 SXM | FP8 | Latency | 8 | 69.71 |
| H100 SXM | BF16 | Throughput | 4 | 138.39 |
| H100 SXM | BF16 | Latency | 8 | 147.66 |
| H100 NVL | FP8 | Throughput | 2 | 68.2 |
| H100 NVL | FP8 | Latency | 4 | 68.72 |
| H100 NVL | BF16 | Throughput | 2 | 133.95 |
| H100 NVL | BF16 | Throughput | 4 | 138.4 |
| H100 NVL | BF16 | Latency | 8 | 147.37 |
| A100 SXM | BF16 | Throughput | 4 | 138.53 |
| A100 SXM | BF16 | Latency | 8 | 147.44 |
| L40S | BF16 | Throughput | 4 | 138.49 |
nvidia/nv-embedqa-e5-v5(Release 1.2.0、2024/12/12)
|
GPUs |
GPU Memory (GB) |
Precision |
|---|---|---|
| A100 PCIe | 40 & 80 | FP16 |
| A100 SXM4 | 40 & 80 | FP16 |
| H100 PCIe | 80 | FP16 |
| H100 HBM3 | 80 | FP16 |
| H100 NVL | 80 | FP16 |
| L40s | 48 | FP16 |
| A10G | 24 | FP16 |
| L4 | 24 | FP16 |
How to deploy containers
Below is how to deploy meta/llama-3.1-70b-instruct.
First, log in so that you can pull containers from NGC.
$ docker login nvcr.io
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>
Next, pull the NVIDIA NIM container with the following command.
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
You have now deployed the NIM container.
You can also query your model using the curl command.
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.1-70b-instruct",
"messages": [{"role":"user", "content":"Write a limerick about the wonders of GPU computing."}],
"max_tokens": 64
}'
You can deploy the sentence embedding model (nvidia/nv-embedqa-e5-v5) in a similar way.
Please check here for details.
Vector Database
We will use Milvus, an open source vector database built for generative AI applications.
Milvus provides containers, and this time we will deploy and use a container.
For more information about Milvus, please click here.
Software requirements (version 2.5.x, 2024/12/12)
| Operating system | Software |
|---|---|
| Linux platforms |
Docker 19.03 or later Docker Compose 1.25.1 or later |
Hardware requirements
| Component | Requirement | Recommendation |
|---|---|---|
| CPU |
Intel 2nd Gen Core CPU or higher Apple Silicon |
Standalone: 4 core or more Cluster: 8 core or more |
| CPU instruction set |
SSE4.2 AVX AVX2 AVX-512 |
SSE4.2 AVX AVX2 AVX-512 |
| RAM |
Standalone: 8G Cluster: 32G |
Standalone: 16G Cluster: 128G |
| hard drives | SATA 3.0 SSD or higher | NVMe SSD or higher |
How to deploy containers
Below is how to deploy a Milvus container.
If you set it up with this command, port 19530 will be used by default.
curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh
You can START, STOP, and DELETE containers with the following commands:
#Start the Docker container
$ bash standalone_embed.sh start
#Stop the Docker container
$ bash standalone_embed.sh stop
#Delete the Docker container
$ bash standalone_embed.sh delete
Sample code released next time!
This time, we explained the hardware configuration and software deployment when building a RAG chatbot using Microservices.
In the third episode, we will explain Microservices based development using sample code.
If you are considering introducing AI, please contact us.
For the introduction of AI, we offer a wide range of services, including the selection and support of NVIDIA GPU cards and GPU workstations, as well as algorithms for face recognition, trajectory analysis, and skeletal detection, and learning environment construction services. Please feel free to contact us if you have any questions.