Services that utilize AI (artificial intelligence) and deep learning technology are now part of our lives. pervasive in the lives of Users enjoy AI services through edge terminals and edge devices, but behind the scenes there are servers and data centers for providing services, on which AI learns large amounts of data and processes huge amounts of data. to return the results of analysis and prediction. This AI utilization has been advanced in the high performance computing (HPC) field such as research and development, and the trend of enterprise cloud service providers and data center operators to incorporate it into their own services is gradually becoming active.

 

On the other hand, AI requires enormous computing resources, and due to the nature of parallel processing using multiple resources at the same time, it is becoming difficult to process with the in-house infrastructure and computing resources that we have been familiar with for many years. . Until now, if resources were insufficient, dozens of units would be added, and dozens of more units would be added and expanded based on the concept of scale-out. Out, scale up does not work.

 

This is due to the difference in data processing practices, such as “parallel processing” of huge amounts of data, whereas “sequential processing” that was commonplace in cloud network configurations is performed in AI. Therefore, for cloud service and data center operators who will start using AI from now on, we will introduce the manners of AI processing that "if you don't know now, you will lose money" and the construction of smart servers and networks for AI. We'll show you how.

Conventional cloud network and HPC/AI network

Differences between serial and distributed processing

In general cloud services, a client user makes some kind of request from an application. Then, the processing is broken down by function, the first half is executed, the result is passed to the next step, and the second half is processed. After completing the process in a form called service chaining, the result is returned to the user. The processing at this time is completed within one server when viewed on a server-by-server basis.

On the other hand, AI requires a huge amount of processing, so a single processing itself is distributed among the nodes within each server and is always synchronized. After completing the processing in a form called distributed computing or parallel computing, the result is returned to the user. In this case, the processing spans multiple servers.

Differences between serial and distributed processing

The difference between the two can be seen here

in a series of processes

・Whether the communication partner is inside the server cabinet or outside the server cabinet

・Whether the frequency of communication is low-frequency processing with only requests and responses, or high-frequency processing that always synchronizes

becomes.

 

In AI processing, there is a lot of communication with the outside of the server cabinet, and it can be seen that there is much more processing via network switches than in a general network. Next, let's compare the communication inside the server cabinet and the communication outside the cabinet, and look in detail at how much the network communication bandwidth changes.

Differences in communication bandwidth between inside and outside the server cabinet

Communication within the chassis is processed between GPUs, and the theoretical communication speed achieved by the high-speed GPU interconnect is 600Gbyte/sec. (In the case of NVIDIA's 3rd generation NVLink) Communication outside the chassis requires network processing, so the CPU must intervene. In that case, since it goes through PCI Express, the theoretical communication speed is 31.51Gbytes/sec (PCIe gen.4 x16), which is only about 1/30 of the communication between GPUs. It's a well-known fact that GPU computing speed is high, but once you step outside the chassis, you'll realize once again how much communication speed you're losing.

Differences in communication bandwidth between inside and outside the server cabinet

This communication speed is only a theoretical value, and the effective value will be even slower. Although some software and algorithms can improve losses, it is beneficial for network infrastructure engineers to know how to improve infrastructure. Next, we will look at specific measures.

Technology essential for network construction that does not slow down GPU computing speed

Based on the content so far, "network construction without CPU intervention" is very important in order not to reduce the computation speed of the GPU. The communication bandwidth of the network, in other words, the size of the clay pipe is predetermined, and the idea is to use it efficiently as follows.

 

There are several technologies that support these, but we will introduce them based on NVIDIA's technology.

RDMA

RDMA is a data transfer technology that eliminates CPU intervention when copying memory data in inter-computer communication.

Cards read memory in computer applications and communicate between compatible cards (HCAs), bypassing normal network protocols. This makes it possible to transfer data using only the card's hardware without performing memory-to-memory copying, and write the data directly to the destination computer's application memory.

NVIDIA GPU Direct

RDMA allows peripheral PCI Express devices to directly access GPU memory. Designed specifically for GPU acceleration needs, GPUDirect RDMA provides direct communication between NVIDIA GPUs in remote systems. This eliminates the necessary buffer copying of data through the system CPU and system memory, improving performance.

Other considerations

We introduced GPU Direct as a network that does not involve the CPU.

GPU Direct requires a server with a GPU and a compatible card. If calculations are sufficiently fast using these, the performance of the storage that supplies the original data will also become important. If clusters are to be formed, performance improvements in computational methods and algorithms will continue to be necessary.
(If you don't build a cluster, you can consider only the storage, but there is a need for users to build a cluster and use it.)

In order to make an HPC/AI system successful, not only techniques such as GPU Direct introduced this time, but also optimization of both hardware and software are required.

NVIDIA technology essential for building AI systems

We've introduced AI processing etiquette and technologies that will be essential for building servers and networks in the future. How was it? There is an NVIDIA Mellanox Network Adapter Card (NIC) packed with these technologies. If you are considering building one, please see below.

NVIDIA MELLANOX BLUEFIELD® SMART NIC Connectx6

NVIDIA® Mellanox® BlueField® SmartNIC for InfiniBand & Ethernet

NVIDIA MELLANOX Network Adapter Card (NIC) Products