Introduction
NVIDIA has unveiled NVIDIA DGX Spark™, which is being increasingly utilized in a variety of applications. Spark is an excellent standalone GPU workstation, and thanks to its ConnectX-7 architecture, it also delivers outstanding performance in multi-node environments.
Many people have shared examples of implementations using two DGX Spark units.
Furthermore, distributed learning and distributed inference are essential technologies for future GPU utilization, but setting up the necessary environment can be quite challenging.
This time, with a focus on larger-scale GPU clusters, we will introduce how we built a GPU cluster using an NVIDIA® Spectrum®-2 SN3000 Ethernet switch and four DGX Spark GPUs, and performed distributed learning.
Bottom row: NVIDIA Spectrum-2 SN3000 Ethernet switch (1 unit)
Verification environment
hardware
- GPU computing nodes: 4 NVIDIA DGX Spark units
- Network switch: NVIDIA Spectrum-2 SN3000 Ethernet switch (1 unit)
software
- OS: DGX OS (Ubuntu 24.04)
- Distributed learning: NeMo Automodel (Docker container)
- nvcr.io/nvidia/nemo-automodel:26.02.00
set up
Network
First, you need to prepare the GPU cluster, which means creating an environment where each Spark is connected via a high-speed network. In this case, we also needed to update the packages within DGX Spark, so we prepared both the internet and high-speed network connections.
This page may be helpful. It's OK if you can ping from one node over a high-speed network.
GPU cluster connectivity check
nccl-test is commonly used to verify that a GPU cluster functions correctly. We will use an example with two DGX Spark GPUs as a reference and configure and verify the operation with four DGX Spark GPUs.
Execution command
mpirun -np 4 -H 10.0.0.10:1,10.0.0.11:1,10.0.0.12:1,10.0.0.13:1 --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH $HOME/nccl-tests/build/all_gather_perf
Execution result
Warning: Permanently added '10.0.0.11' (ED25519) to the list of known hosts.
Warning: Permanently added '10.0.0.12' (ED25519) to the list of known hosts.
Warning: Permanently added '10.0.0.13' (ED25519) to the list of known hosts.
# nccl-tests version 2.18.2 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0 unalign: 0
#
# Using devices
# Rank 0 Group 0 Pid 61045 on spark-XXXX device 0 [000f:01:00] NVIDIA GB10
# Rank 1 Group 0 Pid 61611 on spark-XXXX device 0 [000f:01:00] NVIDIA GB10
# Rank 2 Group 0 Pid 63157 on spark-XXXX device 0 [000f:01:00] NVIDIA GB10
# Rank 3 Group 0 Pid 59461 on spark-XXXX device 0 [000f:01:00] NVIDIA GB10
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 2097152 float none -1 1251.56 26.81 20.11 0 1222.51 27.45 20.59 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 20.3465
#
# Collective test concluded: all_gather_perf
#
The actual speed is 20.3465 GB/s, compared to the theoretical value of 200 GbE (25 GB/s), which is considered a reasonable speed.
Implementation of distributed learning
Launching containers for distributed learning
Start the NeMo Automodel container on each DGX Spark node.
sudo docker run --gpus all -it --rm \
--ipc=host \
--net=host \
--privileged \
-w /opt/Automodel \
nvcr.io/nvidia/nemo-automodel:26.02.00 /bin/bash
*This command is for testing purposes only; please exercise caution when using it in actual operations.
Running a distributed learning benchmark (single node)
Before testing with multiple nodes, let's first try training with a single node.
# 今回使用する`meta-llama/Llama-3.1-8B`モデルのダウンロードが必要なため、Hugging Faceへのログインと、モデル利用の申請を行っておきます。 hf auth login --token hf_XXXXXXXXXXXXXX
Run on a single node
torchrun nemo_automodel/recipes/llm/benchmark.py --config examples/llm_finetune/llama3_1/llama3_1_8b_peft_benchmark.yaml
Execution results (excerpt)
2026-03-26 06:25:40 | INFO | __main__ | ============================================================
2026-03-26 06:25:40 | INFO | __main__ | Benchmarking Summary
2026-03-26 06:25:40 | INFO | __main__ | ============================================================
2026-03-26 06:25:40 | INFO | __main__ | Total setup time: 412.62 seconds
2026-03-26 06:25:40 | INFO | __main__ | Total warmup time (5 steps): 580.72 seconds
2026-03-26 06:25:40 | INFO | __main__ | Total iteration time (5 steps): 576.31 seconds
2026-03-26 06:25:40 | INFO | __main__ | Average iteration time: 115.262 seconds (excluding first 5 warmup iterations)
2026-03-26 06:25:40 | INFO | __main__ | Average MFU: 3.708642% (excluding first 5 warmup iterations)
2026-03-26 06:25:40 | INFO | __main__ | ============================================================
*If you run the sample recipe as is, the Average MFU will be displayed based on a different GPU, so we will ignore it. In this case, the value we will use as an indicator of learning performance is Average iteration time.
Running a distributed learning benchmark (multi-node)
Now we'll finally obtain benchmark results for distributed learning using multiple nodes. We'll configure the master node's IP address and each interface on all DGX Spark systems.
Variable settings for distributed learning (example)
export MASTER_ADDR=10.0.0.1
export MASTER_PORT=29500
export NCCL_SOCKET_IFNAME=enP2p1s0f0np0
export GLOO_SOCKET_IFNAME=enP2p1s0f0np0
export UCX_NET_DEVICES=enP2p1s0f0np0
* Please specify the interface on the high-speed network side.
Benchmark execution command (master node)
torchrun \
--nnodes=4 \
--nproc-per-node=1 \
--node_rank=0 \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
nemo_automodel/recipes/llm/benchmark.py \
--config examples/llm_finetune/llama3_1/llama3_1_8b_peft_benchmark.yaml
Benchmark execution command (worker node)
torchrun \ --nnodes=4 \ --nproc-per-node=1 \ --node_rank=<1-3まで設定> \ --master_addr=$MASTER_ADDR \ --master_port=$MASTER_PORT \ nemo_automodel/recipes/llm/benchmark.py \ --config examples/llm_finetune/llama3_1/llama3_1_8b_peft_benchmark.yaml
* For each Spark, please change `--node_rank=1,2,3` and run it.
Results of distributed learning benchmarks
This time, we ran the program with DP=4 (parameter for distributed learning).
The learning performance metrics, based on a single Spark, showed approximately 1.75 times higher performance with two Sparks and approximately 3.44 times higher performance with four Sparks.
We can see that even when increasing the number of DGX Spark units, the network does not become a bottleneck, and the learning speed can be improved efficiently.
Summary
In this study, we built a GPU cluster using DGX Spark and SN3700 and compared the training time with configurations of 1, 2, and 4 Spark GPUs.
✅ Utilization of high-speed networks
- High-speed 200GbE connectivity makes inter-cluster communication less likely to become a bottleneck.
- Smooth data transfer between nodes (DGX Spark) enables efficient distributed learning.
✅ Achieves linear scaling
- Compared to a single Spark, a 4-Spark configuration achieves approximately 3.44 times faster performance.
- Confirmed that scaling is nearly linear by increasing the number of nodes.
Building small clusters is excellent practice for learning distributed learning and distributed inference techniques. I encourage you all to try it out.
Inquiry
If you are considering implementing NVIDIA DGX Spark, please feel free to contact us.