Site Search

Contents of this time

In recent years, advances in generative AI have led to the rapid adoption of "local LLMs (large-scale language models)" by companies and research institutions. As cloud-based AI services become more common, operating LLMs in a local environment has become an extremely effective option from the perspectives of ensuring security, improving customization, and optimizing costs.

In this second article, we will provide a detailed practical explanation of the overall system needed to build a local LLM, the hardware and software requirements, representative OSS tools, and support information provided by Macnica.

[Introduction to Local LLM]

Part 1: A complete guide to the basics and application of local LLMs

Episode 2: How to actually build a local LLM?

Episode 3: Customization and Fine Tuning

Overview of the overall system

Building a local LLM requires a system design that integrates multiple technological elements. The main components are as follows:

  • Computational resources: GPU servers for model training and inference
  • Data management: Storage for saving and managing training data and model files (e.g., S3 compatible)
  • Network environment: Network switches that enable communication between multiple nodes and external collaboration
  • Control and operation software: Tools for model execution, tracing, agent configuration, etc.

Combining these features enables flexible LLM operation according to the application. Designing according to the purpose is important, from a lightweight configuration dedicated to inference to a high-performance configuration that also enables learning. In particular, when considering fine-tuning using in-house data or a RAG (Retrieval-Augmented Generation) configuration, it is necessary to consider integration with databases and search engines.

Required Hardware

The GPU server is the core of local LLM. While CPU and RAM are important, GPU performance in particular is directly linked to the model processing speed and accuracy. If you are only performing inference, a relatively lightweight configuration will suffice, but if you are handling a 70B-scale model, you will need multiple high-performance GPUs (e.g., NVIDIA H200/B200) and large amounts of memory.

Additionally, model quantization can be used to achieve highly accurate inference while reducing the required resources. Quantization includes formats such as INT4 and INT8, which can significantly reduce model size while maintaining practical accuracy.

Depending on the application, the introduction of S3-compatible storage and high-speed network switches may also be considered. Optimizing bandwidth and latency is especially important when performing distributed inference and training across multiple nodes.

By referring to the products below as configuration examples, you can smoothly select the hardware that suits your purpose.
The following page introduces server configurations optimized for different uses, which can be useful when considering implementation.


See the table below for an example of GPU requirements by model.

Model

GPUs

Precision

Profile

# of GPUs

Disk Space

DeepSeek R1 DistillLlama 8B RTX

RTX 6000 Ada Generation

INT4 AWQ

Throughput

1

5.42

GeForce RTX 5090

INT4 AWQ

Throughput

1

5.42

GeForce RTX 5080

INT4 AWQ

Throughput

1

5.42

GeForce RTX 4090

INT4 AWQ

Throughput

1

5.42

GeForce RTX 4080

INT4 AWQ

Throughput

1

5.42

Llama 3.2 1B Instruct

L40S

FP8

Throughput

1

1.47

FP8

Throughput LoRA

1

2.89

BF16

Throughput

1

2.9

BF16

Throughput LoRA

1

1.97

Llama 3.2 3B Instruct

L40S

FP8

Throughput

1

4.17

FP8

Throughput LoRA

1

4.17

FP16

Throughput

1

6.79

FP16

Throughput LoRA

1

6.79

Model

GPUs

Precision

Profile

# of GPUs

Disk Space

Llama 3.3 70B Instruct

H200

FP8

Throughput

1

67.88

FP8

Latency

2

68.23

BF16

Throughput

2

68.23

FP8

Throughput

2

68.1

FP8

Throughput LoRA

2

68.22

BF16

Latency

4

68.82

FP8

Latency

4

68.68

BF16

Throughput

4

68.76

Llama 3.1 405B Instruct

H100 SXM

FP8

Latency

8

388.75

FP16

Latency

16

794.9

A100 SXM

PP16

Latency

16

798.2

*NVIDIA NIM™ for Large Language Models (LLMs): See Supported Models
*Other: See https://docs.nvidia.com/nim/large-language-models/latest/supported-models.html#llama-3-3-70b-instruct
*As of August 2025

Techniques to reduce memory usage and computational load: quantization and data type innovations

Inference and training of LLMs (large-scale language models) require large amounts of memory and computational resources. For this reason, several techniques are used to improve efficiency. First, the weight data (parameters) of LLMs are typically represented as 16-bit (BF16) or 32-bit floating-point numbers. Especially during training, high-precision data types tend to be used to maintain a certain level of accuracy.

On the other hand, a technique called "quantization" is sometimes used to reduce memory usage during inference and improve processing speed. This is a technology that compresses the size of a model by further reducing the number of bits in the weight data. For example, by converting 16-bit weights to 8-bit or 4-bit weights, the amount of memory required can be reduced to 1/2 and 1/4, respectively.

However, quantization also has trade-offs. Reducing the number of bits may slightly reduce the accuracy of the model, so a balance must be found depending on the application. In particular, for tasks that require high accuracy, it is important to thoroughly evaluate the impact of quantization before introducing it.

By utilizing these techniques, it becomes possible to operate large-scale models even with limited GPU resources. Understanding and utilizing these optimization methods is key to designing a practical system for building local LLMs.

Required software

Software that supports model management, execution, and collaboration is essential for operating a local LLM. Typical OSS (open source software) tools include the following:

LangChain: A Python-based framework that allows you to build flexible and advanced interactive applications by linking multiple language models and external tools (search, calculation, database, etc.). Agent functions and chain structures allow complex processing to be executed in stages.
・Llama.cpp  and  vLLM: OSS that provides a lightweight inference environment. It works well with quantized models and is suitable for high-speed inference in a local environment.


These OSS are highly flexible and scalable, allowing for integration with in-house systems and the addition of unique functions. However, because operating OSS requires a certain level of technical knowledge, some companies choose to use products with commercial support.

In that case, one option is to use the following tools (SDK) that use OSS and NVIDIA's own technology:

NeMo: NVIDIA 's LLM development and operation framework. It provides pre-trained models and fine-tuning functions, enabling stable operation in commercial environments.
NIM: Provides inference APIs as Microservices, enabling scalable operation. It is highly compatible with Kubernetes environments and supports cloud-native configurations.
Blueprint: Provides design templates for LLM applications, supporting rapid development and facilitating the transition from PoC to production.
・Agent Toolkit: A set of tools that allows you to easily build agent configurations. It allows you to design AI agents that fit your business workflow.

Building a local LLM is not just about introducing technology, but also an opportunity to fundamentally rethink business processes and how data is used. A comprehensive perspective is required, from hardware selection, software configuration, security measures, and the establishment of an operational system.

If you have any issues or concerns about the setup, please take advantage of the support provided by Macnica. We will support you in a safe and reliable implementation through setup support and technical consultations, mainly using NVIDIA products. For more information, please see the following page.

Next time, we will explain the overall system required to build a local LLM!

In this article, we have explained the overall system needed to build a local LLM, the hardware and software requirements, representative OSS tools, and the support information provided by Macnica.

In the next issue, we will take a closer look at LoRA, a lightweight fine-tuning method, RAG (Search Augmentation Generation) using in-house data, and technical considerations for security and operation.

For quotes/inquiries click here