*This article is based on a lecture given at the Macnica Data・AI Forum 2024 Autumn held in October 2024.
Introduction
In recent years, with the advancement of generative AI technology, many companies are trying to use generative AI to solve various business problems. In particular, since the emergence of ChatGPT, generative AI has attracted widespread attention. However, when companies introduce generative AI, data protection and privacy issues are major barriers. For companies that find it difficult to use cloud services, building a local LLM (Large Language Model) in an on-premises environment is one solution. In this article, we will explain the history and basic knowledge of generative AI, compare cloud and on-premises environments, and explain the key points of building a local LLM.
The History of Generative AI
The development of generative AI began with rule-based AI in the 1950s, followed by machine learning in the 1980s, neural networks in the 1990s and 2000s, and the Transformer model that emerged in 2017. With the emergence of the Transformer model, generative AI made a big leap forward, and now advanced models such as GPT and BERT have been developed. These models are applied to a wide range of tasks, including response systems like chatbots, text generation, and image generation.
What is generative AI?
Generative AI is a type of machine learning algorithm that generates new data and information based on existing data and information. For example, it is possible to generate data in various formats such as text, images, and program code. When using generative AI in a company, an AI with domain-specific knowledge is required. However, since general LLMs learn from public data, they lack knowledge about specific industries and company operations. Therefore, customization is required to specialize for a company's unique use case.
Cloud vs. On-Premises
When introducing generative AI, choosing between a cloud environment or an on-premise environment is a big issue. Each has its own advantages and disadvantages.
Customizability
Cloud services may only allow customization within the scope of the models and tools provided, whereas on-premise environments allow for a high degree of customization to fit a company's needs, which is crucial for each company to optimize generative AI based on their own data and specifications.
Security
In a cloud environment, data is uploaded to an external server, which increases security risks. This risk cannot be ignored, especially when dealing with confidential or personal information. In an on-premise environment, data is managed within your own network, so you can feel secure in terms of security. Cloud providers such as Azure and AWS also offer advanced security features, but they do not match the complete control of on-premise.
cost
Cloud services generally have low initial costs and are paid for only as much as you use them. However, they have the disadvantage that costs become harder to predict as usage increases. On the other hand, on-premise environments require a large initial investment, but operational costs tend to be stable and it is easier to estimate costs in the long term.
Time to operationalization
Cloud services can be quickly implemented because you can start using them immediately after signing a contract. On the other hand, with an on-premise environment, it can take time to purchase and configure the hardware. However, once built, it offers a high degree of freedom and control.
Ease of integration
Cloud services are used over the internet, making them convenient in environments with fewer restrictions imposed by company regulations. On the other hand, on-premise environments are built within a company's own network, making it easier to link with internal systems and applications and allowing for centralized data management.
Cloud and on-premise have their own advantages and disadvantages. It is important to choose the optimal environment based on the needs of your company and the nature of your data. Especially when dealing with confidential data, the security and customizability of an on-premise environment are a major advantage.
Key points for building a local LLM on-premise
Fine Tuning vs RAG
There are two approaches to using generative AI: fine tuning and RAG (Retrieval Augmented Generation). Fine tuning is a method of retraining an existing model to adapt it to a specific task or domain, and has the advantage of providing highly accurate answers. On the other hand, RAG is a method of searching an existing database and combining the necessary information to generate answers, and has the advantage of being easy to update with the latest information. Many companies use a combination of these to create flexible systems.
Points that are often overlooked
When building LLM locally, it is important to select the number of model parameters, inference engine, and inference server. The more parameters there are, the higher the model accuracy will be, but the more GPU memory required. In addition, an inference engine is necessary for high speed and memory optimization, and the inference server must take into account the throughput and latency when used by a large number of users simultaneously. It is necessary to build an appropriate system by taking these points into consideration.
How to build a local LLM on-premise
Built using a large amount of OSS
The use of open source software (OSS) allows for the construction of systems with a high degree of freedom at low cost. However, the use of OSS requires advanced technical skills and time. In particular, building inference pipelines and customizing models requires specialized knowledge, and there is a risk of project delays if appropriate support is not available.
How to build a local LLM using NVIDIA's SDK
On the other hand, by utilizing NeMo and NIM, generative AI-related SDKs optimized for NVIDIA GPUs, you can combine ready-to-use containers and receive support from NVIDIA, enabling efficient development.
LLM Development to Deployment Platform: NVIDIA NeMo
NVIDIA NeMo is a platform that comprehensively supports the development and deployment of LLM. NeMo is composed of multiple Microservices and provides a wide range of functions from data curation to fine-tuning, evaluation, deployment of embedded models, and guardrails to suppress undesirable answers. This allows companies to utilize generative AI efficiently and safely.
NVIDIA NIM: Inference SDK to accelerate deployment of generative AI in the enterprise
NVIDIA NIM is an SDK specialized for generative AI inference, designed to easily use containers for foundational models and RAG models. By utilizing inference engines such as TensorRT LLM and Triton Inference Server, it provides a highly efficient and scalable inference environment. In addition, it supports industry-standard APIs, making it easy to integrate with existing systems.
What kind of system is required for LLM development/inference?
The ideal system would be one that can be implemented consistently from development to inference. To achieve this, it is efficient to introduce a scalable infrastructure based on Kubernetes and a high-performance job scheduler such as Run AI. In addition, appropriate allocation of GPU resources and auto-scaling functions according to load are also important. This allows companies to realize a development environment with high ROI and flexibly deploy generative AI.
Summary
In order for companies to utilize generative AI, a high level of customization and security is required. Building a local LLM in an on-premise environment is an effective way to achieve this. By utilizing NVIDIA's SDK, you can efficiently support everything from development to inference, accelerating the introduction of generative AI in companies. As technology advances, it is expected that more user-friendly solutions will be provided in the future.

Macnica Inc.
First Technology Division, Technology Department 4, Section 1
Kawabe Kuga
He joined Macnica in 2023. He is currently promoting NVIDIA's embedded GPU Jetson and software development tools related to generative AI to companies in order to solve customer problems.