Site Search

Contents of this time

When introducing a local LLM (large-scale language model), it is essential to not only run the model, but also to customize and fine-tune it to suit the business. This enables highly accurate responses that take into account the terminology and context specific to the company, dramatically improving the practicality of AI.

In this third installment, we'll take a closer look at LoRA, a lightweight fine-tuning method, RAG (Search Augmentation Generation) using internal data, and technical considerations for security and operation.

 

[Introduction to Local LLM]

Part 1: A complete guide to the basics and application of local LLMs

Episode 2: How to actually build a local LLM?

Episode 3: Customization and Fine Tuning

Lightweight Fine Tuning with LoRA

LoRA (Low-Rank Adaptation) is a technique for efficient and low-cost fine-tuning of existing LLMs. Conventional fine-tuning techniques require retraining billions to tens of billions of parameters, which tends to be extremely costly in terms of GPU resources and time.

LoRA inserts a low-rank correction matrix into the model's weight matrix, preserving the original model structure while fine-tuning only the necessary parts.
This provides the following benefits:

- Reduced memory usage: Since there are fewer parameters to update, VRAM consumption is reduced.
- Rapid learning: In many cases, fine-tuning can be completed within a few hours to a few days.
- Model reusability: You can switch between multiple LoRA adapters and operate them for different purposes while keeping the original model.


For example, LoRA is extremely effective when creating models specialized for limited contexts, such as answering internal FAQs or handling terminology for specific industries. Furthermore, LoRA can also be applied to quantized models, enabling both a lightweight inference environment and fine-tuning.

RAG (Search Extension Generation) using in-house data

RAG (Retrieval-Augmented Generation) is a technology that combines the generative capabilities of LLM with external knowledge to achieve more accurate and contextual responses. Particularly in local environments, it enables flexible responses that do not rely on pre-training of the model by utilizing internal documents and knowledge bases.

The basic components of a RAG are:

  1. Embedding processing: Internal documents are vectorized and stored as data that can express semantic similarities.
  2. Vector DBStore in : Store in vector databases such as Faiss, Weaviate, and Qdrant.
  3. Search processing: Embed the user's question and search for similar documents.
  4. Generation process: Search results are integrated into the prompt and LLM generates the response.


This configuration enables accurate answers to questions that cannot be covered through pre-learning, such as questions about company regulations or detailed product specifications. Furthermore, by linking with S3-compatible storage, it becomes easier to manage and update large-scale document collections.

RAG is particularly useful for tasks such as:

・In-house FAQ Auto-reply
・Summary and search of technical documents
・Utilizing customer response history
・Converting product manuals into knowledge

Security and operational considerations

When customizing and fine-tuning a local LLM, it is essential to establish a security and operational system. Especially when handling internal data, you need to pay attention to the following technical aspects.

1. Data Confidentiality and Access Control

If the data used for training contains personal or confidential information, the following measures are required:

PII masking: Personally identifiable information is removed or anonymized in advance.
- Access log management: Record who accessed which data.
・Permissions management: Role-based restrictions on access to the learning and inference environment.

2. Model versioning and auditability

Fine-tuned models must be managed separately from the original model. The following practices are recommended:

- Introduction of a model registry: Utilizing tools such as MLflow and Weights & Biases, model versions and meta information are centrally managed.
- Recording change history: Records which data and parameters were used for learning.
- Maintaining audit logs: Record the model usage history and output contents, making them verifiable later.

3. Stability and scalability of the inference environment

An inference environment must address the following technical challenges:

- Latency optimization
: Utilizing quantized models and high-speed inference engines such as TensorRT-LLM.
- Scalable configuration: Using NVIDIA NIM™ and other technologies, it can be converted into Microservices and run on Kubernetes.
- Anomaly detection and feedback loop: Detect incorrect responses and inappropriate outputs and incorporate them into the improvement cycle.



Customizing and fine-tuning local LLMs is not just about introducing technology; it is the key to enhancing business processes and optimizing knowledge utilization. By combining lightweight model adjustments by LoRA, integration of in-house knowledge by RAG, and technical considerations for security and operation, practical and safe AI utilization becomes possible.

Macnica provides technical support for this type of construction and operation. Utilizing solutions centered on NVIDIA products, we provide consistent support from PoC to full-scale operation. For more information, please visit the following URL.

AI agent construction support

Macnica will support the development of AI agents in an on-premises environment using NVIDIA NIM through a two-month, escorted program. In the first month, participants will learn the basics of agent development using LangGraph and verify operation using a ReAct (Reasoning and Acting) configuration. In the second month, participants will design and implement use cases based on business issues and learn typical workflows. The inference server will be built using NIM, and an environment will be created in which confidential information can be handled safely. The program will enable participants to learn practical and efficient AI agent development through sample code in Jupyter Notebook format, Q&A via email and chat, regular meetings, and lectures on how to use NIM.



We can also provide support for automating RAG accuracy evaluation, as shown in the video below.

AI TRY NOW PROGRAM

Macnica 's "AI TRY NOW PROGRAM" is a support service for businesses that allows them to pre-test NVIDIA 's latest AI solutions in an on-premises environment. Software such as NVIDIA NeMo™, NIM, and AI Enterprise can be used in high-performance GPU environments such as the DGX B200. Macnica engineers prepare the optimal configuration in advance, eliminating the hassle of setting up the environment, allowing users to immediately begin developing and testing AI agents and physical AI. The program also provides sample code in Jupyter Notebook format, technical Q&A support via email and chat, and regular technical meetings, enabling hands-on learning and testing. Furthermore, it supports KPI measurement and ROI evaluation, allowing users to quantitatively understand the effects of AI implementation and smoothly transition from PoC to production operation. This program combines practicality and flexibility for companies looking to accelerate their AI adoption.

For quotes/inquiries click here