Running Japanese LLM on an edge device - Llama-v3-ELYZA-JP-8B

Narrow down by specifying conditions

現在2189件がヒットしています。check

Design AI/Artificial Intelligence Software Microcomputer/Processor/DSP Qualcomm

What is an LLM (Large-Scale Language Model)?

Large Language Models (LLMs) are AImodels thatlearn language patterns from large amounts of text data and can generate natural-sounding sentences that resemble those of humans. As a foundational technology for conversational AI, includingChatGPT, it is currently one of the most talked-about fields.

GenerallyLLMis on the cloudGPUWhile it typically runs on a server, recent advancements in model optimization technology have made it feasible to run it on edge devices.
Cloud-independent on-deviceLLMThis offers the following advantages:

perspective	CloudLLM	On-deviceLLM
privacy	The data is sent to an external server.	Everything is contained within the device. Data never leaves the device.
Offline compatible	Internet connection is required.	Works completely offline
Response delay	Network round trip takes hundreds ofmillisecondsto a few seconds.	Instant response within the device
Operating costs	APIusage fees are charged on a pay-as-you-go basis.	Only an initial investment is required. No running costs.

In environments handling highly confidential data, such as factories and medical facilities, or in environments with unstable communication infrastructure, the value of LLM (Limited Licensing Management) that can be completed entirely on-device becomes particularly high.

Regarding this demonstration

This article presentsa demonstration of running the Japanese language learningmodule"Llama-3-ELYZA-JP-8B"on the NPU (Hexagon HTP) on the "Dragonwing IQ-9075 EVK"evaluation board, which is equipped witha Qualcomm IQ-9075. We will show you how this 8billion-parameter Japanese languagelearning modulegenerates Japanese text in real time on a palm-sized edge device.

AI model used

The Llama-3-ELYZA-JP-8Bused in this project is a Japanese-language-specific LLMdeveloped byELYZA, an AIstartuporiginating from the University of Tokyo.
Based on Meta'sLlama 3architecture, and further trained with Japanese data, it enables natural-sounding Japanese dialogue.

The model is also available onQualcomm AI Hub*, and an export pipeline optimized forQualcomm SoCs is provided.

*Overview and registration for Qualcomm AI Hub
Qualcomm AI Hub - Semiconductor Business -Macnica

Qualcomm AI Hub Llama-v3-ELYZA-JP-8B model page
Llama-v3-ELYZA-JP-8B - Qualcomm AI Hub

item	content
Model name	Llama-3-ELYZA-JP-8B
Number of parameters	8billion (8B)
Base model	Meta Llama 3
Languages spoken	Japanese/English
quantization	w4a16(4-bitweight,16-bitactivation)
runtime	Genie SDK(Qualcomm LLMinference engine)

What is quantization?

LLM typically storeseach parameter as a 32-bitfloating-point number (FP32). For the8B model, FP32requiresapproximately32GB of memory, making operation on edge devices difficult.

Quantization is a technique that reduces model size and computational complexity by lowering the precision of parameters. In this case,we adopted"w4a16," which quantizes weights to 4 bitsand activations (intermediate calculation results) to 16 bits. As a result, the model size is compressed to approximately 5.7GB, allowing it to fit into the memory of edge devices.

testing environment

Evaluation board: Dragonwing IQ-9075 EVK (equipped with Qualcomm IQ-9075)

AIAccelerator: Hexagon HTP (100 TOPS,INT8)

SDK：Qualcomm AI Engine Direct (QAIRT SDK) + Genie SDK

HostPC: Ubuntu 22.04 (used for model export)

About IQ 9075

The IQ-9075isQualcomm 's high-end SoCforindustrial andIoT applications.It features a dual CDSPHexagonprocessor and boasts an NPUperformance of100 TOP (INT8).

It has the processing power to run an 8B-scaleLLM on-device.

Implementation in Edge AI

To run the 8B-parameterLLMtrained withPyTorchonthe NPUof an edge device, we optimized and deployed it in the following three steps.

Step 1: Export the model using Qualcomm AI Hub

Qualcomm AI Hubisa cloud service for optimizingAImodels for Qualcomm SoCs. By uploading models in PyTorchorONNXformat, it automatically performs quantization and compilation, generatinga context binary that can be executed on the device.

# AI Hub経由でエクスポート（量子化 + HTPコンパイル） python3 -m qai_hub_models.models.llama_v3_elyza_jp_8b.export \ --target-runtime genie \ --device "Dragonwing IQ-9075 EVK" \ --output-dir ./export_output

This single command will automatically perform the following processes:

Download model weights fromHuggingFace
Apply w4a16quantization.
CompiledforQCS9075HTP
Generatea Context Binary(.bin file).

The 8Bmodelwas too large to load intoHTP all at once, so we compiled it by splitting the model into eight parts. This ensures that the size of each partfits within the HTPbuffer limit and operates stably.

Step 2: Deployment to the device

Once the export is complete, transfer thegeneratedContext BinaryandGenie SDK runtime libraries to your device.

# SDKのランタイムを転送 adb push genie-t2t-run /data/qairt/bin/ adb push libGenie.so libQnnHtp.so ... /data/qairt/lib/ # モデルファイルを転送（合計約5.7GB） adb push ./export_output/*.bin /data/elyza/ adb push ./export_output/*.json /data/elyza/

The files to be transferred are the complete set of Genie SDKexecutable binaries and libraries, and the model files generated by the export (Context Binary 8parts+configuration file+tokenizer).

Step 3: Execution

Run the model usingthe genie-t2t-run command included inthe Genie SDK.

# 環境変数を設定して実行 export LD_LIBRARY_PATH=/data/qairt/lib cd /data/elyza genie-t2t-run --config genie_config.json --prompt_file prompt.txt

Upon startup, the model is loaded into HTPand responses to prompts are generated.

Operation results

This is the result of actuallyrunningLlama-3-ELYZA-JP-8Bonthe IQ-9075 EVK.

Less than a second elapsed between entering a prompt and receiving the first response, and subsequent text generation proceeded smoothly. Answers to short questions could be completed in just a few seconds, demonstrating sufficient responsiveness for interactive applications.

Summary

This articledemonstrated the operation of the 8billion-parameter JapaneseLLM"Llama-3-ELYZA-JP-8B"on the NPUusingthe Dragonwing IQ-9075 EVK, which is equipped witha Qualcomm IQ-9075.

The IQ-9075boasts100TOPSofNPU performance, and when combined with w4a16quantization,it enables real-time operation of8B-class JapaneseLLMs on edge devices. Qualcomm AI HubandGenie SDKallow for efficient development with a consistent workflow, from model optimization to on-device execution.

Cloud-independent on-deviceLLMsoffer significant value in environmentswith stringent security requirements, limited communication infrastructure, and a need to reduce API usage fees.

QualcommDragonwingThis series is our entry-level model.From high-endWe offer a wide range of products, allowing us to propose the optimal product for your specific use case.AIatLLM and other AIIf you are interested in using this service, please feel free to contact us.

Reference links

ELYZA-Japanese-Llama-3-8B — https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B

Qualcomm AI Hub — https://aihub.qualcomm.com/models/llama_v3_elyza_jp_8b