What is an LLM (Large-Scale Language Model)?
Large Language Models (LLMs) are AImodels thatlearn language patterns from large amounts of text data and can generate natural-sounding sentences that resemble those of humans. As a foundational technology for conversational AI, includingChatGPT, it is currently one of the most talked-about fields.
GenerallyLLMis on the cloudGPUWhile it typically runs on a server, recent advancements in model optimization technology have made it feasible to run it on edge devices.
Cloud-independent on-deviceLLMThis offers the following advantages:
|
perspective |
CloudLLM |
On-deviceLLM |
|
privacy |
The data is sent to an external server. |
Everything is contained within the device. Data never leaves the device. |
|
Offline compatible |
Internet connection is required. |
Works completely offline |
|
Response delay |
Network round trip takes hundreds ofmillisecondsto a few seconds. |
Instant response within the device |
|
Operating costs |
APIusage fees are charged on a pay-as-you-go basis. |
Only an initial investment is required. No running costs. |
In environments handling highly confidential data, such as factories and medical facilities, or in environments with unstable communication infrastructure, the value of LLM (Limited Licensing Management) that can be completed entirely on-device becomes particularly high.
Regarding this demonstration
This article presentsa demonstration of running the Japanese language learningmodule"Llama-3-ELYZA-JP-8B"on the NPU (Hexagon HTP) on the "Dragonwing IQ-9075 EVK"evaluation board, which is equipped witha Qualcomm IQ-9075. We will show you how this 8billion-parameter Japanese languagelearning modulegenerates Japanese text in real time on a palm-sized edge device.
AI model used
The Llama-3-ELYZA-JP-8Bused in this project is a Japanese-language-specific LLMdeveloped byELYZA, an AIstartuporiginating from the University of Tokyo.
Based on Meta'sLlama 3architecture, and further trained with Japanese data, it enables natural-sounding Japanese dialogue.
The model is also available onQualcomm AI Hub*, and an export pipeline optimized forQualcomm SoCs is provided.
*Overview and registration for Qualcomm AI Hub
Qualcomm AI Hub - Semiconductor Business -Macnica
Qualcomm AI Hub Llama-v3-ELYZA-JP-8B model page
Llama-v3-ELYZA-JP-8B - Qualcomm AI Hub
|
item |
content |
|
Model name |
Llama-3-ELYZA-JP-8B |
|
Number of parameters |
8billion (8B) |
|
Base model |
Meta Llama 3 |
|
Languages spoken |
Japanese/English |
|
quantization |
w4a16(4-bitweight,16-bitactivation) |
|
runtime |
Genie SDK(Qualcomm LLMinference engine) |
What is quantization?
LLM typically storeseach parameter as a 32-bitfloating-point number (FP32). For the8B model, FP32requiresapproximately32GB of memory, making operation on edge devices difficult.
Quantization is a technique that reduces model size and computational complexity by lowering the precision of parameters. In this case,we adopted"w4a16," which quantizes weights to 4 bitsand activations (intermediate calculation results) to 16 bits. As a result, the model size is compressed to approximately 5.7GB, allowing it to fit into the memory of edge devices.
testing environment
Evaluation board: Dragonwing IQ-9075 EVK (equipped with Qualcomm IQ-9075)
AIAccelerator: Hexagon HTP (100 TOPS,INT8)
SDK:Qualcomm AI Engine Direct (QAIRT SDK) + Genie SDK
HostPC: Ubuntu 22.04 (used for model export)
About IQ 9075
The IQ-9075isQualcomm 's high-end SoCforindustrial andIoT applications.It features a dual CDSPHexagonprocessor and boasts an NPUperformance of100 TOP (INT8).
It has the processing power to run an 8B-scaleLLM on-device.
Implementation in Edge AI
To run the 8B-parameterLLMtrained withPyTorchonthe NPUof an edge device, we optimized and deployed it in the following three steps.
Step 1: Export the model using Qualcomm AI Hub
Qualcomm AI Hubisa cloud service for optimizingAImodels for Qualcomm SoCs. By uploading models in PyTorchorONNXformat, it automatically performs quantization and compilation, generatinga context binary that can be executed on the device.
# AI Hub経由でエクスポート(量子化 + HTPコンパイル) python3 -m qai_hub_models.models.llama_v3_elyza_jp_8b.export \ --target-runtime genie \ --device "Dragonwing IQ-9075 EVK" \ --output-dir ./export_output
This single command will automatically perform the following processes:
- Download model weights fromHuggingFace
- Apply w4a16quantization.
- CompiledforQCS9075HTP
- Generatea Context Binary(.bin file).
The 8Bmodelwas too large to load intoHTP all at once, so we compiled it by splitting the model into eight parts. This ensures that the size of each partfits within the HTPbuffer limit and operates stably.
Step 2: Deployment to the device
Once the export is complete, transfer thegeneratedContext BinaryandGenie SDK runtime libraries to your device.
# SDKのランタイムを転送 adb push genie-t2t-run /data/qairt/bin/ adb push libGenie.so libQnnHtp.so ... /data/qairt/lib/ # モデルファイルを転送(合計約5.7GB) adb push ./export_output/*.bin /data/elyza/ adb push ./export_output/*.json /data/elyza/
The files to be transferred are the complete set of Genie SDKexecutable binaries and libraries, and the model files generated by the export (Context Binary 8parts+configuration file+tokenizer).
Step 3: Execution
Run the model usingthe genie-t2t-run command included inthe Genie SDK.
# 環境変数を設定して実行 export LD_LIBRARY_PATH=/data/qairt/lib cd /data/elyza genie-t2t-run --config genie_config.json --prompt_file prompt.txt
Upon startup, the model is loaded into HTPand responses to prompts are generated.
Operation results
This is the result of actuallyrunningLlama-3-ELYZA-JP-8Bonthe IQ-9075 EVK.
Less than a second elapsed between entering a prompt and receiving the first response, and subsequent text generation proceeded smoothly. Answers to short questions could be completed in just a few seconds, demonstrating sufficient responsiveness for interactive applications.
Summary
This articledemonstrated the operation of the 8billion-parameter JapaneseLLM"Llama-3-ELYZA-JP-8B"on the NPUusingthe Dragonwing IQ-9075 EVK, which is equipped witha Qualcomm IQ-9075.
The IQ-9075boasts100TOPSofNPU performance, and when combined with w4a16quantization,it enables real-time operation of8B-class JapaneseLLMs on edge devices. Qualcomm AI HubandGenie SDKallow for efficient development with a consistent workflow, from model optimization to on-device execution.
Cloud-independent on-deviceLLMsoffer significant value in environmentswith stringent security requirements, limited communication infrastructure, and a need to reduce API usage fees.
QualcommDragonwingThis series is our entry-level model.From high-endWe offer a wide range of products, allowing us to propose the optimal product for your specific use case.AIatLLM and other AIIf you are interested in using this service, please feel free to contact us.
Reference links
ELYZA-Japanese-Llama-3-8B — https://huggingface.co/elyza/Llama-3-ELYZA-JP-8B
Qualcomm AI Hub — https://aihub.qualcomm.com/models/llama_v3_elyza_jp_8b
Inquiry
If you have any questions about the contents of this page or would like detailed product information, please contact us here.
