[Trial] Running local LLM (Qwen) on an embedded device with Hailo-10H

Narrow down by specifying conditions

現在2153件がヒットしています。check

Design AI/Artificial intelligence microcomputer/processor/DSP Hailo

*This article is based on information as of August 2025.

Introduction

We incorporated Hailo-10H, which makes generative AI possible on edge devices, and created an embedded device that can chat using Qwen.

Used equipment

Hailo AI Accelerator Hailo-10H

Hailo has released the Hailo-10H, a new AI accelerator. It differs from the Hailo-8, which was primarily designed for CNN models, in that it now supports larger models thanks to its DRAM capacity and adds functionality required for Transformer models, making it compatible with generative AI models. However, it is expected to only support a maximum of around 10B of parameters.

This time we are using the 8GB memory version of Hailo-10H.

　
At this time, users cannot freely add models, and only models prepared by Hailo can be run (fine tuning using LoRA is possible). Supported models are listed below.

https://hailo.ai/products/hailo-software/model-explorer/generative-ai/

The photo on the right shows the contents of the Hailo-10H Starter Kit.

ADLINK Industrial Fanless Embedded PC MXE-310

This time, we used the ADLINK MXE-310 box-type PC as the host PC to implement the Hailo-10H. It is a fanless IPC equipped with an Intel® Core™ series processor, and is compact (180x130x70mm). You can choose the CPU and memory, but this time we chose the following specifications.

CPU: I3-1315UE
Memory: 32GB

The photo on the right is an external view of the MXE-310 that was actually used.

Hailo-10H implementation

When you open the back of the MXE-310, you'll find an M.2 connector for expansion, which is where the Hailo-10H is inserted. The photo on the right shows the Hailo-10H inserted.

The back of the Hailo-10H is connected to a heat spreader, allowing heat to escape into the chassis.

Hailo-10H Setup

This time we will build a Hailo-10H environment on MXE-310's Ubuntu 22.04.

Hailo provides an API that is compatible with the Ollama API format, and we will use it by following the link below.

https://github.com/hailo-ai/hailo_model_zoo_genai

I won't go into details here, but I'll briefly explain the process:

①Install HailoRT

Follow the HailoRT User Guide to install the necessary packages, then install the deb package. Download the User Guide and deb package from Hailo's Developer Zone. The link to the Developer Zone is below, but you will need to register with Hailo Web, so please register first.

https://hailo.ai/developer-zone/

A reboot is required after installing HailoRT.

<commands used>.

$ sudo apt update
$ sudo apt install build-essential
$ sudo apt install bison
$ sudo apt install flex
$ sudo apt install libelf-dev
$ sudo apt install dkms
$ sudo apt install curl
$ sudo apt install cmake
$ sudo apt install pip
$ sudo apt install virtualenv
$ sudo apt install systemd

$ sudo dpkg --install hailort_5.0.0_amd64.deb hailort-pcie-driver_5.0.0_all.deb

②Installing Hailo Model Zoo GenAI

A deb package has been prepared, so install it (the deb package can be downloaded from the Developer Zone).

<commands used>.

$ sudo dpkg -i hailo_gen_ai_model_zoo_5.0.0_amd64.deb

The installation is now complete.

Finally, if the hailortcli scan command recognizes it as follows, there should be no problem.

Checking the operation of the ollama API

First, start Hailo's ollama.

$ hailo-ollama

When you execute the above command, hailo-ollama will start as shown below.

From now on, access hailo-ollama using the curl command in another terminal window.

- Display of compatible models

$ curl --silent http://localhost:8000/hailo/v1/list

*Compatible models as of August 2025: {"models":["deepseek_r1_distill_qwen:1.5b","qwen2.5-coder:1.5b","qwen2.5:1.5b","qwen2:1.5b"]}

*As of August 2025, hailo-ollama is not compatible with VLM models, but only with LLM models.

・Download the model

$ curl --silent http://localhost:8000/api/pull -H 'Content-Type: application/json' -d '{ "model": "qwen2:1.5b", "stream" : true }'

From here on, we will use qwen2:1.5b from the compatible models.

(I felt that this model had the best Japanese support)

I'll actually ask a question.

$ curl --silent http://localhost:8000/api/generate -H 'Content-Type: application/json' -d '{"model": "qwen2:1.5b", "prompt": "Why is the sky blue?", "stream":false}'

$ curl --silent http://localhost:8000/api/chat -H 'Content-Type: application/json' -d '{"model": "qwen2:1.5b", "messages": [{"role": "user", "content": "Why is the sky blue?"}]}'

Each response will be in the following format:

We were able to confirm that we could actually interact with the Qwen model using the curl command.

The response was about the same as asking a question to an LLM on the cloud, so I think it's quite impressive that it could be achieved on an embedded device with limited computing resources. However, I feel that further improvement in response is required when processing on an embedded device that does not use the Internet, so I feel that a little more tuning is needed in that area.

Next, since communicating using the curl command is difficult for ordinary users, I would like to create a chat Box in Python that anyone can use to communicate.

Creating a chat Box and checking its operation

I've created a system that allows you to chat on your browser with the help of generative AI, so I'd like to introduce it to you.

Actual Python code and HTML code have been pasted, but please note that these are merely samples and are not guaranteed to work.

First, the Python code:

The file name is app.py.

from flask import Flask, render_template, request, jsonify import requests import json app = Flask(__name__) # メインページを表示するルート @app.route('/') def index(): return render_template('index.html') # チャットメッセージを処理するルート @app.route('/chat', methods=['POST']) def chat(): user_message = request.form['message'] # Ollama API へリクエストを送信 response = requests.post( "http://localhost:8000/api/generate", json={ "model": "qwen2:1.5b", "prompt": user_message, "stream": False }, stream=True ) generated_text = "" # ストリーミングレスポンスを処理 for line in response.iter_lines(): if line: line_str = line.decode('utf-8') data = json.loads(line_str) if "response" in data: generated_text += data["response"] return jsonify({'response': generated_text}) if __name__ == '__main__': app.run(debug=True)

Next is the html code.

The file name is index.html.


    


--><!DOCTYPE html> <html lang="ja"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>チャットアプリ</title> <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script> <style> body { font-family: Arial, sans-serif; } #chat { border: 1px solid #ccc; height: 300px; overflow-y: scroll; padding: 10px; } #user-input { margin-top: 10px; } #message { width: 100%; height: 50px; } /* 幅を100%、高さを50pxに設定 */ button { height: 50px; } /* ボタンの高さを調整 */ </style> </head> <body> <h1>Hailo-10H チャット</h1> <div id="chat"></div> <div id="user-input"> <textarea id="message" placeholder="メッセージを入力"></textarea> <!-- inputからtextareaに変更 --> <button id="send">送信</button> </div> <script> $('#send').click(function() { let message = $('#message').val(); $('#chat').append('<div>あなた: ' + message + '</div>'); $('#message').val(''); $.post('/chat', { message: message }, function(data) { $('#chat').append('<div>Hailo-10H: ' + data.response + '</div>'); $('#chat').scrollTop($('#chat')[0].scrollHeight); }); }); $('#message').keypress(function(e) { if (e.which == 13 && !e.shiftKey) { // Shift + Enter で改行、防止 $('#send').click(); e.preventDefault(); // Enterキーでフォーム送信を防ぐ } }); </script> </body> </html>

Create a "templates" folder in the same folder as app.py and save index.html in it.

The folder structure is as follows:

Run app.py with haillo-ollama running.

$ python3 app.py

It will be executed as above, so next, access "http://127.0.0.1:5000" in your browser.

This will bring up the following chat Box:

Try posting your question in the "Type a message" Box above.

This time, we asked the question in Japanese: "Please tell me how to deal with the summer heat."

The answer was as above.

The Japanese language itself and its content are somewhat lacking, and it seems that the model needs to be fine-tuned for Japanese. Also, the answers are displayed all at once once they are available, so it takes a fair amount of time to get the answer. The way the application is structured may also need some improvement.

However, the fact that LLM can be processed solely on embedded devices suggests that there is potential for the future.

Summary

This time, we built a chat system on an embedded environment called MXE-310 + Hailo-10H. If you try it out, you'll see that it's a little slow to respond and the accuracy of the Japanese language leaves something to be desired.

However, the model is evolving every day, and the techniques for using the Hailo-10H will also improve, so we can expect improvements in the future.

If you would like to try it out for yourself, please contact us using the inquiry form below.

We also have an article where we tried VLM (Qwen-VL) using the same equipment, so please take a look at that as well.
[Trial] Running a local VLM (Qwen-VL) on an embedded device with Hailo-10H

Inquiry

If you have any questions about this article, please contact us using the form below.

Hailo manufacturer information Top

If you would like to return to the Hailo manufacturer information top page, please click below.

Back to Manufacturer Information Top

Site Search