Development of Edge AI Devices ~Implementing AI in Embedded NPUs~
Recently, the advent of AI has made it easier to perform various types of processing. For example, security cameras are used to detect suspicious people, and currently centralized systems using GPU servers are widely used. In centralized systems, all raw data from edge devices must be collected in a central location, and due to issues such as network bandwidth/quality and system costs, there is an increasing demand for edge AI, which performs AI processing within embedded devices at each end (edge) depending on the system.
On the other hand, unlike large GPUs, edge devices have limitations in processing power and power consumption, so there is a demand for processing that can maximize the use of available resources.
In this article, we will use Synaptics' embedded processor as an example to explain the key points for implementing AI on edge devices.
Difference Between GPU and NPU
In centralized systems, AI processing is performed on GPU servers, but in edge devices, the processing is often done by a processor called an NPU.
GPU stands for Graphics Processing Unit and is mainly used for image processing. On the other hand, NPU stands for Neural network Processing Unit and is hardware specialized for AI processing. Both are units specialized in performing simple processing quickly in parallel compared to CPUs and DSPs.
The main performance indicator for GPUs is FLOPS (Floating-point Operations Per Second), while for NPUs it is TOPS (Tera Operations Per Second).
Both are indicators of how many calculations can be performed per second, but the variable type used for the calculations is floating-point types such as float32 and double64 in the case of FLOPS, while it is integer types such as INT8 and INT4 in the case of TOPS.
supplement)
Even though they are both "GPUs," there are differences between the IP used in embedded SoCs and the IP used in PCs/ servers, and care must be taken when using them from a software perspective. For high-performance GPUs such as those for AI servers, NVIDIA 's CUDA is generally used as middleware for parallel computing. On the other hand, for embedded GPUs, OpenCL and OpenVX are used.
Most AI software is designed to run on PCs or servers, and often does not support embedded GPUs/NPUs.
For example, in the case of Pytorch, if you use CUDA, torch.device() supports CPU or CUDA, but does not support processing with Imagination GPUs or VeriSilicon NPUs, which are common in embedded systems. Therefore, if you use AI software for PCs/servers as is in an embedded system, the calculations will be performed using the CPU, and the performance of the SoC will not be fully utilized.
・Reference: Pytorch torch.device
To implement AI on an embedded NPU
As explained in the previous chapter, most AI software is designed to be processed by PC/ server GPUs, so it requires execution in an environment with a floating-point instruction set. However, as explained in the previous chapter, embedded NPUs mainly use integer arithmetic instruction sets, so quantization from Float to int type is required to implement AI processing. Generally, casting from Float32 to int8/int16 results in data being stripped, which reduces the accuracy of AI calculation processing. In addition, quantization reduces the data volume, so it is also meaningful to optimize it to a size that can be loaded sufficiently even on embedded systems with RAM size restrictions. Tools that perform these processes are mainly provided by each SoC manufacturer.
* For SoC vendors that do not provide tools, users can either perform quantization themselves using OSS etc., or choose CPU processing.
Quantization Tool Examples
For example, a tool called SyNAP is available for Synaptics' ASTRA SL series SoC, and model quantization can be easily performed using shell commands.
$ synap convert --target {$CHIP_MODEL} --model example.{$MODEL_FORMAT}
#$MODEL_FORMAT supports various models such as onnx, torchscript, tflite, prototxt
*SyNAP also allows you to set various quantization settings (INT8 or INT16 or composite quantization, etc.).
Detailed manuals can be found on Synaptics AI developer (a portal site for AI developers).
Edge AI use cases
Here we will introduce some typical examples of edge AI usage, some of which have already been commercialized.
Image processing
・ISP replacement
Audio processing
・Lingual language processing (SLM)
・Multimodal AI
Image processing
Image processing is a well-known example of edge AI usage. Previously, image processing such as human detection and Face ID required very heavy algorithms on DSP processing or large-scale FPGAs, but with the emergence of open source AI models such as Yolo, it can now be easily implemented on devices with limited hardware resources.
Yolo can be easily tried out using shell commands by using the Gstreamer plugin included in the Synaptics SyNAP mentioned above.
Example: Object detection in /dev/videoX source
$ gst-launch-1.0 v4l2src device=/dev/videoX ! video/x-raw,framerate=30/1,format=YUY2,width=640,height=480 ! videoconvert ! tee name=t_data t_data. ! queue ! synapoverlay name=overlay label=/usr/share/synap/models/object_detection/coco/info.json ! videoconvert ! waylandsink t_data. ! queue ! videoconvert ! videoscale ! video/x-raw,width=640,height=384,format=RGB ! synapinfer model=/usr/share/synap/models/object_detection/coco/model/yolov8s-640x384/model.synap mode=detector frameinterval=3 ! overlay.inference_sink
https://synaptics-astra.github.io/doc/v/1.5.0/linux/index.html#gstreamer-synap-plugin
・Reference: Synaptics AI developer -YOLO
Alternative ISP
When using a CMOS image sensor, an ISP (Image Signal Processor) is a required piece of hardware. It receives raw data from the sensor, performs various adjustments such as lens distortion correction and white balance, data conversion, and other processes, and converts the data into common data formats such as YUV and RGB. Some manufacturers also offer high-performance ISPs that include functions from HDR to autofocus. In recent years, AI processing has been used as part of the ISP's processing, and it is possible that this will become common in the future.
audio processing
Conventionally, ANC (Active Noise Cancel) and ENC (Environment Noise Cancel) used in wireless microphones, earphones, headphones, etc. were generally realized by attaching a separate error microphone and adding the noise signal picked up from this to the signal in reverse phase. However, the advent of AI has made it possible to separate human voice information and separate the voice of a specific person.
There are already products on the market that claim to feature AI noise cancellation, but it may be possible to implement noise cancellation functionality in even lower-priced products.
Language Processing (SLM)
Most people probably have the impression that conversational AI like the one shown on the right is always processed on a server.
AI that requires high-load processing, such as large-scale language models (LLM), requires ultra-high-performance computing such as GPU clusters, but small-scale language models (SLMs) have emerged for local environments such as embedded systems.
Since embedded devices can implement conversational AI in a local environment, devices such as smart speakers can be developed by simply tuning the dataset.
For an implementation example, please refer to the On-Device AI Voice Assistant at the link below.
・Synaptics AI developer -Llama on Astra
・Synaptics AI developer On-Device AI Voice Assistant
In addition, the source code of the above On-device-Assistant is available on the Git repository below.
Multimodal AI
Multimodal AI is an advanced AI that can process multiple data, rather than a single piece of data. Currently, the general trend is to load multiple models simultaneously for multiple data inputs, and process each model with one piece of data. However, in the future, AI processing that inputs multiple pieces of data into one model may be developed. For example, by using audio and image data simultaneously in a surveillance camera, AI may be able to determine the atmosphere of the surrounding environment, which was previously undetectable, and this may lead to use cases that have never been thought of before.
Synaptics ASTRA series introduction
Finally, we will introduce the Synaptics SoC/MCU series, which we have mentioned in part so far.
There is a lineup of products equipped with NPUs, and since there is an AI software development tool called SyNAP and a comprehensive manual, this product is recommended as hardware for implementing AI.
ASTRA SL Series
The SL series is an SoC that runs embedded Linux and other applications, mainly using the ARM Cortex-A core. The SL1680 and SL1640 are equipped with a VSI NPU.
The SL1680 uses a device driver developed by Synaptics that is more optimized for hardware than the device driver provided by VSI. While other companies' SoCs that provide the same IP achieve a performance of around 2 TOPS, this product achieves a performance of over 7.9 TOPS. In addition, the SL1680 has other features such as the ability to directly input and output HDMI (TMDS) signals, other MIPI DSI/CSI2 ports, and a wide range of audio input and output peripherals.
ASTRA SL Series Lineup
| Usage | part number | detail |
| AI MPU | SL1680 |
ARM Cortex-A73 x4 (2.1GHz) Up to 7.9+Tops NPU 3D/2D GPU (Imagination GE9920) |
| AI MPU | SL1640 |
ARM Cortex-A55 x4 (1.9GHz) Up to 1.6+Tops NPU 3D/2D GPU (Imagination GE9608) Equipped with Audio DSP (Cadence® Tensilica® Dual HiFi 4) |
| General-purpose MPU | SL1620 |
ARM Cortex-A55 x4 (1.9GHz) 3D/2D GPU (Imagination BXE-2-32) |
*Quote
ASTRA SR Series
The SR series is a microcontroller with an ARM Cortex-M core as the main core and an ARM Ethos-U NPU. Development can be done mainly with RTOS or bare metal software.
It has a strong point in low power consumption performance, and the current consumption is about 100mW even when the main core and main NPU are fully operated, about 10mW when operating with ARM Cortex-M4 and Synaptics µNPU in the low-power domain, and about 1mW when operating the hardware accelerator in the lower ALWAYS-ON domain.It also has a MIPI pass-through function, so it is ideal for use cases such as placing it between an existing CMOS camera sensor and a host processor, as an image pre-processing microcontroller, or as a co-processor to start the host processor by motion detection.
This series was officially released in March 2025, but various information, such as how to incorporate AI models into RTOS and manuals on how to use them, will be added sequentially to the Synaptics website.
ASTRA SR Series Lineup
| Usage | part number | detail |
| AI MPU | SR110 |
ARM Cortex-M55+U55(400MHz) + M4/µNPU(100MHz) Up to 100 Gops NPU (U55) Tiny package(WLCSP84 5.2x2.7 mm) |
| AI MPU | SR105 |
ARM Cortex-M55+U55(400MHz) Up to 100 Gops NPU (U55) |
| General-purpose MPU | SR102 | ARM Cortex-M55(400MHz) |
*Quote
・ SR100 Series High-Performance AI MCUs Family Product Brief
・ Synaptics ASTRA SR Series
At the end
Edge AI equipment is in high demand these days, so we've introduced some key points for implementing AI in embedded NPUs and some use cases. What did you think?
Synaptics' ASTRA series is a product that comes with a full range of evaluation kits, AI software development tools, and manuals to promote the introduction of edge AI development.
We hope this information will be helpful in your future development.