In recent years, speech recognition has been used in a variety of situations. While voice recognition, especially voice triggers, is extremely useful and in high demand, there are various barriers to adoption. In this article, while touching on the basics of speech recognition, we will introduce issues in voice trigger development and solutions to solve them.

■Table of contents

・Basic knowledge of speech recognition

・Barriers to introducing voice triggers

・Introduction of solutions that realize voice triggering

·summary

Basic knowledge of speech recognition

What is voice recognition?

Speech recognition is a technology that allows computers to recognize information contained in voices.

In the speaker recognition shown in the image on the left, the speaker speaks to a smart speaker and processes the content. In the image on the right, this technology is used to transcribe what was said during an online meeting.

Types of speech recognition

Generally speaking, speech recognition can be broadly divided into two types.

The first is natural speech recognition. This allows the computer to understand and process the content of the conversation. Cloud environments are generally used because high processing power is required.

The second is voice trigger. When the computer recognizes a specific keyword, it executes a preset process that corresponds to that keyword. This does not require as much processing power as natural speech recognition, so it can be achieved without using the cloud.

The figure below compares the processing when you say "Turn up the volume" using natural speech recognition and voice triggers.

Barriers to introducing voice triggers

From here, we will explain voice trigger among voice recognition.

To reiterate, voice trigger is a function that executes a command and generates a series of actions when a preset "keyword" is found in the input voice data.
This keyword is also called a trigger word.

Benefits and use cases

Below are the benefits and use cases of introducing voice triggers.

Advantages of voice trigger

・Can be operated with a natural feeling
・Can be used hands-free
・Real-time processing is possible without the need for cloud
・Achieving a simple user interface

Voice trigger use cases

1. If you don't want to touch the buttons

2.If your hands are full

3.When mode selection is complicated

In addition to the above, almost everything controlled by buttons can be replaced with voice triggers.

Development procedure

I will explain the development procedure of Voice Trigger. When developing a voice trigger, it is common to proceed along the following flow.

First, we will develop a voice library and use it to develop an application.

Specifically, it involves collecting audio samples. The developing company must prepare voice samples that take into consideration the language and accent of the expected user, the surrounding situation, etc. Next, in model development, a library of keywords is created, and in model deployment, the created model is applied to the hardware. After that, the procedure is to perform programming using the created library and move on to evaluation.

Even in the above flow, audio library development is more difficult, so development is generally outsourced to a third party.

Barriers to introducing voice triggers

Based on the content so far, we will summarize some issues when implementing voice triggers.

Barriers to introducing voice triggers

1. Requires specialized knowledge
2. Requires a large number of audio data samples
3. Speech recognition is difficult

The first is that it requires specialized knowledge of the voice trigger itself, making it difficult to develop it in-house.
Voice triggers themselves are not connected to the network, so they can be realized using microcomputer processor software, but the process of analyzing trigger words from voice data generally uses proprietary technology, and in many cases the algorithm is Not published. Therefore, in-house development is extremely difficult. For this reason, it is common to outsource development to a third party.

Second, some voice triggering techniques require a large number of audio data samples.
One way to achieve voice triggering is to train voice samples from various people for a single trigger word. In that case, you will need the time and expense of preparing your own audio samples. I often create my own trigger words, and in that case I can't use existing audio data, so I have to prepare samples from scratch.

Third, voice recognition is difficult due to the wide variety of user profiles.
Even the same word can be recognized by computers as having very different sounds depending on the age, gender, dialect, etc. of the speaker. Therefore, it is necessary to adjust the speech recognition engine depending on the user's attributes. In particular, voice recognition becomes more difficult for products with a wide range of possible users.

Introducing solutions that realize voice triggers

By using Renesas' voice recognition solution kit and Hitachi Solutions Technologies' software tools, you can solve these issues and implement voice triggers.

Renesas' voice recognition solution kit simplifies system development

First, we will introduce the voice user interface (VUI) hardware platform equipped with the Renesas voice recognition solution kit RA6E1 microcontroller. By utilizing this solution kit, you can easily develop a system using this simple voice user reference kit, even if you do not have extensive coding experience or specialized knowledge.

*"RA6E1" is a low-priced entry-line series of the RA family equipped with an Arm core provided by Renesas.

This hardware platform is developed by Renesas  Renesas Ready Partner Network All provided within VUI Available for partner-enabled solutions. This board is equipped with a microphone, and its placement is suitable for beamforming for noise reduction.

About Hitachi Solutions Technologies' software tool "Ruby Spotter"

"Ruby Spotter" by Hitachi Solutions Technologies is an application programming interface (API) that realizes a voice interface that operates with a small amount of memory (ROM size of about 200KB) such as an MCU.

Features of “Ruby Spotter”

- Operates with a small amount of memory (operates with a ROM size of about 200KB)

・High quality noise reduction

・You can select the language to be recognized from over 40 languages.

・Phoneme-based modeling is used, so even if the number of languages increases, data other than the command list will not increase.

Functional specifications

Ruby Spotter

language

Available in Japanese and 40 other languages as an option.

voice recognition

Word recognition, wake up word, noise reduction

speech synthesis

-

Operating OS

Linux, Android, Windows, iOS, RTOS, non-OS

CPU

60MIPS(or 35MIPS※1)

code size

40KB

Data size

155KB+32B×N※2

Memory capacity

24KB+128B×N※2

【remarks】
*1: When using SIMD instructions
*2: N = number of commands

How to create a voice trigger program

By combining Renesas' speech recognition solution kit with Hitachi Solutions Technologies' Ruby Spotter, voice triggering can be achieved in just three steps.

Step1. Create a trigger word using the GUI of "Ruby Spotter" (just enter the text!)

This is the "Ruby Spotter" screen.
You can create a trigger word simply by entering text in the area surrounded by the red frame.

You can also tune using actual audio when setting the trigger word.
Assign a number called MapID to the trigger word created here in a blue frame.

Step2. Coding using MapID linked to the trigger word created in Step 1

Create a program using Renesas' integrated development environment e2studio, which can be used to develop RA microcontrollers.
(Sample programs are also available)

In the above example, the MapID associated with "Light on" and "Turn on the lights" is set to 1,
The MapID associated with "Light off" and "Turn off the lights" is set to 2.
Since you can code using this MapID, you can easily program as shown in the code on the right.

Step3. Import the audio library created in Step 1

Just import the audio library created in Step 1 by entering the path in the specified location.

You can easily create a voice trigger program with the above 3 steps.

Summary

In this article, we explained the basics of voice recognition and the barriers to introducing voice triggers. We also introduced how systems can be easily developed using Renesas' voice recognition solution kit and Hitachi Solutions Technologies' Ruby Spotter.

Inquiry

If you would like more information about the solution or have any questions, please contact us using the form below.

Reference page