Recently, smart speakers that operate using voice as commands have become a hot topic. Such devices can be easily operated by just talking to them, but do you know how machines recognize speech?

This time, we will talk about the structure of the speech recognition system, how to deal with misrecognition in the input part, speech recognition technology using IP provided by a 3rd party, and speech recognition solutions recommended for those who want to reduce development time and costs. to introduce.

Outline of speech recognition system

The speech recognition system mainly operates by the following processes.

Article header library 126161 pic01 1
General configuration diagram

1. Audio input and signal processing

マイクアレイで音をキャプチャして、ノイズキャンセルなど信号処理を行い、品質の高い音データを生成します。これによって、音声認識エンジンでの精度を向上させます。

2. Initial speech recognition processing (voice trigger, local command)

The speech recognition engine in embedded devices is always running. This engine captures input speech data and generates a series of actions when there is a "keyword" in the input data. Also, by having multiple keywords, it is possible to execute various commands.

3. Cloud based recognition processing

It can interpret more complex contexts and run cloud-based services by running natural dialogue recognition in the cloud over a network connection.

Reasons and countermeasures for misrecognition on the input side

Because of the long route from voice input to recognition processing by the cloud, there are various possible causes for misrecognition. Here, in the general configuration diagram introduced earlier, we will focus on the "audio input and signal processing" and "initial speech recognition processing" parts. There are three main reasons for misrecognition of speech in this part.

  • Voice recognition accuracy is lacking
  • Use in noisy environments
  • The position or distance between the machine and the speaker is too far. or because it's too close


From these causes, it can be said that the points to be dealt with technically are ``that the speech is received clearly when it is input'' and ``the performance of the speech recognition engine itself'' are important.

If misrecognition occurs, first try to deal with it by simply changing the installation location of the machine or changing the position of the machine and the speaker. If you still can't solve the problem, you can use the techniques and equipment listed below to solve the problem.

microphone array

A microphone array is an arrangement of multiple microphones.
A single microphone cannot acquire spatial sound information, but by installing multiple microphones, it is possible to determine the directivity of the sound and where the sound is coming from.

Article header library 126161 pic02 1
PCM1864-Based Circular Microphone Board (CMB) Reference Design Board Image

beam forming

Beamforming is a technology that narrows down radio waves and emits them in a specific direction. It is also a technology that is used not only in audio equipment but also in the field of radio waves such as smartphones and wireless LAN.

In the case of the figure below, the gray areas that are noise are subtracted from the entire audio, and only the audio in the required direction (red frame) is extracted.

Article header library 126161 pic03 1
beamforming image

noise reduction

Article header library 126161 pic04 1
noise image

Noise reduction is a technology that suppresses the noise contained in signals such as audio and video, making the signals easier to recognize.

In terms of processing that is easy to imagine, it refers to a filter that allows high input level sounds to pass through and attenuates low input level sounds as noise.

echo cancellation

Article header library 126161 pic05 2
Echo generation image

If the device also has an audio output device such as a speaker, the microphone will also pick up the sound from the speaker, causing echo and howling.

Echo cancellation refers to technology that suppresses this, and is a technology used in telephones and teleconference systems.

Three speech recognition solutions provided by Texas Instruments

Texas Instruments (hereafter, TI) provides a solution that combines hardware that can implement the algorithm to reduce misrecognition introduced earlier and voice triggers.

In addition, some solutions are equipped with an embedded speech recognition engine provided by 3rd party "Sensroy", which has already been proven in smartphones and AI speakers.

We will introduce each feature and examples of applications that can be used.

Features common to each solution

  • Multiple microphones (up to 4ch) connected to PCM1864 (Audio ADC)
  • 4ch microphone and 8ch circle (circular) microphone designs are also provided

→ Improved accuracy of audio input and signal processing

  • Transfer raw audio data from Audio ADC to DSP (C5000, C6000 series)
  • DSP performs echo cancellation, beamforming, ASNR (Adaptive Spectral Noise Reduction per microphone), etc.
  • Pass optimized audio data to a recognition engine running on a DSP or external processor

→Improved accuracy of initial speech recognition processing

Article header library 126161 pic06 2
Basic configuration example

1. Voice trigger using beamforming with microphone array

This is a 3rd party speech recognition solution that leverages TI's reference design.

Keyword recognition using TrulyHandsfree™ speech recognition engine and beamforming with multiple microphones are possible. Moreover, it corresponds to the multi-source selection which can select a high-quality signal. This allows for better performance in noisy environments and a wider range of sound.

Article header library 126161 pic07 1
C5517 board image

Features

  • Extract clear speech from noisy environments using a single digital signal processor (DSP) and an array of microphones
  • Remove background noise from audio sources
  • Sends clear speech to the speech recognition engine for better speech recognition

Application example

 

  • Cloud interface-based speech recognition for voice-activated digital assistant applications
  • Cloud interface-based speech recognition for smart home applications
  • Local speech recognition for voice-based home appliance control
  • voice and speech applications


Click here for details of this reference design
TIDEP-0077 Audio Pre-processing System Reference Design for Voice-based Applications Using C5517

What is TrulyHandsfree?

Article header library 126161 pic08 1
Truly Hands free

An embedded speech recognition engine developed by Sensory in the United States.
When installed in embedded devices such as mobile and wearable devices, it is possible to control devices by voice recognition using command triggers.

In addition, it can be activated by speech recognition while the processor is in low power consumption mode, and has features specialized for embedded devices, such as a memory-saving design, recognition in noisy environments, and fast response.

2. Voice trigger and processing reference design with cloud connectivity to IBM Watson(R)

This signal processing-based reference design uses multiple microphone inputs to a digital signal processor (DSP) to acquire high-quality speech signals, recognize trigger words, and record voice commands with the IBM Watson® Cloud. Send to the base service. Watson™ returns locally recorded, transcribed speech (voice commands).

Article header library 126161 pic09 1
Left: Using C5545 Booster Pack / Right: Using C5517 Evaluation Module

Features

  • Multiple microphone support (2 or 4)
  • Audio pre-processing features
  • Connection to IBM Watson Cloud
  • Sensory™ TrulyHandsFree™ Keyword Recognition
  • Auto record audio after keyword trigger
  • speech-to-text conversion

Application example

  • Voice-activated digital assistant products
  • Voice-activated building automation
  • video doorbell
  • consumer audio products


Click here for details of this reference design
Voice Trigger and Voice Processing Reference Design with Cloud Connectivity to IBM Watson

3. Reference design for extracting clear speech and audio from noise and other clutter

Recently, there has been a demand for systems that extract clear speech from noisy environments.

This design uses multiple microphones, a beamforming algorithm, and multiple other processes. Equipped with these, this reference design can extract clear speech and audio from noise and other clutter.

Article header library 126161 pic10 1
66AK2G02 board image

Features

  • Extract audio from noisy environments using a single digital signal processor (DSP) and circular microphone board (CMB)
  • Remove background noise from audio sources
  • CMB enables 360° reception of audio sources

Application example

  • Cloud interface-based speech recognition for voice-activated digital assistant applications
  • Cloud interface-based speech recognition for smart home applications
  • Local speech recognition for voice-based home appliance control
  • voice and speech applications


Click here for details of this reference design
Audio Preprocessing System Reference Design for Speech Based Applications Using 66AK2G02

Products and technologies that form the cornerstone of solutions

DSP C5000&C6000 series for front-end processing

These two types of DSPs are DSPs for front-end processing.
The C5000 is a product for portable devices that run on batteries. In addition, the C6000 is a product for stationary devices and devices that require network connectivity.

Article header library 126161 pic11 1
C5000 series

C5000 DSP (for portable devices)

  • Main products: TMS320C5515, C5517, C5535, C5545
  • Product features: low power consumption, small size, low price
  • Performance: 50MHz to 200MHz fixed-point DSP
  • Interface: I2S, audio serial port, UART, USB, etc.

 

Article header library 126161 pic12 1
C6000 series

C6000 DSP (for stationary equipment)

  • Main products: TMS320C674x DSP, 66AK2G0x
  • Product features: low power consumption, floating point arithmetic DSP, high performance
  • Performance: up to 456MHz (C674x), up to 600MHz (66AK2G0x)
  • Interface: I2S, audio serial port, Ethernet, USB

 

Features of the Audio ADC used

The PCM186x family of audio front-end devices takes a new approach to audio function integration. It also facilitates compliance with European ecodesign laws and enables high performance end products.

Additionally, no 5V power supply or external programmable gain amplifier is required, resulting in a smaller, smarter product at a lower cost.

Article header library 126161 pic13 2
PCM186x block diagram

Features

  • small size
  • very high record quality
  • Supports TDM (Time Division Multiplexing) mode for connecting multiple chips
  • Supports both analog and digital microphones


specification

  • 103db SNR
  • ~128-mW stereo record @48kHz
  • 4ch stereo analog input

 

Microphone reference design using PCM1864

TI offers reference designs that help speed up development.
The two reference designs below are both for applications that require clear speech, such as voice triggers and speech recognition.

With this design, the DSP system extracts clear audio from noisy environments and converts it into a digital stream.

Circular Microphone Array Board (CMB)

  • 7 mics and 1 middle mic implemented with 2 PCM1864 Audio ADCs
  • I2S is used for I/F with DSP


Click here for reference design
PCM1864-Based Circular Microphone Board (CMB) Reference Design

Linear Microphone Board (LMB)

  • 4 mics implemented with PCM1864 Audio ADC
  • I2S is used for I/F with DSP


Click here for reference design
PCM1864-Based Linear Microphone Board (LMB) Reference Design

Article header library 126161 pic14 1
Left: PCM1864-Based Circular Microphone Board (CMB) Reference Design Board Image / Right: PCM1864-Based Linear Microphone Board (LMB) Reference Design Board Image

Easy voice operation with TI's solution

What did you think.
Currently, the operation of devices by voice is attracting attention, and it seems that various things around us will be voice-operated.

TI's hardware and 3rd party solutions make embedding command operations much easier.

Contact Us

If you are interested in any of the products introduced here, please feel free to contact us.

Related information

Click here for recommended articles/materials

Let's make a speech recognition demo that responds to phrases
This is different from a microcomputer! What is a DSP specialized for digital signal processing?
What is a sensor? Basic knowledge for digitization and IoT

Click here to purchase products

TMDSEVM5517
EVMK2G
BOOST5545ULP
CC3220SF-LAUNCHXL

Click here for manufacturer site/other related links

DSP (Digital Signal Processor) Overview
C5000™ Ultra-Low Power DSP
C6000 Power Optimized DSP
C6000 multi-core DSP+ARM SoC
Sensory Inc.