● Recommended for: ●

・ Those who want to know about the topic of multimodal AI
・Those who want to know the use cases of multimodal AI

Time needed to finish reading this article

10 minutes

Introduction

hello! This time, it is about "multimodal AI".

In recent years, we will explain “multimodal AI”, which has become a hot topic in deep learning.
This articleassumes that you haveheard the term "multimodal AI" for the first time, or that you are not very familiar with AI but would like to know about AI trends.

We hope that reading this article will give you an opportunity to think about how the AI industry will move in the future.
Let's quickly explain! !

What is multimodal AI?

Multimodal AI is a deep learning method that inputs multiple types of data and processes them in an integrated manner.
Conventional convolutional neural networks (CNNs) create decision machines and process data from a single type of information input.
On the other hand,multimodal AIrefers to a deep learning method that takes multiple types of data as input and processes them in an integrated manner.

When humans process information, they combine and process multiple sensory information that comes from the outside, such as visual information represented by the "five senses", smell, touch, taste, and hearing.
In multimodal AI, a deep learning model is created from multiple data similar to the information processing performed by the human brain, and a decision device is created.

Modal is a type of input information

For example, when judging that an animal in an image is a dog, in most cases AI models are created by learning only the image (visual information).
This typeof learning and judging only one piece of informationis called "single modal".
On the other hand, inthe case of multimodal AI, it will be possible to determine that it is a dog from multiple modals such as "visual information", "auditory information", and "olfactory information", which are the five human senses.
This can be easily understood by imagining that a robot-like judgment device with "five senses" similar to humans is in front of you.

However, this is a future image of multimodal AI that may be realized in the future, andmultimodal learning that is actually used has not reached that level.
What kind of information do you use?
Taking the example of a "dog image", you can use the image's meta information (where it was taken, when it was taken, what camera it was taken with, etc.) It is possible to improve the accuracy of discrimination by using information such as whether the user is shooting, age and gender of the user.

Development of deep learning draws attention to multimodal

Due to the dramatic improvement in deep learning research in recent years, "multimodal AI" is attracting attention as a method for making judgments with more human-like sensations.
As multimodal learning progresses, a single AI model will be able to discriminate multiple factors. By doing so, it is more likely that you will be able to deal with unexpected abnormal patterns.
Easy-to-understand examples of abnormal patterns includefake listings on flea market apps and the discovery of fake profiles on matching apps.

History of multimodal AI

Let's take a look at how multimodal learning has evolved and the history of multimodal AI.

Since 1986, research has been conducted on multimodal learning that recognizes languages from ``speech'' and ``images'' and converts them to text.
In fact, humans process language more accurately by reading mouth movements and speech at the same time in environments where noise and surrounding sounds are loud and difficult to hear.
Such research has been conducted since 1986, and at the same time, technology for recognizing speech from 'complex speech' and 'images' and converting it to text has also been researched.
This is how multimodal learning runs.

After that, in 2013, research began to make it possible for users to enter arbitrary text and have it speak with various mood expressions, from happy to angry.
Research has been conducted on recognizing human emotions (joy, sadness, anger) using information from both audio and images. Something that automatically generates images to be displayed has appeared.
In addition, there are those that AI answers when you ask questions about images in text, and those that automatically generate voice from image information.

Even now, while continuing to evolve in various ways, the cost of resources such as data collection and information processing has fallen dramatically, and the accuracy of algorithms has improved through the development of algorithms. technology is beginning to be used.

Expanding business with multimodal AI

So, in what kind of situations is multimodal AI actually used? I will explain from two cases.

▼ Flea Market App Company A
Company A, which operates a flea market app, is able to list products 24 hours a day. Therefore, we use AI to monitor new listings to determine whether they are correct listings.
When new listing information is registered, it is possible to detect listings that are likely to be counterfeit based on the photos and descriptions of the listed items and the tags (brand information, etc.) attached to the items. By doing this, human operators can quickly check for potentially counterfeit items, which leads to improved safety for the entire app.

▼ Sports game data analysis
Multimodal AI systems are also used to analyze the performance of each player from multiple angles in team sports such as soccer.
Using cameras, lasers, and wearable sensors attached to the athlete's arm, measurement data for each athlete is collected and analyzed in real time, and real-time feedback is provided to the athlete and coach. Strategy planning and player appointments, which previously relied solely on the experience and intuition of coaches and players, can now be done objectively using data. It has been.

Future Possibilities of Multimodal AI

In what direction will multimodal AI develop in the future?

▼ Evolution of input
Until now, easy-to-handle information such as images and text was often used for input, but in the future, AI that can communicate more like human interaction will be born from tactile sensors and olfactory sensors installed in robots. may come.
Even with voice recognition AI, if it becomes possible to read the voice tone of the speaker and have a conversation that is close to their emotions, multimodal AI may be useful in many situations, including nursing care and medical care.

▼ Evolution of output
Image generation AI may be able to generate images, sounds, human movements, background music, etc. For example, when you enter a sentence, it is automatically visualized. A world where anyone can become a film director from the comfort of their own home and instantly create and share amazing art is no longer a dream.
Until now, information was generated between a single modal such as "image input → image output", but AI is now being created that can more closely resemble human intellectual production activities. New innovations will occur in all fields, including entertainment.

It is expected that multimodal AI technology will continue to attract attention in the future.

Related article

*Tech Blog*
<Summary> The latest in 2021 * The topic GPT-3 is amazing

* Tech Blog AI Women's Club *
Gaze data x AI utilization! Three social changes caused by eye tracking

* Tech Blog AI Women's Club *
[International Society for Artificial Intelligence Presentation Report] Deterioration Estimation of Tools by Ensemble Learning