マルチモーダルAIとは？活用事例と今後の展望をサクッと解説！ - スマートマニュファクチャリング

● Recommended for: ●

・ Those who want to know about the topic of multimodal AI
・Those who want to know the use cases of multimodal AI

Time needed to finish reading this article

10 minutes

Introduction

こんにちは！今回は、「マルチモーダルAI」についてです。

近年、深層学習の中で一際話題となっている「マルチモーダルAI」について解説していきます。
この記事では、「マルチモーダルAI」という言葉を初めて聞いた方や、AIの知識にはそれほど詳しくないが、話題のAI動向について知りたい方を想定しています。

この記事を読んでいただくことによって、今後のAI業界がどのように動いていくのか、先駆けを考えていただくきっかけになると幸いです。
それでは早速解説していきましょう！！

What is multimodal AI?

マルチモーダルAIとは、複数種類のデータを入力し、統合的に処理する深層学習の手法のことです。
従来の畳み込みニュートラルネットワーク(CNN)では単一種類の情報入力から判定機の作成や、データの加工を行っていました。
一方、マルチモーダルAIは、複数種類のデータを入力とし、統合的に処理する深層学習の手法のことを指します。

人間は情報を処理する際、「五感」に代表される視覚情報、嗅覚、触覚、味覚、聴覚など外部から入ってくる複数の感覚情報を組み合わせて処理しています。
マルチモーダルAIでは、このような人間の脳が行っている情報処理に近い複数のデータから深層学習モデルの作成を行い、判定器を作るという手法をとります。

Modal is a type of input information

例えば、ある画像に写っている動物が「犬である」ことを判定する場合、画像（視覚情報）のみを学習させ、AIモデルを作るケースがほとんどです。
このような、一つの情報のみを学習させて、判定することを「シングルモーダル」と言います。
一方で、マルチモーダルAIの場合、人間の五感である「視覚情報」「聴覚情報」「嗅覚情報」といった複数のモーダルから「犬である」ことを判定することができるようになります。
これは、人間に近い「五感センサー」を持ったロボットのような判定器が目の前にいることをイメージすると分かりやすいと思います。

しかし、これは将来的に実現されるかもしれないマルチモーダルAIの未来像であり、実際に利用されているマルチモーダル学習はその域に達していません。
では、どのような情報を利用しているのでしょうか。
「犬の画像」の例でいえば、その画像のメタ情報（どこで撮影されたものか、いつ撮影されたか、なんのカメラで撮影されたかなど）を利用したり、そのユーザーが他にどんな写真を撮影しているかや、そのユーザーの年齢や性別といった情報も利用したりし、判別の精度を上げていくといったことが可能性として考えられます。

Development of deep learning draws attention to multimodal

昨今のディープラーニング研究の飛躍的な向上によって、より人間に近い感覚での判定を行うための手法として、「マルチモーダルAI」に注目集まっています。
マルチモーダル学習が進むと、一つのAIモデルで複数の因数を判別できるようになります。そうすることで、予想のつかない異常パターンについても対処できるようになる可能性が高まります。
異常パターンの分かりやすい例としては、フリマアプリでのニセの出品や、マッチングアプリでのニセのプロフィールの発見などが考えられます。

History of multimodal AI

では、マルチモーダル学習がどのように発展を遂げてきたのか、マルチモーダルAIの歴史について見ていきましょう。

「音声」や「画像」から言語を認識したり、テキストに変換したりするマルチモーダル学習の走りとなる研究は1986年から行われました。
実は、人間は雑音や周囲の音が大きく、聞き取りづらい環境では口の動きと音声を同時に読み取ることによって、言語をより正確に処理しています。
こうした研究は1986年から行われており、同時に「複雑な音声」と「画像」から音声を認識したり、テキストに変換する技術も研究されてきました。
これが、マルチモダール学習の走りです。

その後、2013年にユーザーが任意のテキストを入力すると、楽しそうな表情から、怒ったものまで様々な気分の表情で話させることを可能にする研究が始まりました。
音声と画像の両方の情報を使って人の感情（喜び・悲しみ・怒り）を認識するという研究がなされたり、画像に対して説明文を自動生成するものや、テキストやキャプションを条件に、マッチする画像を自動生成するものが登場したりしました。
他にも、画像に対する質問をテキストで行うとAIが回答してくれるものや、画像情報から自動的に音声を生成するものも登場しました。

現在も様々な進化を続けながら、データ収集、情報処理などの資源コストが劇的に下がってきたことや、アルゴリズムが発展して精度が向上してきたことで、様々なビジネス用途にマルチモーダルAIの技術が使われ始めています。

Expanding business with multimodal AI

So, in what kind of situations is multimodal AI actually used? I will explain from two cases.

▼ Flea Market App Company A
Company A, which operates a flea market app, is able to list products 24 hours a day. Therefore, we use AI to monitor new listings to determine whether they are correct listings.
When new listing information is registered, it is possible to detect listings that are likely to be counterfeit based on the photos and descriptions of the listed items and the tags (brand information, etc.) attached to the items. By doing this, human operators can quickly check for potentially counterfeit items, which leads to improved safety for the entire app.

▼ Sports game data analysis
Multimodal AI systems are also used to analyze the performance of each player from multiple angles in team sports such as soccer.
Using cameras, lasers, and wearable sensors attached to the athlete's arm, measurement data for each athlete is collected and analyzed in real time, and real-time feedback is provided to the athlete and coach. Strategy planning and player appointments, which previously relied solely on the experience and intuition of coaches and players, can now be done objectively using data. It has been.

Future Possibilities of Multimodal AI

In what direction will multimodal AI develop in the future?

▼ Evolution of input
Until now, easy-to-handle information such as images and text was often used for input, but in the future, AI that can communicate more like human interaction will be born from tactile sensors and olfactory sensors installed in robots. may come.
Even with voice recognition AI, if it becomes possible to read the voice tone of the speaker and have a conversation that is close to their emotions, multimodal AI may be useful in many situations, including nursing care and medical care.

▼ Evolution of output
Image generation AI may be able to generate images, sounds, human movements, background music, etc. For example, when you enter a sentence, it is automatically visualized. A world where anyone can become a film director from the comfort of their own home and instantly create and share amazing art is no longer a dream.
Until now, information was generated between a single modal such as "image input → image output", but AI is now being created that can more closely resemble human intellectual production activities. New innovations will occur in all fields, including entertainment.

It is expected that multimodal AI technology will continue to attract attention in the future.