How Speech and Multimodal Data Power AI Training

One of the apps that has become essential when traveling abroad is an automatic translation app. In the past, people often communicated in broken English depending on the language environment, or pointed to pictures of dishes on restaurant menus to place orders. However, after automatic translation apps such as Google Translate appeared, even simple travel conversations became much easier.

At that time, however, users still had to type their language in order to translate it. In that sense, it was somewhat inconvenient, similar to how officials from Korea used to communicate with envoys from China or Japan through written Chinese characters.

As AI technology continued to develop, translation became much more convenient. Today, simply pointing a camera at street signs or restaurant menus allows the text to be translated instantly. In addition, translation apps can recognize speech directly and translate it without the need for manual typing.

AI models can understand text in images and human speech because they are trained not only on text but also on multimodal data, including images, speech, and video. Flitto also develops speech recognition and translation AI models and provides a Live Translation service that enables real-time interpretation. As overseas clients in regions such as the United States, Japan, China, and the Middle East continue to increase, Flitto’s Live Translation service was used at 442 conferences across 13 countries in 2025.

Recently, speech recognition performance for major languages such as English, Chinese, Spanish, and French has improved significantly. As a result, demand for speech data from minority languages and regional dialects has also increased. In the case of Korean, companies building speech recognition models initially trained their systems mostly using standard Korean spoken in the Seoul metropolitan area. Over time, additional speech data from regional dialects such as those spoken in Gyeongsang, Jeolla, and Jeju regions were added. Similarly, speech data is now being expanded to include English spoken outside the United States and the United Kingdom, Spanish spoken in Latin America, and various regional dialects of Arabic rather than only standard Arabic.

There are countless languages around the world, and speech differs significantly from parallel corpora used in machine translation. Even when the same sentence is spoken, the way it is pronounced can vary depending on region, gender, age, and other factors. For this reason, it is necessary to collect as much speech data as possible.

Companies that build speech recognition models often face limitations in collecting large amounts of speech data on their own. To address this demand, Flitto has focused on collecting speech data along with parallel corpora since the early stages of the company.

Collecting Speech Using Text

One of the simplest ways to collect speech data is to provide sentences and ask participants to read them aloud. Flitto has built a data warehouse using text data collected over the years, which includes not only written sentences but also many conversational expressions. By extracting sentences from the data warehouse and opening speech collection events on the Arcade platform, users can participate and record their voices by reading the provided sentences.

Unlike parallel corpus tasks that require foreign language skills, speech data collection only requires participants to read sentences in their native language. This makes it possible for people of all ages and backgrounds to participate, allowing speech data to be collected quickly.

Korean sentence:
“The weather suddenly became cold today, so it feels like winter again.”

Speech recorded by a male in his 30s: ko_speech_000001_30_M_1.wav
Speech recorded by a male in his 30s: ko_speech_000001_30_M_2.wav
Speech recorded by a female in her 20s: ko_speech_000001_20_F_1.wav
Speech recorded by a male in his 40s: ko_speech_000001_40_M_1.wav
…

Example of collected Korean speech

Japanese sentence:
“今日は天気が急に寒くなって冬になったようです。”

Speech recorded by a female in her 30s: ja_speech_000001_30_F_1.wav
Speech recorded by a male in his 20s: ja_speech_000001_20_M_1.wav
Speech recorded by a male in his 20s: ja_speech_000001_20_M_2.wav
Speech recorded by a female in her 40s: ja_speech_000001_40_F_1.wav
…

Example of collected Japanese speech

Readers who understand Japanese may notice that the two sentences above were originally built as parallel corpus sentences. While they existed as parallel corpora in the data warehouse, once speech recordings are collected for each sentence, they can also be used as Speech-to-Speech Translation (S2ST) data.

For real-time interpretation, the conventional process involves converting speech to text (Speech-to-Text), translating the text (Machine Translation), and then converting the translated text back into speech (Text-to-Speech). Each stage requires a separate AI model.

In contrast, S2ST does not explicitly perform machine translation, which reduces the number of models required and allows faster processing. Although S2ST technology still has limitations today, if larger datasets are collected and AI models continue to improve, near real-time speech translation similar to the “Babel Fish” described in The Hitchhiker’s Guide to the Galaxy may eventually become possible.

Collecting Speech Through Natural Conversation

In the early stages of speech recognition model development, speech data was collected by asking participants to read given sentences. However, because the sentences were scripted, the recordings often sounded like someone reading from a book or carefully pronouncing each word to avoid mistakes, which resulted in unnatural speech.

Speech data is used not only for speech recognition (Speech-to-Text) but also for speech synthesis (Text-to-Speech). If models are trained using unnatural recordings, the synthesized speech may also sound unnatural. Many people have experienced this when listening to mechanical voices while waiting during customer service calls.

For this reason, more recent methods involve giving participants one or two keywords and asking them to speak freely, or connecting two users and asking them to have a natural conversation similar to a phone call. These approaches allow more natural speech to be collected.

However, because no predefined text exists in these recordings, the speech must be converted into text afterward. At Flitto, speech recognition models are first used to generate transcripts, and users on the Arcade platform then review the audio and correct any errors in the text. This process helps secure more natural speech data.

Once the speech and text created through Arcade are registered in the data warehouse, the text can later be extracted and translated into multiple languages. As a result, speech data can be expanded into multilingual parallel corpus datasets.

How Multimodal Data Powers AI Training

Parallel corpora are created by translating sentences from one language into another. Speech data is created by reading text or speaking naturally. Although these datasets may appear to be separate, they are integrated through the data warehouse. As described above, parallel corpora can be used to generate speech data, and speech data can also be used to create parallel corpora.

As the number of model parameters continues to increase, the amount of data required for training has also grown significantly. The Arcade platform and the data warehouse have become effective tools for meeting this growing demand.

Flitto began its data business by building parallel corpora. Over time, it expanded into images and speech, enabling the rapid creation of multimodal datasets.

In the next article, we will explore how the emergence of large language models (LLMs) has created the need for new types of datasets, and how Flitto is responding to these changes.

[Data Deep Dive #5] How Speech and Multimodal Data Power AI Training

Collecting Speech Using Text

Collecting Speech Through Natural Conversation

How Multimodal Data Powers AI Training

By Flitto DataLab Admin

More on Flitto DataLab

[Data Deep Dive #5] How Speech and Multimodal Data Power AI Training

[Data Deep Dive #4] Data Warehouse Strategy for AI Training Data

From Data to Structural Profitability in AI

[Data Deep Dive Special Edition] Small Language Models for Machine Translation

Collecting Speech Using Text

Collecting Speech Through Natural Conversation

How Multimodal Data Powers AI Training

By Flitto DataLab Admin

Related Post

More on Flitto DataLab