{"id":1679,"date":"2026-03-12T22:00:00","date_gmt":"2026-03-12T13:00:00","guid":{"rendered":"https:\/\/datalab.flitto.com\/en\/company\/blog\/?p=1679"},"modified":"2026-03-11T10:12:34","modified_gmt":"2026-03-11T01:12:34","slug":"ai-training-data-speech-multimodal-data","status":"publish","type":"post","link":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/","title":{"rendered":"[Data Deep Dive #5] How Speech and Multimodal Data Power AI Training"},"content":{"rendered":"\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">One of the apps that has become essential when traveling abroad is an automatic translation app. In the past, people often communicated in broken English depending on the language environment, or pointed to pictures of dishes on restaurant menus to place orders. However, after automatic translation apps such as Google Translate appeared, even simple travel conversations became much easier.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At that time, however, users still had to type their language in order to translate it. In that sense, it was somewhat inconvenient, similar to how officials from Korea used to communicate with envoys from China or Japan through written Chinese characters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As AI technology continued to develop, translation became much more convenient. Today, simply pointing a camera at street signs or restaurant menus allows the text to be translated instantly. In addition, translation apps can recognize speech directly and translate it without the need for manual typing.<\/p>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>AI models can understand text in images and human speech because they are trained not only on text but also on <mark style=\"background-color:rgba(0, 0, 0, 0);color:#1978f0\" class=\"has-inline-color\">multimodal <\/mark>data, including images, speech, and video<\/strong>. <strong>Flitto also develops speech recognition and translation AI models and provides a <mark style=\"background-color:rgba(0, 0, 0, 0);color:#1978f0\" class=\"has-inline-color\"><a href=\"https:\/\/livetr.flitto.com\/en\/\">Live Translation service<\/a><\/mark> that enables real-time interpretation<\/strong>. As overseas clients in regions such as the United States, Japan, China, and the Middle East continue to increase, <strong>Flitto\u2019s Live Translation service was used at 442 conferences across 13 countries in 2025.<\/strong><\/p>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"327\" src=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/multimodal-dataset-flitto-600x327.png\" alt=\"\" class=\"wp-image-1691\" style=\"width:840px;height:auto\" srcset=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/multimodal-dataset-flitto-600x327.png 600w, https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/multimodal-dataset-flitto-300x164.png 300w, https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/multimodal-dataset-flitto-768x419.png 768w, https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/multimodal-dataset-flitto-1024x558.png 1024w, https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/multimodal-dataset-flitto.png 1280w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/figure>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Recently, speech recognition performance for major languages such as English, Chinese, Spanish, and French has improved significantly. As a result, <strong>demand for <mark style=\"background-color:rgba(0, 0, 0, 0);color:#1978f0\" class=\"has-inline-color\">speech data<\/mark> from minority languages and regional dialects has also increased.<\/strong> In the case of Korean, companies building speech recognition models initially trained their systems mostly using standard Korean spoken in the Seoul metropolitan area. Over time, additional speech data from regional dialects such as those spoken in Gyeongsang, Jeolla, and Jeju regions were added. Similarly, speech data is now being expanded to include English spoken outside the United States and the United Kingdom, Spanish spoken in Latin America, and various regional dialects of Arabic rather than only standard Arabic.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There are countless languages around the world, and speech differs significantly from parallel corpora used in machine translation. Even when the same sentence is spoken, the way it is pronounced can vary depending on region, gender, age, and other factors. For this reason, <strong>it is necessary to collect as much speech data as possible.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Companies that build speech recognition models often face limitations in collecting large amounts of speech data on their own. To address this demand, Flitto has focused on collecting speech data along with parallel corpora since the early stages of the company.<\/p>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Collecting Speech Using Text<\/h2>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">One of the simplest ways to collect speech data is to provide sentences and ask participants to read them aloud. <strong><mark style=\"background-color:rgba(0, 0, 0, 0);color:#1978f0\" class=\"has-inline-color\">Flitto has built a data warehouse <\/mark><\/strong>using text data collected over the years, which includes not only written sentences but also many conversational expressions. By extracting sentences from the data warehouse and opening speech collection events on the <strong><mark style=\"background-color:rgba(0, 0, 0, 0);color:#1978f0\" class=\"has-inline-color\">Arcade platform<\/mark><\/strong>, users can participate and record their voices by reading the provided sentences.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Unlike parallel corpus tasks that require foreign language skills, speech data collection only requires participants to read sentences in their native language. This makes it possible for people of all ages and backgrounds to participate, allowing speech data to be collected quickly.<\/p>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Korean sentence:<br>\u201cThe weather suddenly became cold today, so it feels like winter again.\u201d<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Speech recorded by a male in his 30s: ko_speech_000001_30_M_1.wav<\/li>\n\n\n\n<li>Speech recorded by a male in his 30s: ko_speech_000001_30_M_2.wav<\/li>\n\n\n\n<li>Speech recorded by a female in her 20s: ko_speech_000001_20_F_1.wav<\/li>\n\n\n\n<li>Speech recorded by a male in his 40s: ko_speech_000001_40_M_1.wav<br>\u2026<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Example of collected Korean speech<\/p>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Japanese sentence:<br>\u201c\u4eca\u65e5\u306f\u5929\u6c17\u304c\u6025\u306b\u5bd2\u304f\u306a\u3063\u3066\u51ac\u306b\u306a\u3063\u305f\u3088\u3046\u3067\u3059\u3002\u201d<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Speech recorded by a female in her 30s: ja_speech_000001_30_F_1.wav<\/li>\n\n\n\n<li>Speech recorded by a male in his 20s: ja_speech_000001_20_M_1.wav<\/li>\n\n\n\n<li>Speech recorded by a male in his 20s: ja_speech_000001_20_M_2.wav<\/li>\n\n\n\n<li>Speech recorded by a female in her 40s: ja_speech_000001_40_F_1.wav<br>\u2026<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Example of collected Japanese speech<\/p>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Readers who understand Japanese may notice that the two sentences above were originally built as <a href=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/flitto-ai-training-data\/\" type=\"link\" id=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/flitto-ai-training-data\/\">parallel corpus sentences<\/a>. While they existed as parallel corpora in the data warehouse, once speech recordings are collected for each sentence, they can also be used as <strong><mark style=\"background-color:rgba(0, 0, 0, 0);color:#1978f0\" class=\"has-inline-color\">Speech-to-Speech Translation (S2ST) data.<\/mark><\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For real-time interpretation, <strong><mark style=\"background-color:rgba(0, 0, 0, 0);color:#1978f0\" class=\"has-inline-color\">the conventional process involves converting speech to text (Speech-to-Text), translating the text (Machine Translation), and then converting the translated text back into speech (Text-to-Speech).<\/mark><\/strong> Each stage requires a separate AI model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In contrast, S2ST does not explicitly perform machine translation, which reduces the number of models required and allows faster processing. Although S2ST technology still has limitations today, if larger datasets are collected and AI models continue to improve, near real-time speech translation similar to the \u201cBabel Fish\u201d described in <em>The Hitchhiker\u2019s Guide to the Galaxy<\/em> may eventually become possible.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Collecting Speech Through Natural Conversation<\/h2>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">In the early stages of speech recognition model development, speech data was collected by asking participants to read given sentences. However, because the sentences were scripted, the recordings often sounded like someone reading from a book or carefully pronouncing each word to avoid mistakes, which resulted in unnatural speech.<\/p>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"327\" src=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/speech-data-collection-flitto-600x327.png\" alt=\"\" class=\"wp-image-1692\" style=\"width:840px;height:auto\" srcset=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/speech-data-collection-flitto-600x327.png 600w, https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/speech-data-collection-flitto-300x164.png 300w, https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/speech-data-collection-flitto-768x419.png 768w, https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/speech-data-collection-flitto-1024x558.png 1024w, https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/speech-data-collection-flitto.png 1280w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/figure>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\"><strong><mark style=\"background-color:rgba(0, 0, 0, 0);color:#1978f0\" class=\"has-inline-color\">Speech data is used not only for <a href=\"https:\/\/en.wikipedia.org\/wiki\/Speech_recognition\">speech recognition<\/a> (Speech-to-Text) but also for speech synthesis (Text-to-Speech)<\/mark><\/strong>. If models are trained using unnatural recordings, the synthesized speech may also sound unnatural. Many people have experienced this when listening to mechanical voices while waiting during customer service calls.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For this reason, more recent methods involve giving participants one or two keywords and asking them to speak freely, or connecting two users and asking them to have a natural conversation similar to a phone call. These approaches allow more natural speech to be collected.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, because no predefined text exists in these recordings, the speech must be converted into text afterward. At Flitto, speech recognition models are first used to generate transcripts, and users on the Arcade platform then review the audio and correct any errors in the text. <strong><mark style=\"background-color:rgba(0, 0, 0, 0);color:#1978f0\" class=\"has-inline-color\">This process helps secure more natural speech data.<\/mark><\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Once the speech and text created through Arcade are registered in the data warehouse, the text can later be extracted and translated into multiple languages. As a result, speech data can be expanded into multilingual parallel corpus datasets.<\/p>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>How Multimodal Data Powers AI Training<\/strong><\/h2>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Parallel corpora are created by translating sentences from one language into another. Speech data is created by reading text or speaking naturally. Although these datasets may appear to be separate, they are integrated through the data warehouse. As described above, parallel corpora can be used to generate speech data, and speech data can also be used to create parallel corpora.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As the number of model parameters continues to increase, the amount of data required for training has also grown significantly. The Arcade platform and the data warehouse have become effective tools for meeting this growing demand.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Flitto began its data business by building parallel corpora. <strong><mark style=\"background-color:rgba(0, 0, 0, 0);color:#1978f0\" class=\"has-inline-color\">Over time, it expanded into images and speech, enabling the rapid creation of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Multimodal_learning\">multimodal datasets<\/a>.<\/mark><\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the next article, we will explore how the emergence of large language models (LLMs) has created the need for new types of datasets, and how Flitto is responding to these changes.<\/p>\n\n\n\n<div style=\"height:35px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n","protected":false},"excerpt":{"rendered":"<p>One of the apps that has become essential when traveling abroad is an automatic translation app. In the past, people often communicated in broken English depending on the language environment, or pointed to pictures of dishes on restaurant menus to place orders. However, after automatic translation apps such as Google Translate appeared, even simple travel [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":1680,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[8],"tags":[97,71,123,118,49,51,31,76,154,155],"class_list":["post-1679","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-analysis","tag-ai","tag-ai-data","tag-ai-language-data","tag-ai-training-data","tag-ai-translation","tag-flitto","tag-language-data","tag-live-translation","tag-speech-data","tag-speech-to-speech-translation"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.8 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>How Speech and Multimodal Data Power AI Training<\/title>\n<meta name=\"description\" content=\"Learn how speech data, parallel corpora, and multimodal datasets are collected and managed to train AI models.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How Speech and Multimodal Data Power AI Training\" \/>\n<meta property=\"og:description\" content=\"Learn how speech data, parallel corpora, and multimodal datasets are collected and managed to train AI models.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/\" \/>\n<meta property=\"og:site_name\" content=\"Flitto DataLab\" \/>\n<meta property=\"article:published_time\" content=\"2026-03-12T13:00:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/ChatGPT-Image-Mar-10-2026-10_50_12-AM-600x400.png\" \/>\n\t<meta property=\"og:image:width\" content=\"600\" \/>\n\t<meta property=\"og:image:height\" content=\"400\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Flitto DataLab Admin\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Flitto DataLab Admin\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/ai-training-data-speech-multimodal-data\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/ai-training-data-speech-multimodal-data\\\/\"},\"author\":{\"name\":\"Flitto DataLab Admin\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#\\\/schema\\\/person\\\/c09e946fb133658e0475d281e795362e\"},\"headline\":\"[Data Deep Dive #5] How Speech and Multimodal Data Power AI Training\",\"datePublished\":\"2026-03-12T13:00:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/ai-training-data-speech-multimodal-data\\\/\"},\"wordCount\":1220,\"publisher\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/ai-training-data-speech-multimodal-data\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/wp-content\\\/uploads\\\/ChatGPT-Image-Mar-10-2026-10_50_12-AM.png\",\"keywords\":[\"AI\",\"AI data\",\"AI Language Data\",\"AI Training Data\",\"AI Translation\",\"Flitto\",\"Language Data\",\"Live Translation\",\"Speech Data\",\"Speech to Speech Translation\"],\"articleSection\":[\"Analysis\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/ai-training-data-speech-multimodal-data\\\/\",\"url\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/ai-training-data-speech-multimodal-data\\\/\",\"name\":\"How Speech and Multimodal Data Power AI Training\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/ai-training-data-speech-multimodal-data\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/ai-training-data-speech-multimodal-data\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/wp-content\\\/uploads\\\/ChatGPT-Image-Mar-10-2026-10_50_12-AM.png\",\"datePublished\":\"2026-03-12T13:00:00+00:00\",\"description\":\"Learn how speech data, parallel corpora, and multimodal datasets are collected and managed to train AI models.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/ai-training-data-speech-multimodal-data\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/ai-training-data-speech-multimodal-data\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/ai-training-data-speech-multimodal-data\\\/#primaryimage\",\"url\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/wp-content\\\/uploads\\\/ChatGPT-Image-Mar-10-2026-10_50_12-AM.png\",\"contentUrl\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/wp-content\\\/uploads\\\/ChatGPT-Image-Mar-10-2026-10_50_12-AM.png\",\"width\":1536,\"height\":1024,\"caption\":\"How Speech and Multimodal Data Power AI Training\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/ai-training-data-speech-multimodal-data\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"[Data Deep Dive #5] How Speech and Multimodal Data Power AI Training\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/\",\"name\":\"Flitto DataLab\",\"description\":\"Latest AI and Data Insights\",\"publisher\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#organization\",\"name\":\"Flitto DataLab\",\"url\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/wp-content\\\/uploads\\\/2023\\\/07\\\/datalab.svg\",\"contentUrl\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/wp-content\\\/uploads\\\/2023\\\/07\\\/datalab.svg\",\"width\":1,\"height\":1,\"caption\":\"Flitto DataLab\"},\"image\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/showcase\\\/flitto-datalab\\\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#\\\/schema\\\/person\\\/c09e946fb133658e0475d281e795362e\",\"name\":\"Flitto DataLab Admin\",\"url\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/author\\\/daeun-lee\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How Speech and Multimodal Data Power AI Training","description":"Learn how speech data, parallel corpora, and multimodal datasets are collected and managed to train AI models.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/","og_locale":"en_US","og_type":"article","og_title":"How Speech and Multimodal Data Power AI Training","og_description":"Learn how speech data, parallel corpora, and multimodal datasets are collected and managed to train AI models.","og_url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/","og_site_name":"Flitto DataLab","article_published_time":"2026-03-12T13:00:00+00:00","og_image":[{"width":600,"height":400,"url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/ChatGPT-Image-Mar-10-2026-10_50_12-AM-600x400.png","type":"image\/png"}],"author":"Flitto DataLab Admin","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Flitto DataLab Admin","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/#article","isPartOf":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/"},"author":{"name":"Flitto DataLab Admin","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#\/schema\/person\/c09e946fb133658e0475d281e795362e"},"headline":"[Data Deep Dive #5] How Speech and Multimodal Data Power AI Training","datePublished":"2026-03-12T13:00:00+00:00","mainEntityOfPage":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/"},"wordCount":1220,"publisher":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#organization"},"image":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/#primaryimage"},"thumbnailUrl":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/ChatGPT-Image-Mar-10-2026-10_50_12-AM.png","keywords":["AI","AI data","AI Language Data","AI Training Data","AI Translation","Flitto","Language Data","Live Translation","Speech Data","Speech to Speech Translation"],"articleSection":["Analysis"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/","url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/","name":"How Speech and Multimodal Data Power AI Training","isPartOf":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/#primaryimage"},"image":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/#primaryimage"},"thumbnailUrl":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/ChatGPT-Image-Mar-10-2026-10_50_12-AM.png","datePublished":"2026-03-12T13:00:00+00:00","description":"Learn how speech data, parallel corpora, and multimodal datasets are collected and managed to train AI models.","breadcrumb":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/#primaryimage","url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/ChatGPT-Image-Mar-10-2026-10_50_12-AM.png","contentUrl":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/ChatGPT-Image-Mar-10-2026-10_50_12-AM.png","width":1536,"height":1024,"caption":"How Speech and Multimodal Data Power AI Training"},{"@type":"BreadcrumbList","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/ai-training-data-speech-multimodal-data\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/datalab.flitto.com\/en\/company\/blog\/"},{"@type":"ListItem","position":2,"name":"[Data Deep Dive #5] How Speech and Multimodal Data Power AI Training"}]},{"@type":"WebSite","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#website","url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/","name":"Flitto DataLab","description":"Latest AI and Data Insights","publisher":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/datalab.flitto.com\/en\/company\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#organization","name":"Flitto DataLab","url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/2023\/07\/datalab.svg","contentUrl":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/2023\/07\/datalab.svg","width":1,"height":1,"caption":"Flitto DataLab"},"image":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.linkedin.com\/showcase\/flitto-datalab\/"]},{"@type":"Person","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#\/schema\/person\/c09e946fb133658e0475d281e795362e","name":"Flitto DataLab Admin","url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/author\/daeun-lee\/"}]}},"_links":{"self":[{"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/posts\/1679","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/comments?post=1679"}],"version-history":[{"count":5,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/posts\/1679\/revisions"}],"predecessor-version":[{"id":1693,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/posts\/1679\/revisions\/1693"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/media\/1680"}],"wp:attachment":[{"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/media?parent=1679"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/categories?post=1679"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/tags?post=1679"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}