{"id":543,"date":"2024-05-14T18:21:46","date_gmt":"2024-05-14T09:21:46","guid":{"rendered":"https:\/\/datalab.flitto.com\/en\/company\/blog\/?p=543"},"modified":"2024-05-14T18:21:47","modified_gmt":"2024-05-14T09:21:47","slug":"thai-corpus-for-ai-systems-challenges-and-opportunities","status":"publish","type":"post","link":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/","title":{"rendered":"Thai Corpus for AI Systems: Challenges and Opportunities"},"content":{"rendered":"\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>As multilingual AI systems are on the rise, Thai corpus data is also gaining traction. These corpus data, including monolingual, parallel, spoken, and specialized, are paramount in developing advanced and helpful AI systems.<\/p>\n\n\n\n<p>Particularly, the Thai language is rich with uniqueness that makes it harder for AIs to learn. Each challenge is rooted in the language\u2019s unique linguistic characteristics.<\/p>\n\n\n\n<p>In this article, we will be looking at the various linguistic attributes in the Thai language that differentiate it from other languages, which must also be considered in AI corpus construction.<\/p>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Thai as a language<\/strong><\/h2>\n\n\n\n<div style=\"height:36px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>As of 2020, over 69 million people speak and read Thai. Also called Central Thai, Thai is the most spoken language among the 60 languages spoken in Thailand. While each of these languages are technically different, most share many characteristics. For instance, the native speakers are able to understand different languages like recognizing dialects.<\/p>\n\n\n\n<p>Thai also features its own writing system, comprising consonants, vowels, and tone marks. These components, particularly the vowels, can form a complex combination, with strict rules governing the possible combinations. These intricacies, together with the fact that there is less data available in the language, made it highly challenging for AI systems to fully learn Thai.<\/p>\n\n\n\n<p>Let us now look at the specific characteristics present in the language that Thai corpus datasets must consider.<\/p>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Tonality in spoken Thai<\/strong><\/h2>\n\n\n\n<div style=\"height:36px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Thai&#8217;s tonal nature poses a significant challenge when it comes to speech recognition systems.<\/p>\n\n\n\n<p>The Thai language features five distinct tones altering word meanings. Therefore, accurate tone recognition is critical for natural language processing (NLP) and speech recognition technologies.<\/p>\n\n\n\n<p>For example, the word &#8220;mai&#8221; can mean &#8220;wood&#8221; (\u0e44\u0e21\u0e49, m\u00e1i), &#8220;no&#8221; (\u0e44\u0e21\u0e48, m\u00e2i), &#8220;new&#8221; (\u0e43\u0e2b\u0e21\u0e48, m\u00e0i), &#8220;silk&#8221; (\u0e44\u0e2b\u0e21, m\u01cei), or be a question particle depending on the tone used.<\/p>\n\n\n\n<p>Thai speech corpus data can provide a wealth of examples for training models to discern these subtle tonal variations. This is indispensable for achieving high accuracy in voice assistants, transcription services, and real-time translation tools.<\/p>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Context-reliant grammatical structure<\/strong><\/h2>\n\n\n\n<div style=\"height:36px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Thai&#8217;s grammatical structure relies heavily on word order and auxiliary words rather than inflections. This presents a unique challenge for AI systems in learning it.<\/p>\n\n\n\n<p>For instance, the sentence &#8220;\u0e09\u0e31\u0e19\u0e44\u0e1b\u0e42\u0e23\u0e07\u0e40\u0e23\u0e35\u0e22\u0e19&#8221; (<em>ch\u01cen bpai rong rian<\/em>, lit. I go school) means &#8220;I go to school,&#8221; with the meaning derived from word order rather than verb conjugation.<\/p>\n\n\n\n<p>As exemplified, certain AI models trained on Thai corpus data can learn the subtleties of Thai syntax and semantics. This knowledge is vital for machine translation systems, which must preserve the meaning and context of sentences without the help of inflectional cues, and for conversational AI, which must understand and generate contextually appropriate responses.<\/p>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Classifiers and contextual nuances<\/strong><\/h2>\n\n\n\n<div style=\"height:36px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Classifiers are an integral part of Thai, used to count and categorize nouns.<\/p>\n\n\n\n<p>For example, &#8220;one person&#8221; is &#8220;\u0e2b\u0e19\u0e36\u0e48\u0e07\u0e04\u0e19&#8221; (n\u00f9eng khon), where &#8220;khon&#8221; is the classifier for people, and &#8220;one animal&#8221; is &#8220;\u0e2b\u0e19\u0e36\u0e48\u0e07\u0e15\u0e31\u0e27&#8221; (n\u00f9eng dtua), where &#8220;dtua&#8221; is the classifier for animals.<\/p>\n\n\n\n<p>Thai corpus data helps AI systems grasp the nuances of classifier usage, improving the accuracy of language generation and comprehension tasks. Moreover, this factor is crucial for developing AI that interacts respectfully and appropriately with users across various social contexts.<\/p>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Pronouns and politeness particles<\/strong><\/h2>\n\n\n\n<div style=\"height:36px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Thai pronouns and politeness particles vary widely based on gender, age, and social status. For example, a male speaker might use &#8220;\u0e1c\u0e21&#8221; (ph\u01d2m) for &#8220;I,&#8221; while a female speaker might use &#8220;\u0e09\u0e31\u0e19&#8221; (ch\u01cen) or &#8220;\u0e14\u0e34\u0e09\u0e31\u0e19&#8221; (d\u00ec-ch\u01cen). Politeness particles like &#8220;\u0e04\u0e23\u0e31\u0e1a&#8221; (khr\u00e1p) for males and &#8220;\u0e04\u0e48\u0e30&#8221; (kh\u00e2) for females are added to the end of sentences to communicate with respect.<\/p>\n\n\n\n<p>AI systems must be adept at recognizing and using these correctly to ensure respectful and contextually appropriate communication.<\/p>\n\n\n\n<div style=\"height:36px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1100\" height=\"733\" src=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/colleagues-discussing-upcoming-meeting-2023-11-27-05-18-10-utc.jpg\" alt=\"\" class=\"wp-image-560\" srcset=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/colleagues-discussing-upcoming-meeting-2023-11-27-05-18-10-utc.jpg 1100w, https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/colleagues-discussing-upcoming-meeting-2023-11-27-05-18-10-utc-300x200.jpg 300w, https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/colleagues-discussing-upcoming-meeting-2023-11-27-05-18-10-utc-600x400.jpg 600w, https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/colleagues-discussing-upcoming-meeting-2023-11-27-05-18-10-utc-768x512.jpg 768w, https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/colleagues-discussing-upcoming-meeting-2023-11-27-05-18-10-utc-1024x682.jpg 1024w\" sizes=\"auto, (max-width: 1100px) 100vw, 1100px\" \/><figcaption class=\"wp-element-caption\"><em>Thai is a high-context language, where the speakers&#8217; position and environment matter in communicating.<\/em><\/figcaption><\/figure>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Complex script forms<\/strong><\/h2>\n\n\n\n<div style=\"height:36px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>In addition to Thai corpus data, Thai optical character recognition (OCR) text datasets are important for functional AI systems.<\/p>\n\n\n\n<p>The Thai script comprises of 44 consonants, 21 vowel symbols that can form 32 phonemes, with four tone marks. This intricate combination rules of consonants, vowels, and tone marks, require sophisticated handling by AI systems.<\/p>\n\n\n\n<p>Incorporating Thai text image data helps in developing models capable of understanding and generating text with correct orthographic conventions. For instance, OCR applications must accurately interpret the differences between \u0e44\u0e21\u0e49 (m\u00e1i) and \u0e44\u0e21\u0e48 (m\u00e2i), like in the earlier example. This is crucial for applications like optical character recognition (OCR), which must accurately interpret both handwritten and printed text.<\/p>\n\n\n\n<div style=\"height:40px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Wrapping up\u2026<\/strong><\/h2>\n\n\n\n<div style=\"height:36px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Despite the rise of AI usage and application, there is still an imbalance in the amount of data available in each language. While high-resource languages like English are privileged with advanced AI applications, some users do not get to enjoy the benefits.<\/p>\n\n\n\n<p>This is a part of the reasons why constructing large datasets for low-resource languages, such as most Asian languages, is now a government-initiated project in many countries. Flitto is a part of this year\u2019s \u201cKorean-Foreign Language Parallel Corpora Project\u201d led by the National Institute of Korean Language (NIKL) of South Korea, in tandem with academic institutions. This year, Flitto <a href=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/flitto-selected-to-participate-in-the-2024-korean-foreign-language-parallel-corpora-project\/\">aims to build a total of <strong>1,260 million words of data<\/strong><\/a> in nine languages this year, including Thai.<\/p>\n\n\n\n<p>International language model developers are keeping an eye on Thai speakers as important potential users as well. For instance, Upstage, the AI company behind the state-of-the-art Solar LLM, aims to expand its language support for Thai users through its recent <a href=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/flitto-and-upstage-partnering-up-for-powerful-multilingual-llms\/\">strategic partnership with Flitto<\/a> for the AI data solutions.<\/p>\n\n\n\n<p>Incorporating the right corpus data into AI systems is not just a technical necessity but a cultural imperative. It ensures that AI technologies are inclusive, respectful, and effective across linguistic boundaries. The richness of Thai linguistic data will play a crucial role in shaping diverse AI systems that are not only accurate but also sophisticated and human-like.<\/p>\n\n\n\n<p>Gain access to sample Thai corpus dataset by toggling the Thai language option <strong><em><a href=\"https:\/\/datalab.flitto.com\/en\/browse-sample\">here<\/a><\/em><\/strong>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As multilingual AI systems are on the rise, Thai corpus data is also gaining traction. These corpus data, including monolingual, parallel, spoken, and specialized, are paramount in developing advanced and helpful AI systems. Particularly, the Thai language is rich with uniqueness that makes it harder for AIs to learn. Each challenge is rooted in the [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":558,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[8],"tags":[9,29,26,14],"class_list":["post-543","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-analysis","tag-artificial-intelligence","tag-data-centric-ai","tag-language-models","tag-nlp"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Thai Corpus for AI Systems: Challenges and Opportunities - Flitto DataLab<\/title>\n<meta name=\"description\" content=\"Let&#039;s look at the specific characteristics present in the Thai language that AI corpus datasets must consider.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Thai Corpus for AI Systems: Challenges and Opportunities - Flitto DataLab\" \/>\n<meta property=\"og:description\" content=\"Let&#039;s look at the specific characteristics present in the Thai language that AI corpus datasets must consider.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/\" \/>\n<meta property=\"og:site_name\" content=\"Flitto DataLab\" \/>\n<meta property=\"article:published_time\" content=\"2024-05-14T09:21:46+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-05-14T09:21:47+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/thai-corpus.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1080\" \/>\n\t<meta property=\"og:image:height\" content=\"800\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Flitto DataLab\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Flitto DataLab\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/thai-corpus-for-ai-systems-challenges-and-opportunities\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/thai-corpus-for-ai-systems-challenges-and-opportunities\\\/\"},\"author\":{\"name\":\"Flitto DataLab\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#\\\/schema\\\/person\\\/1ca8babde9794ccfb0d3183ab0cf7837\"},\"headline\":\"Thai Corpus for AI Systems: Challenges and Opportunities\",\"datePublished\":\"2024-05-14T09:21:46+00:00\",\"dateModified\":\"2024-05-14T09:21:47+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/thai-corpus-for-ai-systems-challenges-and-opportunities\\\/\"},\"wordCount\":982,\"publisher\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/thai-corpus-for-ai-systems-challenges-and-opportunities\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/wp-content\\\/uploads\\\/thai-corpus.png\",\"keywords\":[\"Artificial Intelligence\",\"Data-centric AI\",\"Language Models\",\"Natural Language Processing\"],\"articleSection\":[\"Analysis\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/thai-corpus-for-ai-systems-challenges-and-opportunities\\\/\",\"url\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/thai-corpus-for-ai-systems-challenges-and-opportunities\\\/\",\"name\":\"Thai Corpus for AI Systems: Challenges and Opportunities - Flitto DataLab\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/thai-corpus-for-ai-systems-challenges-and-opportunities\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/thai-corpus-for-ai-systems-challenges-and-opportunities\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/wp-content\\\/uploads\\\/thai-corpus.png\",\"datePublished\":\"2024-05-14T09:21:46+00:00\",\"dateModified\":\"2024-05-14T09:21:47+00:00\",\"description\":\"Let's look at the specific characteristics present in the Thai language that AI corpus datasets must consider.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/thai-corpus-for-ai-systems-challenges-and-opportunities\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/thai-corpus-for-ai-systems-challenges-and-opportunities\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/thai-corpus-for-ai-systems-challenges-and-opportunities\\\/#primaryimage\",\"url\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/wp-content\\\/uploads\\\/thai-corpus.png\",\"contentUrl\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/wp-content\\\/uploads\\\/thai-corpus.png\",\"width\":1080,\"height\":800},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/thai-corpus-for-ai-systems-challenges-and-opportunities\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Thai Corpus for AI Systems: Challenges and Opportunities\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/\",\"name\":\"Flitto DataLab\",\"description\":\"Latest AI and Data Insights\",\"publisher\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#organization\",\"name\":\"Flitto DataLab\",\"url\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/wp-content\\\/uploads\\\/2023\\\/07\\\/datalab.svg\",\"contentUrl\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/wp-content\\\/uploads\\\/2023\\\/07\\\/datalab.svg\",\"width\":1,\"height\":1,\"caption\":\"Flitto DataLab\"},\"image\":{\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/www.linkedin.com\\\/showcase\\\/flitto-datalab\\\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/#\\\/schema\\\/person\\\/1ca8babde9794ccfb0d3183ab0cf7837\",\"name\":\"Flitto DataLab\",\"url\":\"https:\\\/\\\/datalab.flitto.com\\\/en\\\/company\\\/blog\\\/author\\\/flitto-datalab-editor\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Thai Corpus for AI Systems: Challenges and Opportunities - Flitto DataLab","description":"Let's look at the specific characteristics present in the Thai language that AI corpus datasets must consider.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/","og_locale":"en_US","og_type":"article","og_title":"Thai Corpus for AI Systems: Challenges and Opportunities - Flitto DataLab","og_description":"Let's look at the specific characteristics present in the Thai language that AI corpus datasets must consider.","og_url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/","og_site_name":"Flitto DataLab","article_published_time":"2024-05-14T09:21:46+00:00","article_modified_time":"2024-05-14T09:21:47+00:00","og_image":[{"width":1080,"height":800,"url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/thai-corpus.png","type":"image\/png"}],"author":"Flitto DataLab","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Flitto DataLab","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/#article","isPartOf":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/"},"author":{"name":"Flitto DataLab","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#\/schema\/person\/1ca8babde9794ccfb0d3183ab0cf7837"},"headline":"Thai Corpus for AI Systems: Challenges and Opportunities","datePublished":"2024-05-14T09:21:46+00:00","dateModified":"2024-05-14T09:21:47+00:00","mainEntityOfPage":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/"},"wordCount":982,"publisher":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#organization"},"image":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/#primaryimage"},"thumbnailUrl":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/thai-corpus.png","keywords":["Artificial Intelligence","Data-centric AI","Language Models","Natural Language Processing"],"articleSection":["Analysis"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/","url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/","name":"Thai Corpus for AI Systems: Challenges and Opportunities - Flitto DataLab","isPartOf":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/#primaryimage"},"image":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/#primaryimage"},"thumbnailUrl":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/thai-corpus.png","datePublished":"2024-05-14T09:21:46+00:00","dateModified":"2024-05-14T09:21:47+00:00","description":"Let's look at the specific characteristics present in the Thai language that AI corpus datasets must consider.","breadcrumb":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/#primaryimage","url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/thai-corpus.png","contentUrl":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/thai-corpus.png","width":1080,"height":800},{"@type":"BreadcrumbList","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/thai-corpus-for-ai-systems-challenges-and-opportunities\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/datalab.flitto.com\/en\/company\/blog\/"},{"@type":"ListItem","position":2,"name":"Thai Corpus for AI Systems: Challenges and Opportunities"}]},{"@type":"WebSite","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#website","url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/","name":"Flitto DataLab","description":"Latest AI and Data Insights","publisher":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/datalab.flitto.com\/en\/company\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#organization","name":"Flitto DataLab","url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/2023\/07\/datalab.svg","contentUrl":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-content\/uploads\/2023\/07\/datalab.svg","width":1,"height":1,"caption":"Flitto DataLab"},"image":{"@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.linkedin.com\/showcase\/flitto-datalab\/"]},{"@type":"Person","@id":"https:\/\/datalab.flitto.com\/en\/company\/blog\/#\/schema\/person\/1ca8babde9794ccfb0d3183ab0cf7837","name":"Flitto DataLab","url":"https:\/\/datalab.flitto.com\/en\/company\/blog\/author\/flitto-datalab-editor\/"}]}},"_links":{"self":[{"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/posts\/543","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/comments?post=543"}],"version-history":[{"count":6,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/posts\/543\/revisions"}],"predecessor-version":[{"id":562,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/posts\/543\/revisions\/562"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/media\/558"}],"wp:attachment":[{"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/media?parent=543"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/categories?post=543"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datalab.flitto.com\/en\/company\/blog\/wp-json\/wp\/v2\/tags?post=543"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}