As multilingual AI systems are on the rise, Thai corpus data is also gaining traction. These corpus data, including monolingual, parallel, spoken, and specialized, are paramount in developing advanced and helpful AI systems.

Particularly, the Thai language is rich with uniqueness that makes it harder for AIs to learn. Each challenge is rooted in the language’s unique linguistic characteristics.

In this article, we will be looking at the various linguistic attributes in the Thai language that differentiate it from other languages, which must also be considered in AI corpus construction.

Thai as a language

As of 2020, over 69 million people speak and read Thai. Also called Central Thai, Thai is the most spoken language among the 60 languages spoken in Thailand. While each of these languages are technically different, most share many characteristics. For instance, the native speakers are able to understand different languages like recognizing dialects.

Thai also features its own writing system, comprising consonants, vowels, and tone marks. These components, particularly the vowels, can form a complex combination, with strict rules governing the possible combinations. These intricacies, together with the fact that there is less data available in the language, made it highly challenging for AI systems to fully learn Thai.

Let us now look at the specific characteristics present in the language that Thai corpus datasets must consider.

Tonality in spoken Thai

Thai’s tonal nature poses a significant challenge when it comes to speech recognition systems.

The Thai language features five distinct tones altering word meanings. Therefore, accurate tone recognition is critical for natural language processing (NLP) and speech recognition technologies.

For example, the word “mai” can mean “wood” (ไม้, mái), “no” (ไม่, mâi), “new” (ใหม่, mài), “silk” (ไหม, mǎi), or be a question particle depending on the tone used.

Thai speech corpus data can provide a wealth of examples for training models to discern these subtle tonal variations. This is indispensable for achieving high accuracy in voice assistants, transcription services, and real-time translation tools.

Context-reliant grammatical structure

Thai’s grammatical structure relies heavily on word order and auxiliary words rather than inflections. This presents a unique challenge for AI systems in learning it.

For instance, the sentence “ฉันไปโรงเรียน” (chǎn bpai rong rian, lit. I go school) means “I go to school,” with the meaning derived from word order rather than verb conjugation.

As exemplified, certain AI models trained on Thai corpus data can learn the subtleties of Thai syntax and semantics. This knowledge is vital for machine translation systems, which must preserve the meaning and context of sentences without the help of inflectional cues, and for conversational AI, which must understand and generate contextually appropriate responses.

Classifiers and contextual nuances

Classifiers are an integral part of Thai, used to count and categorize nouns.

For example, “one person” is “หนึ่งคน” (nùeng khon), where “khon” is the classifier for people, and “one animal” is “หนึ่งตัว” (nùeng dtua), where “dtua” is the classifier for animals.

Thai corpus data helps AI systems grasp the nuances of classifier usage, improving the accuracy of language generation and comprehension tasks. Moreover, this factor is crucial for developing AI that interacts respectfully and appropriately with users across various social contexts.

Pronouns and politeness particles

Thai pronouns and politeness particles vary widely based on gender, age, and social status. For example, a male speaker might use “ผม” (phǒm) for “I,” while a female speaker might use “ฉัน” (chǎn) or “ดิฉัน” (dì-chǎn). Politeness particles like “ครับ” (khráp) for males and “ค่ะ” (khâ) for females are added to the end of sentences to communicate with respect.

AI systems must be adept at recognizing and using these correctly to ensure respectful and contextually appropriate communication.

Thai is a high-context language, where the speakers’ position and environment matter in communicating.

Complex script forms

In addition to Thai corpus data, Thai optical character recognition (OCR) text datasets are important for functional AI systems.

The Thai script comprises of 44 consonants, 21 vowel symbols that can form 32 phonemes, with four tone marks. This intricate combination rules of consonants, vowels, and tone marks, require sophisticated handling by AI systems.

Incorporating Thai text image data helps in developing models capable of understanding and generating text with correct orthographic conventions. For instance, OCR applications must accurately interpret the differences between ไม้ (mái) and ไม่ (mâi), like in the earlier example. This is crucial for applications like optical character recognition (OCR), which must accurately interpret both handwritten and printed text.

Wrapping up…

Despite the rise of AI usage and application, there is still an imbalance in the amount of data available in each language. While high-resource languages like English are privileged with advanced AI applications, some users do not get to enjoy the benefits.

This is a part of the reasons why constructing large datasets for low-resource languages, such as most Asian languages, is now a government-initiated project in many countries. Flitto is a part of this year’s “Korean-Foreign Language Parallel Corpora Project” led by the National Institute of Korean Language (NIKL) of South Korea, in tandem with academic institutions. This year, Flitto aims to build a total of 1,260 million words of data in nine languages this year, including Thai.

International language model developers are keeping an eye on Thai speakers as important potential users as well. For instance, Upstage, the AI company behind the state-of-the-art Solar LLM, aims to expand its language support for Thai users through its recent strategic partnership with Flitto for the AI data solutions.

Incorporating the right corpus data into AI systems is not just a technical necessity but a cultural imperative. It ensures that AI technologies are inclusive, respectful, and effective across linguistic boundaries. The richness of Thai linguistic data will play a crucial role in shaping diverse AI systems that are not only accurate but also sophisticated and human-like.

Gain access to sample Thai corpus dataset by toggling the Thai language option here.

Flitto DataLab

CEO Simon Lee

CPO Simon Lee

Business Registration Number 215-87-72878

E-Commerce Registration Number 2014-SeoulGangnam-02858

Address (06173) 6F, 20 Yeongdong-daero 96-gil, Gangnam-gu, Seoul, Republic of Korea (169 Samsung-dong)

© 2023 Flitto Inc. All rights reserved.