Automatic Identification of Close Languages - Case study: Malay and Indonesian

Main Article Content

Bali Ranaivo-Malancon

Abstract

Identifying the language of an unknown text is not a new problem but what is new is the task of identifying close languages. Malay and Indonesian as many other languages are very similar, and therefore it is a real difficulty to search, retrieve, classify, and above all translate texts written in one of the two languages. We have built a language identifier to
determine whether the text is written in Malay or Indonesian which could be used in any similar situation. It uses the frequency and rank of trigrams of characters, the lists of exclusive words, and the format of numbers. The trigrams are derived from the most frequent words in each language. The current program contains as language models: Malay/Indonesian (661 trigrams), Dutch (826 trigrams), English (652 trigrams), French (579 trigrams), and German (482 trigrams). The trigrams of an unknown text are searched in each language model. The language of the input text is the language having the highest ratio in “number of shared trigrams / total number of trigrams” and “number of winner trigrams / number of shared trigrams”. If the language found at trigram search level is ’Malay or Indonesian’, the text is then scanned by searching the format of numbers and of
some exclusive words.

Article Details

How to Cite
[1]
B. Ranaivo-Malancon, “Automatic Identification of Close Languages - Case study: Malay and Indonesian”, ECTI-CIT Transactions, vol. 2, no. 2, pp. 126–134, Mar. 2016.
Section
Artificial Intelligence and Machine Learning (AI)