TEXT MINING AND SENTIMENT ANALYSIS FOR UZBEK: EVALUATING SVM AND NAIVE BAYES FOR UNDER-RESOURCED LANGUAGE PROCESSING

Dildora Muxamediyeva; Baxrixon Otaxonova; Munisaxon Raxmonova; Nilufar Mirzayeva

Авторы

Dildora Muxamediyeva Muhammad al-Xorazmiy nomidagi Toshkent Axborot Texnologiyalari Universiteti
Baxrixon Otaxonova Muhammad al-Xorazmiy nomidagi Toshkent Axborot Texnologiyalari Universiteti
Munisaxon Raxmonova Muhammad al-Xorazmiy nomidagi Toshkent Axborot Texnologiyalari Universiteti
Nilufar Mirzayeva Muhammad al-Xorazmiy nomidagi Toshkent Axborot Texnologiyalari Universiteti

Ключевые слова:

Uzbek text mining, Support Vector Machine, Naive Bayes, under-resourced languages, NLP, agglutinative morphology, script normalization, sentiment analysis

Аннотация

This study explores the application of text mining techniques to classify and analyze Uzbek text, focusing on the performance of Support Vector Machine (SVM) and Naive Bayes algorithms. Due to the unique linguistic structure of Uzbek, an under-resourced language with an agglutinative morphology and dual-script usage (Cyrillic and Latin), text mining presents several challenges. We collected a dataset from various Uzbek text sources, including news articles and social media posts, and applied customized preprocessing steps such as script normalization, tokenization, and stop word removal.
The processed text was represented using Term Frequency-Inverse Document Frequency (TF-IDF) features with n-grams to capture contextual nuances. Both SVM and Naive Bayes classifiers were trained on the dataset and evaluated using accuracy, precision, recall, and F1-score metrics. SVM demonstrated higher accuracy and precision, making it well-suited for tasks requiring specificity, while Naive Bayes showed robustness in recall, effectively capturing diverse linguistic patterns.
Our findings indicate that both models, with tailored preprocessing, can effectively handle Uzbek’s morphological and syntactic features. However, each model has distinct strengths: SVM excels in handling high-dimensional, well-preprocessed data, while Naive Bayes is more resilient to informal and morphologically diverse language contexts. Future research directions include integrating ensemble models, exploring deep learning approaches, and expanding resources for Uzbek NLP to improve model accuracy and adaptability further. This work contributes to advancing natural language processing for under-resourced languages, providing insights into efficient text mining techniques for Uzbek.

Библиографические ссылки

Maalel, A., Belguith, L. H., & Ghazouani, W. (2018). "Sentiment Analysis for Arabic Dialects". IEEE International Conference on Big Data, 1–8. https://ieeexplore.ieee.org/document/8507579

Alaybeyoglu, A., & Toprak, S. (2019). "Sentiment Classification on Arabic Reviews Using Different Classifiers". AI and Deep Learning for Biomedical Applications, 126–139. https://link.springer.com/article/10.1007/s10462-018-9622-1

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL-HLT, 4171–4186. https://arxiv.org/abs/1810.04805

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., & Stoyanov, V. (2020). "Unsupervised Cross-lingual Representation Learning at Scale". ACL, 8440–8451. https://arxiv.org/abs/1911.02116

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). "Enriching Word Vectors with Subword Information". Transactions of the Association for Computational Linguistics, 5, 135–146. https://arxiv.org/abs/1607.04606

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space". ICLR. https://arxiv.org/abs/1301.3781

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., & Stoyanov, V. (2020). "RoBERTa: A Robustly Optimized BERT Pretraining Approach". ACL, 1–8. https://arxiv.org/abs/1909.00964

Artetxe, M., & Schwenk, H. (2019). "Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond". Transactions of the Association for Computational Linguistics, 7, 597–610. https://arxiv.org/abs/1812.10464

Sennrich, R., Haddow, B., & Birch, A. (2016). "Improving Neural Machine Translation Models with Monolingual Data". ACL, 86–96. https://arxiv.org/abs/1609.08144

Németh, L., Trón, V., & Halácsy, P. (2004). "Hunspell - An Open Source Spell Checker and Morphological Analyzer". LREC. https://www.lrec-conf.org/proceedings/lrec2004/pdf/552.pdf