TEXT MINING AND SENTIMENT ANALYSIS FOR UZBEK: EVALUATING SVM AND NAIVE BAYES FOR UNDER-RESOURCED LANGUAGE PROCESSING
Keywords:
Uzbek text mining, Support Vector Machine, Naive Bayes, under-resourced languages, NLP, agglutinative morphology, script normalization, sentiment analysisAbstract
This study explores the application of text mining techniques to classify and analyze Uzbek text, focusing on the performance of Support Vector Machine (SVM) and Naive Bayes algorithms. Due to the unique linguistic structure of Uzbek, an under-resourced language with an agglutinative morphology and dual-script usage (Cyrillic and Latin), text mining presents several challenges. We collected a dataset from various Uzbek text sources, including news articles and social media posts, and applied customized preprocessing steps such as script normalization, tokenization, and stop word removal.
The processed text was represented using Term Frequency-Inverse Document Frequency (TF-IDF) features with n-grams to capture contextual nuances. Both SVM and Naive Bayes classifiers were trained on the dataset and evaluated using accuracy, precision, recall, and F1-score metrics. SVM demonstrated higher accuracy and precision, making it well-suited for tasks requiring specificity, while Naive Bayes showed robustness in recall, effectively capturing diverse linguistic patterns.
Our findings indicate that both models, with tailored preprocessing, can effectively handle Uzbek’s morphological and syntactic features. However, each model has distinct strengths: SVM excels in handling high-dimensional, well-preprocessed data, while Naive Bayes is more resilient to informal and morphologically diverse language contexts. Future research directions include integrating ensemble models, exploring deep learning approaches, and expanding resources for Uzbek NLP to improve model accuracy and adaptability further. This work contributes to advancing natural language processing for under-resourced languages, providing insights into efficient text mining techniques for Uzbek.
References
Maalel, A., Belguith, L. H., & Ghazouani, W. (2018). "Sentiment Analysis for Arabic Dialects". IEEE International Conference on Big Data, 1–8. https://ieeexplore.ieee.org/document/8507579
Alaybeyoglu, A., & Toprak, S. (2019). "Sentiment Classification on Arabic Reviews Using Different Classifiers". AI and Deep Learning for Biomedical Applications, 126–139. https://link.springer.com/article/10.1007/s10462-018-9622-1
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL-HLT, 4171–4186. https://arxiv.org/abs/1810.04805
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., & Stoyanov, V. (2020). "Unsupervised Cross-lingual Representation Learning at Scale". ACL, 8440–8451. https://arxiv.org/abs/1911.02116
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). "Enriching Word Vectors with Subword Information". Transactions of the Association for Computational Linguistics, 5, 135–146. https://arxiv.org/abs/1607.04606
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space". ICLR. https://arxiv.org/abs/1301.3781
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., & Stoyanov, V. (2020). "RoBERTa: A Robustly Optimized BERT Pretraining Approach". ACL, 1–8. https://arxiv.org/abs/1909.00964
Artetxe, M., & Schwenk, H. (2019). "Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond". Transactions of the Association for Computational Linguistics, 7, 597–610. https://arxiv.org/abs/1812.10464
Sennrich, R., Haddow, B., & Birch, A. (2016). "Improving Neural Machine Translation Models with Monolingual Data". ACL, 86–96. https://arxiv.org/abs/1609.08144
Németh, L., Trón, V., & Halácsy, P. (2004). "Hunspell - An Open Source Spell Checker and Morphological Analyzer". LREC. https://www.lrec-conf.org/proceedings/lrec2004/pdf/552.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Muxamediyeva Dildora Kabulovna, Otaxonova Baxrixon Ibragimovna, Raxmonova Munisaxon Rashodovna, Mirzayeva Nilufar Sirojidovna
This work is licensed under a Creative Commons Attribution 4.0 International License.