O‘ZBEK TILI KORPUSI MATNLARINI QAYTA ISHLASH USULLARI

Botir Elov; Shahlo Hamroyeva; Ruhillo Alayev; Zilola Xusainova; Umidjon Yodgorov

Авторы

Botir Elov Alisher Navoiy nomidagi Toshkent davlat o‘zbek tili va adabiyoti universiteti https://orcid.org/0000-0001-5032-6648
Shahlo Hamroyeva Alisher Navoiy nomidagi Toshkent davlat o‘zbek tili va adabiyoti universiteti https://orcid.org/0000-0002-5429-4708
Ruhillo Alayev Alisher Navoiy nomidagi Toshkent davlat o‘zbek tili va adabiyoti universiteti https://orcid.org/0000-0003-3757-7711
Zilola Xusainova Alisher Navoiy nomidagi Toshkent davlat o‘zbek tili va adabiyoti universiteti https://orcid.org/0000-0003-4357-7515
Umidjon Yodgorov Alisher Navoiy nomidagi Toshkent davlat o‘zbek tili va adabiyoti universiteti https://orcid.org/0000-0002-7666-8395

Ключевые слова:

O`zbek tili korpusi, matnlarini qayta ishlash, Word2Vec, CBOW, Skip-Gram, GloVe, ELMO, BERT

Аннотация

Kompyuterlar raqamli yoki sonli ko`rinishdagi ma`lumotlarni qayta ishlashga mo`ljallangan. Ammo ma'lumotlar har doim ham sonli shaklda bo'lmaydi. Ma'lumotlar belgilar, so'zlar va matnli shaklda bo'lsa ularni qanday qayta ishlash lozim? Kompyuterlarni bizning tabiiy tilimizni qayta ishlashga qanday o`rgatish mumkin? Bugungi kunda Alexa, Google Home va boshqa ko'plab “aqlli” yordamchilar nutqimizni qanday tushunadi va javob beradi? Ushbu maqolada tabiiy tilni qayta ishlash deb nomlangan sun'iy intellekt sohasidagi Bag-of-words (BOW), CountVectorizer, TF-IDF, Co-Occurrence matrix, Word2Vec, CBOW, Skip-Gram, GloVe, ELMO va BERT kabi matnlarni qayta ishlash usullari orqali o`zbek tili korpusi matnlarini qayta ishlash usullari keltiriladi.

Библиографические ссылки

Naseem, U., Razzak, I., Khan, S. K., & Prasad, M. (2021). A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models. ACM Transactions on Asian and Low-Resource Language Information Processing, 20(5). https://doi.org/10.1145/3434237

Chai, C. P. (2023). Comparison of text preprocessing methods. Natural Language Engineering, 29(3). https://doi.org/10.1017/S1351324922000213

Probierz, B., Hrabia, A., & Kozak, J. (2023). A New Method for Graph-Based Representation of Text in Natural Language Processing. Electronics, 12(13). https://doi.org/10.3390/electronics12132846

B.ELov, E.Adalı, Sh.Khamroeva, O.Abdullayeva, Z.Xusainova, N.Xudayberganov (2023). The Problem of Pos Tagging and Stemming for Agglutinative Languages. 8 th International Conference on Computer Science and Engineering UBMK 2023, Mehmet Akif Ersoy University, Burdur – Turkey.

B.ELov, Sh.Khamroeva, Z.Xusainova (2023). The pipeline processing of NLP. E3S Web of Conferences 413, 03011, INTERAGROMASH 2023. https://doi.org/10.1051/e3sconf/202341303011

B.Elov, Sh.Hamroyeva, X.Axmedova. Methods for creating a morphological analyzer. 14th International Conference on Intellegent Human Computer Interaction, IHCI 2022, 19-23 October 2022, Tashkent. https://dx.doi.org/10.1007/978-3-031-27199-1_4

Siebers, P., Janiesch, C., & Zschech, P. (2022). A Survey of Text Representation Methods and Their Genealogy. IEEE Access, 10. https://doi.org/10.1109/ACCESS.2022.3205719

Jiang, Z., Gao, S., & Chen, L. (2020). Study on text representation method based on deep learning and topic information. Computing, 102(3). https://doi.org/10.1007/s00607-019-00755-y

Rodríguez, P., Bautista, M. A., Gonzàlez, J., & Escalera, S. (2018). Beyond one-hot encoding: Lower dimensional target embedding. Image and Vision Computing, 75. https://doi.org/10.1016/j.imavis.2018.04.004

B.Elov, Z.Xusainova, N.Xudayberganov. Tabiiy tilni qayta ishlashda Bag of Words algoritmidan foydalanish. O`zbekiston: til va madaniyat (Amaliy filologiya), 2022, 5(4). http://aphil.tsuull.uz/index.php/language-and-culture/article/download/32/29

B.Elov, Z.Xusainova, N.Xudayberganov. O`zbek tili korpusi matnlari uchun TF-IDF statistik ko`rsatkichni hisoblash. SCIENCE AND INNOVATION INTERNATIONAL SCIENTIFIC JOURNAL VOLUME 1 ISSUE 8 UIF-2022: 8.2 | ISSN: 2181-3337

https://www.academia.edu/105829396/OZBEK_TILI_KORPUSI_MATNLARI_UCHUN_TF_IDF_STATISTIK_KORSATKICHNI_HISOBLASH

Fu, Y., & Yu, Y. (2020). Research on text representation method based on improved TF-IDF. Journal of Physics: Conference Series, 1486(7). https://doi.org/10.1088/1742-6596/1486/7/072032

Maharjan, S., Mave, D., Shrestha, P., Montes-Y-Gómez, M., González, F. A., & Solorio, T. (2019). Jointly learning author and annotated character N-gram embeddings: A case study in literary text. International Conference Recent Advances in Natural Language Processing, RANLP, 2019-September. https://doi.org/10.26615/978-954-452-056-4_080

Wawrzyński, A., & Szymański, J. (2021). Study of statistical text representation methods for performance improvement of a hierarchical attention network. Applied Sciences (Switzerland), 11(13). https://doi.org/10.3390/app11136113

Zhao, J. S., Song, M. X., Gao, X., & Zhu, Q. M. (2022). Research on Text Representation in Natural Language Processing. Ruan Jian Xue Bao/Journal of Software, 33(1). https://doi.org/10.13328/j.cnki.jos.006304

Babić, K., Martinčić-Ipšić, S., & Meštrović, A. (2020). Survey of neural text representation models. In Information (Switzerland) (Vol. 11, Issue 11). https://doi.org/10.3390/info11110511

Eleyan, A., & Demirel, H. (2011). Co-occurrence matrix and its statistical features as a new approach for face recognition. Turkish Journal of Electrical Engineering and Computer Sciences, 19(1). https://doi.org/10.3906/elk-0906-27

Cahyani, D. E., & Patasik, I. (2021). Performance comparison of tf-idf and word2vec models for emotion text classification. Bulletin of Electrical Engineering and Informatics, 10(5). https://doi.org/10.11591/eei.v10i5.3157

Method, N. W., Goldberg, Y., Levy, O., Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2014). word2vec Explained : Deriving Mikolov et al. ArXiv:1402.3722 [Cs, Stat], 2.

Xiong, Z., Shen, Q., Xiong, Y., Wang, Y., & Li, W. (2019). New generation model of word vector representation based on CBOW or skip-gram. Computers, Materials and Continua, 60(1). https://doi.org/10.32604/cmc.2019.05155

Jang, B., Kim, I., & Kim, J. W. (2019). Word2vec convolutional neural networks for classification of news articles and tweets. PLoS ONE, 14(8). https://doi.org/10.1371/journal.pone.0220976

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. https://doi.org/10.3115/v1/d14-1162

Kutuzov, A., & Kuzmenko, E. (2021). Representing ELMo embeddings as two-dimensional text online. EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the System Demonstrations. https://doi.org/10.18653/v1/2021.eacl-demos.18

Joshi, M., Levy, O., Weld, D. S., & Zettlemoyer, L. (2019). BERT for coreference resolution: Baselines and analysis. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference. https://doi.org/10.18653/v1/d19-1588