O‘ZBEK TILI KORPUSI MATNLARINI QAYTA ISHLASH USULLARI
Ключевые слова:
O`zbek tili korpusi, matnlarini qayta ishlash, Word2Vec, CBOW, Skip-Gram, GloVe, ELMO, BERTАннотация
Kompyuterlar raqamli yoki sonli ko`rinishdagi ma`lumotlarni qayta ishlashga mo`ljallangan. Ammo ma'lumotlar har doim ham sonli shaklda bo'lmaydi. Ma'lumotlar belgilar, so'zlar va matnli shaklda bo'lsa ularni qanday qayta ishlash lozim? Kompyuterlarni bizning tabiiy tilimizni qayta ishlashga qanday o`rgatish mumkin? Bugungi kunda Alexa, Google Home va boshqa ko'plab “aqlli” yordamchilar nutqimizni qanday tushunadi va javob beradi? Ushbu maqolada tabiiy tilni qayta ishlash deb nomlangan sun'iy intellekt sohasidagi Bag-of-words (BOW), CountVectorizer, TF-IDF, Co-Occurrence matrix, Word2Vec, CBOW, Skip-Gram, GloVe, ELMO va BERT kabi matnlarni qayta ishlash usullari orqali o`zbek tili korpusi matnlarini qayta ishlash usullari keltiriladi.
Библиографические ссылки
Naseem, U., Razzak, I., Khan, S. K., & Prasad, M. (2021). A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models. ACM Transactions on Asian and Low-Resource Language Information Processing, 20(5). https://doi.org/10.1145/3434237
Chai, C. P. (2023). Comparison of text preprocessing methods. Natural Language Engineering, 29(3). https://doi.org/10.1017/S1351324922000213
Probierz, B., Hrabia, A., & Kozak, J. (2023). A New Method for Graph-Based Representation of Text in Natural Language Processing. Electronics, 12(13). https://doi.org/10.3390/electronics12132846
B.ELov, E.Adalı, Sh.Khamroeva, O.Abdullayeva, Z.Xusainova, N.Xudayberganov (2023). The Problem of Pos Tagging and Stemming for Agglutinative Languages. 8 th International Conference on Computer Science and Engineering UBMK 2023, Mehmet Akif Ersoy University, Burdur – Turkey.
B.ELov, Sh.Khamroeva, Z.Xusainova (2023). The pipeline processing of NLP. E3S Web of Conferences 413, 03011, INTERAGROMASH 2023. https://doi.org/10.1051/e3sconf/202341303011
B.Elov, Sh.Hamroyeva, X.Axmedova. Methods for creating a morphological analyzer. 14th International Conference on Intellegent Human Computer Interaction, IHCI 2022, 19-23 October 2022, Tashkent. https://dx.doi.org/10.1007/978-3-031-27199-1_4
Siebers, P., Janiesch, C., & Zschech, P. (2022). A Survey of Text Representation Methods and Their Genealogy. IEEE Access, 10. https://doi.org/10.1109/ACCESS.2022.3205719
Jiang, Z., Gao, S., & Chen, L. (2020). Study on text representation method based on deep learning and topic information. Computing, 102(3). https://doi.org/10.1007/s00607-019-00755-y
Rodríguez, P., Bautista, M. A., Gonzàlez, J., & Escalera, S. (2018). Beyond one-hot encoding: Lower dimensional target embedding. Image and Vision Computing, 75. https://doi.org/10.1016/j.imavis.2018.04.004
B.Elov, Z.Xusainova, N.Xudayberganov. Tabiiy tilni qayta ishlashda Bag of Words algoritmidan foydalanish. O`zbekiston: til va madaniyat (Amaliy filologiya), 2022, 5(4). http://aphil.tsuull.uz/index.php/language-and-culture/article/download/32/29
B.Elov, Z.Xusainova, N.Xudayberganov. O`zbek tili korpusi matnlari uchun TF-IDF statistik ko`rsatkichni hisoblash. SCIENCE AND INNOVATION INTERNATIONAL SCIENTIFIC JOURNAL VOLUME 1 ISSUE 8 UIF-2022: 8.2 | ISSN: 2181-3337
Fu, Y., & Yu, Y. (2020). Research on text representation method based on improved TF-IDF. Journal of Physics: Conference Series, 1486(7). https://doi.org/10.1088/1742-6596/1486/7/072032
Maharjan, S., Mave, D., Shrestha, P., Montes-Y-Gómez, M., González, F. A., & Solorio, T. (2019). Jointly learning author and annotated character N-gram embeddings: A case study in literary text. International Conference Recent Advances in Natural Language Processing, RANLP, 2019-September. https://doi.org/10.26615/978-954-452-056-4_080
Wawrzyński, A., & Szymański, J. (2021). Study of statistical text representation methods for performance improvement of a hierarchical attention network. Applied Sciences (Switzerland), 11(13). https://doi.org/10.3390/app11136113
Zhao, J. S., Song, M. X., Gao, X., & Zhu, Q. M. (2022). Research on Text Representation in Natural Language Processing. Ruan Jian Xue Bao/Journal of Software, 33(1). https://doi.org/10.13328/j.cnki.jos.006304
Babić, K., Martinčić-Ipšić, S., & Meštrović, A. (2020). Survey of neural text representation models. In Information (Switzerland) (Vol. 11, Issue 11). https://doi.org/10.3390/info11110511
Eleyan, A., & Demirel, H. (2011). Co-occurrence matrix and its statistical features as a new approach for face recognition. Turkish Journal of Electrical Engineering and Computer Sciences, 19(1). https://doi.org/10.3906/elk-0906-27
Cahyani, D. E., & Patasik, I. (2021). Performance comparison of tf-idf and word2vec models for emotion text classification. Bulletin of Electrical Engineering and Informatics, 10(5). https://doi.org/10.11591/eei.v10i5.3157
Method, N. W., Goldberg, Y., Levy, O., Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2014). word2vec Explained : Deriving Mikolov et al. ArXiv:1402.3722 [Cs, Stat], 2.
Xiong, Z., Shen, Q., Xiong, Y., Wang, Y., & Li, W. (2019). New generation model of word vector representation based on CBOW or skip-gram. Computers, Materials and Continua, 60(1). https://doi.org/10.32604/cmc.2019.05155
Jang, B., Kim, I., & Kim, J. W. (2019). Word2vec convolutional neural networks for classification of news articles and tweets. PLoS ONE, 14(8). https://doi.org/10.1371/journal.pone.0220976
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. https://doi.org/10.3115/v1/d14-1162
Kutuzov, A., & Kuzmenko, E. (2021). Representing ELMo embeddings as two-dimensional text online. EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the System Demonstrations. https://doi.org/10.18653/v1/2021.eacl-demos.18
Joshi, M., Levy, O., Weld, D. S., & Zettlemoyer, L. (2019). BERT for coreference resolution: Baselines and analysis. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference. https://doi.org/10.18653/v1/d19-1588
Загрузки
Опубликован
Как цитировать
Выпуск
Раздел
Лицензия
Copyright (c) 2023 Botir Elov, Shahlo Hamroyeva, Ruhillo Alayev, Zilola Xusainova, Umidjon Yodgorov
Это произведение доступно по лицензии Creative Commons «Attribution» («Атрибуция») 4.0 Всемирная.