MATNLI MA’LUMOTLARNI WORD2VEC, GLOVE VA FASTTEXT CHUQUR O‘RGANISH USULLARI VOSITASIDA QAYTA ISHLASH

Botir Elov

Authors

Botir Elov Alisher Navoiy nomidagi Toshkent davlat o‘zbek tili va adabiyoti universiteti

Keywords:

Matnli ma’lumotlarni qayta ishlash, Word2Vec, GloVe, FastText, chuqur o‘rganısh usullari, sonli vektorlar, so‘zlar o‘xshashligi

Abstract

Kompyuter so‘zlarni inson kabi tushunmaydi, ular raqamlar bilan ishlashni afzal ko‘radi. Biroq buni amalga oshirish uchun so‘zlar o‘rtasidagi semantik aloqalarni o‘zida saqlanadigan usulni tanlashni istaymiz va hujjatlarda so‘zlar nafaqat semantikani, balki kontekstni ham eng yaxshi ifodalash uchun sonli tasvirlar hisoblanadi. Kompyuterga so‘zlar va ularning ma’nolarini tushunishga yordam berish uchun biz o‘rnatish/joylashtirsh (embeddings) deb ataladigan usuldan foydalanamiz. So‘zlarni joylashtirish – bu tabiiy tilni qayta ishlashning maxsus sohasi bo‘lib, so‘zlarni sonli vektorlarga mos qo‘yishda so‘zning qurshovidan foydalanadi. Ushbu o‘rnatishlar so‘zlarni matematik vektor sifatida ifodalaydi. Ushbu o‘rnatishlar to‘g‘ri va aniq ishlab chiqilganda, o‘xshash ma’noga ega bo‘lgan so‘zlar o‘xshash raqamli qiymatlarga ega bo‘ladi. Bu esa kompyuterlarga turli so‘zlar orasidagi bog‘lanish va o‘xshashliklarni ularning raqamli ko‘rinishlariga asoslangan holda tushunish imkonini beradi. Bugungi kunda so‘zlarni joylashtirishni o‘rganishning Word2Vec, GloVe va FastText kabi mashinali o‘qitish (Machine Learning, ML)ning chuqur o‘rganish (Deep Learning, DL) usullari navjud. NLP vazifasini hal qilishda samarali natijalarga erishish uchun so‘zlarni joylashtirish va chuqur o‘rganish modellarini tanlash juda muhim. Hozirda tabiiy tildagi matnni tahlil qilish, matn tasnifi, his-tuyg‘ularni tahlil qilish, NER obyektni tanib olish, mavzuni modellashtirish va boshqa NLP vazifalarini hal qilishda ushbu ML usullaridan keng miqyosida foydalanilmoqda. Ushbu maqolada o‘zbek tili korpusi matnlarini ushbu usullar vositasida qayta ishlash usullari, ularning arxitekturalari, so‘zlarni joylashtirish va chuqur o‘rganish modellarining Python tilidagi tadbig‘i keltiriladi. Shuningdek, NLP bo‘yicha so‘nggi tadqiqot tendentsiyalarining umumiy ko‘rinishini va matn tahlili vazifalarida samarali natijalarga erishish uchun ushbu modellardan qanday foydalanish batafsil keltiriladi. Matnni tahlil qilish vazifalarini bajarish uchun turli usullarni qiyosiy tahlil qilish asosida so‘zlarni joylashtirish va chuqur o‘rganish yondashuvini tanlash uchun zarur ma’lumotlar taqdim etiladi. Ushbu maqola so‘zlarni sonli ifodalashning turli yondashuvlari va chuqur o‘rganish modellarining asoslari, afzallik va qiyinchiliklarini o‘rganish uchun tezkor ma’lumot bo‘lib xizmat qilishi mumkin. Maqolada keltirilgan usullar o‘zbek tili matnlarini tahlil qilishda qo‘llanilishi va kelajakdagi NLP sohasidagi ilmiy tadqiqot uchun zarur vosita bo‘lib xizmat qiladi.

References

Asudani, D. S., Nagwani, N. K., & Singh, P. (2023). Impact of word embedding models on text analytics in deep learning environment: a review. Artificial Intelligence Review, 56(9).

https://doi.org/10.1007/s10462-023-10419-1

Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep learning based natural language processing [Review Article]. In IEEE Computational Intelligence Magazine (Vol. 13, Issue 3). https://doi.org/10.1109/MCI.2018.2840738

Lavanya, P. M., & Sasikala, E. (2021). Deep learning techniques on text classification using Natural language processing (NLP) in social healthcare network: A comprehensive survey. 2021 3rd International Conference on Signal Processing and Communication, ICPSC 2021. https://doi.org/10.1109/ICSPC51351.2021.9451752

Moreo, A., Esuli, A., & Sebastiani, F. (2021). Word-class embeddings for multiclass text classification. Data Mining and Knowledge Discovery, 35(3). https://doi.org/10.1007/s10618-020-00735-3

B.Elov, Z.Xusainova, N.Xudayberganov. (2022). Tabiiy tilni qayta ishlashda Bag of Words algoritmidan foydalanish. Oʻzbekiston: til va madaniyat (Amaliy filologiya), 2022, 5(4). 31-45

Elov B., Hamroyeva Sh., Matyakubova N., Yodgorov U. One-hot encoding and Bag-of-Words methods in processing the uzbek language corpus texts // Труды XI Международной конференции «Компьютерная обработка тюркских языков» «TURKLANG 2023». Бухара, 20-22 октября 2023 г.

B.Elov, Z.Xusainova, N.Xudayberganov. O‘zbek tili korpusi matnlari uchun TF-IDF statistik ko‘rsatkichni hisoblash. SCIENCE AND INNOVATION INTERNATIONAL SCIENTIFIC JOURNAL VOLUME 1 ISSUE 8 UIF-2022: 8.2 | ISSN: 2181-3337

Erk, K. (2012). Vector Space Models of Word Meaning and Phrase Meaning: A Survey. Linguistics and Language Compass, 6(10). https://doi.org/10.1002/lnco.362

Salton, G., Wong, A., & Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11). https://doi.org/10.1145/361219.361220

Liu, G., Lu, Y., Shi, K., Chang, J., & Wei, X. (2019). Mapping Bug Reports to Relevant Source Code Files Based on the Vector Space Model and Word Embedding. IEEE Access, 7. https://doi.org/10.1109/ACCESS.2019.2922686

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6). https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

Campbell, J. C., Hindle, A., & Stroulia, E. (2015). Latent Dirichlet Allocation: Extracting Topics from Software Engineering Data. In The Art and Science of Analyzing Software Data.

https://doi.org/10.1016/B978-0-12-411519-4.00006-9

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3(6). https://doi.org/10.1162/153244303322533223

Wang, B., Wang, A., Chen, F., Wang, Y., & Kuo, C. C. J. (2019). Evaluating word embedding models: Methods and experimental results. In APSIPA Transactions on Signal and Information Processing (Vol. 8). https://doi.org/10.1017/ATSIP.2019.12

Shahzad, K., Kanwal, S., Malik, K., Aslam, F., & Ali, M. (2019). A word-embedding-based approach for accurate identification of corresponding activities. Computers and Electrical Engineering, 78. https://doi.org/10.1016/j.compeleceng.2019.07.011

El-Demerdash, K., El-Khoribi, R. A., Ismail Shoman, M. A., & Abdou, S. (2022). Deep learning based fusion strategies for personality prediction. In Egyptian Informatics Journal (Vol. 23, Issue 1). https://doi.org/10.1016/j.eij.2021.05.004

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI.Com.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1.

Bakarov A. (2018) Asurvey of word embeddings evaluation methods. arXiv preprint https://doi.org/10.48550/arXiv.1801.09536

Nayak, N., Angeli, G., & Manning, C. D. (2016). Evaluating word embeddings using a representative suite of practical tasks. Proceedings of the Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/w16-2504

Ghannay, S., Favre, B., Estève, Y., & Camelin, N. (2016). Word embeddings evaluation and combination. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016.

Schnabel, T., Labutov, I., Mimno, D., & Joachims, T. (2015). Evaluation methods for unsupervised word embeddings. Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing. https://doi.org/10.18653/v1/d15-1036

Glavaš, G., Litschko, R., Ruder, S., & Vulic, I. (2020). How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. https://doi.org/10.18653/v1/p19-1070

Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. Proceedings of the 25th International Conference on Machine Learning.

Mikolov, T., Yih, W. T., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations. Proceedings of the 2nd Workshop on Computational Linguistics for Literature, CLfL 2013 at the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2013.

Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., Kingsbury, P., & Liu, H. (2018). A comparison of word embeddings for the biomedical natural language processing. Journal of Biomedical Informatics, 87. https://doi.org/10.1016/j.jbi.2018.09.008

Levy, O., & Goldberg, Y. (2014). Dependency-based word embeddings. 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference, 2. https://doi.org/10.3115/v1/p14-2050

Yu, L. C., Wang, J., Robert Lai, K., & Zhang, X. (2018). Refining Word Embeddings Using Intensity Scores for Sentiment Analysis. IEEE/ACM Transactions on Audio Speech and Language Processing, 26(3). https://doi.org/10.1109/TASLP.2017.2788182

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. In Introduction to Information Retrieval. https://doi.org/10.1017/cbo9780511809071

Ganguly, D., Roy, D., Mitra, M., & Jones, G. J. F. (2015). A word embedding based generalized language model for information retrieval. SIGIR 2015 - Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. https://doi.org/10.1145/2766462.2767780

Wang, Y., Wang, L., Rastegar-Mojarad, M., Moon, S., Shen, F., Afzal, N., Liu, S., Zeng, Y., Mehrabi, S., Sohn, S., & Liu, H. (2018). Clinical information extraction applications: A literature review. In Journal of Biomedical Informatics (Vol. 77). https://doi.org/10.1016/j.jbi.2017.11.011

Zeng, D., Liu, K., Lai, S., Zhou, G., & Zhao, J. (2014). Relation classification via convolutional deep neural network. COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers.

Nguyen, T. H., & Grishman, R. (2014). Employing word representations and regularization for domain adaptation of relation extraction. 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference, 2. https://doi.org/10.3115/v1/p14-2012

Chen, W., Zhang, M., & Zhang, Y. (2015). Distributed feature representations for dependency parsing. IEEE Transactions on Audio, Speech and Language Processing, 23(3). https://doi.org/10.1109/TASLP.2014.2365359

Ouchi, H., Duh, K., Shindo, H., & Matsumoto, Y. (2016). Transition-based dependency parsing exploiting supertags. IEEE/ACM Transactions on Audio Speech and Language Processing, 24(11). https://doi.org/10.1109/TASLP.2016.2598310

Shen, M., Kawahara, D., & Kurohashi, S. (2014). Dependency parse reranking with rich subtree features. IEEE Transactions on Audio, Speech and Language Processing, 22(7). https://doi.org/10.1109/TASLP.2014.2327295

Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., & Qin, B. (2014). Learning sentiment-specific word embedding for twitter sentiment classification. 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference, 1. https://doi.org/10.3115/v1/p14-1146

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1.

Zhou, G., Xie, Z., He, T., Zhao, J., & Hu, X. T. (2016). Learning the Multilingual Translation Representations for Question Retrieval in Community Question Answering via Non-Negative Matrix Factorization. IEEE/ACM Transactions on Audio Speech and Language Processing, 24(7). https://doi.org/10.1109/TASLP.2016.2544661

Hao, Y., Zhang, Y., Liu, K., He, S., Liu, Z., Wu, H., & Zhao, J. (2017). An end-to-end model for question answering over knowledge base with cross-attention combining global knowledge. ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), 1. https://doi.org/10.18653/v1/P17-1021

Ren, M., Kiros, R., & Zemel, R. S. (2015). Exploring models and data for image question answering. Advances in Neural Information Processing Systems, 2015-January.

Dong, L., Wei, F., Zhou, M., & Xu, K. (2015). Question answering over freebase with multi-column convolutional neural networks. ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference, 1. https://doi.org/10.3115/v1/p15-1026

Yogatama, D., Liu, F., & Smith, N. A. (2015). Extractive summarization by maximizing semantic volume. Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing.

https://doi.org/10.18653/v1/d15-1228

Rush, A. M., Chopra, S., & Weston, J. (2015). A Neural Attention Model for Abstractive Sentence Summarization. Proceedings of EMNLP 2015, 1509(685).

Zhang, B., Xiong, D., Su, J., & Duan, H. (2017). A Context-Aware Recurrent Encoder for Neural Machine Translation. IEEE/ACM Transactions on Audio Speech and Language Processing, 25(12). https://doi.org/10.1109/TASLP.2017.2751420

Chen, K., Zhao, T., Yang, M., Liu, L., Tamura, A., Wang, R., Utiyama, M., & Sumita, E. (2018). A Neural Approach to Source Dependence Based Context Model for Statistical Machine Translation. IEEE/ACM Transactions on Audio Speech and Language Processing, 26(2). https://doi.org/10.1109/TASLP.2017.2772846

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. https://doi.org/10.3115/v1/d14-1162

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5.

https://doi.org/10.1162/tacl_a_00051

Glavaš, G., Litschko, R., Ruder, S., & Vulic, I. (2020). How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. https://doi.org/10.18653/v1/p19-1070

Qiu, Y., Li, H., Li, S., Jiang, Y., Hu, R., & Yang, L. (2018). Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11221 LNAI. https://doi.org/10.1007/978-3-030-01716-3_18

Megan Leszczynski, Avner May, Jian Zhang, Sen Wu, Christopher R. Aberger, and Christopher R´ e. (2020). Understanding the downstream instability of word embeddings. In Proceedings of MLSys, pages 262–290. https://doi.org/10.48550/arXiv.2003.04983

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 - Proceedings of the Conference, 1. https://doi.org/10.3115/v1/p14-1023

Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. Proceedings of the 25th International Conference on Machine Learning.

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12.

Tjong Kim Sang, E. F., & de Meulder, F. (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Proceedings of the 7th Conference on Natural Language Learning, CoNLL 2003 at HLT-NAACL 2003.

Søgaard, A. (2016). Evaluating word embeddings with fmri and eye-tracking. Proceedings of the Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/w16-2521

Schwenk, H. (2013). CSLM - A modular open-source continuous space language modeling toolkit. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH.

https://doi.org/10.21437/interspeech.2013-326

Li, Z., Zhang, M., Che, W., Liu, T., & Chen, W. (2014). Joint optimization for Chinese POS tagging and dependency parsing. IEEE Transactions on Audio, Speech and Language Processing, 22 (1). https://doi.org/10.1109/TASLP.2013.2288081

Xu, J., He, H., Sun, X., Ren, X., & Li, S. (2018). Cross-Domain and Semisupervised Named Entity Recognition in Chinese Social Media: A Unified Model. IEEE/ACM Transactions on Audio Speech and Language Processing, 26(11).

https://doi.org/10.1109/TASLP.2018.2856625

Ravi, K., & Ravi, V. (2015). A survey on opinion mining and sentiment analysis: Tasks, approaches and applications. Knowledge-Based Systems, 89. https://doi.org/10.1016/j.knosys.2015.06.015

Bahdanau, D., Cho, K. H., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings.

(x1) Wang, Y., Huang, G., Li, J., Li, H., Zhou, Y., & Jiang, H. (2021). Refined Global Word Embeddings Based on Sentiment Concept for Sentiment Analysis. IEEE Access, 9.

https://doi.org/10.1109/ACCESS.2021.3062654