MATNLARNI RAQAMLASHTIRISH VA MASHINALI O‘QITISHDA WORD2VEC METODINING AHAMIYATI
Keywords:
matnlarini qayta ishlash, Word2Vec, so‘zlarni joylashtirish, word embedding, tokenizatsiya, o‘quv ma'lumotlari, one-hot encoding modeli, matnlarni raqamlashtirish, mashinali o‘qitishAbstract
Tabiiy tilni qayta ishlash (NLP) ‒ tilshunoslik, kompyuter fanlari, sun’iy intellektning kompyuter va insonning o‘zaro ta’siri bilan bog‘liq bo‘lgan bo‘limi. U asosan, tabiiy tilni qayta ishlash va baholash uchun metod, algoritm va axborot tizimlarini loyihalash, ishlab chiqish masalalari bilan shug‘ullanadi. Hozirda NLP usullari vositasida katta hajmdagi til korpuslari va millionlab veb-sahifalar bir soniya ichida tahlil qilinadi. Shuningdek, NLP vazifalarini yechishda statistik va neyron tarmoqli metodlar qo‘l kelmoqda. Ko‘pgina NLP ilovalari chuqur neyron tarmoq usullaridan foydalanib, texnologik taraqqiyot, kompyuter quvvatining ortishi va katta hajmdagi til korpuslarining mavjudligi tufayli samarali ishlamoqda. Matnli ma’lumotlarining aksariyati strukturlanmagan, Internetda mavjud yoki turli manbalarda joylashgan. Matnli ma’lumotlar to‘g‘ri olingan, jamlangan, formatlangan va tahlil qilingan bo‘lsa, ahamiyatli va foydali bo‘la oladi. Matn tahlilini to‘g‘ri amalga oshirish kompaniya va tashkilotlarga turli yo‘llar bilan foyda keltirishi mumkin. Strukturalanmagan matnni tahlil qilish usullari matn tasnifi, hissiyotlarni tahlil qilish, NER obyektlarni aniqlash va mavzuni modellashtirish kabi vazifalarini qamrab oladi. NLPning ushbu vazifalari turli kontekstlarda qo‘llaniladi. Ularni bajarish uchun, birinchi navbatda, mashina inson tilini tushunishi, qayta ishlashi uchun nutq va matnlarni raqamli shaklga o‘tkazish zarur. Tabiiy tilni talqin qiluvchi, tushunuvchi aqlli tizimlarni ishlab chiqishda strukturlanmagan matnli ma’lumotlar bilan ishlash, ularni sun’iy intellekt metodlari vositasida qayta ishlash maqsadida raqamli shaklga o‘tkazish lozim. So‘zlarni joylashtirish – bu tabiiy tildagi leksik birliklarning umumiy semantikasi va lingvistik shablonlarini qamrab oluvchi so‘zlarning muayyan (fiksirlangan) uzunlikdagi vektor ko‘rinishlari. NLP tadqiqotchilari bunday tasvirlarni olishning turli usullarini taklif qilishgan. Jumladan, Word2ec 2013-yilda Google kompyaniyasi tadqiqotchilari tomonidan ishlab chiqilgan matnni qayta ishlashga va raqamlashtirishga mo‘ljallangan metod bo‘lib, uning asosiy maqsadi so‘zlarni vektorlar orqali ifodalashdan iborat. Word2vec metodi vostasida matndagi so‘zlarning semantikasi ma’no jihatdan kodlanadi. Ushbu maqolada Python tilidagi NumPy paketidan foydalangan holda word2vec metodi orqali o‘zbek tili matnlaridagi so‘zlarni raqamlashtirishni amalda qo‘llash masalasi tahlil qilinadi.
References
Sabharwal, N., & Agrawal, A. (2021). Introduction to Word Embeddings. In Hands-on Question Answering Systems with BERT. https://doi.org/10.1007/978-1-4842-6664-9_3
Tan, M., Zhou, W., Zheng, L., & Wang, S. (2012). A Scalable Distributed Syntactic, Semantic, and Lexical Language Model. Computational Linguistics, 38(3). https://doi.org/10.1162/COLI_a_00107
Elsafoury, F., Wilson, S. R., & Ramzan, N. (2022). A Comparative Study on Word Embeddings in Social NLP Tasks. SocialNLP 2022 - 10th International Workshop on Natural Language Processing for Social Media, Proceedings of the Workshop. https://doi.org/10.18653/v1/2022.socialnlp-1.5
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2006). Distributed Representations of Words and Phrases and their Compositionality. Neural Information Processing Systems, 1.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings.
Elov, B., Hamroyeva, Sh., Alayev , R., Xusainova , Z., & Yodgorov , U. (2023). O‘zbek tili korpusi matnlarini qayta ishlash usullari. Digital transformation and artificial intelligence, 1(3), 117–129. Retrieved from https://dtai.tsue.uz/index.php/dtai/article/view/v1i317
Elov B., Aloyev N., Xusainova Z., Yuldashev A. Oʻzbek tili korpusi matnlarini qayta ishlash Word2Vec, GloVe, ELMO, BERT usullari // Труды XI Международной конференции «Компьютерная обработка тюркских языков» «TURKLANG 2023». Бухара, 20-22 октября 2023 г.
Bamler, R., & Mandt, S. (2017). Dynamic Word Embeddings via Skip-Gram Filtering. Proceedings of ICML 2017.
Preethi Krishna, P., & Sharada, A. (2020). Word Embeddings - Skip Gram Model. In ICICCT 2019 – System Reliability, Quality Control, Safety, Maintenance and Management. https://doi.org/10.1007/978-981-13-8461-5_15
Bakarov A. (2018) Asurvey of word embeddings evaluation methods. arXiv preprint https://doi.org/10.48550/arXiv.1801.09536
Nayak, N., Angeli, G., & Manning, C. D. (2016). Evaluating word embeddings using a representative suite of practical tasks. Proceedings of the Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/w16-2504
Ghannay, S., Favre, B., Estève, Y., & Camelin, N. (2016). Word embeddings evaluation and combination. Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016.
B.ELov, Sh.Khamroeva, Z.Xusainova (2023). The pipeline processing of NLP. E3S Web of Conferences 413, 03011, INTERAGROMASH 2023. https://doi.org/10.1051/e3sconf/202341303011
Rodríguez, P., Bautista, M. A., Gonzàlez, J., & Escalera, S. (2018). Beyond one-hot encoding: Lower dimensional target embedding. Image and Vision Computing, 75. https://doi.org/10.1016/j.imavis.2018.04.004
Karthiga, R., Usha, G., Raju, N., & Narasimhan, K. (2021). Transfer Learning Based Breast cancer Classification using One-Hot Encoding Technique. Proceedings - International Conference on Artificial Intelligence and Smart Systems, ICAIS 2021. https://doi.org/10.1109/ICAIS50930.2021.9395930
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Botir Elov, Alayev Ruhillo, Narzullo Alayev
This work is licensed under a Creative Commons Attribution 4.0 International License.