OʻZBEK TILI MATNLARI UCHUN UNIGRAM TIL MODELINI ISHLAB CHIQISH: MUAMMO VA YECHINLAR
Keywords:
Til modellari, n-gram til modeli, unigram, modelni baholash, Laplas silliqlash, mashinali oʻqitishAbstract
Tabiiy tilni qayta ishlashda kontekstdagi keyingi qaysi soʻzni bashorat qilish vazifasi tilni modellashtirish deb ataladi. Usbu maqolada avvalo unigram modeli, ya’ni yagona soʻzlarga asoslangan model tavsiflanadi. Soʻngra, “noma’lum” soʻzlar deb nomlanuvchi modelda oʻrgatilmagan soʻzlar ehtimolligini bashorat qilish jarayoning murakkabligi va bu muammoni bartaraf qilish uchun Laplasni silliqlash usuli keltirladi. Soʻngra ushbu silliqlash usuli turli modellarning kombinatsiyasi boʻlgan model interpolyatsiyasi sifatida koʻrish mumkinligini namoyish qilinadi. Shuningdek, yuqoriroq n-gramm modellari va ushbu n-gramm modellarining model interpolyatsiyasining ta’sirini koʻrsatiladi. Ma’lumotlar toʻplami (dataset) sifatida Alisher Navoiyning “Navodir un-nihoya” devoni – oʻquv ma’lumotlar toʻplami (train data), “Badoyi-ul-bidoya” devoni va Pirimqul Qodirovning “Avlodlar dovoni” asari n-gram (n=1) modelida baholash toʻplami (evaluation text) sifatida tanlangan. Maqolada matnlar uchun oʻrtacha logarifmik ehtimollik baholash koʻrsatkichini hisoblash usullari va algortimlari keltirilgan boʻlib, unigram til modelini amalda qoʻllanilish ketma-ketligi tavsiflangan. Yakuniy natija shuni koʻrsatadiki, berilgan baholash matnlarining unigram modeli mos ravishda -11.98 va -14.86 oʻrtacha logarifmik ehtimolga ega boʻlgan. Oʻquv matni va 1-kitobning oʻrtacha logarifmik ehtimoli natijalarning oʻxshashligi ularning bir xil seriyaning birinchi va ikkinchi kitoblari ekanligida hisoblanadi. Biroq, Pirimqul Qodirovning “Avlodlar_dovoni” matnining unigram taqsimoti oʻquv taqsimotidan tubdan farq qiladi. Chunki ular turli vaqtlar, janrlar va turli mualliflarning ikki xil kitobidir.
References
Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., & Zhang, Y. (2024). A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 100211.
Mayfield, J., & McNamee, P. (2003, July). Single n-gram stemming. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (pp. 415-416).
Cavnar, W. B., & Trenkle, J. M. (1994, April). N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval (Vol. 161175, p. 14).
Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., & Chanona-Hernández, L. (2014). Syntactic n-grams as machine learning features for natural language processing. Expert Systems with Applications, 41(3), 853-860.
Zitouni, I. (2007). Backoff hierarchical class n-gram language models: effectiveness to model unseen events in speech recognition. Computer Speech & Language, 21(1), 88-104.
Aouragh, S. L., Yousfi, A., Laaroussi, S., Gueddah, H., & Nejja, M. (2021). A new estimate of the n-gram language model. Procedia Computer Science, 189, 211-215.
Fanani, A. M., & Suyanto, S. (2021). Syllabification Model of Indonesian Language Named-Entity Using Syntactic n-Gram. Procedia Computer Science, 179, 721-727.
Costa-Jussa, M. R., & Fonollosa, J. A. (2009). An Ngram-based reordering model. Computer Speech & Language, 23(3), 362-375.
Jurafsky, D., & Martin, J. H. (2019). Chapter 3: N-Gram Language Models N-Gram Language Models. Speech and Language Processing.
B. Elov, A. Abdullayev, A., N.Xudoyberganov. (2024). Oʻzbek tili korpusi matnlari asosida til modellarini yaratish. «Contemporary technologies of computational linguistics», 2(22.04), 344-353.
Bahl, L., Baker, J., Cohen, P., Jelinek, F., Lewis, B., & Mercer, R. (1978, April). Recognition of continuously read natural corpus. In ICASSP'78. IEEE International Conference on Acoustics, Speech, and Signal Processing (Vol. 3, pp. 422-424). IEEE.
Tang, L., Sun, Z., Idnay, B., Nestor, J. G., Soroush, A., Elias, P. A., ... & Peng, Y. (2023). Evaluating large language models on medical evidence summarization. NPJ digital medicine, 6(1), 158.
Lee, R. S. T. (2024). N-Gram Language Model. In Natural Language Processing. https://doi.org/10.1007/978-981-99-1999-4_2
Colla, D., Delsanto, M., Agosto, M., Vitiello, B., & Radicioni, D. P. (2022). Semantic coherence markers: The contribution of perplexity metrics. Artificial Intelligence in Medicine, 134, 102393.
Mimi, R. J., Masud, M. A., Rahman, R., & Dina, N. S. (2022, February). Text Prediction Zero Probability Problem Handling with N-gram Model and Laplace Smoothing. In 2022 International Conference on Advancement in Electrical and Electronic Engineering (ICAEEE) (pp. 1-6). IEEE.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Botir Elov, Nizomaddin Xudayberganov, Mastura Primova
This work is licensed under a Creative Commons Attribution 4.0 International License.