TABIIY TILNING STATISTIK MODELLARI

Authors

  • Botir Elov Alisher Navoiy nomidagi Toshkent davlat oʻzbek tili va adabiyoti universiteti
  • Ruhillo Alayev Mirzo Ulugʻbek nomidagi O‘zbekiston Milliy universiteti
  • Abdulla Abdullayev Urganch innovatsion university

Keywords:

Tinish belgilarini tiklash, punctuation restoration, Automatic Speech Recognition, ASR, og‘zaki nutqni tanib olish, nutq transkriptlari, UzbPunct ma'lumotlar to‘plami

Abstract

Tabiiy tilning statistik modeli (Statistical Language Model, SLM) – tabiiy tilni qayta ishlashda qo‘llaniladigan zamonaviy vosita bo‘lib, u ma’lum tildagi so‘zlar ketma-ketligi ehtimolini bashorat qilishga qaratilgan. SLM asosida gapdagi muayyan ketma-ketlikdan keyingi so‘z bashorat qilinadi. SLM so‘zlarning tabiiy til  ma’lumotlari korpusida paydo bo‘lishiga asoslangan ketma-ketlik ehtimolini hisobga oladi. Katta hajmdagi matn ma’lumotlarini tahlil qilish orqali model so‘zlarning tilda qanday qo‘llanilishi qoliplarini o‘rganishi va ushbu qoliplar asosida keyingi ehtimoli yuqori so‘zni bashorat qilishi mumkin. NLP sohasi rivojlanishda davom etar ekan, statistik til modellari tilni tushunish va qayta ishlash uchun asosiy vosita bo‘lib hisoblanadi. SLMlar yordamida tabiiy til texnologiyasida mumkin bo‘lgan chegaralarni kengaytirishni davom ettirishimiz va yanada innovatsion va kuchli NLP ilovalarni yaratishimiz mumkin. Ushbu maqolada tabiiy tilning statistik modellaridan hiosblangan N-gram modelini o‘zbek tili korpusi asosida ishlab chiqish usullari keltiriladi. Shuningdek, N-gram modellarining matematik tavsifi va baholash usullari hamda umumlashtirish, sezgirlik, OOV (noma’lum so‘zlar), maxsus kontekst muammolari va ularni bartaraf qilish yo‘llari keltiriladi.

References

Yi, J., Tao, J., Bai, Y., Tian, Z., & Fan, C. (2020). Adversarial transfer learning for punctuation restoration. arXiv preprint arXiv:2004.00248.

Nguyen, T. B., Nguyen, Q. M., Nguyen, T. T. H., Do, Q. T., & Luong, C. M. (2020). Improving vietnamese named entity recognition from speech using word capitalization and punctuation recovery models. arXiv preprint arXiv:2010.00198.

Sirts, K., & Peekman, K. (2020). Evaluating sentence segmentation and word tokenization systems on Estonian web texts. In Human Language Technologies–The Baltic Perspective (pp. 174-181). IOS Press.

Wang, X. (2020, February). Analysis of Sentence Boundary of the Host's Spoken Language Based on Semantic Orientation Pointwise Mutual Information Algorithm. In 2020 12th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA) (pp. 501-506). IEEE.

Makhija, K., Ho, T. N., & Chng, E. S. (2019, November). Transfer learning for punctuation prediction. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. 268-273). IEEE.

Xu, K., Xie, L., & Yao, K. (2016, October). Investigating LSTM for punctuation prediction. In 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 1-5). IEEE.

Wu, X., Zhu, S., Wu, Y., & Yu, K. (2016, October). Rich punctuations prediction using large-scale deep learning. In 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP) (pp. 1-5). IEEE.

Liu, X., Liu, Y., & Song, X. (2018, November). Investigating for punctuation prediction in Chinese speech transcriptions. In 2018 International Conference on Asian Language Processing (IALP) (pp. 74-78). IEEE.

Silva, A., Theobald, B. J., & Apostoloff, N. (2021, June). Multimodal punctuation prediction with contextual dropout. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3980-3984). IEEE.

Zheng, A., Ye, N., Wang, X., & Song, X. (2020, November). 3r: Word and phoneme edition based data augmentation for lexical punctuation prediction. In 2020 16th International Conference on Computational Intelligence and Security (CIS) (pp. 1-5). IEEE.

Fang, M., Zhao, H., Song, X., Wang, X., & Huang, S. (2019, December). Using bidirectional LSTM with BERT for Chinese punctuation prediction. In 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP) (pp. 1-5). IEEE.

Sunkara, M., Ronanki, S., Bekal, D., Bodapati, S., & Kirchhoff, K. (2020). Multimodal semi-supervised learning framework for punctuation prediction in conversational speech. arXiv preprint arXiv:2008.00702.

Downloads

Published

2024-12-28

How to Cite

Elov, B., Alayev, R., & Abdullayev, A. (2024). TABIIY TILNING STATISTIK MODELLARI. DIGITAL TRANSFORMATION AND ARTIFICIAL INTELLIGENCE, 2(6), 178–189. Retrieved from https://dtai.tsue.uz/index.php/dtai/article/view/v2i620

Most read articles by the same author(s)