SPEECH COMMANDS RECOGNITION FOR UZBEK LANGUAGE USING AUDIO SPECTROGRAM TRANSFORMER

Olimboy Shavkatov; Lochinbek Niyazmetov

Авторы

Olimboy Shavkatov Urganch branch of Tashkent University of Information Technologies
Lochinbek Niyazmetov Single Integrator Uzinfocom, Software Engineer

Ключевые слова:

Speech Recognition, Uzbek Language, Audio Spectrogram Transformer, AST, Dataset Collection, Telegram Bot

Аннотация

This paper presents a novel approach for training speech recognition models for the Uzbek language using the Audio Spectrogram Transformer (AST) algorithm. The study involves the collection of a dataset comprising speech commands in Uzbek, recorded with 28 speakers, and the training of AST models using this dataset. Two experiments are conducted: the first involves training AST models exclusively on the Uzbek dataset, while the second combines the Uzbek dataset with the widely used speechcommand v2 dataset. The results demonstrate the effectiveness of the AST algorithm in accurately recognizing Uzbek speech commands, with the combined dataset yielding particularly promising results. The evaluation accuracy for the combined dataset experiment reached 97.96%. The trained models are integrated into a Telegram bot (@UzVoiceDataSetBot), enabling users to interact with speech commands in Uzbek with high accuracy and speed. This research contributes to the advancement of speech recognition technology for the Uzbek language and highlights the potential of AST in multilingual speech processing applications.

Библиографические ссылки

Bahdanau, D., et al. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. Proceedings of the International Conference on Learning Representations (ICLR).

Chiu, C. C., et al. (2018). State-of-the-art Speech Recognition with Sequence-to-Sequence Models. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Chorowski, J., et al. (2015). Attention-based Models for Speech Recognition. Advances in Neural Information Processing Systems (NIPS).

Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).

Gong, Y., et al. (2021). Audio Spectrogram Transformer. Proceedings of the International Conference on Learning Representations (ICLR).

Graves, A., et al. (2013). Speech Recognition with Deep Recurrent Neural Networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Hinton, G., et al. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine.

Musayev, M. M., & Khujayorov, I. Sh. (2022). Uzbek Commands Recognition by Processing the Spectrogram Image. Technical Science and Innovation, 2022(2).

Pate, P. (2018). Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209.

Ribeiro, M. T., et al. (2017). An Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine Translation. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL).

Serçu, T., et al. (2016). Very Deep Multilingual Convolutional Neural Networks for LVCSR. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Shukurov, K. E., & Kholdorov, S. I. (2020). Uzbek Speech Commands Recognition and Implementation Based on HMM. In 2020 IEEE 14th International Conference on Application of Information and Communication Technologies (AICT).

Sutskever, I., et al. (2014). Sequence to Sequence Learning with Neural Networks. Advances in Neural Information Processing Systems (NIPS).

Vasquez, F., et al. (2018). Mel Frequency Cepstral Coefficients (MFCC) and Deep Learning for Voice Disorder Detection. International Conference on Pattern Recognition (ICPR).

Vaswani, A., et al. (2017). Attention is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS).

Wang, Y., et al. (2017). Transformer Based Speech Recognition with CTC Loss on Librispeech. arXiv preprint arXiv:1712.01769.

Warden, P. (2018). Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv preprint arXiv:1804.03209.

Xu, K., et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the 32nd International Conference on Machine Learning (ICML