AUDIO SPECTROGRAM TRANSFORMER (AST): ADVANTAGES OVER TRADITIONAL ALGORITHMS IN SPEECH-TO-TEXT (STT)

Authors

  • Shavkatov Olimboy Urgench Branch of Tashkent university of Information Technologies Department of Computer Engineering
  • Niyazmetov Lochinbek Single Integrator Uzinfocom, Software Engineer

Keywords:

Audio Spectrogram Transformer, Deep Learning, Natural Language Processing (NLP), Speech Processing, Speech Recognition, Transformer Models

Abstract

Automatic Speech Recognition (ASR) has seen significant advancements in recent years, largely due to the development of deep learning models. One of the most notable advancements is the Spectrogram Transformer, a variant of the Transformer architecture tailored for audio processing tasks. In this paper, we review the Spectrogram Transformer and compare it with other traditional ASR algorithms. We discuss its benefits, such as improved performance on noisy audio and better modeling of long-range dependencies. Additionally, we explore its applications in various domains, including voice assistants, transcription services, and audio indexing. Through experiments on benchmark datasets like LibriSpeech and the Speech Commands Dataset, we demonstrate the effectiveness of the Spectrogram Transformer in achieving state-of-the-art performance. Our findings suggest that the Spectrogram Transformer offers a promising direction for future advancements in ASR technology.

References

Abadi, M., Barham, P., Chen, J., et al. (2016). TensorFlow: A system for large-scale machine

learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI

(pp. 265-283).

Bahdanau, D., et al. (2015). Neural Machine Translation by Jointly Learning to Align and

Translate. Proceedings of the International Conference on Learning Representations (ICLR).

Chiu, C. C., et al. (2018). State-of-the-art Speech Recognition with Sequence-to-Sequence

Models. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Chollet, F. (2015). Keras. GitHub repository. https://github.com/fchollet/keras

Chorowski, J., et al. (2015). Attention-based Models for Speech Recognition. Advances in

Neural Information Processing Systems (NIPS).

Chorowski, Ron J. Weiss, Samy Bengio, Aäron van den Oord. (2019). Unsupervised speech

representation learning using WaveNet autoencoders. IEEE/ACM Transactions on Audio,

Speech, and Language Processing

Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language

Understanding. Proceedings of the 2019 Conference of the North American Chapter of the

Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).

Dong, L., Xu, Sh., et al. (2018). Speech-Transformer: A No-Recurrence Sequence-to-Sequence

Model for Speech Recognition. IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP)

Gales M., S. Young (2008). The Application of Hidden Markov Models in Speech Recognition.

Foundations and Trends in Signal Processing

Graves, A., et al. (2013). Speech Recognition with Deep Recurrent Neural Networks. IEEE

International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Hannun, A., Case, C., Casper, et al. (2014). Deep speech: Scaling up end-to-end speech

recognition. arXiv preprint arXiv:1412.5567.

Hinton, G., et al. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition:

The Shared Views of Four Research Groups. IEEE Signal Processing Magazine.

Huang, X., et al. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and

System Development.

Kahn, L., Lee, A. (2020). Self-Training for End-to-End Speech Recognition. IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP)

Karita, Sh., Chen, N. (2019). A Comparative Study on Transformer vs RNN in Speech

Applications. IEEE Automatic Speech Recognition and Understanding Workshop

Paszke, A., Gross, et al. (2019). PyTorch: An Imperative Style, High-Performance Deep

Learning Library. Advances in Neural Information Processing Systems 32 (NeurIPS 2019).

Povey, D., Ghoshal, A., Boulianne, et al. (2011). The Kaldi speech recognition toolkit. (2011).

IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU).

Radford, A., Narasimhan, K. (2018). Improving Language Understanding by Generative

Pretraining. OpenAI Technical Report.

Ravanelli, M., Omologo, M., & Svaizer, P. (2019). Speaker Recognition With Deep Residual

Networks. IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP).

Sercu, T., et al. (2016). Very Deep Multilingual Convolutional Neural Networks for LVCSR.

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Sutskever, I., et al. (2014). Sequence to Sequence Learning with Neural Networks. Advances in

Neural Information Processing Systems (NIPS).

Vasquez, F., et al. (2018). Mel Frequency Cepstral Coefficients (MFCC) and Deep Learning for

Voice Disorder Detection. International Conference on Pattern Recognition (ICPR).

Vaswani, A., et al. (2017). Attention is All You Need. Proceedings of the 31st International

Conference on Neural Information Processing Systems (NIPS).

Wang, Y., et al. (2017). Transformer Based Speech Recognition with CTC Loss on Librispeech.

arXiv preprint arXiv:1712.01769.

Warden, P. (2018). Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition.

arXiv preprint arXiv:1804.03209.

Wolf, T., et al. (2019). HuggingFace's Transformers: State-of-the-art Natural Language

Processing. ArXiv preprint arXiv:1910.03771.

Xu, K., et al. (2015). Show, Attend, and Tell: Neural Image Caption Generation with Visual

Attention. Proceedings of the 32nd International Conference on Machine Learning (ICML).

Gong, Y., Chung,Yu., James Glass (2021). AST: Audio Spectrogram Transformer. ArXiv

preprint arXiv:2104.01778

Downloads

Published

2024-03-28

How to Cite

Shavkatov, O., & Niyazmetov, L. (2024). AUDIO SPECTROGRAM TRANSFORMER (AST): ADVANTAGES OVER TRADITIONAL ALGORITHMS IN SPEECH-TO-TEXT (STT). DIGITAL TRANSFORMATION AND ARTIFICIAL INTELLIGENCE, 2(1), 182–188. Retrieved from https://dtai.tsue.uz/index.php/dtai/article/view/v2i127