AUDIO SPECTROGRAM TRANSFORMER (AST): ADVANTAGES OVER TRADITIONAL ALGORITHMS IN SPEECH-TO-TEXT (STT)
Ключевые слова:
Audio Spectrogram Transformer, Deep Learning, Natural Language Processing (NLP), Speech Processing, Speech Recognition, Transformer ModelsАннотация
Automatic Speech Recognition (ASR) has seen significant advancements in recent years, largely due to the development of deep learning models. One of the most notable advancements is the Spectrogram Transformer, a variant of the Transformer architecture tailored for audio processing tasks. In this paper, we review the Spectrogram Transformer and compare it with other traditional ASR algorithms. We discuss its benefits, such as improved performance on noisy audio and better modeling of long-range dependencies. Additionally, we explore its applications in various domains, including voice assistants, transcription services, and audio indexing. Through experiments on benchmark datasets like LibriSpeech and the Speech Commands Dataset, we demonstrate the effectiveness of the Spectrogram Transformer in achieving state-of-the-art performance. Our findings suggest that the Spectrogram Transformer offers a promising direction for future advancements in ASR technology.
Библиографические ссылки
Abadi, M., Barham, P., Chen, J., et al. (2016). TensorFlow: A system for large-scale machine
learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI
(pp. 265-283).
Bahdanau, D., et al. (2015). Neural Machine Translation by Jointly Learning to Align and
Translate. Proceedings of the International Conference on Learning Representations (ICLR).
Chiu, C. C., et al. (2018). State-of-the-art Speech Recognition with Sequence-to-Sequence
Models. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Chollet, F. (2015). Keras. GitHub repository. https://github.com/fchollet/keras
Chorowski, J., et al. (2015). Attention-based Models for Speech Recognition. Advances in
Neural Information Processing Systems (NIPS).
Chorowski, Ron J. Weiss, Samy Bengio, Aäron van den Oord. (2019). Unsupervised speech
representation learning using WaveNet autoencoders. IEEE/ACM Transactions on Audio,
Speech, and Language Processing
Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
Dong, L., Xu, Sh., et al. (2018). Speech-Transformer: A No-Recurrence Sequence-to-Sequence
Model for Speech Recognition. IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP)
Gales M., S. Young (2008). The Application of Hidden Markov Models in Speech Recognition.
Foundations and Trends in Signal Processing
Graves, A., et al. (2013). Speech Recognition with Deep Recurrent Neural Networks. IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Hannun, A., Case, C., Casper, et al. (2014). Deep speech: Scaling up end-to-end speech
recognition. arXiv preprint arXiv:1412.5567.
Hinton, G., et al. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition:
The Shared Views of Four Research Groups. IEEE Signal Processing Magazine.
Huang, X., et al. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and
System Development.
Kahn, L., Lee, A. (2020). Self-Training for End-to-End Speech Recognition. IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP)
Karita, Sh., Chen, N. (2019). A Comparative Study on Transformer vs RNN in Speech
Applications. IEEE Automatic Speech Recognition and Understanding Workshop
Paszke, A., Gross, et al. (2019). PyTorch: An Imperative Style, High-Performance Deep
Learning Library. Advances in Neural Information Processing Systems 32 (NeurIPS 2019).
Povey, D., Ghoshal, A., Boulianne, et al. (2011). The Kaldi speech recognition toolkit. (2011).
IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (ASRU).
Radford, A., Narasimhan, K. (2018). Improving Language Understanding by Generative
Pretraining. OpenAI Technical Report.
Ravanelli, M., Omologo, M., & Svaizer, P. (2019). Speaker Recognition With Deep Residual
Networks. IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP).
Sercu, T., et al. (2016). Very Deep Multilingual Convolutional Neural Networks for LVCSR.
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
Sutskever, I., et al. (2014). Sequence to Sequence Learning with Neural Networks. Advances in
Neural Information Processing Systems (NIPS).
Vasquez, F., et al. (2018). Mel Frequency Cepstral Coefficients (MFCC) and Deep Learning for
Voice Disorder Detection. International Conference on Pattern Recognition (ICPR).
Vaswani, A., et al. (2017). Attention is All You Need. Proceedings of the 31st International
Conference on Neural Information Processing Systems (NIPS).
Wang, Y., et al. (2017). Transformer Based Speech Recognition with CTC Loss on Librispeech.
arXiv preprint arXiv:1712.01769.
Warden, P. (2018). Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition.
arXiv preprint arXiv:1804.03209.
Wolf, T., et al. (2019). HuggingFace's Transformers: State-of-the-art Natural Language
Processing. ArXiv preprint arXiv:1910.03771.
Xu, K., et al. (2015). Show, Attend, and Tell: Neural Image Caption Generation with Visual
Attention. Proceedings of the 32nd International Conference on Machine Learning (ICML).
Gong, Y., Chung,Yu., James Glass (2021). AST: Audio Spectrogram Transformer. ArXiv
preprint arXiv:2104.01778
Загрузки
Опубликован
Как цитировать
Выпуск
Раздел
Лицензия
Copyright (c) 2024 Shavkatov Olimboy, Niyazmetov Lochinbek
Это произведение доступно по лицензии Creative Commons «Attribution» («Атрибуция») 4.0 Всемирная.