FAST VOICE FILTERING IN A FEW STEPS USING VOICE CONVERSION AS A POSTPROCESSING MODULE ADAPTATION OF A SPEAKER FROM UZBEK TEXT TO SPEECH
Keywords:
Speaker Adaptation, Uzbek Text-ToSpeech, Natural Language Processing, Few-Shot Learning, Voice ConversionAbstract
Text-to-Speech (TTS) systems developed in recent years require hours of recorded speech data to generate high-fidelity human-like synthetic speech. Low resources or small amount of speech can lead to several problems in the development of TTs models, which makes it difficult to train TTS systems with limited resources. This paper proposes a new lowresource TTS method called Voice Filter that uses only one minute of the target speaker's speech. It applies Voice Conversion (VC) as a post-processing module added to an already existing high-quality TTS system, which marks a conceptual change in the current TTS paradigm by recasting the multi-frame TTS problem as a VC task. In addition, it has been proposed to use a TTS system with controlled duration to create a parallel speech corpus that facilitates the VC task. The results show that Voice Filter outperforms modern multi-frame speech synthesis methods based on objective and subjective metrics using only one minute of speech from a diverse set of sounds, and at the same time with the Uzbek TTS model. remains competitive. 25 times more data.
References
Kakhorov A.A., Yodgorova D.M., Khujakulov T.A., Bozorova Z.S. “Processing models and algorithms of natural languages”- Descendants of Muhammad al-Khwarizmi, №3(21)/2022.
J. Latorre, J. Lachowicz, J. Lorenzo-Trueba, et al., “Effect of data reduction on sequence-to-sequence neural tts,” in Proc ICASSP. IEEE, 2019, pp. 7075–7079.
Y.-A. Chung, Y. Wang, W.-N. Hsu, Y. Zhang, and R. Skerry-Ryan, “Semi-supervised training for improving data efficiency in end-to-end speech synthesis,” in Proc. ICASSP). IEEE, 2019, pp. 6940–6944.
Y.-J. Chen, T. Tu, C. chieh Yeh, and H.-Y. Lee, “End-to-End Text-toSpeech for Low-Resource Languages by Cross-Lingual Transfer Learning,” in Proc. Interspeech 2019, 2019, pp. 2075–2079.
Q. Xie, X. Tian, G. Liu, et al., “The multi-speaker multi-style voice cloning challenge 2021,” in Proc. ICASSP, 2021, pp. 8613–8617.
Y. Chen, Y. Assael, B. Shillingford, et al., “Sample efficient adaptive text-to-speech,” in International Conference on Learning Representations, 2019.
Kakhorov, A. (2023). Og‘zaki muloqot tizimlarini ishlab chiqish uchun noravshan qoidalarga asoslangan evolyutsion klassifikatorlarning qo‘llanilishi. DIGITAL TRANSFORMATION AND ARTIFICIAL INTELLIGENCE, 1(2), 108–115. Retrieved from https://dtai.tsue.uz/index.php/dtai/article/view/v1i228
D.-Y. Wu, Y.-H. Chen, and H. yi Lee, “VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net Architecture,” in Proc. Interspeech 2020, 2020, pp. 4691–4695.
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
S. Choi, S. Han, D. Kim, and S. Ha, “Attentron: Few-Shot Text-toSpeech Utilizing Attention-Based Variable-Length Embedding,” in Proc. Interspeech 2020, 2020, pp. 2007–2011.
Y. Chen, Y. Assael, B. Shillingford, et al., “Sample efficient adaptive text-to-speech,” in International Conference on Learning Representations, 2019.
S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Advances in Neural Information Processing Systems, 2018, vol. 31.
Kakhorov A., Yodgorova D., and Khayrullayev M.. "AI-driven environmental communication: translating sustainability reports into uzbek for global awareness" Экономика и социум, no. 6-1 (121), 2024, pp. 292-301.
M. Chen, X. Tan, B. Li, et al., “Adaspeech: Adaptive text to speech for custom voice,” International Conference on Learning Representations (ICLR), 2021.
Z. Zhang, Q. Tian, H. Lu, L.-H. Chen, and S. Liu, “Adadurian: Fewshot adaptation for neural text-to-speech with durian,” arXiv preprint arXiv:2005.05642, 2020.
H. B. Moss, V. Aggarwal, N. Prateek, J. Gonz´alez, and R. BarraChicote, “Boffin tts: Few-shot speaker adaptation by bayesian optimization,” in Proc. ICASSP. IEEE, 2020, pp. 7639–7643.
A.A.Kakhorov, D.M.Yodgorova, T.A.Khujakulov, Z.S.Bozorova Methods of using natural language processing for digital profile, Bulletin of TUIT: Management and Communication Technologies 2023.Vol-2(8)
T. Wang, J. Tao, R. Fu, et al., “Bi-level speaker supervision for oneshot speech synthesis.,” in Proc. Interspeech, 2020, pp. 3989–3993.
Z. Cai, C. Zhang, and M. Li, “From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint,” in Proc. Interspeech 2020, 2020, pp. 3974–3978.
Y. Jia, Y. Zhang, R. J. Weiss, et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in Neural Information Processing Systems 31, 2018, pp. 4485–4495.
H. B. Moss, V. Aggarwal, N. Prateek, J. Gonz´alez, and R. BarraChicote, “Boffin tts: Few-shot speaker adaptation by bayesian optimization,” in Proc. ICASSP. IEEE, 2020, pp.7639–7643.
M. Ott, S. Edunov, A. Baevski, et al., “fairseq: A fast, extensible toolkit for sequence modeling,” in Proceedings of NAACL-HLT 2019: Demonstrations, 2019
M. Bi´nkowski, J. Donahue, S. Dieleman, et al., “High fidelity speech synthesis with adversarial networks,” in International Conference on Learning Representations, 2020.
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 12449–12460.
Muminov, B., Nasimov, R., Gadoyboyeva, N. and Mirzahalilov, S., 2019. Estimation affects of formats and resizing process to the accuracy of convolutional neural network. In International Conference on Information Science and Communications Technologies: Applications, Trends and Opportunities, ICISCT 2019 (pp. 9011858- 9011858).