HAS MULTIMODAL LEARNING SUCCEEDED ENOUGH TO CAPTURE CONTEXTUAL MEANING OF HUMAN-TO-HUMAN INTERACTION? A SURVEY

Authors

  • Abdulaziz Xo‘jamqulov Tashkent University of Information Technologies named after Muhammad al-Khwarizmi
  • Javlon Jumanazarov Tashkent State University of Economics

Keywords:

Human-to-human interaction, multimodal learning, fusion models, computer vision, natural language processing, audio processing

Abstract

Human communication is inherently multimodal, involving speech, facial expressions, gestures, body language, and even contextual cues. Variability and ambiguity make more complex to understand contextual meaning of human-to-human interaction (HHI) as gestures and expressions may have different meanings across cultures and personal habits and styles influence behavior interpretation. To tackle these problems, this article systematically analyses past and current state-of-the-art researches in multimodal learning techniques for contextual understanding of HHI using audio, text, and vision data.

References

A, S., Thomas, A., & Mathew, D. (2018). Study of MFCC and IHC Feature Extraction Methods With Probabilistic Acoustic Models for Speaker Biometric Applications. Procedia Computer Science, 143, 267–276. https://doi.org/10.1016/j.procs.2018.10.395

Antoniou, N., Katsamanis, A., Giannakopoulos, T., & Narayanan, S. (2023). Designing and Evaluating Speech Emotion Recognition Systems: A reality check case study with IEMOCAP. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096808

Antypas, D., Ushio, A., Barbieri, F., & Camacho-Collados, J. (2024). Multilingual Topic Classification in X: Dataset and Analysis (No. arXiv:2410.03075). arXiv. https://doi.org/10.48550/arXiv.2410.03075

Bagher Zadeh, A., Liang, P. P., Poria, S., Cambria, E., & Morency, L.-P. (2018). Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2236–2246. https://doi.org/10.18653/v1/P18-1208

Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., & Blackburn, J. (2020). The Pushshift Reddit Dataset (No. arXiv:2001.08435). arXiv. https://doi.org/10.48550/arXiv.2001.08435

Bertasius, G., Wang, H., & Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding? (No. arXiv:2102.05095). arXiv. https://doi.org/10.48550/arXiv.2102.05095

Budzianowski, P., Wen, T.-H., Tseng, B.-H., Casanueva, I., Ultes, S., Ramadan, O., & Gašić, M. (2018). MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 5016–5026. https://doi.org/10.18653/v1/D18-1547

Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., & Sheikh, Y. (2019). OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields (No. arXiv:1812.08008). arXiv. https://doi.org/10.48550/arXiv.1812.08008

Carreira, J., Noland, E., Hillier, C., & Zisserman, A. (2022). A Short Note on the Kinetics-700 Human Action Dataset (No. arXiv:1907.06987). arXiv. https://doi.org/10.48550/arXiv.1907.06987

Carreira, J., & Zisserman, A. (2018). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (No. arXiv:1705.07750). arXiv. https://doi.org/10.48550/arXiv.1705.07750

Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M. F., Parikh, D., & Batra, D. (2017). Visual Dialog (No. arXiv:1611.08669). arXiv. https://doi.org/10.48550/arXiv.1611.08669

Degirmenci, A. (n.d.). Introduction to Hidden Markov Models.

Dritsas, E., Trigka, M., Troussas, C., & Mylonas, P. (2025). Multimodal Interaction, Interfaces, and Communication: A Survey. Multimodal Technologies and Interaction, 9(1), 6. https://doi.org/10.3390/mti9010006

Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast Networks for Video Recognition (No. arXiv:1812.03982). arXiv. https://doi.org/10.48550/arXiv.1812.03982

Feichtenhofer, C., Pinz, A., & Wildes, R. P. (2017). Spatiotemporal Multiplier Networks for Video Action Recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7445–7454. https://doi.org/10.1109/CVPR.2017.787

Goel, K., & Chandak, M. (2024). HIRO: Hierarchical Information Retrieval Optimization (No. arXiv:2406.09979). arXiv. https://doi.org/10.48550/arXiv.2406.09979

Goel, N. K., Sarma, M., Kushwah, T. S., Agrawal, D. K., Iqbal, Z., & Chauhan, S. (n.d.). Extracting speaker’s gender, accent, age and emotional state from speech.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative Adversarial Networks (No. arXiv:1406.2661). arXiv. https://doi.org/10.48550/arXiv.1406.2661

Goyal, R., Kahou, S. E., Michalski, V., Materzyńska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The “something something” video database for learning and evaluating visual common sense (No. arXiv:1706.04261). arXiv. https://doi.org/10.48550/arXiv.1706.04261

Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., Schmid, C., & Malik, J. (2018). AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions (No. arXiv:1705.08421). arXiv. https://doi.org/10.48550/arXiv.1705.08421

Gupta, M., Kulkarni, N., Chanda, R., Rayasam, A., & Lipton, Z. C. (2019). AmazonQA: A Review-Based Question Answering Task (No. arXiv:1908.04364). arXiv. https://doi.org/10.48550/arXiv.1908.04364

Hansen, J. H. L., & Hasan, T. (2015). Speaker Recognition by Machines and Humans: A tutorial review. IEEE Signal Processing Magazine, 32(6), 74–99. https://doi.org/10.1109/MSP.2015.2462851

Hara, K., Kataoka, H., & Satoh, Y. (2017). Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition (No. arXiv:1708.07632). arXiv. https://doi.org/10.48550/arXiv.1708.07632

Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339. https://doi.org/10.1109/TPAMI.2013.248

Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., & Chen, T. (2023). MotionGPT: Human Motion as a Foreign Language (No. arXiv:2306.14795). arXiv. https://doi.org/10.48550/arXiv.2306.14795

Kim, W., Son, B., & Kim, I. (2021). ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision (No. arXiv:2102.03334). arXiv. https://doi.org/10.48550/arXiv.2102.03334

Ko, W.-R., Jang, M., Lee, J., & Kim, J. (2020). AIR-Act2Act: Human-human interaction dataset for teaching non-verbal social behaviors to robots (No. arXiv:2009.02041). arXiv. https://doi.org/10.48550/arXiv.2009.02041

Lachter, J., Forster, K. I., & Ruthruff, E. (2004). Forty-five years after Broadbent (1958): Still no identification without attention. Psychological Review, 111(4), 880–913. https://doi.org/10.1037/0033-295X.111.4.880

Lai, K., & Yanushkevich, S. N. (2018). CNN+RNN Depth and Skeleton based Dynamic Hand Gesture Recognition. 2018 24th International Conference on Pattern Recognition (ICPR), 3451–3456. https://doi.org/10.1109/ICPR.2018.8545718

Laptev & Lindeberg. (2003). Space-time interest points. Proceedings Ninth IEEE International Conference on Computer Vision, 432–439 vol.1. https://doi.org/10.1109/ICCV.2003.1238378

Lea, C., Vidal, R., Reiter, A., & Hager, G. D. (2016). Temporal Convolutional Networks: A Unified Approach to Action Segmentation (No. arXiv:1608.08242). arXiv. https://doi.org/10.48550/arXiv.1608.08242

Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791

Li, L., Chen, Y.-C., Cheng, Y., Gan, Z., Yu, L., & Liu, J. (2020). HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2046–2065. https://doi.org/10.18653/v1/2020.emnlp-main.161

Li, Y., Su, H., Shen, X., Li, W., Cao, Z., & Niu, S. (2017). DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset (No. arXiv:1710.03957). arXiv. https://doi.org/10.48550/arXiv.1710.03957

Lin, J., Gan, C., & Han, S. (2019). TSM: Temporal Shift Module for Efficient Video Understanding (No. arXiv:1811.08383). arXiv. https://doi.org/10.48550/arXiv.1811.08383

Lison, P., Tiedemann, J., & Kouylekov, M. (n.d.). OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora.

Liu, C., Hu, Y., Li, Y., Song, S., & Liu, J. (2017). PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding (No. arXiv:1703.07475). arXiv. https://doi.org/10.48550/arXiv.1703.07475

Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., & Kot, A. C. (2020). NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10), 2684–2701. https://doi.org/10.1109/TPAMI.2019.2916873

Ludl, D., Gulde, T., & Curio, C. (2019). Simple yet efficient real-time pose-based action recognition (No. arXiv:1904.09140). arXiv. https://doi.org/10.48550/arXiv.1904.09140

Materzynska, J., Berger, G., Bax, I., & Memisevic, R. (2019). The Jester Dataset: A Large-Scale Video Dataset of Human Gestures. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2874–2882. https://doi.org/10.1109/ICCVW.2019.00349

Mieczkowski, H., Hancock, J. T., Naaman, M., Jung, M., & Hohenstein, J. (2021). AI-Mediated Communication: Language Use and Interpersonal Effects in a Referential Communication Task. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1), 1–14. https://doi.org/10.1145/3449091

Mohamed, A., Qian, K., Elhoseiny, M., & Claudel, C. (2020). Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction (No. arXiv:2002.11927). arXiv. https://doi.org/10.48550/arXiv.2002.11927

Nguyen, C. V. T., Mai, T., The, S., Kieu, D., & Le, D.-T. (2023). Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 15154–15167. https://doi.org/10.18653/v1/2023.emnlp-main.937

Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., & Mihalcea, R. (2019). MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations (No. arXiv:1810.02508). arXiv. https://doi.org/10.48550/arXiv.1810.02508

Quadrana, M., Karatzoglou, A., Hidasi, B., & Cremonesi, P. (2017). Personalizing Session-based Recommendations with Hierarchical Recurrent Neural Networks. Proceedings of the Eleventh ACM Conference on Recommender Systems, 130–137. https://doi.org/10.1145/3109859.3109896

Rashkin, H., Smith, E. M., Li, M., & Boureau, Y.-L. (2019). Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset (No. arXiv:1811.00207). arXiv. https://doi.org/10.48550/arXiv.1811.00207

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10674–10685. https://doi.org/10.1109/CVPR52688.2022.01042

Ryoo, M. S., Chen, C.-C., Aggarwal, J. K., & Roy-Chowdhury, A. (2010). An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010. In D. Ünay, Z. Çataltepe, & S. Aksoy (Eds.), Recognizing Patterns in Signals, Speech, Images and Videos (Vol. 6388, pp. 270–285). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-17711-8_28

Santoro, A., Faulkner, R., Raposo, D., Rae, J., Chrzanowski, M., Weber, T., Wierstra, D., Vinyals, O., Pascanu, R., & Lillicrap, T. (2018). Relational recurrent neural networks (No. arXiv:1806.01822). arXiv. https://doi.org/10.48550/arXiv.1806.01822

Santoro, A., Raposo, D., Barrett, D. G. T., Malinowski, M., Pascanu, R., Battaglia, P., & Lillicrap, T. (2017). A simple neural network module for relational reasoning (No. arXiv:1706.01427). arXiv. https://doi.org/10.48550/arXiv.1706.01427

Shou, Y., Meng, T., Ai, W., Zhang, F., Yin, N., & Li, K. (2024). Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations. Information Fusion, 112, 102590. https://doi.org/10.1016/j.inffus.2024.102590

Shuchang, Z. (2022). A Survey on Human Action Recognition (No. arXiv:2301.06082). arXiv. https://doi.org/10.48550/arXiv.2301.06082

Sigurdsson, G. A., Gupta, A., Schmid, C., Farhadi, A., & Alahari, K. (2018). Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos (No. arXiv:1804.09626). arXiv. https://doi.org/10.48550/arXiv.1804.09626

Simonyan, K., & Zisserman, A. (2014). Two-Stream Convolutional Networks for Action Recognition in Videos (No. arXiv:1406.2199). arXiv. https://doi.org/10.48550/arXiv.1406.2199

Staudemeyer, R. C., & Morris, E. R. (2019). Understanding LSTM -- a tutorial into Long Short-Term Memory Recurrent Neural Networks (No. arXiv:1909.09586). arXiv. https://doi.org/10.48550/arXiv.1909.09586

Sun, C., Myers, A., Vondrick, C., Murphy, K., & Schmid, C. (2019). VideoBERT: A Joint Model for Video and Language Representation Learning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 7463–7472. https://doi.org/10.1109/ICCV.2019.00756

Sutton, C., & McCallum, A. (2007). An Introduction to Conditional Random Fields for Relational Learning. In L. Getoor & B. Taskar (Eds.), Introduction to Statistical Relational Learning (pp. 93–128). The MIT Press. https://doi.org/10.7551/mitpress/7432.003.0006

Tan, Z.-X., Goel, A., Nguyen, T.-S., & Ong, D. C. (2019). A Multimodal LSTM for Predicting Listener Empathic Responses Over Time (No. arXiv:1812.04891). arXiv. https://doi.org/10.48550/arXiv.1812.04891

Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). MovieQA: Understanding Stories in Movies through Question-Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4631–4640. https://doi.org/10.1109/CVPR.2016.501

Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning Spatiotemporal Features with 3D Convolutional Networks (No. arXiv:1412.0767). arXiv. https://doi.org/10.48550/arXiv.1412.0767

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2023). Attention Is All You Need (No. arXiv:1706.03762). arXiv. https://doi.org/10.48550/arXiv.1706.03762

Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Gool, L. V. (2017). Temporal Segment Networks for Action Recognition in Videos (No. arXiv:1705.02953). arXiv. https://doi.org/10.48550/arXiv.1705.02953

Weston, J., Bordes, A., Chopra, S., Rush, A. M., Merriënboer, B. van, Joulin, A., & Mikolov, T. (2015). Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks (No. arXiv:1502.05698). arXiv. https://doi.org/10.48550/arXiv.1502.05698

Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5288–5296. https://doi.org/10.1109/CVPR.2016.571

Yan, S., Xiong, Y., & Lin, D. (2018). Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition (No. arXiv:1801.07455). arXiv. https://doi.org/10.48550/arXiv.1801.07455

Yang, D., Huang, S., Wang, S., Liu, Y., Zhai, P., Su, L., Li, M., & Zhang, L. (2022). Emotion Recognition for Multiple Context Awareness. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer Vision – ECCV 2022 (Vol. 13697, pp. 144–162). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-19836-6_9

Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., & Tao, D. (2019). ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering (No. arXiv:1906.02467). arXiv. https://doi.org/10.48550/arXiv.1906.02467

Zalluhoglu, C., & Ikizler-Cinbis, N. (2020). Collective Sports: A multi-task dataset for collective activity recognition. Image and Vision Computing, 94, 103870. https://doi.org/10.1016/j.imavis.2020.103870

Zhang, C., Gao, F., Jia, B., Zhu, Y., & Zhu, S.-C. (2019). RAVEN: A Dataset for Relational and Analogical Visual REasoNing. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition

(CVPR), 5312–5322. https://doi.org/10.1109/CVPR.2019.00546

Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., & Weston, J. (2018). Personalizing Dialogue Agents: I have a dog, do you have pets too? (No. arXiv:1801.07243). arXiv. https://doi.org/10.48550/arXiv.1801.07243

Zhong, Y., Xiao, J., Ji, W., Li, Y., Deng, W., & Chua, T.-S. (2022). Video Question Answering: Datasets, Algorithms and Challenges (No. arXiv:2203.01225). arXiv. https://doi.org/10.48550/arXiv.2203.01225

Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal Relational Reasoning in Videos (No. arXiv:1711.08496). arXiv. https://doi.org/10.48550/arXiv.1711.08496

Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., & Wang, L. (2020). Deep Graph Contrastive Representation Learning (No. arXiv:2006.04131). arXiv. https://doi.org/10.48550/arXiv.2006.04131

Downloads

Published

2025-06-16

How to Cite

Xo‘jamqulov, A., & Jumanazarov, J. (2025). HAS MULTIMODAL LEARNING SUCCEEDED ENOUGH TO CAPTURE CONTEXTUAL MEANING OF HUMAN-TO-HUMAN INTERACTION? A SURVEY. DIGITAL TRANSFORMATION AND ARTIFICIAL INTELLIGENCE, 3(3), 153–170. Retrieved from https://dtai.tsue.uz/index.php/dtai/article/view/v3i322