HAS MULTIMODAL LEARNING SUCCEEDED ENOUGH TO CAPTURE CONTEXTUAL MEANING OF HUMAN-TO-HUMAN INTERACTION? A SURVEY
Keywords:
Human-to-human interaction, multimodal learning, fusion models, computer vision, natural language processing, audio processingAbstract
Human communication is inherently multimodal, involving speech, facial expressions, gestures, body language, and even contextual cues. Variability and ambiguity make more complex to understand contextual meaning of human-to-human interaction (HHI) as gestures and expressions may have different meanings across cultures and personal habits and styles influence behavior interpretation. To tackle these problems, this article systematically analyses past and current state-of-the-art researches in multimodal learning techniques for contextual understanding of HHI using audio, text, and vision data.
References
A, S., Thomas, A., & Mathew, D. (2018). Study of MFCC and IHC Feature Extraction Methods With Probabilistic Acoustic Models for Speaker Biometric Applications. Procedia Computer Science, 143, 267–276. https://doi.org/10.1016/j.procs.2018.10.395
Antoniou, N., Katsamanis, A., Giannakopoulos, T., & Narayanan, S. (2023). Designing and Evaluating Speech Emotion Recognition Systems: A reality check case study with IEMOCAP. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096808
Antypas, D., Ushio, A., Barbieri, F., & Camacho-Collados, J. (2024). Multilingual Topic Classification in X: Dataset and Analysis (No. arXiv:2410.03075). arXiv. https://doi.org/10.48550/arXiv.2410.03075
Bagher Zadeh, A., Liang, P. P., Poria, S., Cambria, E., & Morency, L.-P. (2018). Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2236–2246. https://doi.org/10.18653/v1/P18-1208
Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., & Blackburn, J. (2020). The Pushshift Reddit Dataset (No. arXiv:2001.08435). arXiv. https://doi.org/10.48550/arXiv.2001.08435
Bertasius, G., Wang, H., & Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding? (No. arXiv:2102.05095). arXiv. https://doi.org/10.48550/arXiv.2102.05095
Budzianowski, P., Wen, T.-H., Tseng, B.-H., Casanueva, I., Ultes, S., Ramadan, O., & Gašić, M. (2018). MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 5016–5026. https://doi.org/10.18653/v1/D18-1547
Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., & Sheikh, Y. (2019). OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields (No. arXiv:1812.08008). arXiv. https://doi.org/10.48550/arXiv.1812.08008
Carreira, J., Noland, E., Hillier, C., & Zisserman, A. (2022). A Short Note on the Kinetics-700 Human Action Dataset (No. arXiv:1907.06987). arXiv. https://doi.org/10.48550/arXiv.1907.06987
Carreira, J., & Zisserman, A. (2018). Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (No. arXiv:1705.07750). arXiv. https://doi.org/10.48550/arXiv.1705.07750
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M. F., Parikh, D., & Batra, D. (2017). Visual Dialog (No. arXiv:1611.08669). arXiv. https://doi.org/10.48550/arXiv.1611.08669
Degirmenci, A. (n.d.). Introduction to Hidden Markov Models.
Dritsas, E., Trigka, M., Troussas, C., & Mylonas, P. (2025). Multimodal Interaction, Interfaces, and Communication: A Survey. Multimodal Technologies and Interaction, 9(1), 6. https://doi.org/10.3390/mti9010006
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast Networks for Video Recognition (No. arXiv:1812.03982). arXiv. https://doi.org/10.48550/arXiv.1812.03982
Feichtenhofer, C., Pinz, A., & Wildes, R. P. (2017). Spatiotemporal Multiplier Networks for Video Action Recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7445–7454. https://doi.org/10.1109/CVPR.2017.787
Goel, K., & Chandak, M. (2024). HIRO: Hierarchical Information Retrieval Optimization (No. arXiv:2406.09979). arXiv. https://doi.org/10.48550/arXiv.2406.09979
Goel, N. K., Sarma, M., Kushwah, T. S., Agrawal, D. K., Iqbal, Z., & Chauhan, S. (n.d.). Extracting speaker’s gender, accent, age and emotional state from speech.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative Adversarial Networks (No. arXiv:1406.2661). arXiv. https://doi.org/10.48550/arXiv.1406.2661
Goyal, R., Kahou, S. E., Michalski, V., Materzyńska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The “something something” video database for learning and evaluating visual common sense (No. arXiv:1706.04261). arXiv. https://doi.org/10.48550/arXiv.1706.04261
Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., Schmid, C., & Malik, J. (2018). AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions (No. arXiv:1705.08421). arXiv. https://doi.org/10.48550/arXiv.1705.08421
Gupta, M., Kulkarni, N., Chanda, R., Rayasam, A., & Lipton, Z. C. (2019). AmazonQA: A Review-Based Question Answering Task (No. arXiv:1908.04364). arXiv. https://doi.org/10.48550/arXiv.1908.04364
Hansen, J. H. L., & Hasan, T. (2015). Speaker Recognition by Machines and Humans: A tutorial review. IEEE Signal Processing Magazine, 32(6), 74–99. https://doi.org/10.1109/MSP.2015.2462851
Hara, K., Kataoka, H., & Satoh, Y. (2017). Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition (No. arXiv:1708.07632). arXiv. https://doi.org/10.48550/arXiv.1708.07632
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339. https://doi.org/10.1109/TPAMI.2013.248
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., & Chen, T. (2023). MotionGPT: Human Motion as a Foreign Language (No. arXiv:2306.14795). arXiv. https://doi.org/10.48550/arXiv.2306.14795
Kim, W., Son, B., & Kim, I. (2021). ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision (No. arXiv:2102.03334). arXiv. https://doi.org/10.48550/arXiv.2102.03334
Ko, W.-R., Jang, M., Lee, J., & Kim, J. (2020). AIR-Act2Act: Human-human interaction dataset for teaching non-verbal social behaviors to robots (No. arXiv:2009.02041). arXiv. https://doi.org/10.48550/arXiv.2009.02041
Lachter, J., Forster, K. I., & Ruthruff, E. (2004). Forty-five years after Broadbent (1958): Still no identification without attention. Psychological Review, 111(4), 880–913. https://doi.org/10.1037/0033-295X.111.4.880
Lai, K., & Yanushkevich, S. N. (2018). CNN+RNN Depth and Skeleton based Dynamic Hand Gesture Recognition. 2018 24th International Conference on Pattern Recognition (ICPR), 3451–3456. https://doi.org/10.1109/ICPR.2018.8545718
Laptev & Lindeberg. (2003). Space-time interest points. Proceedings Ninth IEEE International Conference on Computer Vision, 432–439 vol.1. https://doi.org/10.1109/ICCV.2003.1238378
Lea, C., Vidal, R., Reiter, A., & Hager, G. D. (2016). Temporal Convolutional Networks: A Unified Approach to Action Segmentation (No. arXiv:1608.08242). arXiv. https://doi.org/10.48550/arXiv.1608.08242
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791
Li, L., Chen, Y.-C., Cheng, Y., Gan, Z., Yu, L., & Liu, J. (2020). HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2046–2065. https://doi.org/10.18653/v1/2020.emnlp-main.161
Li, Y., Su, H., Shen, X., Li, W., Cao, Z., & Niu, S. (2017). DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset (No. arXiv:1710.03957). arXiv. https://doi.org/10.48550/arXiv.1710.03957
Lin, J., Gan, C., & Han, S. (2019). TSM: Temporal Shift Module for Efficient Video Understanding (No. arXiv:1811.08383). arXiv. https://doi.org/10.48550/arXiv.1811.08383
Lison, P., Tiedemann, J., & Kouylekov, M. (n.d.). OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora.
Liu, C., Hu, Y., Li, Y., Song, S., & Liu, J. (2017). PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding (No. arXiv:1703.07475). arXiv. https://doi.org/10.48550/arXiv.1703.07475
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., & Kot, A. C. (2020). NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10), 2684–2701. https://doi.org/10.1109/TPAMI.2019.2916873
Ludl, D., Gulde, T., & Curio, C. (2019). Simple yet efficient real-time pose-based action recognition (No. arXiv:1904.09140). arXiv. https://doi.org/10.48550/arXiv.1904.09140
Materzynska, J., Berger, G., Bax, I., & Memisevic, R. (2019). The Jester Dataset: A Large-Scale Video Dataset of Human Gestures. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2874–2882. https://doi.org/10.1109/ICCVW.2019.00349
Mieczkowski, H., Hancock, J. T., Naaman, M., Jung, M., & Hohenstein, J. (2021). AI-Mediated Communication: Language Use and Interpersonal Effects in a Referential Communication Task. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1), 1–14. https://doi.org/10.1145/3449091
Mohamed, A., Qian, K., Elhoseiny, M., & Claudel, C. (2020). Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction (No. arXiv:2002.11927). arXiv. https://doi.org/10.48550/arXiv.2002.11927
Nguyen, C. V. T., Mai, T., The, S., Kieu, D., & Le, D.-T. (2023). Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 15154–15167. https://doi.org/10.18653/v1/2023.emnlp-main.937
Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., & Mihalcea, R. (2019). MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations (No. arXiv:1810.02508). arXiv. https://doi.org/10.48550/arXiv.1810.02508
Quadrana, M., Karatzoglou, A., Hidasi, B., & Cremonesi, P. (2017). Personalizing Session-based Recommendations with Hierarchical Recurrent Neural Networks. Proceedings of the Eleventh ACM Conference on Recommender Systems, 130–137. https://doi.org/10.1145/3109859.3109896
Rashkin, H., Smith, E. M., Li, M., & Boureau, Y.-L. (2019). Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset (No. arXiv:1811.00207). arXiv. https://doi.org/10.48550/arXiv.1811.00207
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10674–10685. https://doi.org/10.1109/CVPR52688.2022.01042
Ryoo, M. S., Chen, C.-C., Aggarwal, J. K., & Roy-Chowdhury, A. (2010). An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010. In D. Ünay, Z. Çataltepe, & S. Aksoy (Eds.), Recognizing Patterns in Signals, Speech, Images and Videos (Vol. 6388, pp. 270–285). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-17711-8_28
Santoro, A., Faulkner, R., Raposo, D., Rae, J., Chrzanowski, M., Weber, T., Wierstra, D., Vinyals, O., Pascanu, R., & Lillicrap, T. (2018). Relational recurrent neural networks (No. arXiv:1806.01822). arXiv. https://doi.org/10.48550/arXiv.1806.01822
Santoro, A., Raposo, D., Barrett, D. G. T., Malinowski, M., Pascanu, R., Battaglia, P., & Lillicrap, T. (2017). A simple neural network module for relational reasoning (No. arXiv:1706.01427). arXiv. https://doi.org/10.48550/arXiv.1706.01427
Shou, Y., Meng, T., Ai, W., Zhang, F., Yin, N., & Li, K. (2024). Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations. Information Fusion, 112, 102590. https://doi.org/10.1016/j.inffus.2024.102590
Shuchang, Z. (2022). A Survey on Human Action Recognition (No. arXiv:2301.06082). arXiv. https://doi.org/10.48550/arXiv.2301.06082
Sigurdsson, G. A., Gupta, A., Schmid, C., Farhadi, A., & Alahari, K. (2018). Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos (No. arXiv:1804.09626). arXiv. https://doi.org/10.48550/arXiv.1804.09626
Simonyan, K., & Zisserman, A. (2014). Two-Stream Convolutional Networks for Action Recognition in Videos (No. arXiv:1406.2199). arXiv. https://doi.org/10.48550/arXiv.1406.2199
Staudemeyer, R. C., & Morris, E. R. (2019). Understanding LSTM -- a tutorial into Long Short-Term Memory Recurrent Neural Networks (No. arXiv:1909.09586). arXiv. https://doi.org/10.48550/arXiv.1909.09586
Sun, C., Myers, A., Vondrick, C., Murphy, K., & Schmid, C. (2019). VideoBERT: A Joint Model for Video and Language Representation Learning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 7463–7472. https://doi.org/10.1109/ICCV.2019.00756
Sutton, C., & McCallum, A. (2007). An Introduction to Conditional Random Fields for Relational Learning. In L. Getoor & B. Taskar (Eds.), Introduction to Statistical Relational Learning (pp. 93–128). The MIT Press. https://doi.org/10.7551/mitpress/7432.003.0006
Tan, Z.-X., Goel, A., Nguyen, T.-S., & Ong, D. C. (2019). A Multimodal LSTM for Predicting Listener Empathic Responses Over Time (No. arXiv:1812.04891). arXiv. https://doi.org/10.48550/arXiv.1812.04891
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). MovieQA: Understanding Stories in Movies through Question-Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4631–4640. https://doi.org/10.1109/CVPR.2016.501
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning Spatiotemporal Features with 3D Convolutional Networks (No. arXiv:1412.0767). arXiv. https://doi.org/10.48550/arXiv.1412.0767
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2023). Attention Is All You Need (No. arXiv:1706.03762). arXiv. https://doi.org/10.48550/arXiv.1706.03762
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Gool, L. V. (2017). Temporal Segment Networks for Action Recognition in Videos (No. arXiv:1705.02953). arXiv. https://doi.org/10.48550/arXiv.1705.02953
Weston, J., Bordes, A., Chopra, S., Rush, A. M., Merriënboer, B. van, Joulin, A., & Mikolov, T. (2015). Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks (No. arXiv:1502.05698). arXiv. https://doi.org/10.48550/arXiv.1502.05698
Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5288–5296. https://doi.org/10.1109/CVPR.2016.571
Yan, S., Xiong, Y., & Lin, D. (2018). Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition (No. arXiv:1801.07455). arXiv. https://doi.org/10.48550/arXiv.1801.07455
Yang, D., Huang, S., Wang, S., Liu, Y., Zhai, P., Su, L., Li, M., & Zhang, L. (2022). Emotion Recognition for Multiple Context Awareness. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer Vision – ECCV 2022 (Vol. 13697, pp. 144–162). Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-19836-6_9
Yu, Z., Xu, D., Yu, J., Yu, T., Zhao, Z., Zhuang, Y., & Tao, D. (2019). ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering (No. arXiv:1906.02467). arXiv. https://doi.org/10.48550/arXiv.1906.02467
Zalluhoglu, C., & Ikizler-Cinbis, N. (2020). Collective Sports: A multi-task dataset for collective activity recognition. Image and Vision Computing, 94, 103870. https://doi.org/10.1016/j.imavis.2020.103870
Zhang, C., Gao, F., Jia, B., Zhu, Y., & Zhu, S.-C. (2019). RAVEN: A Dataset for Relational and Analogical Visual REasoNing. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 5312–5322. https://doi.org/10.1109/CVPR.2019.00546
Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., & Weston, J. (2018). Personalizing Dialogue Agents: I have a dog, do you have pets too? (No. arXiv:1801.07243). arXiv. https://doi.org/10.48550/arXiv.1801.07243
Zhong, Y., Xiao, J., Ji, W., Li, Y., Deng, W., & Chua, T.-S. (2022). Video Question Answering: Datasets, Algorithms and Challenges (No. arXiv:2203.01225). arXiv. https://doi.org/10.48550/arXiv.2203.01225
Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal Relational Reasoning in Videos (No. arXiv:1711.08496). arXiv. https://doi.org/10.48550/arXiv.1711.08496
Zhu, Y., Xu, Y., Yu, F., Liu, Q., Wu, S., & Wang, L. (2020). Deep Graph Contrastive Representation Learning (No. arXiv:2006.04131). arXiv. https://doi.org/10.48550/arXiv.2006.04131
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Abdulaziz Xo‘jamqulov, Javlon Jumanazarov

This work is licensed under a Creative Commons Attribution 4.0 International License.