INTEGRATING LARGE LANGUAGE MODELS WITH VISUAL DATA FOR ENHANCED HUMAN-OBJECT INTERACTION DETECTION
Keywords:
Human-object interaction, image analysis, interaction detection, interaction-language models, large language models, multimodal learningAbstract
In recent years, the widespread use of visionbased intelligent systems has significantly advanced image and video analysis technologies. One key research area within this field is human activity recognition. Recent studies have primarily concentrated on specific tasks such as human action recognition and human-object interaction detection, employing depth data, 3D skeleton data, image data, and spatiotemporal interest point-based methods. Most of these approaches rely on bounding-box techniques to recognize human-object interactions. However, limited research has been conducted on using language models for this purpose. In this paper, we propose a model that combines language and image data to detect human-object interactions and discuss the challenges and future directions in this domain.
References
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
B. Dai, Y. Zhang, and D. Lin. Detecting visual relationships with deep relational networks. In CVPR, 2017.
Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, Tao Chen. MotionGPT: Human Motion as a Foreign Language. arXiv:2306.14795v2 [cs.CV] 20 Jul 2023.
Bo Wan, Desen Zhou, Yongfei Liu, Rongjie Li, and Xuming He. Pose-aware multi-level feature network for human object interaction detection. In Int. Conf. Comput. Vis., pages 9469–9478, October 2019.
Chaofan Huo, Ye Shi, Yuexin Ma, Lan Xu, Jingyi Yu and Jingya Wang. StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. Main Track. Pages 902-910. 2023.
C. Gao, Y. Zou, J.B. Huang. iCAN: Instance-centric attention network for human-object interaction detection. In: BMVC (2018).
Fang, H.S., Cao, J., Tai, Y.W., Lu, C.: Pairwise body-part attention for recognizing human-object interactions. In: ECCV (2018).
Fredierc Z. Zhang, Dylan Campbell, and Stephen Gould. Efficient two-stage detection of human–object interactions with a novel unary–pairwise transformer. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
Frederic Z. Zhang, Yuhui Yuan, Dylan Campbell, Zhuoyao Zhong, Stephen Gould. Exploring Predicate Visual Context in Detecting of Human–Object Interactions. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: CVPR, 2018.
Gupta, T., Schwing, A., Hoiem, D.: No-frills human-object interaction detection: Factorization, appearance and layout encodings, and training techniques. In: ICCV (2019).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
Korrawe Karunratanakul, Jinlong Yang, Yan Zhang, Michael J. Black, Krikamol Muandet, and Siyu Tang. Grasping field: Learning implicit representations for human grasps. In 2020 International Conference on 3D Vision (3DV), pages 333–344, 2020.
Li, Y.L., Zhou, S., Huang, X., Xu, L., Ma, Z., Fang, H.S., Wang, Y.F., Lu, C.: Transferable interactiveness prior for human-object interaction detection. In: CVPR (2019).
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Lu Liu and Robby T. Tan. Human object interaction detection using two-direction spatial enhancement and exclusive object prior. Pattern Recognition, 124:108438, 2022.
Masato Tamura, Hiroki Ohashi, and Tomoaki Yoshinaga. QPIC: Query-based pairwise human-object interaction detection with image-wide contextual information. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
Mingfei Chen, Yue Liao, Si Liu, Zhiyuan Chen, Fei Wang, and Chen Qian. Reformulating HOI detection as adaptive set prediction. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Tiberiu Popa. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 Conference Papers, pages 1–8, 2022.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Proceedings of European Conference on Computer Vision, 2020.
Qi, S., Wang, W., Jia, B., Shen, J., Zhu, S.C.: Learning human-object interactions by graph parsing neural networks. In: ECCV (2018).
R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Visual relationship detection with internal and external linguistic knowledge distillation. In CVPR, 2017.
Xianghui Xie, Bharat Lal Bhatnagar, and Gerard Pons-Moll. Visibility aware human-object interaction tracking from single rgb camera. In IEEE Conf. Comput. Vis. Pattern Recog., June 2023.
X. Liang, L. Lee, and E. P. Xing. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In CVPR, 2017.
Yue Liao, Aixi Zhang, Miao Lu, Yongliang Wang, Xiaobo Li, and Si Liu. GEN-VLKT: Simplify association and enhance interaction understanding for hoi detection. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In WACV, 2018.
Yusuke Goutsu and Tetsunari Inamura. Linguistic descriptions of human interaction with generative adversarial seq2seq learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4281–4287. IEEE, 2021.