Semantic Enrichment for Video Question Answering with Gated Graph Neural Networks

Abstract

Video Question Answering (VideoQA) is a complex task that requires a deep understanding of a video to accurately answer questions. Existing methods often struggle to effectively integrate the visual and language-based semantic information, subsequently leading to an incomplete understanding of video content and sub-optimal performance. To address the challenge, we introduce a novel approach in this paper to enrich the semantics of video frames, questions, and answer candidates. Specifically, we parse video frames and questions into semantic graphs - visual semantic graph and question semantic graph, which captures information about objects, their attributes, and relationships. These graphs are then encoded using a Gated Graph Neural Network (GGNN), For answer candidates, we propose to verbalize them using Large Language Models (LLMs) to further inject more semantic information from visual and acoustic aspects. We evaluate our approach on benchmark VideoQA datasets: AVQA and Music-AVQA. Experimental results show that our approach outperforms competitive baseline models, achieving state-of-the-art performance on various question types.

Publication
In 2024 IEEE International Conference on Acoustics, Speech and Signal Processing