Video graph transformer for video question answering

J Xiao, P Zhou, TS Chua, S Yan - European Conference on Computer …, 2022 - Springer
This paper proposes a Video Graph Transformer (VGT) model for Video Question Answering
(VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer …

Video question answering: Datasets, algorithms and challenges

Y Zhong, J Xiao, W Ji, Y Li, W Deng… - arXiv preprint arXiv …, 2022 - arxiv.org
Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …

Discovering spatio-temporal rationales for video question answering

Y Li, J Xiao, C Feng, X Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
This paper strives to solve complex video question answering (VideoQA) which features
long videos containing multiple objects and events at different time. To tackle the challenge …

Contrastive video question answering via video graph transformer

J Xiao, P Zhou, A Yao, Y Li, R Hong… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
We propose to perform video question answering (VideoQA) in a Co ntrastive manner via a
V ideo G raph T ransformer model (CoVGT). CoVGT's uniqueness and superiority are three …

Your negative may not be true negative: Boosting image-text matching with false negative elimination

H Li, Y Bin, J Liao, Y Yang, HT Shen - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Most existing image-text matching methods adopt triplet loss as the optimization objective,
and choosing a proper negative sample for the triplet of< anchor, positive, negative> is …

Equivariant and invariant grounding for video question answering

Y Li, X Wang, J Xiao, TS Chua - Proceedings of the 30th ACM …, 2022 - dl.acm.org
Video Question Answering (VideoQA) is the task of answering the natural language
questions about a video. Producing an answer requires understanding the interplay across …

Robust video question answering via contrastive cross-modality representation learning

X Yang, J Zeng, D Guo, S Wang, J Dong… - Science China Information …, 2024 - Springer
Video question answering (VideoQA) is a challenging yet important task that requires a joint
understanding of low-level video content and high-level textual semantics. Despite the …

Learning situation hyper-graphs for video question answering

A Urooj, H Kuehne, B Wu, K Chheu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Answering questions about complex situations in videos requires not only capturing of the
presence of actors, objects, and their relations, but also the evolution of these relationships …

Transformer-empowered invariant grounding for video question answering

Y Li, X Wang, J Xiao, W Ji… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Video Question Answering (VideoQA) is the task of answering questions about a video. At its
core is the understanding of the alignments between video scenes and question semantics …

Unifying two-stream encoders with transformers for cross-modal retrieval

Y Bin, H Li, Y Xu, X Xu, Y Yang, HT Shen - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Most existing cross-modal retrieval methods employ two-stream encoders with different
architectures for images and texts, eg, CNN for images and RNN/Transformer for texts. Such …