Progressive graph attention network for video question answering

J Xiao, P Zhou, TS Chua, S Yan - European Conference on Computer …, 2022 - Springer

This paper proposes a Video Graph Transformer (VGT) model for Video Question Answering
(VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer …

被引用次数：81 相关文章所有 6 个版本

[PDF] arxiv.org

Video question answering: Datasets, algorithms and challenges

Y Zhong, J Xiao, W Ji, Y Li, W Deng… - arXiv preprint arXiv …, 2022 - arxiv.org

Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …

被引用次数：76 相关文章所有 3 个版本

[PDF] thecvf.com

Discovering spatio-temporal rationales for video question answering

Y Li, J Xiao, C Feng, X Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

This paper strives to solve complex video question answering (VideoQA) which features
long videos containing multiple objects and events at different time. To tackle the challenge …

被引用次数：11 相关文章所有 5 个版本

[PDF] arxiv.org

Contrastive video question answering via video graph transformer

J Xiao, P Zhou, A Yao, Y Li, R Hong… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org

We propose to perform video question answering (VideoQA) in a Co ntrastive manner via a
V ideo G raph T ransformer model (CoVGT). CoVGT's uniqueness and superiority are three …

被引用次数：26 相关文章所有 8 个版本

[PDF] acm.org

Your negative may not be true negative: Boosting image-text matching with false negative elimination

H Li, Y Bin, J Liao, Y Yang, HT Shen - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Most existing image-text matching methods adopt triplet loss as the optimization objective,
and choosing a proper negative sample for the triplet of< anchor, positive, negative> is …

被引用次数：21 相关文章所有 3 个版本

[PDF] acm.org

Equivariant and invariant grounding for video question answering

Y Li, X Wang, J Xiao, TS Chua - Proceedings of the 30th ACM …, 2022 - dl.acm.org

Video Question Answering (VideoQA) is the task of answering the natural language
questions about a video. Producing an answer requires understanding the interplay across …

被引用次数：28 相关文章所有 3 个版本

Robust video question answering via contrastive cross-modality representation learning

X Yang, J Zeng, D Guo, S Wang, J Dong… - Science China Information …, 2024 - Springer

Video question answering (VideoQA) is a challenging yet important task that requires a joint
understanding of low-level video content and high-level textual semantics. Despite the …

被引用次数：4 相关文章

[PDF] thecvf.com

Learning situation hyper-graphs for video question answering

A Urooj, H Kuehne, B Wu, K Chheu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Answering questions about complex situations in videos requires not only capturing of the
presence of actors, objects, and their relations, but also the evolution of these relationships …

被引用次数：12 相关文章所有 7 个版本

Transformer-empowered invariant grounding for video question answering

Y Li, X Wang, J Xiao, W Ji… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Video Question Answering (VideoQA) is the task of answering questions about a video. At its
core is the understanding of the alignments between video scenes and question semantics …

被引用次数：6 相关文章所有 4 个版本

[PDF] arxiv.org

Unifying two-stream encoders with transformers for cross-modal retrieval

Y Bin, H Li, Y Xu, X Xu, Y Yang, HT Shen - Proceedings of the 31st ACM …, 2023 - dl.acm.org

Most existing cross-modal retrieval methods employ two-stream encoders with different
architectures for images and texts, eg, CNN for images and RNN/Transformer for texts. Such …

被引用次数：15 相关文章所有 3 个版本