Video graph transformer for video question answering
This paper proposes a Video Graph Transformer (VGT) model for Video Question Answering
(VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer …
(VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer …
Video question answering: Datasets, algorithms and challenges
Video Question Answering (VideoQA) aims to answer natural language questions according
to the given videos. It has earned increasing attention with recent research trends in joint …
to the given videos. It has earned increasing attention with recent research trends in joint …
Discovering spatio-temporal rationales for video question answering
This paper strives to solve complex video question answering (VideoQA) which features
long videos containing multiple objects and events at different time. To tackle the challenge …
long videos containing multiple objects and events at different time. To tackle the challenge …
Contrastive video question answering via video graph transformer
We propose to perform video question answering (VideoQA) in a Co ntrastive manner via a
V ideo G raph T ransformer model (CoVGT). CoVGT's uniqueness and superiority are three …
V ideo G raph T ransformer model (CoVGT). CoVGT's uniqueness and superiority are three …
Your negative may not be true negative: Boosting image-text matching with false negative elimination
Most existing image-text matching methods adopt triplet loss as the optimization objective,
and choosing a proper negative sample for the triplet of< anchor, positive, negative> is …
and choosing a proper negative sample for the triplet of< anchor, positive, negative> is …
Equivariant and invariant grounding for video question answering
Video Question Answering (VideoQA) is the task of answering the natural language
questions about a video. Producing an answer requires understanding the interplay across …
questions about a video. Producing an answer requires understanding the interplay across …
Robust video question answering via contrastive cross-modality representation learning
Video question answering (VideoQA) is a challenging yet important task that requires a joint
understanding of low-level video content and high-level textual semantics. Despite the …
understanding of low-level video content and high-level textual semantics. Despite the …
Learning situation hyper-graphs for video question answering
Answering questions about complex situations in videos requires not only capturing of the
presence of actors, objects, and their relations, but also the evolution of these relationships …
presence of actors, objects, and their relations, but also the evolution of these relationships …
Transformer-empowered invariant grounding for video question answering
Video Question Answering (VideoQA) is the task of answering questions about a video. At its
core is the understanding of the alignments between video scenes and question semantics …
core is the understanding of the alignments between video scenes and question semantics …
Unifying two-stream encoders with transformers for cross-modal retrieval
Most existing cross-modal retrieval methods employ two-stream encoders with different
architectures for images and texts, eg, CNN for images and RNN/Transformer for texts. Such …
architectures for images and texts, eg, CNN for images and RNN/Transformer for texts. Such …