Vlp: A survey on vision-language pre-training
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …
such as computer vision (CV) and natural language processing (NLP) to a new era …
Shortcut learning of large language models in natural language understanding
Shortcut Learning of Large Language Models in Natural Language Understanding Page 1 110
COMMUNICATIONS OF THE ACM | JANUARY 2024 | VOL. 67 | NO. 1 research IMA GE B Y …
COMMUNICATIONS OF THE ACM | JANUARY 2024 | VOL. 67 | NO. 1 research IMA GE B Y …
Benchmarking spatial relationships in text-to-image generation
Spatial understanding is a fundamental aspect of computer vision and integral for human-
level reasoning about images, making it an important component for grounded language …
level reasoning about images, making it an important component for grounded language …
Multimodal fake news detection via clip-guided learning
Fake news detection (FND) has attracted much research interests in social forensics. Many
existing approaches introduce tailored attention mechanisms to fuse unimodal features …
existing approaches introduce tailored attention mechanisms to fuse unimodal features …
Coca: Collaborative causal regularization for audio-visual question answering
Abstract Audio-Visual Question Answering (AVQA) is a sophisticated QA task, which aims at
answering textual questions over given video-audio pairs with comprehensive multimodal …
answering textual questions over given video-audio pairs with comprehensive multimodal …
Bootstrapping multi-view representations for fake news detection
Previous researches on multimedia fake news detection include a series of complex feature
extraction and fusion networks to gather useful information from the news. However, how …
extraction and fusion networks to gather useful information from the news. However, how …
Multi-modal fake news detection on social media via multi-grained information fusion
The easy sharing of multimedia content on social media has caused a rapid dissemination
of fake news, which threatens society's stability and security. Therefore, fake news detection …
of fake news, which threatens society's stability and security. Therefore, fake news detection …
Clip-td: Clip targeted distillation for vision-language tasks
Contrastive language-image pretraining (CLIP) links vision and language modalities into a
unified embedding space, yielding the tremendous potential for vision-language (VL) tasks …
unified embedding space, yielding the tremendous potential for vision-language (VL) tasks …
Multi-level counterfactual contrast for visual commonsense reasoning
Given a question about an image, a Visual Commonsense Reasoning (VCR) model needs
to provide not only a correct answer, but also a rationale to justify the answer. It is a …
to provide not only a correct answer, but also a rationale to justify the answer. It is a …
Relation Inference Enhancement Network for Visual Commonsense Reasoning
M Yuan, G Jia, BK Bao - IEEE Transactions on Multimedia, 2024 - ieeexplore.ieee.org
When presented with a question regarding an image, Visual Commonsense Reasoning
(VCR) offers not only a correct answer but also a rationale to justify the answer. Existing …
(VCR) offers not only a correct answer but also a rationale to justify the answer. Existing …