Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Shortcut learning of large language models in natural language understanding

M Du, F He, N Zou, D Tao, X Hu - Communications of the ACM, 2023 - dl.acm.org
Shortcut Learning of Large Language Models in Natural Language Understanding Page 1 110
COMMUNICATIONS OF THE ACM | JANUARY 2024 | VOL. 67 | NO. 1 research IMA GE B Y …

Benchmarking spatial relationships in text-to-image generation

T Gokhale, H Palangi, B Nushi, V Vineet… - arXiv preprint arXiv …, 2022 - arxiv.org
Spatial understanding is a fundamental aspect of computer vision and integral for human-
level reasoning about images, making it an important component for grounded language …

Multimodal fake news detection via clip-guided learning

Y Zhou, Y Yang, Q Ying, Z Qian… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
Fake news detection (FND) has attracted much research interests in social forensics. Many
existing approaches introduce tailored attention mechanisms to fuse unimodal features …

Coca: Collaborative causal regularization for audio-visual question answering

M Lao, N Pu, Y Liu, K He, EM Bakker… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Abstract Audio-Visual Question Answering (AVQA) is a sophisticated QA task, which aims at
answering textual questions over given video-audio pairs with comprehensive multimodal …

Bootstrapping multi-view representations for fake news detection

Q Ying, X Hu, Y Zhou, Z Qian, D Zeng… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Previous researches on multimedia fake news detection include a series of complex feature
extraction and fusion networks to gather useful information from the news. However, how …

Multi-modal fake news detection on social media via multi-grained information fusion

Y Zhou, Y Yang, Q Ying, Z Qian, X Zhang - Proceedings of the 2023 …, 2023 - dl.acm.org
The easy sharing of multimedia content on social media has caused a rapid dissemination
of fake news, which threatens society's stability and security. Therefore, fake news detection …

Clip-td: Clip targeted distillation for vision-language tasks

Z Wang, N Codella, YC Chen, L Zhou, J Yang… - arXiv preprint arXiv …, 2022 - arxiv.org
Contrastive language-image pretraining (CLIP) links vision and language modalities into a
unified embedding space, yielding the tremendous potential for vision-language (VL) tasks …

Multi-level counterfactual contrast for visual commonsense reasoning

X Zhang, F Zhang, C Xu - Proceedings of the 29th ACM International …, 2021 - dl.acm.org
Given a question about an image, a Visual Commonsense Reasoning (VCR) model needs
to provide not only a correct answer, but also a rationale to justify the answer. It is a …

Relation Inference Enhancement Network for Visual Commonsense Reasoning

M Yuan, G Jia, BK Bao - IEEE Transactions on Multimedia, 2024 - ieeexplore.ieee.org
When presented with a question regarding an image, Visual Commonsense Reasoning
(VCR) offers not only a correct answer but also a rationale to justify the answer. Existing …