A case study of the shortcut effects in visual commonsense reasoning

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer

In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

被引用次数：211 相关文章所有 10 个版本

[PDF] acm.org Full View

Shortcut learning of large language models in natural language understanding

M Du, F He, N Zou, D Tao, X Hu - Communications of the ACM, 2023 - dl.acm.org

Shortcut Learning of Large Language Models in Natural Language Understanding Page 1 110
COMMUNICATIONS OF THE ACM | JANUARY 2024 | VOL. 67 | NO. 1 research IMA GE B Y …

被引用次数：110 相关文章所有 8 个版本

[PDF] arxiv.org

Benchmarking spatial relationships in text-to-image generation

T Gokhale, H Palangi, B Nushi, V Vineet… - arXiv preprint arXiv …, 2022 - arxiv.org

Spatial understanding is a fundamental aspect of computer vision and integral for human-
level reasoning about images, making it an important component for grounded language …

被引用次数：66 相关文章所有 2 个版本

[PDF] arxiv.org

Multimodal fake news detection via clip-guided learning

Y Zhou, Y Yang, Q Ying, Z Qian… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org

Fake news detection (FND) has attracted much research interests in social forensics. Many
existing approaches introduce tailored attention mechanisms to fuse unimodal features …

被引用次数：60 相关文章所有 4 个版本

[PDF] aaai.org

Coca: Collaborative causal regularization for audio-visual question answering

M Lao, N Pu, Y Liu, K He, EM Bakker… - Proceedings of the AAAI …, 2023 - ojs.aaai.org

Abstract Audio-Visual Question Answering (AVQA) is a sophisticated QA task, which aims at
answering textual questions over given video-audio pairs with comprehensive multimodal …

被引用次数：14 相关文章所有 2 个版本

[PDF] aaai.org

Bootstrapping multi-view representations for fake news detection

Q Ying, X Hu, Y Zhou, Z Qian, D Zeng… - Proceedings of the AAAI …, 2023 - ojs.aaai.org

Previous researches on multimedia fake news detection include a series of complex feature
extraction and fusion networks to gather useful information from the news. However, how …

被引用次数：47 相关文章所有 4 个版本

[PDF] arxiv.org

Multi-modal fake news detection on social media via multi-grained information fusion

Y Zhou, Y Yang, Q Ying, Z Qian, X Zhang - Proceedings of the 2023 …, 2023 - dl.acm.org

The easy sharing of multimedia content on social media has caused a rapid dissemination
of fake news, which threatens society's stability and security. Therefore, fake news detection …

被引用次数：30 相关文章所有 5 个版本

[PDF] arxiv.org

Clip-td: Clip targeted distillation for vision-language tasks

Z Wang, N Codella, YC Chen, L Zhou, J Yang… - arXiv preprint arXiv …, 2022 - arxiv.org

Contrastive language-image pretraining (CLIP) links vision and language modalities into a
unified embedding space, yielding the tremendous potential for vision-language (VL) tasks …

被引用次数：27 相关文章所有 3 个版本

Multi-level counterfactual contrast for visual commonsense reasoning

X Zhang, F Zhang, C Xu - Proceedings of the 29th ACM International …, 2021 - dl.acm.org

Given a question about an image, a Visual Commonsense Reasoning (VCR) model needs
to provide not only a correct answer, but also a rationale to justify the answer. It is a …

被引用次数：24 相关文章

Relation Inference Enhancement Network for Visual Commonsense Reasoning

M Yuan, G Jia, BK Bao - IEEE Transactions on Multimedia, 2024 - ieeexplore.ieee.org

When presented with a question regarding an image, Visual Commonsense Reasoning
(VCR) offers not only a correct answer but also a rationale to justify the answer. Existing …