Show and tell: Lessons learned from the 2015 mscoco image captioning challenge

C Zhang, C Zhang, C Li, Y Qiao, S Zheng… - arXiv preprint arXiv …, 2023 - arxiv.org

OpenAI has recently released GPT-4 (aka ChatGPT plus), which is demonstrated to be one
small step for generative AI (GAI), but one giant leap for artificial general intelligence (AGI) …

被引用次数：135 相关文章所有 5 个版本

[PDF] arxiv.org

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arXiv preprint arXiv:2209.03430, 2022 - arxiv.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

被引用次数：123 相关文章所有 2 个版本

[PDF] arxiv.org

Llama-adapter v2: Parameter-efficient visual instruction model

P Gao, J Han, R Zhang, Z Lin, S Geng, A Zhou… - arXiv preprint arXiv …, 2023 - arxiv.org

How to efficiently transform large language models (LLMs) into instruction followers is
recently a popular research direction, while training LLM for multi-modal reasoning remains …

被引用次数：355 相关文章所有 3 个版本

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

被引用次数：105 相关文章所有 6 个版本

[PDF] thecvf.com

Frozen in time: A joint video and image encoder for end-to-end retrieval

M Bain, A Nagrani, G Varol… - Proceedings of the …, 2021 - openaccess.thecvf.com

Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …

被引用次数：863 相关文章所有 12 个版本

[PDF] thecvf.com

HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

T Guan, F Liu, X Wu, R Xian, Z Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce" HallusionBench" a comprehensive benchmark designed for the evaluation of
image-context reasoning. This benchmark presents significant challenges to advanced large …

被引用次数：35 相关文章所有 3 个版本

[PDF] neurips.cc

Large language models are visual reasoning coordinators

L Chen, B Li, S Shen, J Yang, C Li… - Advances in …, 2024 - proceedings.neurips.cc

Visual reasoning requires multimodal perception and commonsense cognition of the world.
Recently, multiple vision-language models (VLMs) have been proposed with excellent …

被引用次数：19 相关文章所有 5 个版本

[PDF] thecvf.com

Meshed-memory transformer for image captioning

M Cornia, M Stefanini, L Baraldi… - Proceedings of the …, 2020 - openaccess.thecvf.com

Transformer-based architectures represent the state of the art in sequence modeling tasks
like machine translation and language understanding. Their applicability to multi-modal …

被引用次数：1021 相关文章所有 13 个版本

[PDF] thecvf.com

Rstnet: Captioning with adaptive attention on visual and non-visual words

X Zhang, X Sun, Y Luo, J Ji, Y Zhou… - Proceedings of the …, 2021 - openaccess.thecvf.com

Recent progress on visual question answering has explored the merits of grid features for
vision language tasks. Meanwhile, transformer-based models have shown remarkable …

被引用次数：206 相关文章所有 5 个版本

[PDF] arxiv.org

Eigen-cam: Class activation map using principal components

MB Muhammad, M Yeasin - 2020 international joint conference …, 2020 - ieeexplore.ieee.org

Deep neural networks are ubiquitous due to the ease of developing models and their
influence on other domains. At the heart of this progress is convolutional neural networks …

被引用次数：327 相关文章所有 4 个版本