UNITER: Learning UNiversal Image-TExt Representations YC Chen, L Li, L Yu, AE Kholy, F Ahmed, Z Gan, Y Cheng, J Liu ECCV 2020, 2020 | 2370* | 2020 |
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling J Lei, L Li, L Zhou, Z Gan, TL Berg, M Bansal, J Liu CVPR 2021, 2021 | 631 | 2021 |
HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training L Li, YC Chen, Y Cheng, Z Gan, L Yu, J Liu EMNLP 2020, 2020 | 492 | 2020 |
Large-Scale Adversarial Training for Vision-and-Language Representation Learning Z Gan, YC Chen, L Li, C Zhu, Y Cheng, J Liu NeurIPS 2020, 2020 | 486 | 2020 |
GIT: A Generative Image-to-text Transformer for Vision and Language J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu, C Liu, L Wang TMLR, 2022 | 401 | 2022 |
Relation-aware graph attention network for visual question answering L Li, Z Gan, Y Cheng, J Liu ICCV 2019, 2019 | 395 | 2019 |
The dawn of lmms: Preliminary explorations with gpt-4v (ision) Z Yang, L Li, K Lin, J Wang, CC Lin, Z Liu, L Wang arXiv preprint arXiv:2309.17421 9, 1, 2023 | 320 | 2023 |
Improving image generation with better captions J Betker, G Goh, L Jing, T Brooks, J Wang, L Li, L Ouyang, J Zhuang, ... Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf 2 (3), 8, 2023 | 313 | 2023 |
Segment Everything Everywhere All at Once X Zou, J Yang, H Zhang, F Li, L Li, J Gao, YJ Lee NeurIPS 2023, 2023 | 302 | 2023 |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action Z Yang, L Li, J Wang, K Lin, E Azarnasab, F Ahmed, Z Liu, C Liu, M Zeng, ... arXiv preprint arXiv:2303.11381, 2023 | 235 | 2023 |
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning K Lin, L Li, CC Lin, F Ahmed, Z Gan, Z Liu, Y Lu, L Wang CVPR 2022, 2021 | 217 | 2021 |
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities W Yu, Z Yang, L Li, J Wang, K Lin, Z Liu, X Wang, L Wang ICML 2024, 2023 | 190 | 2023 |
Mitigating hallucination in large multi-modal models via robust instruction tuning F Liu, K Lin, L Li, J Wang, Y Yacoob, L Wang ICLR 2024, 2023 | 188* | 2023 |
VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling TJ Fu, L Li, Z Gan, K Lin, WY Wang, L Wang, Z Liu arXiv preprint arXiv:2111.12681, 2021 | 187 | 2021 |
Graph Optimal Transport for Cross-Domain Alignment L Chen, Z Gan, Y Cheng, L Li, L Carin, J Liu ICML 2020, 2020 | 164 | 2020 |
Generalized Decoding for Pixel, Image, and Language X Zou, ZY Dou, J Yang, Z Gan, L Li, C Li, X Dai, H Behl, J Wang, L Yuan, ... CVPR 2023, 2022 | 161 | 2022 |
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends Z Gan, L Li, C Li, L Wang, Z Liu, J Gao Foundations and Trends® in Computer Graphics and Vision 14 (3–4), 163-352, 2022 | 142 | 2022 |
Multi-step reasoning via recurrent dual attention for visual dialog Z Gan, Y Cheng, AEI Kholy, L Li, J Liu, J Gao ACL 2019, 2019 | 112 | 2019 |
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation L Li, J Lei, Z Gan, L Yu, YC Chen, R Pillai, Y Cheng, L Zhou, XE Wang, ... NeurIPS 2021 Data and Benchmark Track, 2021 | 103 | 2021 |
Multimodal foundation models: From specialists to general-purpose assistants C Li, Z Gan, Z Yang, J Yang, L Li, L Wang, J Gao Foundations and Trends® in Computer Graphics and Vision 16.1-2 (2024): 1-214., 2023 | 98 | 2023 |