A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering Y Li, L Wang, B Hu, X Chen, W Zhong, C Lyu, M Zhang arXiv preprint arXiv:2311.07536, 2023 | 26 | 2023 |
Lmeye: An interactive perception network for large language models Y Li, B Hu, X Chen, L Ma, Y Xu, M Zhang IEEE Transactions on Multimedia, 2024 | 20 | 2024 |
A multi-modal context reasoning approach for conditional inference on joint textual and visual clues Y Li, B Hu, X Chen, Y Ding, L Ma, M Zhang arXiv preprint arXiv:2305.04530, 2023 | 11 | 2023 |
Llms meet long video: Advancing long video comprehension with an interactive visual adapter in llms Y Li, X Chen, B Hu, M Zhang arXiv preprint arXiv:2402.13546, 2024 | 3 | 2024 |
Vision-language model for generating textual descriptions from clinical images: model development and validation study J Ji, Y Hou, X Chen, Y Pan, Y Xiang JMIR Formative Research 8, e32690, 2024 | 1 | 2024 |
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning Y Li, X Chen, B Hu, L Wang, H Shi, M Zhang arXiv preprint arXiv:2406.11303, 2024 | | 2024 |
Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment Y Li, X Chen, B Hu, H Shi, M Zhang arXiv preprint arXiv:2402.13561, 2024 | | 2024 |