Deep modular co-attention networks for visual question answering Z Yu, J Yu, Y Cui, D Tao, Q Tian IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6281-6290, 2019 | 950 | 2019 |
Multi-modal factorized bilinear pooling with co-attention learning for visual question answering Z Yu, J Yu, J Fan, D Tao IEEE International Conference on Computer Vision (ICCV), 1821-1830, 2017 | 794 | 2017 |
Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering Z Yu, J Yu, C Xiang, J Fan, D Tao IEEE Transactions on Neural Networks and Learning Systems 29 (12), 5947-5959, 2018 | 522 | 2018 |
Multimodal transformer with multi-view visual representation for image captioning J Yu, J Li, Z Yu, Q Huang IEEE Transactions on Circuits and Systems for Video Technology 30 (12), 4467 …, 2020 | 405 | 2020 |
ActivityNet-QA: A dataset for understanding complex web videos via question answering Z Yu, D Xu, J Yu, T Yu, Z Zhao, Y Zhuang, D Tao Proceedings of the AAAI Conference on Artificial Intelligence, 9127-9134, 2019 | 297 | 2019 |
Sparse multi-modal hashing F Wu, Z Yu, Y Yang, S Tang, Y Zhang, Y Zhuang IEEE Transactions on Multimedia 16 (2), 427 - 439, 2014 | 148 | 2014 |
Prompting large language models with answer heuristics for knowledge-based visual question answering Z Shao, Z Yu, M Wang, J Yu IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 14974-14983, 2023 | 141 | 2023 |
Rethinking diversified and discriminative proposal generation for visual grounding Z Yu, J Yu, C Xiang, Z Zhao, Q Tian, D Tao International Joint Conference on Artificial Intelligence (IJCAI), 1114-1120, 2018 | 134 | 2018 |
Discriminative coupled dictionary hashing for fast cross-media retrieval Z Yu, F Wu, Y Yang, Q Tian, J Luo, Y Zhuang Proceedings of the 37th international ACM SIGIR conference on Research …, 2014 | 131 | 2014 |
Deep multimodal neural architecture search Z Yu, Y Cui, J Yu, M Wang, D Tao, Q Tian Proceedings of the 28th ACM International Conference on Multimedia, 3743-3752, 2020 | 94 | 2020 |
SPRNet: Single pixel reconstruction for one-stage instance segmentation J Yu, J Yao, J Zhang, Z Yu, D Tao IEEE Transactions on Cybernetics 51 (4), 1731-1742, 2021 | 81 | 2021 |
Open-ended long-form video question answering via adaptive hierarchical reinforced networks Z Zhao, Z Zhang, S Xiao, Z Yu, J Yu, D Cai, F Wu, Y Zhuang International Joint Conference on Artificial Intelligence (IJCAI), 3683-3689, 2018 | 69 | 2018 |
MARN: Multi-level attentional reconstruction networks for weakly supervised video temporal grounding Y Song, J Wang, L Ma, J Yu, J Liang, L Yuan, Z Yu Neurocomputing 554, 126625, 2023 | 56* | 2023 |
ROSITA: Enhancing vision-and-language semantic alignments via cross-and intra-modal knowledge integration Y Cui, Z Yu, C Wang, Z Zhao, J Zhang, M Wang, J Yu Proceedings of the 29th ACM International Conference on Multimedia, 797-806, 2021 | 56 | 2021 |
Long-term video question answering via multimodal hierarchical memory attentive networks T Yu, J Yu, Z Yu, Q Huang, Q Tian IEEE Transactions on Circuits and Systems for Video Technology 31 (3), 931-944, 2020 | 52 | 2020 |
Compositional attention networks with two-stream fusion for video question answering T Yu, J Yu, Z Yu, D Tao IEEE Transactions on Image Processing 29, 1204-1218, 2019 | 43 | 2019 |
Multimodal unified attention networks for vision-and-language interactions Z Yu, Y Cui, J Yu, D Tao, Q Tian arXiv preprint arXiv:1908.04107, 2019 | 43 | 2019 |
Comprehensive distance-preserving autoencoders for cross-modal retrieval Y Zhan, J Yu, Z Yu, R Zhang, D Tao, Q Tian Proceedings of the 26th ACM international conference on Multimedia, 1137-1145, 2018 | 37 | 2018 |
Cross-media hashing with neural networks Y Zhuang, Z Yu, W Wang, F Wu, S Tang, J Shao Proceedings of the 22nd ACM international conference on Multimedia, 901-904, 2014 | 35 | 2014 |
Accelerated masked transformer for dense video captioning Z Yu, N Han Neurocomputing 445, 72-80, 2021 | 22 | 2021 |