TVQA: Localized, compositional video question answering J Lei, L Yu, M Bansal, TL Berg EMNLP 2018, 2018 | 635 | 2018 |
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling J Lei*, L Li*, L Zhou, Z Gan, TL Berg, M Bansal, J Liu CVPR 2021, Best Student Paper Honorable Mention, 2021 | 634 | 2021 |
Unifying vision-and-language tasks via text generation J Cho, J Lei, H Tan, M Bansal ICML 2021, 2021 | 474 | 2021 |
Tvr: A large-scale dataset for video-subtitle moment retrieval J Lei, L Yu, TL Berg, M Bansal ECCV 2020, 2020 | 243 | 2020 |
TVQA+: Spatio-temporal grounding for video question answering J Lei, L Yu, TL Berg, M Bansal ACL 2020, 2020 | 230 | 2020 |
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning J Lei, L Wang, Y Shen, D Yu, TL Berg, M Bansal ACL 2020, 2020 | 186 | 2020 |
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries J Lei, TL Berg, M Bansal NeurIPS 2021, 2021 | 149* | 2021 |
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation L Li*, J Lei*, Z Gan, L Yu, YC Chen, R Pillai, Y Cheng, L Zhou, XE Wang, ... NeurIPS 2021 Datasets and Benchmarks Track, 2021 | 103 | 2021 |
Revealing single frame bias for video-and-language learning J Lei, TL Berg, M Bansal ACL 2023, 2022 | 97 | 2022 |
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners Z Wang, M Li, R Xu, L Zhou, J Lei, X Lin, S Wang, Z Yang, C Zhu, ... NeurIPS 2022, 2022 | 97 | 2022 |
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning H Tan*, J Lei*, T Wolf, M Bansal CVPR 2022 workshop on Transformers for Vision, 2021 | 69 | 2021 |
Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models L Li, J Lei, Z Gan, J Liu ICCV 2021, 2021 | 65 | 2021 |
What is More Likely to Happen Next? Video-and-Language Future Event Prediction J Lei, L Yu, TL Berg, M Bansal EMNLP 2020, 2020 | 60 | 2020 |
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization Z Tang*, J Lei*, M Bansal NAACL 2021, 2021 | 59 | 2021 |
VindLU: A Recipe for Effective Video-and-Language Pretraining F Cheng, X Wang, J Lei, D Crandall, M Bansal, G Bertasius CVPR 2023, 2022 | 53 | 2022 |
Vision Transformers are Parameter-Efficient Audio-Visual Learners YB Lin, YL Sung, J Lei, M Bansal, G Bertasius CVPR 2023, 2022 | 50 | 2022 |
RESIN-11: Schema-guided event prediction for 11 newsworthy scenarios X Du, Z Zhang, S Li, P Yu, H Wang, T Lai, X Lin, Z Wang, I Liu, B Zhou, ... Proceedings of the 2022 Conference of the North American Chapter of the …, 2022 | 30 | 2022 |
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound YB Lin, J Lei, M Bansal, G Bertasius ECCV 2022 Oral, 2022 | 30 | 2022 |
Weakly supervised image classification with coarse and fine labels J Lei, Z Guo, Y Wang 2017 14th conference on computer and robot vision (crv), 240-247, 2017 | 24 | 2017 |
mtvr: Multilingual moment retrieval in videos J Lei, TL Berg, M Bansal ACL 2021, 2021 | 12 | 2021 |