How2: a large-scale dataset for multimodal language understanding

Z Xing, Q Feng, H Chen, Q Dai, H Hu, H Xu… - ACM Computing …, 2024 - dl.acm.org

The recent wave of AI-generated content (AIGC) has witnessed substantial success in
computer vision, with the diffusion model playing a crucial role in this achievement. Due to …

被引用次数：76 相关文章所有 3 个版本

[PDF] researchgate.net

A review of machine learning-based human activity recognition for diverse applications

F Kulsoom, S Narejo, Z Mehmood… - Neural Computing and …, 2022 - Springer

Human activity recognition (HAR) is a very active yet challenging and demanding area of
computer science. Due to the articulated nature of human motion, it is not trivial to detect …

被引用次数：85 相关文章所有 5 个版本

[PDF] arxiv.org

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li… - arXiv preprint arXiv …, 2023 - arxiv.org

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …

被引用次数：196 相关文章所有 4 个版本

[PDF] thecvf.com

Advancing high-resolution video-language representation with large-scale video transcriptions

H Xue, T Hang, Y Zeng, Y Sun, B Liu… - Proceedings of the …, 2022 - openaccess.thecvf.com

We study joint video and language (VL) pre-training to enable cross-modality learning and
benefit plentiful downstream VL tasks. Existing works either extract low-quality video …

被引用次数：184 相关文章所有 5 个版本

[PDF] thecvf.com

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

A Miech, D Zhukov, JB Alayrac… - Proceedings of the …, 2019 - openaccess.thecvf.com

Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …

被引用次数：1272 相关文章所有 10 个版本

[PDF] thecvf.com

Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

X Wang, J Wu, J Chen, L Li… - Proceedings of the …, 2019 - openaccess.thecvf.com

We present a new large-scale multilingual video description dataset, VATEX, which contains
over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions …

被引用次数：598 相关文章所有 8 个版本

[PDF] thecvf.com

Align and attend: Multimodal summarization with dual contrastive losses

B He, J Wang, J Qiu, T Bui… - Proceedings of the …, 2023 - openaccess.thecvf.com

The goal of multimodal summarization is to extract the most important information from
different modalities to form summaries. Unlike unimodal summarization, the multimodal …

被引用次数：56 相关文章所有 7 个版本

[PDF] thecvf.com

End-to-end learning of visual representations from uncurated instructional videos

A Miech, JB Alayrac, L Smaira… - Proceedings of the …, 2020 - openaccess.thecvf.com

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video
models still rely on manually annotated data. With the recent introduction of the HowTo100M …

被引用次数：820 相关文章所有 15 个版本

[PDF] thecvf.com

How2sign: a large-scale multimodal dataset for continuous american sign language

A Duarte, S Palaskar, L Ventura… - Proceedings of the …, 2021 - openaccess.thecvf.com

One of the factors that have hindered progress in the areas of sign language recognition,
translation, and production is the absence of large annotated datasets. Towards this end, we …

被引用次数：213 相关文章所有 13 个版本

[PDF] fbk.eu

Findings of the IWSLT 2022 Evaluation Campaign.

A Anastasopoulos, L Barrault, L Bentivogli… - Proceedings of the 19th …, 2022 - cris.fbk.eu

The evaluation campaign of the 19th International Conference on Spoken Language
Translation featured eight shared tasks:(i) Simultaneous speech translation,(ii) Offline …

被引用次数：109 相关文章所有 17 个版本