A survey on video diffusion models

Z Xing, Q Feng, H Chen, Q Dai, H Hu, H Xu… - ACM Computing …, 2024 - dl.acm.org
The recent wave of AI-generated content (AIGC) has witnessed substantial success in
computer vision, with the diffusion model playing a crucial role in this achievement. Due to …

A review of machine learning-based human activity recognition for diverse applications

F Kulsoom, S Narejo, Z Mehmood… - Neural Computing and …, 2022 - Springer
Human activity recognition (HAR) is a very active yet challenging and demanding area of
computer science. Due to the articulated nature of human motion, it is not trivial to detect …

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Y Wang, Y He, Y Li, K Li, J Yu, X Ma, X Li… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …

Advancing high-resolution video-language representation with large-scale video transcriptions

H Xue, T Hang, Y Zeng, Y Sun, B Liu… - Proceedings of the …, 2022 - openaccess.thecvf.com
We study joint video and language (VL) pre-training to enable cross-modality learning and
benefit plentiful downstream VL tasks. Existing works either extract low-quality video …

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

A Miech, D Zhukov, JB Alayrac… - Proceedings of the …, 2019 - openaccess.thecvf.com
Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …

Vatex: A large-scale, high-quality multilingual dataset for video-and-language research

X Wang, J Wu, J Chen, L Li… - Proceedings of the …, 2019 - openaccess.thecvf.com
We present a new large-scale multilingual video description dataset, VATEX, which contains
over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions …

Align and attend: Multimodal summarization with dual contrastive losses

B He, J Wang, J Qiu, T Bui… - Proceedings of the …, 2023 - openaccess.thecvf.com
The goal of multimodal summarization is to extract the most important information from
different modalities to form summaries. Unlike unimodal summarization, the multimodal …

End-to-end learning of visual representations from uncurated instructional videos

A Miech, JB Alayrac, L Smaira… - Proceedings of the …, 2020 - openaccess.thecvf.com
Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video
models still rely on manually annotated data. With the recent introduction of the HowTo100M …

How2sign: a large-scale multimodal dataset for continuous american sign language

A Duarte, S Palaskar, L Ventura… - Proceedings of the …, 2021 - openaccess.thecvf.com
One of the factors that have hindered progress in the areas of sign language recognition,
translation, and production is the absence of large annotated datasets. Towards this end, we …

Findings of the IWSLT 2022 Evaluation Campaign.

A Anastasopoulos, L Barrault, L Bentivogli… - Proceedings of the 19th …, 2022 - cris.fbk.eu
The evaluation campaign of the 19th International Conference on Spoken Language
Translation featured eight shared tasks:(i) Simultaneous speech translation,(ii) Offline …