A survey on video diffusion models
The recent wave of AI-generated content (AIGC) has witnessed substantial success in
computer vision, with the diffusion model playing a crucial role in this achievement. Due to …
computer vision, with the diffusion model playing a crucial role in this achievement. Due to …
A review of machine learning-based human activity recognition for diverse applications
Human activity recognition (HAR) is a very active yet challenging and demanding area of
computer science. Due to the articulated nature of human motion, it is not trivial to detect …
computer science. Due to the articulated nature of human motion, it is not trivial to detect …
Internvid: A large-scale video-text dataset for multimodal understanding and generation
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …
learning powerful and transferable video-text representations for multimodal understanding …
Advancing high-resolution video-language representation with large-scale video transcriptions
We study joint video and language (VL) pre-training to enable cross-modality learning and
benefit plentiful downstream VL tasks. Existing works either extract low-quality video …
benefit plentiful downstream VL tasks. Existing works either extract low-quality video …
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …
provided captions. However, such datasets are expensive and time consuming to create and …
Vatex: A large-scale, high-quality multilingual dataset for video-and-language research
We present a new large-scale multilingual video description dataset, VATEX, which contains
over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions …
over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions …
Align and attend: Multimodal summarization with dual contrastive losses
The goal of multimodal summarization is to extract the most important information from
different modalities to form summaries. Unlike unimodal summarization, the multimodal …
different modalities to form summaries. Unlike unimodal summarization, the multimodal …
End-to-end learning of visual representations from uncurated instructional videos
Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video
models still rely on manually annotated data. With the recent introduction of the HowTo100M …
models still rely on manually annotated data. With the recent introduction of the HowTo100M …
How2sign: a large-scale multimodal dataset for continuous american sign language
One of the factors that have hindered progress in the areas of sign language recognition,
translation, and production is the absence of large annotated datasets. Towards this end, we …
translation, and production is the absence of large annotated datasets. Towards this end, we …
Findings of the IWSLT 2022 Evaluation Campaign.
The evaluation campaign of the 19th International Conference on Spoken Language
Translation featured eight shared tasks:(i) Simultaneous speech translation,(ii) Offline …
Translation featured eight shared tasks:(i) Simultaneous speech translation,(ii) Offline …