Just ask: Learning to answer questions from millions of narrated videos

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2021 - openaccess.thecvf.com
Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …

Mad: A scalable dataset for language grounding in videos from movie audio descriptions

M Soldan, A Pardo, JL Alcázar… - Proceedings of the …, 2022 - openaccess.thecvf.com
The recent and increasing interest in video-language research has driven the development
of large-scale datasets that enable data-intensive machine learning techniques. In …

Stepformer: Self-supervised step discovery and localization in instructional videos

N Dvornik, I Hadji, R Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Instructional videos are an important resource to learn procedural tasks from human
demonstrations. However, the instruction steps in such videos are typically short and sparse …

Punctuation restoration using transformer models for high-and low-resource languages

T Alam, A Khan, F Alam - Proceedings of the Sixth Workshop on …, 2020 - aclanthology.org
Punctuation restoration is a common post-processing problem for Automatic Speech
Recognition (ASR) systems. It is important to improve the readability of the transcribed text …

Attention-based parallel networks (APNet) for PM2. 5 spatiotemporal prediction

J Zhu, F Deng, J Zhao, H Zheng - Science of The Total Environment, 2021 - Elsevier
Urban particulate matter forecast is an important part of air pollution early warning and
control management, especially the forecast of fine particulate matter (PM 2.5). However, the …

Learning to answer visual questions from web videos

A Yang, A Miech, J Sivic, I Laptev, C Schmid - arXiv preprint arXiv …, 2022 - arxiv.org
Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …

Learning to segment actions from visual and language instructions via differentiable weak sequence alignment

Y Shen, L Wang, E Elhamifar - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We address the problem of unsupervised localization of key-steps and feature learning in
instructional videos using both visual and language instructions. Our key observation is that …

Advanced rich transcription system for Estonian speech

T Alumäe, O Tilk - Human language technologies–the Baltic …, 2018 - ebooks.iospress.nl
This paper describes the current TTÜ speech transcription system for Estonian speech. The
system is designed to handle semi-spontaneous speech, such as broadcast conversations …

Towards automatic detection of misinformation in online medical videos

R Hou, V Pérez-Rosas, S Loeb… - … International conference on …, 2019 - dl.acm.org
Recent years have witnessed a significant increase in the online sharing of medical
information, with videos representing a large fraction of such online sources. Previous …

Capitalization and punctuation restoration: a survey

V Păiş, D Tufiş - Artificial Intelligence Review, 2022 - Springer
Ensuring proper punctuation and letter casing is a key pre-processing step towards applying
complex natural language processing algorithms. This is especially significant for textual …