Just ask: Learning to answer questions from millions of narrated videos
Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …
Manual annotation of questions and answers for videos, however, is tedious, expensive and …
Mad: A scalable dataset for language grounding in videos from movie audio descriptions
The recent and increasing interest in video-language research has driven the development
of large-scale datasets that enable data-intensive machine learning techniques. In …
of large-scale datasets that enable data-intensive machine learning techniques. In …
Stepformer: Self-supervised step discovery and localization in instructional videos
Instructional videos are an important resource to learn procedural tasks from human
demonstrations. However, the instruction steps in such videos are typically short and sparse …
demonstrations. However, the instruction steps in such videos are typically short and sparse …
Punctuation restoration using transformer models for high-and low-resource languages
Punctuation restoration is a common post-processing problem for Automatic Speech
Recognition (ASR) systems. It is important to improve the readability of the transcribed text …
Recognition (ASR) systems. It is important to improve the readability of the transcribed text …
Attention-based parallel networks (APNet) for PM2. 5 spatiotemporal prediction
Urban particulate matter forecast is an important part of air pollution early warning and
control management, especially the forecast of fine particulate matter (PM 2.5). However, the …
control management, especially the forecast of fine particulate matter (PM 2.5). However, the …
Learning to answer visual questions from web videos
Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …
Manual annotation of questions and answers for videos, however, is tedious, expensive and …
Learning to segment actions from visual and language instructions via differentiable weak sequence alignment
We address the problem of unsupervised localization of key-steps and feature learning in
instructional videos using both visual and language instructions. Our key observation is that …
instructional videos using both visual and language instructions. Our key observation is that …
Advanced rich transcription system for Estonian speech
This paper describes the current TTÜ speech transcription system for Estonian speech. The
system is designed to handle semi-spontaneous speech, such as broadcast conversations …
system is designed to handle semi-spontaneous speech, such as broadcast conversations …
Towards automatic detection of misinformation in online medical videos
Recent years have witnessed a significant increase in the online sharing of medical
information, with videos representing a large fraction of such online sources. Previous …
information, with videos representing a large fraction of such online sources. Previous …
Capitalization and punctuation restoration: a survey
Ensuring proper punctuation and letter casing is a key pre-processing step towards applying
complex natural language processing algorithms. This is especially significant for textual …
complex natural language processing algorithms. This is especially significant for textual …