Avlnet: Learning audio-visual language representations from instructional videos

A Rouditchenko, A Boggust, D Harwath, B Chen… - arXiv preprint arXiv …, 2020 - arxiv.org
Current methods for learning visually grounded language from videos often rely on text
annotation, such as human generated captions or machine generated automatic speech …

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arXiv preprint arXiv:1911.09602, 2019 - arxiv.org
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …

Trilingual semantic embeddings of visually grounded speech with self-attention mechanisms

Y Ohishi, A Kimura, T Kawanishi… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
We propose a trilingual semantic embedding model that associates visual objects in images
with segments of speech signals corresponding to spoken words in an unsupervised …

The complementarity of a diverse range of deep learning features extracted from video content for video recommendation

A Almeida, JP de Villiers, A De Freitas… - Expert Systems with …, 2022 - Elsevier
Following the popularisation of media streaming, a number of video streaming services are
continuously buying new video content to mine the potential profit from them. As such, the …

[PDF][PDF] Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets.

Y Ohishi, A Kimura, T Kawanishi, K Kashino… - …, 2020 - isca-archive.org
We propose a data expansion method for learning a multilingual semantic embedding
model using disjoint datasets containing images and their multilingual audio captions. Here …

Cascaded multilingual audio-visual learning from videos

A Rouditchenko, A Boggust, D Harwath… - arXiv preprint arXiv …, 2021 - arxiv.org
In this paper, we explore self-supervised audio-visual models that learn from instructional
videos. Prior work has shown that these models can relate spoken words and sounds to …

[PDF][PDF] Multimodal Learning from Videos: Exploring Models and Task Complexities

S Palaskar - 2022 - kilthub.cmu.edu
Human learning is inherently multimodal. We watch, listen, read, and communicate to learn
from and understand our surroundings. There have been several advancements in machine …

Grounded sequence to sequence transduction

L Specia, L Barrault, O Caglayan… - IEEE journal of …, 2020 - ieeexplore.ieee.org
Speech recognition and machine translation have made major progress over the past
decades, providing practical systems to map one language sequence to another. Although …

Unsupervised co-segmentation for athlete movements and live commentaries using crossmodal temporal proximity

Y Ohishi, Y Tanaka, K Kashino - 2020 25th International …, 2021 - ieeexplore.ieee.org
Audio-visual co-segmentation is a task to extract segments and regions corresponding to
specific events on unlabeled audio and video signals. It is particularly important to …

Leveraging the Multimodal Information from Video Content for Video Recommendation

ARL De Almeida - 2021 - search.proquest.com
Since the popularisation of media streaming, a number of video streaming services are
continually buying new video content to mine the potential profit. As such, newly added …