Avlnet: Learning audio-visual language representations from instructional videos
Current methods for learning visually grounded language from videos often rely on text
annotation, such as human generated captions or machine generated automatic speech …
annotation, such as human generated captions or machine generated automatic speech …
Learning hierarchical discrete linguistic units from visually-grounded speech
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …
vector quantization layers into neural models of visually grounded speech. We show that our …
Trilingual semantic embeddings of visually grounded speech with self-attention mechanisms
Y Ohishi, A Kimura, T Kawanishi… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
We propose a trilingual semantic embedding model that associates visual objects in images
with segments of speech signals corresponding to spoken words in an unsupervised …
with segments of speech signals corresponding to spoken words in an unsupervised …
The complementarity of a diverse range of deep learning features extracted from video content for video recommendation
A Almeida, JP de Villiers, A De Freitas… - Expert Systems with …, 2022 - Elsevier
Following the popularisation of media streaming, a number of video streaming services are
continuously buying new video content to mine the potential profit from them. As such, the …
continuously buying new video content to mine the potential profit from them. As such, the …
[PDF][PDF] Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets.
We propose a data expansion method for learning a multilingual semantic embedding
model using disjoint datasets containing images and their multilingual audio captions. Here …
model using disjoint datasets containing images and their multilingual audio captions. Here …
Cascaded multilingual audio-visual learning from videos
In this paper, we explore self-supervised audio-visual models that learn from instructional
videos. Prior work has shown that these models can relate spoken words and sounds to …
videos. Prior work has shown that these models can relate spoken words and sounds to …
[PDF][PDF] Multimodal Learning from Videos: Exploring Models and Task Complexities
S Palaskar - 2022 - kilthub.cmu.edu
Human learning is inherently multimodal. We watch, listen, read, and communicate to learn
from and understand our surroundings. There have been several advancements in machine …
from and understand our surroundings. There have been several advancements in machine …
Grounded sequence to sequence transduction
Speech recognition and machine translation have made major progress over the past
decades, providing practical systems to map one language sequence to another. Although …
decades, providing practical systems to map one language sequence to another. Although …
Unsupervised co-segmentation for athlete movements and live commentaries using crossmodal temporal proximity
Y Ohishi, Y Tanaka, K Kashino - 2020 25th International …, 2021 - ieeexplore.ieee.org
Audio-visual co-segmentation is a task to extract segments and regions corresponding to
specific events on unlabeled audio and video signals. It is particularly important to …
specific events on unlabeled audio and video signals. It is particularly important to …
Leveraging the Multimodal Information from Video Content for Video Recommendation
ARL De Almeida - 2021 - search.proquest.com
Since the popularisation of media streaming, a number of video streaming services are
continually buying new video content to mine the potential profit. As such, newly added …
continually buying new video content to mine the potential profit. As such, newly added …