Learning from multiview correlations in open-domain videos

A Rouditchenko, A Boggust, D Harwath, B Chen… - arXiv preprint arXiv …, 2020 - arxiv.org

Current methods for learning visually grounded language from videos often rely on text
annotation, such as human generated captions or machine generated automatic speech …

被引用次数：145 相关文章所有 9 个版本

[PDF] arxiv.org

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arXiv preprint arXiv:1911.09602, 2019 - arxiv.org

In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …

被引用次数：102 相关文章所有 8 个版本

[PDF] utexas.edu

Trilingual semantic embeddings of visually grounded speech with self-attention mechanisms

Y Ohishi, A Kimura, T Kawanishi… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org

We propose a trilingual semantic embedding model that associates visual objects in images
with segments of speech signals corresponding to spoken words in an unsupervised …

被引用次数：28 相关文章所有 2 个版本

[PDF] arxiv.org

The complementarity of a diverse range of deep learning features extracted from video content for video recommendation

A Almeida, JP de Villiers, A De Freitas… - Expert Systems with …, 2022 - Elsevier

Following the popularisation of media streaming, a number of video streaming services are
continuously buying new video content to mine the potential profit from them. As such, the …

被引用次数：16 相关文章所有 4 个版本

[PDF] isca-archive.org

[PDF][PDF] Pair Expansion for Learning Multilingual Semantic Embeddings Using Disjoint Visually-Grounded Speech Audio Datasets.

Y Ohishi, A Kimura, T Kawanishi, K Kashino… - …, 2020 - isca-archive.org

We propose a data expansion method for learning a multilingual semantic embedding
model using disjoint datasets containing images and their multilingual audio captions. Here …

被引用次数：15 相关文章所有 5 个版本

[PDF] arxiv.org

Cascaded multilingual audio-visual learning from videos

A Rouditchenko, A Boggust, D Harwath… - arXiv preprint arXiv …, 2021 - arxiv.org

In this paper, we explore self-supervised audio-visual models that learn from instructional
videos. Prior work has shown that these models can relate spoken words and sounds to …

被引用次数：6 相关文章所有 11 个版本

[PDF] cmu.edu

[PDF][PDF] Multimodal Learning from Videos: Exploring Models and Task Complexities

S Palaskar - 2022 - kilthub.cmu.edu

Human learning is inherently multimodal. We watch, listen, read, and communicate to learn
from and understand our surroundings. There have been several advancements in machine …

被引用次数：2 相关文章所有 5 个版本

[PDF] nsf.gov

Grounded sequence to sequence transduction

L Specia, L Barrault, O Caglayan… - IEEE journal of …, 2020 - ieeexplore.ieee.org

Speech recognition and machine translation have made major progress over the past
decades, providing practical systems to map one language sequence to another. Although …

被引用次数：5 相关文章所有 12 个版本

Unsupervised co-segmentation for athlete movements and live commentaries using crossmodal temporal proximity

Y Ohishi, Y Tanaka, K Kashino - 2020 25th International …, 2021 - ieeexplore.ieee.org

Audio-visual co-segmentation is a task to extract segments and regions corresponding to
specific events on unlabeled audio and video signals. It is particularly important to …

被引用次数：1 相关文章所有 5 个版本

[PDF] up.ac.za

Leveraging the Multimodal Information from Video Content for Video Recommendation

ARL De Almeida - 2021 - search.proquest.com

Since the popularisation of media streaming, a number of video streaming services are
continually buying new video content to mine the potential profit. As such, newly added …