Deep multimodal speaker naming

Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation

A Ephrat, I Mosseri, O Lang, T Dekel, K Wilson… - arXiv preprint arXiv …, 2018 - arxiv.org

We present a joint audio-visual model for isolating a single speech signal from a mixture of
sounds such as other speakers and background noise. Solving this task using only audio as …

被引用次数：901 相关文章所有 8 个版本

Deep convolutional computation model for feature learning on big data in internet of things

P Li, Z Chen, LT Yang, Q Zhang… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org

Currently, a large number of industrial data, usually referred to big data, are collected from
Internet of Things (IoT). Big data are typically heterogeneous, ie, each object in big datasets …

被引用次数：235 相关文章所有 2 个版本

[PDF] arxiv.org

Ava active speaker: An audio-visual dataset for active speaker detection

J Roth, S Chaudhuri, O Klejch, R Marvin… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org

Active speaker detection is an important component in video analysis algorithms for
applications such as speaker diarization, video re-targeting for meetings, speech …

被引用次数：174 相关文章所有 6 个版本

[PDF] ieee.org

Audio-visual biometric recognition and presentation attack detection: A comprehensive survey

H Mandalapu, AR PN, R Ramachandra, KS Rao… - IEEE …, 2021 - ieeexplore.ieee.org

Biometric recognition is a trending technology that uses unique characteristics data to
identify or verify/authenticate security applications. Amidst the classically used biometrics …

被引用次数：29 相关文章所有 8 个版本

[PDF] aaai.org

Look, listen and learn—A multimodal LSTM for speaker identification

J Ren, Y Hu, YW Tai, C Wang, L Xu, W Sun… - Proceedings of the AAAI …, 2016 - ojs.aaai.org

Speaker identification refers to the task of localizing the face of a person who has the same
identity as the ongoing voice in a video. This task not only requires collective perception …

被引用次数：139 相关文章所有 13 个版本

[PDF] acm.org

Adaptive multimodal fusion for facial action units recognition

H Yang, T Wang, L Yin - Proceedings of the 28th ACM international …, 2020 - dl.acm.org

Multimodal facial action units (AU) recognition aims to build models that are capable of
processing, correlating, and integrating information from multiple modalities (ie, 2D images …

被引用次数：33 相关文章所有 3 个版本

[PDF] arxiv.org

Online multi-modal person search in videos

J Xia, A Rao, Q Huang, L Xu, J Wen, D Lin - Computer Vision–ECCV 2020 …, 2020 - Springer

The task of searching certain people in videos has seen increasing potential in real-world
applications, such as video organization and editing. Most existing approaches are devised …

被引用次数：35 相关文章所有 4 个版本

A deep residual computation model for heterogeneous data learning in smart Internet of Things

H Yu, LT Yang, X Fan, Q Zhang - Applied Soft Computing, 2021 - Elsevier

Abstract Smart Internet of Things (smart IoT) have emerged as a transformative computing
paradigm recently. This new approach has made great contributions in the area of cyber …

被引用次数：23 相关文章

[PDF] arxiv.org

Self-supervised learning for audio-visual speaker diarization

Y Ding, Y Xu, SX Zhang, Y Cong… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org

Speaker diarization, which is to find the speech segments of specific speakers, has been
widely used in human-centered applications such as video conferences or human-computer …

被引用次数：37 相关文章所有 4 个版本

Hybrid model-based emotion contextual recognition for cognitive assistance services

N Ayari, H Abdelkawy, A Chibani… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org

Endowing ubiquitous robots with cognitive capabilities for recognizing emotions, sentiments,
affects, and moods of humans in their context is an important challenge, which requires …

被引用次数：20 相关文章所有 4 个版本