Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation

A Ephrat, I Mosseri, O Lang, T Dekel, K Wilson… - arXiv preprint arXiv …, 2018 - arxiv.org
We present a joint audio-visual model for isolating a single speech signal from a mixture of
sounds such as other speakers and background noise. Solving this task using only audio as …

Deep convolutional computation model for feature learning on big data in internet of things

P Li, Z Chen, LT Yang, Q Zhang… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org
Currently, a large number of industrial data, usually referred to big data, are collected from
Internet of Things (IoT). Big data are typically heterogeneous, ie, each object in big datasets …

Ava active speaker: An audio-visual dataset for active speaker detection

J Roth, S Chaudhuri, O Klejch, R Marvin… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
Active speaker detection is an important component in video analysis algorithms for
applications such as speaker diarization, video re-targeting for meetings, speech …

Audio-visual biometric recognition and presentation attack detection: A comprehensive survey

H Mandalapu, AR PN, R Ramachandra, KS Rao… - IEEE …, 2021 - ieeexplore.ieee.org
Biometric recognition is a trending technology that uses unique characteristics data to
identify or verify/authenticate security applications. Amidst the classically used biometrics …

Look, listen and learn—A multimodal LSTM for speaker identification

J Ren, Y Hu, YW Tai, C Wang, L Xu, W Sun… - Proceedings of the AAAI …, 2016 - ojs.aaai.org
Speaker identification refers to the task of localizing the face of a person who has the same
identity as the ongoing voice in a video. This task not only requires collective perception …

Adaptive multimodal fusion for facial action units recognition

H Yang, T Wang, L Yin - Proceedings of the 28th ACM international …, 2020 - dl.acm.org
Multimodal facial action units (AU) recognition aims to build models that are capable of
processing, correlating, and integrating information from multiple modalities (ie, 2D images …

Online multi-modal person search in videos

J Xia, A Rao, Q Huang, L Xu, J Wen, D Lin - Computer Vision–ECCV 2020 …, 2020 - Springer
The task of searching certain people in videos has seen increasing potential in real-world
applications, such as video organization and editing. Most existing approaches are devised …

A deep residual computation model for heterogeneous data learning in smart Internet of Things

H Yu, LT Yang, X Fan, Q Zhang - Applied Soft Computing, 2021 - Elsevier
Abstract Smart Internet of Things (smart IoT) have emerged as a transformative computing
paradigm recently. This new approach has made great contributions in the area of cyber …

Self-supervised learning for audio-visual speaker diarization

Y Ding, Y Xu, SX Zhang, Y Cong… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
Speaker diarization, which is to find the speech segments of specific speakers, has been
widely used in human-centered applications such as video conferences or human-computer …

Hybrid model-based emotion contextual recognition for cognitive assistance services

N Ayari, H Abdelkawy, A Chibani… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Endowing ubiquitous robots with cognitive capabilities for recognizing emotions, sentiments,
affects, and moods of humans in their context is an important challenge, which requires …