Speaker diarization: A review of recent research

X Anguera, S Bozonnet, N Evans… - … on audio, speech …, 2012 - ieeexplore.ieee.org
Speaker diarization is the task of determining “who spoke when?” in an audio or video
recording that contains an unknown amount of speech and also an unknown number of …

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

[HTML][HTML] Voxceleb: Large-scale speaker verification in the wild

A Nagrani, JS Chung, W Xie, A Zisserman - Computer Speech & Language, 2020 - Elsevier
The objective of this work is speaker recognition under noisy and unconstrained conditions.
We make two key contributions. First, we introduce a very large-scale audio-visual dataset …

SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization

B Gliwa, I Mochol, M Biesek, A Wawer - arXiv preprint arXiv:1911.12237, 2019 - arxiv.org
This paper introduces the SAMSum Corpus, a new dataset with abstractive dialogue
summaries. We investigate the challenges it poses for automated summarization by testing …

FakeAVCeleb: A novel audio-video multimodal deepfake dataset

H Khalid, S Tariq, M Kim, SS Woo - arXiv preprint arXiv:2108.05080, 2021 - arxiv.org
While the significant advancements have made in the generation of deepfakes using deep
learning technologies, its misuse is a well-known issue now. Deepfakes can cause severe …

Voxceleb: a large-scale speaker identification dataset

A Nagrani, JS Chung, A Zisserman - arXiv preprint arXiv:1706.08612, 2017 - arxiv.org
Most existing datasets for speaker identification contain samples obtained under quite
constrained conditions, and are usually hand-annotated, hence limited in size. The goal of …

Wavesplit: End-to-end speech separation by speaker clustering

N Zeghidour, D Grangier - IEEE/ACM Transactions on Audio …, 2021 - ieeexplore.ieee.org
We introduce Wavesplit, an end-to-end source separation system. From a single mixture, the
model infers a representation for each source and then estimates each source signal given …

A survey on semantic processing techniques

R Mao, K He, X Zhang, G Chen, J Ni, Z Yang… - Information …, 2024 - Elsevier
Semantic processing is a fundamental research domain in computational linguistics. In the
era of powerful pre-trained language models and large language models, the advancement …

Gender and dialect bias in YouTube's automatic captions

R Tatman - Proceedings of the first ACL workshop on ethics in …, 2017 - aclanthology.org
This project evaluates the accuracy of YouTube's automatically-generated captions across
two genders and five dialect groups. Speakers' dialect and gender was controlled for by …

MediaSum: A large-scale media interview dataset for dialogue summarization

C Zhu, Y Liu, J Mei, M Zeng - arXiv preprint arXiv:2103.06410, 2021 - arxiv.org
MediaSum, a large-scale media interview dataset consisting of 463.6 K transcripts with
abstractive summaries. To create this dataset, we collect interview transcripts from NPR and …