High fidelity neural audio compression

A Défossez, J Copet, G Synnaeve, Y Adi - arXiv preprint arXiv:2210.13438, 2022 - arxiv.org
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural
networks. It consists in a streaming encoder-decoder architecture with quantized latent …

Soundstream: An end-to-end neural audio codec

N Zeghidour, A Luebs, A Omran… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org
We present SoundStream, a novel neural audio codec that can efficiently compress speech,
music and general audio at bitrates normally targeted by speech-tailored codecs …

From audio to photoreal embodiment: Synthesizing humans in conversations

E Ng, J Romero, T Bagautdinov, S Bai… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present a framework for generating full-bodied photorealistic avatars that gesture
according to the conversational dynamics of a dyadic interaction. Given speech audio we …

Hifi-codec: Group-residual vector quantization for high fidelity audio codec

D Yang, S Liu, R Huang, J Tian, C Weng… - arXiv preprint arXiv …, 2023 - arxiv.org
Audio codec models are widely used in audio communication as a crucial technique for
compressing audio into discrete representations. Nowadays, audio codec models are …

kNN Classification: a review

PK Syriopoulos, NG Kalampalikis, SB Kotsiantis… - Annals of Mathematics …, 2023 - Springer
The k-nearest neighbors (k/NN) algorithm is a simple yet powerful non-parametric classifier
that is robust to noisy data and easy to implement. However, with the growing literature on …

Lauragpt: Listen, attend, understand, and regenerate audio with gpt

Z Du, J Wang, Q Chen, Y Chu, Z Gao, Z Li, K Hu… - arXiv preprint arXiv …, 2023 - arxiv.org
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance
on various natural language processing tasks, and have shown great potential as …

Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis

K Yang, D Marković, S Krenn… - Proceedings of the …, 2022 - openaccess.thecvf.com
Since facial actions such as lip movements contain significant information about speech
content, it is not surprising that audio-visual speech enhancement methods are more …

Av-rir: Audio-visual room impulse response estimation

A Ratnarajah, S Ghosh, S Kumar… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Accurate estimation of Room Impulse Response (RIR) which captures an
environment's acoustic properties is important for speech processing and AR/VR …

Make-a-voice: Unified voice synthesis with discrete representation

R Huang, C Zhang, Y Wang, D Yang, L Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Various applications of voice synthesis have been developed independently despite the fact
that they generate" voice" as output in common. In addition, the majority of voice synthesis …

Behavior generation with latent actions

S Lee, Y Wang, H Etukuru, HJ Kim… - arXiv preprint arXiv …, 2024 - arxiv.org
Generative modeling of complex behaviors from labeled datasets has been a longstanding
problem in decision making. Unlike language or image generation, decision making …