Recent developments on espnet toolkit boosted by conformer
In this study, we present recent developments on ESPnet: End-to-End Speech Processing
toolkit, which mainly involves a recently proposed architecture called Conformer …
toolkit, which mainly involves a recently proposed architecture called Conformer …
[HTML][HTML] Self-attentive speaker embeddings for text-independent speaker verification.
摘要This paper introduces a new method to extract speaker embed-dings from a deep
neural network (DNN) for text-independent speaker verification. Usually, speaker …
neural network (DNN) for text-independent speaker verification. Usually, speaker …
Split computing and early exiting for deep learning applications: Survey and research challenges
Mobile devices such as smartphones and autonomous vehicles increasingly rely on deep
neural networks (DNNs) to execute complex inference tasks such as image classification …
neural networks (DNNs) to execute complex inference tasks such as image classification …
Scaling speech technology to 1,000+ languages
Expanding the language coverage of speech technology has the potential to improve
access to information for many more people. However, current speech technology is …
access to information for many more people. However, current speech technology is …
Neural codec language models are zero-shot text to speech synthesizers
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …
we train a neural codec language model (called Vall-E) using discrete codes derived from …
A high-performance speech neuroprosthesis
Speech brain–computer interfaces (BCIs) have the potential to restore rapid communication
to people with paralysis by decoding neural activity evoked by attempted speech into text, or …
to people with paralysis by decoding neural activity evoked by attempted speech into text, or …
Voicebox: Text-guided multilingual universal speech generation at scale
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …
community. These models not only generate high fidelity outputs, but are also generalists …
Ego4d: Around the world in 3,000 hours of egocentric video
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …
Masked autoencoders that listen
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-
supervised representation learning from audio spectrograms. Following the Transformer …
supervised representation learning from audio spectrograms. Following the Transformer …
SpeechBrain: A general-purpose speech toolkit
SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the
research and development of neural speech processing technologies by being simple …
research and development of neural speech processing technologies by being simple …