Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing
with videos, enabling an immersive audio-visual experience. Despite its wide range of …
with videos, enabling an immersive audio-visual experience. Despite its wide range of …
Improving text-to-audio models with synthetic captions
It is an open challenge to obtain high quality training data, especially captions, for text-to-
audio models. Although prior methods have leveraged\textit {text-only language models} to …
audio models. Although prior methods have leveraged\textit {text-only language models} to …
Images that Sound: Composing Images and Sounds on a Single Canvas
Spectrograms are 2D representations of sound that look very different from the images found
in our visual world. And natural images, when played as spectrograms, make unnatural …
in our visual world. And natural images, when played as spectrograms, make unnatural …
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
We introduce Audio-Agent, a multimodal framework for audio generation, editing and
composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) …
composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) …
Video-to-Audio Generation with Fine-grained Temporal Semantics
With recent advances of AIGC, video generation have gained a surge of research interest in
both academia and industry (eg, Sora). However, it remains a challenge to produce …
both academia and industry (eg, Sora). However, it remains a challenge to produce …
Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation
Despite significant advancements in neural text-to-audio generation, challenges persist in
controllability and evaluation. This paper addresses these issues through the Sound Scene …
controllability and evaluation. This paper addresses these issues through the Sound Scene …
SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound
Generating combined visual and auditory sensory experiences is critical for the consumption
of immersive content. Recent advances in neural generative models have enabled the …
of immersive content. Recent advances in neural generative models have enabled the …
Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer
Despite the significant progress in controllable music generation and editing, challenges
remain in the quality and length of generated music due to the use of Mel-spectrogram …
remain in the quality and length of generated music due to the use of Mel-spectrogram …
[PDF][PDF] Sound of Vision: Audio Generation from Visual Text Embedding through Training Domain Discriminator
Recent advancements in text-to-audio (TTA) models have demonstrated their ability to
generate sound that aligns with user intentions. Despite this advancement, a notable …
generate sound that aligns with user intentions. Despite this advancement, a notable …
AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework
Y Jia, Y Chen, J Zhao, S Zhao, W Zeng, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Diffusion-based text-to-audio (TTA) generation has made substantial progress, leveraging
latent diffusion model (LDM) to produce high-quality, diverse and instruction-relevant …
latent diffusion model (LDM) to produce high-quality, diverse and instruction-relevant …