Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds

Y Zhang, Y Gu, Y Zeng, Z Xing, Y Wang, Z Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
We study Neural Foley, the automatic generation of high-quality sound effects synchronizing
with videos, enabling an immersive audio-visual experience. Despite its wide range of …

Improving text-to-audio models with synthetic captions

Z Kong, S Lee, D Ghosal, N Majumder… - arXiv preprint arXiv …, 2024 - arxiv.org
It is an open challenge to obtain high quality training data, especially captions, for text-to-
audio models. Although prior methods have leveraged\textit {text-only language models} to …

Images that Sound: Composing Images and Sounds on a Single Canvas

Z Chen, D Geng, A Owens - arXiv preprint arXiv:2405.12221, 2024 - arxiv.org
Spectrograms are 2D representations of sound that look very different from the images found
in our visual world. And natural images, when played as spectrograms, make unnatural …

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

Z Wang, YW Tai, CK Tang - arXiv preprint arXiv:2410.03335, 2024 - arxiv.org
We introduce Audio-Agent, a multimodal framework for audio generation, editing and
composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) …

Video-to-Audio Generation with Fine-grained Temporal Semantics

Y Hu, Y Gu, C Li, R Chen, D Yu - arXiv preprint arXiv:2409.14709, 2024 - arxiv.org
With recent advances of AIGC, video generation have gained a surge of research interest in
both academia and industry (eg, Sora). However, it remains a challenge to produce …

Challenge on Sound Scene Synthesis: Evaluating Text-to-Audio Generation

J Lee, M Tailleur, LM Heller, K Choi… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite significant advancements in neural text-to-audio generation, challenges persist in
controllability and evaluation. This paper addresses these issues through the Sound Scene …

SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound

R Dagli, S Prakash, R Wu, H Khosravani - arXiv preprint arXiv:2406.06612, 2024 - arxiv.org
Generating combined visual and auditory sensory experiences is critical for the consumption
of immersive content. Recent advances in neural generative models have enabled the …

Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer

S Hou, S Liu, R Yuan, W Xue, Y Shan, M Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
Despite the significant progress in controllable music generation and editing, challenges
remain in the quality and length of generated music due to the use of Mel-spectrogram …

[PDF][PDF] Sound of Vision: Audio Generation from Visual Text Embedding through Training Domain Discriminator

J Kim, WG Choi, S Ahn, JH Chang - Proc. Interspeech 2024, 2024 - isca-archive.org
Recent advancements in text-to-audio (TTA) models have demonstrated their ability to
generate sound that aligns with user intentions. Despite this advancement, a notable …

AudioEditor: A Training-Free Diffusion-Based Audio Editing Framework

Y Jia, Y Chen, J Zhao, S Zhao, W Zeng, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Diffusion-based text-to-audio (TTA) generation has made substantial progress, leveraging
latent diffusion model (LDM) to produce high-quality, diverse and instruction-relevant …