MAE-AST: Masked Autoencoding Audio Spectrogram Transformer A Baade, P Peng, D Harwath Interspeech 2022, 2022 | 93 | 2022 |
Word discovery in visually grounded, self-supervised speech models P Peng, D Harwath Interspeech 2022, 2022 | 39 | 2022 |
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization P Peng, B Yan, S Watanabe, D Harwath Interspeech 2023, 2023 | 35 | 2023 |
Fast-slow transformer for visually grounding speech P Peng, D Harwath ICASSP 2022, 2022 | 32 | 2022 |
Self-supervised representation learning for speech using visual grounding and masked language modeling P Peng, D Harwath AAAI 2022 SAS Workshop, 2022 | 28 | 2022 |
A correspondence variational autoencoder for unsupervised acoustic word embeddings P Peng, H Kamper, K Livescu NeurIPS 2020 SAS Workshop, 2020 | 17 | 2020 |
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild P Peng, PY Huang, D Li, A Mohamed, D Harwath arXiv preprint arXiv:2403.16973, 2024 | 10 | 2024 |
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models Y Tseng, L Berry*, YT Chen*, I Chiu*, HH Lin*, M Liu*, P Peng*, YJ Shih*, ... preprint, 2023 | 6 | 2023 |
Syllable segmentation and cross-lingual generalization in a visually grounded, self-supervised speech model P Peng, SW Li, AM Okko Räsänen, D Harwath Interspeech, 2023 | 5 | 2023 |
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model P Peng, SW Li, O Räsänen, A Mohamed, D Harwath Interspeech 2023, 2023 | 4 | 2023 |
BAT: Learning to Reason about Spatial Sounds with Large Language Models Z Zheng, P Peng, Z Ma, X Chen, E Choi, D Harwath arXiv preprint arXiv:2402.01591, 2024 | 3 | 2024 |
Audio-Visual Neural Syntax Acquisition CIJ Lai*, F Shi*, P Peng*, Y Kim, K Gimpel, S Chang, YS Chuang, S Bhati, ... ASRU 2023, 2023 | 3 | 2023 |
SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data HF Wang, YJ Shih, HJ Chang, L Berry, P Peng, H Lee, HM Wang, ... arXiv preprint arXiv:2402.06959, 2024 | 2 | 2024 |
Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos C Hori, P Peng, D Harwath, X Liu, K Ota, S Jain, R Corcodel, D Jha, ... Interspeech 2023, 2023 | 2 | 2023 |
Zero-shot Video Moment Retrieval With Off-the-Shelf Models A Diwan*, P Peng*, RJ Mooney (* denotes equal contribution) NeurIPS 2022 TL4NLP, 2022 | 2 | 2022 |
Textless phrase structure induction from visually-grounded speech CI Lai, F Shi, P Peng, Y Kim, K Gimpel, S Chang, YS Chuang, S Bhati, ... | 1 | 2023 |
Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos C Chen, P Peng, A Baid, Z Xue, WN Hsu, D Harwarth, K Grauman arXiv preprint arXiv:2406.09272, 2024 | | 2024 |
Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model HC Fang, NX Ye, YJ Shih, P Peng, HF Wang, L Berry, H Lee, D Harwath arXiv preprint arXiv:2402.05819, 2024 | | 2024 |
Neural Codec Language Models for Disentangled and Textless Voice Conversion A Baade, P Peng, D Harwath | | |