Unsupervised learning of spoken language with visual context D Harwath, A Torralba, J Glass Advances in neural information processing systems 29, 2016 | 291 | 2016 |
Jointly discovering visual objects and spoken words from raw sensory input D Harwath, A Recasens, D Surís, G Chuang, A Torralba, J Glass Proceedings of the European conference on computer vision (ECCV), 649-665, 2018 | 234 | 2018 |
Deep multimodal semantic embeddings for speech and images D Harwath, J Glass 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU …, 2015 | 180 | 2015 |
Avlnet: Learning audio-visual language representations from instructional videos A Rouditchenko, A Boggust, D Harwath, B Chen, D Joshi, S Thomas, ... arXiv preprint arXiv:2006.09199, 2020 | 141 | 2020 |
Everything at once-multi-modal fusion transformer for video retrieval N Shvetsova, B Chen, A Rouditchenko, S Thomas, B Kingsbury, RS Feris, ... Proceedings of the ieee/cvf conference on computer vision and pattern …, 2022 | 136 | 2022 |
A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition A Jansen, E Dupoux, S Goldwater, M Johnson, S Khudanpur, K Church, ... 2013 IEEE International Conference on Acoustics, Speech and Signal …, 2013 | 120 | 2013 |
Learning word-like units from joint audio-visual analysis D Harwath, JR Glass arXiv preprint arXiv:1701.07481, 2017 | 118 | 2017 |
Learning hierarchical discrete linguistic units from visually-grounded speech D Harwath, WN Hsu, J Glass arXiv preprint arXiv:1911.09602, 2019 | 99 | 2019 |
Contrastive audio-visual masked autoencoder Y Gong, A Rouditchenko, AH Liu, D Harwath, L Karlinsky, H Kuehne, ... arXiv preprint arXiv:2210.07839, 2022 | 95 | 2022 |
Mae-ast: Masked autoencoding audio spectrogram transformer A Baade, P Peng, D Harwath arXiv preprint arXiv:2203.16691, 2022 | 84 | 2022 |
Multimodal clustering networks for self-supervised learning from unlabeled videos B Chen, A Rouditchenko, K Duarte, H Kuehne, S Thomas, A Boggust, ... Proceedings of the IEEE/CVF International Conference on Computer Vision …, 2021 | 78 | 2021 |
Text-free image-to-speech synthesis using learned segmental units WN Hsu, D Harwath, C Song, J Glass arXiv preprint arXiv:2012.15454, 2020 | 68 | 2020 |
Vision as an interlingua: Learning multilingual semantic embeddings of untranscribed speech D Harwath, G Chuang, J Glass 2018 IEEE International Conference on Acoustics, Speech and Signal …, 2018 | 66 | 2018 |
Spoken moments: Learning joint audio-visual representations from video descriptions M Monfort, SY Jin, A Liu, D Harwath, R Feris, J Glass, A Oliva Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2021 | 64 | 2021 |
Towards visually grounded sub-word speech unit discovery D Harwath, J Glass ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and …, 2019 | 43 | 2019 |
Why is winoground hard? investigating failures in visuolinguistic compositionality A Diwan, L Berry, E Choi, D Harwath, K Mahowald arXiv preprint arXiv:2211.00768, 2022 | 39 | 2022 |
Word discovery in visually grounded, self-supervised speech models P Peng, D Harwath arXiv preprint arXiv:2203.15081, 2022 | 38 | 2022 |
Learning modality-invariant representations for speech and images K Leidal, D Harwath, J Glass 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU …, 2017 | 33 | 2017 |
Look, Listen, and Decode: Multimodal Speech Recognition with Images F Sun, D Harwath, J Glass IEEE Workshop on Spoken Language Technology, 2016 | 32 | 2016 |
Prompting the hidden talent of web-scale speech models for zero-shot task generalization P Peng, B Yan, S Watanabe, D Harwath arXiv preprint arXiv:2305.11095, 2023 | 31 | 2023 |