Mining Training Data for Language Modeling Across the World's Languages.

V Carbune, P Gonnet, T Deselaers, HA Rowley… - International Journal on …, 2020 - Springer

We describe an online handwriting system that is able to support 102 languages using a
deep neural network architecture. This new system has completely replaced our previous …

被引用次数：168 相关文章所有 8 个版本

[PDF] arxiv.org

Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus

I Caswell, T Breiner, D Van Esch, A Bapna - arXiv preprint arXiv …, 2020 - arxiv.org

Large text corpora are increasingly important for a wide variety of Natural Language
Processing (NLP) tasks, and automatic language identification (LangID) is a core technology …

被引用次数：80 相关文章所有 7 个版本

[PDF] aclanthology.org

Writing system and speaker metadata for 2,800+ language varieties

D van Esch, T Lucassen, S Ruder… - Proceedings of the …, 2022 - aclanthology.org

We describe an open-source dataset providing metadata for about 2,800 language varieties
used in the world today. Specifically, the dataset provides the attested writing system (s) for …

被引用次数：22 相关文章所有 5 个版本

[PDF] aclanthology.org

No data to crawl? monolingual corpus creation from PDF files of truly low-resource languages in Peru

G Bustamante, A Oncevay… - Proceedings of the Twelfth …, 2020 - aclanthology.org

We introduce new monolingual corpora for four indigenous and endangered languages
from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these …

被引用次数：35 相关文章所有 5 个版本

[PDF] arxiv.org

Writing across the world's languages: Deep internationalization for Gboard, the Google keyboard

D van Esch, E Sarbar, T Lucassen, J O'Brien… - arXiv preprint arXiv …, 2019 - arxiv.org

This technical report describes our deep internationalization program for Gboard, the
Google Keyboard. Today, Gboard supports 900+ language varieties across 70+ writing …

被引用次数：24 相关文章所有 5 个版本

Indylstms: independently recurrent LSTMs

P Gonnet, T Deselaers - ICASSP 2020-2020 IEEE International …, 2020 - ieeexplore.ieee.org

We introduce Independently Recurrent Long Short-term Memory cells: IndyLSTMs. These
differ from regular LSTM cells in that the recurrent weights are not modeled as a full matrix …

被引用次数：22 相关文章所有 5 个版本

[PDF] isca-archive.org

[PDF][PDF] Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data.

M Prasad, D van Esch, S Ritchie, JF Mortensen - INTERSPEECH, 2019 - isca-archive.org

When building automatic speech recognition (ASR) systems, typically some amount of audio
and text data in the target language is needed. While text data can be obtained relatively …

被引用次数：18 相关文章所有 6 个版本

[PDF] academia.edu

[PDF][PDF] Unified Verbalization for Speech Recognition & Synthesis Across Languages.

S Ritchie, R Sproat, K Gorman, D van Esch… - …, 2019 - academia.edu

We describe a new approach to converting written tokens to their spoken form, which can be
shared by automatic speech recognition (ASR) and text-to-speech synthesis (TTS) systems …

被引用次数：10 相关文章所有 8 个版本

[PDF] academia.edu

[PDF][PDF] Developing Pronunciation Models in New Languages Faster by Exploiting Common Grapheme-to-Phoneme Correspondences Across Languages.

H Bleyan, S Ritchie, JF Mortensen, D van Esch - INTERSPEECH, 2019 - academia.edu

We discuss two methods that let us easily create grapheme-tophoneme (G2P) conversion
systems for languages without any human-curated pronunciation lexicons, as long as we …

被引用次数：7 相关文章所有 5 个版本

[PDF] aclanthology.org

Now You See Me, Now You Don't:'Poverty of the Stimulus' Problems and Arbitrary Correspondences in End-to-End Speech Models

D van Esch - Proceedings of the Second Workshop on …, 2024 - aclanthology.org

End-to-end models for speech recognition and speech synthesis have many benefits, but we
argue they also face a unique set of challenges not encountered in conventional multi-stage …