The multilingual tedx corpus for speech recognition and translation

E Salesky, M Wiesner, J Bremerman, R Cattoni… - arXiv preprint arXiv …, 2021 - arxiv.org
We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and
speech translation (ST) research across many non-English source languages. The corpus is …

Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus

I Caswell, T Breiner, D Van Esch, A Bapna - arXiv preprint arXiv …, 2020 - arxiv.org
Large text corpora are increasingly important for a wide variety of Natural Language
Processing (NLP) tasks, and automatic language identification (LangID) is a core technology …

The SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion

K Gorman, LFE Ashby, A Goyzueta… - Proceedings of the …, 2020 - aclanthology.org
We describe the design and findings of the SIGMORPHON 2020 shared task on multilingual
grapheme-to-phoneme conversion. Participants were asked to submit systems which take in …

A surprisal--duration trade-off across and within the world's languages

T Pimentel, C Meister, E Salesky, S Teufel… - arXiv preprint arXiv …, 2021 - arxiv.org
While there exist scores of natural languages, each with its unique features and
idiosyncrasies, they all share a unifying theme: enabling human communication. We may …

Zero-shot learning for grapheme to phoneme conversion with language ensemble

X Li, F Metze, DR Mortensen… - Findings of the …, 2022 - aclanthology.org
Abstract Grapheme-to-Phoneme (G2P) has many applications in NLP and speech fields.
Most existing work focuses heavily on languages with abundant training datasets, which …

A large and evolving cognate database

K Batsuren, G Bella, F Giunchiglia - Language Resources and Evaluation, 2022 - Springer
We present CogNet, a large-scale, automatically-built database of sense-tagged cognates—
words of common origin and meaning across languages. CogNet is continuously evolving …

A corpus for large-scale phonetic typology

E Salesky, E Chodroff, T Pimentel, M Wiesner… - arXiv preprint arXiv …, 2020 - arxiv.org
A major hurdle in data-driven research on typology is having sufficient data in many
languages to draw meaningful conclusions. We present VoxClamantis v1. 0, the first large …

ByT5 model for massively multilingual grapheme-to-phoneme conversion

J Zhu, C Zhang, D Jurgens - arXiv preprint arXiv:2204.03067, 2022 - arxiv.org
In this study, we tackle massively multilingual grapheme-to-phoneme conversion through
implementing G2P models based on ByT5. We have curated a G2P dataset from various …

[图书][B] Finite-state text processing

K Gorman, R Sproat - 2022 - books.google.com
Weighted finite-state transducers (WFSTs) are commonly used by engineers and
computational linguists for processing and generating speech and text. This book first …

[PDF][PDF] Wiktextract: Wiktionary as machine-readable structured data

T Ylonen - International Conference on Language …, 2022 - researchportal.helsinki.fi
We present a machine-readable structured data version of Wiktionary. Unlike previous
Wiktionary extractions, the new extractor, Wiktextract, fully interprets and expands templates …