Massively multilingual pronunciation modeling with WikiPron

E Salesky, M Wiesner, J Bremerman, R Cattoni… - arXiv preprint arXiv …, 2021 - arxiv.org

We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and
speech translation (ST) research across many non-English source languages. The corpus is …

被引用次数：131 相关文章所有 12 个版本

[PDF] arxiv.org

Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus

I Caswell, T Breiner, D Van Esch, A Bapna - arXiv preprint arXiv …, 2020 - arxiv.org

Large text corpora are increasingly important for a wide variety of Natural Language
Processing (NLP) tasks, and automatic language identification (LangID) is a core technology …

被引用次数：82 相关文章所有 7 个版本

[PDF] aclanthology.org

The SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion

K Gorman, LFE Ashby, A Goyzueta… - Proceedings of the …, 2020 - aclanthology.org

We describe the design and findings of the SIGMORPHON 2020 shared task on multilingual
grapheme-to-phoneme conversion. Participants were asked to submit systems which take in …

被引用次数：61 相关文章所有 6 个版本

[PDF] arxiv.org

A surprisal--duration trade-off across and within the world's languages

T Pimentel, C Meister, E Salesky, S Teufel… - arXiv preprint arXiv …, 2021 - arxiv.org

While there exist scores of natural languages, each with its unique features and
idiosyncrasies, they all share a unifying theme: enabling human communication. We may …

被引用次数：31 相关文章所有 9 个版本

[PDF] aclanthology.org

Zero-shot learning for grapheme to phoneme conversion with language ensemble

X Li, F Metze, DR Mortensen… - Findings of the …, 2022 - aclanthology.org

Abstract Grapheme-to-Phoneme (G2P) has many applications in NLP and speech fields.
Most existing work focuses heavily on languages with abundant training datasets, which …

被引用次数：19 相关文章所有 6 个版本

[PDF] springer.com

A large and evolving cognate database

K Batsuren, G Bella, F Giunchiglia - Language Resources and Evaluation, 2022 - Springer

We present CogNet, a large-scale, automatically-built database of sense-tagged cognates—
words of common origin and meaning across languages. CogNet is continuously evolving …

被引用次数：25 相关文章所有 8 个版本

[PDF] arxiv.org

A corpus for large-scale phonetic typology

E Salesky, E Chodroff, T Pimentel, M Wiesner… - arXiv preprint arXiv …, 2020 - arxiv.org

A major hurdle in data-driven research on typology is having sufficient data in many
languages to draw meaningful conclusions. We present VoxClamantis v1. 0, the first large …

被引用次数：31 相关文章所有 14 个版本

[PDF] arxiv.org

ByT5 model for massively multilingual grapheme-to-phoneme conversion

J Zhu, C Zhang, D Jurgens - arXiv preprint arXiv:2204.03067, 2022 - arxiv.org

In this study, we tackle massively multilingual grapheme-to-phoneme conversion through
implementing G2P models based on ByT5. We have curated a G2P dataset from various …

被引用次数：35 相关文章所有 6 个版本

[PDF] wellformedness.com

[图书][B] Finite-state text processing

K Gorman, R Sproat - 2022 - books.google.com

Weighted finite-state transducers (WFSTs) are commonly used by engineers and
computational linguists for processing and generating speech and text. This book first …

被引用次数：21 相关文章所有 4 个版本

[PDF] helsinki.fi

[PDF][PDF] Wiktextract: Wiktionary as machine-readable structured data

T Ylonen - International Conference on Language …, 2022 - researchportal.helsinki.fi

We present a machine-readable structured data version of Wiktionary. Unlike previous
Wiktionary extractions, the new extractor, Wiktextract, fully interprets and expands templates …

被引用次数：14 相关文章所有 6 个版本