The multilingual tedx corpus for speech recognition and translation
We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and
speech translation (ST) research across many non-English source languages. The corpus is …
speech translation (ST) research across many non-English source languages. The corpus is …
Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus
Large text corpora are increasingly important for a wide variety of Natural Language
Processing (NLP) tasks, and automatic language identification (LangID) is a core technology …
Processing (NLP) tasks, and automatic language identification (LangID) is a core technology …
The SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion
K Gorman, LFE Ashby, A Goyzueta… - Proceedings of the …, 2020 - aclanthology.org
We describe the design and findings of the SIGMORPHON 2020 shared task on multilingual
grapheme-to-phoneme conversion. Participants were asked to submit systems which take in …
grapheme-to-phoneme conversion. Participants were asked to submit systems which take in …
A surprisal--duration trade-off across and within the world's languages
While there exist scores of natural languages, each with its unique features and
idiosyncrasies, they all share a unifying theme: enabling human communication. We may …
idiosyncrasies, they all share a unifying theme: enabling human communication. We may …
Zero-shot learning for grapheme to phoneme conversion with language ensemble
Abstract Grapheme-to-Phoneme (G2P) has many applications in NLP and speech fields.
Most existing work focuses heavily on languages with abundant training datasets, which …
Most existing work focuses heavily on languages with abundant training datasets, which …
A large and evolving cognate database
We present CogNet, a large-scale, automatically-built database of sense-tagged cognates—
words of common origin and meaning across languages. CogNet is continuously evolving …
words of common origin and meaning across languages. CogNet is continuously evolving …
A corpus for large-scale phonetic typology
A major hurdle in data-driven research on typology is having sufficient data in many
languages to draw meaningful conclusions. We present VoxClamantis v1. 0, the first large …
languages to draw meaningful conclusions. We present VoxClamantis v1. 0, the first large …
ByT5 model for massively multilingual grapheme-to-phoneme conversion
In this study, we tackle massively multilingual grapheme-to-phoneme conversion through
implementing G2P models based on ByT5. We have curated a G2P dataset from various …
implementing G2P models based on ByT5. We have curated a G2P dataset from various …
[图书][B] Finite-state text processing
K Gorman, R Sproat - 2022 - books.google.com
Weighted finite-state transducers (WFSTs) are commonly used by engineers and
computational linguists for processing and generating speech and text. This book first …
computational linguists for processing and generating speech and text. This book first …
[PDF][PDF] Wiktextract: Wiktionary as machine-readable structured data
T Ylonen - International Conference on Language …, 2022 - researchportal.helsinki.fi
We present a machine-readable structured data version of Wiktionary. Unlike previous
Wiktionary extractions, the new extractor, Wiktextract, fully interprets and expands templates …
Wiktionary extractions, the new extractor, Wiktextract, fully interprets and expands templates …