Vocabulary learning via optimal transport for neural machine translation
The choice of token vocabulary affects the performance of machine translation. This paper
aims to figure out what is a good vocabulary and whether one can find the optimal …
aims to figure out what is a good vocabulary and whether one can find the optimal …
Protein design by directed evolution guided by large language models
Directed evolution, a strategy for protein engineering, optimizes protein properties (ie,
fitness) by a rigorous and resource-intensive process of screening or selecting among a vast …
fitness) by a rigorous and resource-intensive process of screening or selecting among a vast …
A comparison between morphological complexity measures: typological data vs. language corpora
Abstract Language complexity is an intriguing phenomenon argued to play an important role
in both language learning and processing. The need to compare languages with regard to …
in both language learning and processing. The need to compare languages with regard to …
Molecule generation by principal subgraph mining and assembling
Molecule generation is central to a variety of applications. Current attention has been paid to
approaching the generation task as subgraph prediction and assembling. Nevertheless …
approaching the generation task as subgraph prediction and assembling. Nevertheless …
Morphological complexity of languages refle ts the settlement history of the Americas
J Nichols, C Bentz - New Perspectivdes on the Peopling of …, 2019 - researchportal.helsinki.fi
Morphological complexity is widely believed to increase with sociolinguistic isolation, and to
decrease with language spreads and absorption of L2 adult learner populations. However …
decrease with language spreads and absorption of L2 adult learner populations. However …
[HTML][HTML] TLSPG: Transfer learning-based semi-supervised pseudo-corpus generation approach for zero-shot translation
Abstract Machine Translation (MT) has come a long way in recent years, but it still suffers
from data scarcity issue due to lack of parallel corpora for low (or sometimes zero) resource …
from data scarcity issue due to lack of parallel corpora for low (or sometimes zero) resource …
Entropy-based syntactic tree analysis for text classification: a novel approach to distinguishing between original and translated Chinese texts
Z Wang, AKF Cheung, K Liu - Digital Scholarship in the …, 2024 - academic.oup.com
This research focuses on classifying translated and non-translated Chinese texts by
analyzing syntactic rule features, using an integrated approach of machine learning and …
analyzing syntactic rule features, using an integrated approach of machine learning and …
Content Reduction, Surprisal and Information Density Estimation for Long Documents
Many computational linguistic methods have been proposed to study the information content
of languages. We consider two interesting research questions: 1) how is information …
of languages. We consider two interesting research questions: 1) how is information …
Towards robust complexity indices in linguistic typology: A corpus-based assessment
YM Oh, F Pellegrino - Studies in Language, 2023 - jbe-platform.com
There is high hope that corpus-based approaches to language complexity will contribute to
explaining linguistic diversity. Several complexity indices have consequently been proposed …
explaining linguistic diversity. Several complexity indices have consequently been proposed …
[PDF][PDF] Graphpiece: Efficiently generating high-quality molecular graph with substructures
XKZTY Liu - arXiv preprint arXiv:2106.15098, 2021 - academia.edu
Molecular graph generation is a fundamental but challenging task in various applications
such as drug discovery and material science, which requires generating valid molecules …
such as drug discovery and material science, which requires generating valid molecules …