Vocabulary learning via optimal transport for neural machine translation

J Xu, H Zhou, C Gan, Z Zheng, L Li - arXiv preprint arXiv:2012.15671, 2020 - arxiv.org
The choice of token vocabulary affects the performance of machine translation. This paper
aims to figure out what is a good vocabulary and whether one can find the optimal …

Protein design by directed evolution guided by large language models

TVT Tran, TS Hy - IEEE Transactions on Evolutionary …, 2024 - ieeexplore.ieee.org
Directed evolution, a strategy for protein engineering, optimizes protein properties (ie,
fitness) by a rigorous and resource-intensive process of screening or selecting among a vast …

A comparison between morphological complexity measures: typological data vs. language corpora

C Bentz, T Ruzsics, A Koplenig… - Proceedings of the …, 2016 - aclanthology.org
Abstract Language complexity is an intriguing phenomenon argued to play an important role
in both language learning and processing. The need to compare languages with regard to …

Molecule generation by principal subgraph mining and assembling

X Kong, W Huang, Z Tan, Y Liu - Advances in Neural …, 2022 - proceedings.neurips.cc
Molecule generation is central to a variety of applications. Current attention has been paid to
approaching the generation task as subgraph prediction and assembling. Nevertheless …

Morphological complexity of languages refle ts the settlement history of the Americas

J Nichols, C Bentz - New Perspectivdes on the Peopling of …, 2019 - researchportal.helsinki.fi
Morphological complexity is widely believed to increase with sociolinguistic isolation, and to
decrease with language spreads and absorption of L2 adult learner populations. However …

[HTML][HTML] TLSPG: Transfer learning-based semi-supervised pseudo-corpus generation approach for zero-shot translation

A Kumar, RK Mundotiya, A Pratap, AK Singh - Journal of King Saud …, 2022 - Elsevier
Abstract Machine Translation (MT) has come a long way in recent years, but it still suffers
from data scarcity issue due to lack of parallel corpora for low (or sometimes zero) resource …

Entropy-based syntactic tree analysis for text classification: a novel approach to distinguishing between original and translated Chinese texts

Z Wang, AKF Cheung, K Liu - Digital Scholarship in the …, 2024 - academic.oup.com
This research focuses on classifying translated and non-translated Chinese texts by
analyzing syntactic rule features, using an integrated approach of machine learning and …

Content Reduction, Surprisal and Information Density Estimation for Long Documents

S Ji, W Sun, P Marttinen - arXiv preprint arXiv:2309.06009, 2023 - arxiv.org
Many computational linguistic methods have been proposed to study the information content
of languages. We consider two interesting research questions: 1) how is information …

Towards robust complexity indices in linguistic typology: A corpus-based assessment

YM Oh, F Pellegrino - Studies in Language, 2023 - jbe-platform.com
There is high hope that corpus-based approaches to language complexity will contribute to
explaining linguistic diversity. Several complexity indices have consequently been proposed …

[PDF][PDF] Graphpiece: Efficiently generating high-quality molecular graph with substructures

XKZTY Liu - arXiv preprint arXiv:2106.15098, 2021 - academia.edu
Molecular graph generation is a fundamental but challenging task in various applications
such as drug discovery and material science, which requires generating valid molecules …