Linguistically inspired roadmap for building biologically reliable protein language models

MH Vu, R Akbar, PA Robert, B Swiatczak… - Nature Machine …, 2023 - nature.com
Deep neural-network-based language models (LMs) are increasingly applied to large-scale
protein sequence data to predict protein function. However, being largely black-box models …

Language model tokenizers introduce unfairness between languages

A Petrov, E La Malfa, P Torr… - Advances in Neural …, 2024 - proceedings.neurips.cc
Recent language models have shown impressive multilingual performance, even when not
explicitly trained for it. Despite this, there are concerns about the quality of their outputs …

Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model

L Weissweiler, V Hofmann, A Kantharuban… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) have recently reached an impressive level of linguistic
capability, prompting comparisons with human language skills. However, there have been …

Analyzing cognitive plausibility of subword tokenization

L Beinborn, Y Pinter - arXiv preprint arXiv:2310.13348, 2023 - arxiv.org
Subword tokenization has become the de-facto standard for tokenization, although
comparative evaluations of subword vocabulary quality across languages are scarce …

Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing

T Sherborne, T Hosking, M Lapata - Transactions of the Association …, 2023 - direct.mit.edu
Cross-lingual semantic parsing transfers parsing capability from a high-resource language
(eg, English) to low-resource languages with scarce training data. Previous work has …

Testing large language models on compositionality and inference with phrase-level adjective-noun entailment

L Bertolini, J Weeds, D Weir - Proceedings of the 29th International …, 2022 - aclanthology.org
Previous work has demonstrated that pre-trained large language models (LLM) acquire
knowledge during pre-training which enables reasoning over relationships between words …

Learning sentiment-enhanced word representations by fusing external hybrid sentiment knowledge

Y Li, Z Lin, Y Lin, J Yin, L Chang - Cognitive Computation, 2023 - Springer
Word representation learning is a fundamental technique in cognitive computation that plays
a crucial role in enabling machines to understand and process human language. By …

The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations

AG Soler, M Labeau, C Clavel - Transactions of the Association for …, 2024 - direct.mit.edu
When deriving contextualized word representations from language models, a decision
needs to be made on how to obtain one for out-of-vocabulary (OOV) words that are …

MorphPiece: Moving away from Statistical Language Representation

H Jabbar - arXiv preprint arXiv:2307.07262, 2023 - arxiv.org
Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers
for Large Language Models are based on statistical analysis of text corpora, without much …

[PDF][PDF] Advancing protein language models with linguistics: a roadmap for improved interpretability

MH Vu, R Akbar, PA Robert, B Swiatczak… - arXiv preprint arXiv …, 2022 - academia.edu
Deep neural-network-based language models (LMs) are increasingly applied to large-scale
protein sequence data to predict protein function. However, being largely blackbox models …