Linguistically inspired roadmap for building biologically reliable protein language models
Deep neural-network-based language models (LMs) are increasingly applied to large-scale
protein sequence data to predict protein function. However, being largely black-box models …
protein sequence data to predict protein function. However, being largely black-box models …
Language model tokenizers introduce unfairness between languages
Recent language models have shown impressive multilingual performance, even when not
explicitly trained for it. Despite this, there are concerns about the quality of their outputs …
explicitly trained for it. Despite this, there are concerns about the quality of their outputs …
Counting the Bugs in ChatGPT's Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model
Large language models (LLMs) have recently reached an impressive level of linguistic
capability, prompting comparisons with human language skills. However, there have been …
capability, prompting comparisons with human language skills. However, there have been …
Analyzing cognitive plausibility of subword tokenization
L Beinborn, Y Pinter - arXiv preprint arXiv:2310.13348, 2023 - arxiv.org
Subword tokenization has become the de-facto standard for tokenization, although
comparative evaluations of subword vocabulary quality across languages are scarce …
comparative evaluations of subword vocabulary quality across languages are scarce …
Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing
Cross-lingual semantic parsing transfers parsing capability from a high-resource language
(eg, English) to low-resource languages with scarce training data. Previous work has …
(eg, English) to low-resource languages with scarce training data. Previous work has …
Testing large language models on compositionality and inference with phrase-level adjective-noun entailment
Previous work has demonstrated that pre-trained large language models (LLM) acquire
knowledge during pre-training which enables reasoning over relationships between words …
knowledge during pre-training which enables reasoning over relationships between words …
Learning sentiment-enhanced word representations by fusing external hybrid sentiment knowledge
Y Li, Z Lin, Y Lin, J Yin, L Chang - Cognitive Computation, 2023 - Springer
Word representation learning is a fundamental technique in cognitive computation that plays
a crucial role in enabling machines to understand and process human language. By …
a crucial role in enabling machines to understand and process human language. By …
The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations
When deriving contextualized word representations from language models, a decision
needs to be made on how to obtain one for out-of-vocabulary (OOV) words that are …
needs to be made on how to obtain one for out-of-vocabulary (OOV) words that are …
MorphPiece: Moving away from Statistical Language Representation
H Jabbar - arXiv preprint arXiv:2307.07262, 2023 - arxiv.org
Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers
for Large Language Models are based on statistical analysis of text corpora, without much …
for Large Language Models are based on statistical analysis of text corpora, without much …
[PDF][PDF] Advancing protein language models with linguistics: a roadmap for improved interpretability
Deep neural-network-based language models (LMs) are increasingly applied to large-scale
protein sequence data to predict protein function. However, being largely blackbox models …
protein sequence data to predict protein function. However, being largely blackbox models …