Automatic genre identification: a survey

T Kuzman, N Ljubešić - Language Resources and Evaluation, 2023 - Springer
Automatic genre identification (AGI) is a text classification task focused on genres, ie, text
categories defined by the author's purpose, common function of the text, and the text's …

The responsible foundation model development cheatsheet: A review of tools & resources

S Longpre, S Biderman, A Albalak… - arXiv preprint arXiv …, 2024 - arxiv.org
Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …

Simple and scalable strategies to continually pre-train large language models

A Ibrahim, B Thérien, K Gupta, ML Richter… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start
the process over again once new data becomes available. A much more efficient solution is …

Untangling the unrestricted web: Automatic identification of multilingual registers

E Henriksson, A Myntti, A Eskelinen… - arXiv preprint arXiv …, 2024 - arxiv.org
This article explores deep learning models for the automatic identification of registers-text
varieties such as news reports and discussion forums-in web-based datasets across 16 …

CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

N Ljubešić, T Kuzman - arXiv preprint arXiv:2403.12721, 2024 - arxiv.org
This paper presents a collection of highly comparable web corpora of Slovenian, Croatian,
Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole …

A New Massive Multilingual Dataset for High-Performance Language Technologies

O De Gibert, G Nail, N Arefyev, M Bañón… - arXiv preprint arXiv …, 2024 - arxiv.org
We present the HPLT (High Performance Language Technologies) language resources, a
new massive multilingual dataset including both monolingual and bilingual corpora …

Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora

A Myntti, L Repo, E Freyermuth, A Kanner… - Proceedings of the …, 2024 - aclanthology.org
Web-scale corpora present valuable research opportunities but often lack detailed
metadata, making them challenging to use in linguistics and social sciences. This study …

From Discrete to Continuous Classes: A Situational Analysis of Multilingual Web Registers with LLM Annotations

E Henriksson, A Myntti, S Hellström… - Proceedings of the …, 2024 - aclanthology.org
In corpus linguistics, registers–language varieties suited to different contexts–have
traditionally been defined by their situations of use, yet recent studies reveal significant …

The shifting landscape of data: learning to tame distributional shifts

A Ibrahim - 2024 - papyrus.bib.umontreal.ca
Machine learning (ML) models achieve remarkable performance on tasks they are trained
for. However, they often are sensitive to shifts in the data distribution, which may lead to …

Building Question-Answer Data Using Web Register Identification

A Eskelinen, A Myntti, E Henriksson… - Proceedings of the …, 2024 - aclanthology.org
This article introduces a resource-efficient method for developing question-answer (QA)
datasets by extracting QA pairs from web-scale data using machine learning (ML). Our …