Automatic genre identification: a survey
T Kuzman, N Ljubešić - Language Resources and Evaluation, 2023 - Springer
Automatic genre identification (AGI) is a text classification task focused on genres, ie, text
categories defined by the author's purpose, common function of the text, and the text's …
categories defined by the author's purpose, common function of the text, and the text's …
The responsible foundation model development cheatsheet: A review of tools & resources
Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …
and applications. To help shape responsible development practices, we introduce the …
Simple and scalable strategies to continually pre-train large language models
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start
the process over again once new data becomes available. A much more efficient solution is …
the process over again once new data becomes available. A much more efficient solution is …
Untangling the unrestricted web: Automatic identification of multilingual registers
E Henriksson, A Myntti, A Eskelinen… - arXiv preprint arXiv …, 2024 - arxiv.org
This article explores deep learning models for the automatic identification of registers-text
varieties such as news reports and discussion forums-in web-based datasets across 16 …
varieties such as news reports and discussion forums-in web-based datasets across 16 …
CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation
N Ljubešić, T Kuzman - arXiv preprint arXiv:2403.12721, 2024 - arxiv.org
This paper presents a collection of highly comparable web corpora of Slovenian, Croatian,
Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole …
Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole …
A New Massive Multilingual Dataset for High-Performance Language Technologies
We present the HPLT (High Performance Language Technologies) language resources, a
new massive multilingual dataset including both monolingual and bilingual corpora …
new massive multilingual dataset including both monolingual and bilingual corpora …
Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora
A Myntti, L Repo, E Freyermuth, A Kanner… - Proceedings of the …, 2024 - aclanthology.org
Web-scale corpora present valuable research opportunities but often lack detailed
metadata, making them challenging to use in linguistics and social sciences. This study …
metadata, making them challenging to use in linguistics and social sciences. This study …
From Discrete to Continuous Classes: A Situational Analysis of Multilingual Web Registers with LLM Annotations
E Henriksson, A Myntti, S Hellström… - Proceedings of the …, 2024 - aclanthology.org
In corpus linguistics, registers–language varieties suited to different contexts–have
traditionally been defined by their situations of use, yet recent studies reveal significant …
traditionally been defined by their situations of use, yet recent studies reveal significant …
The shifting landscape of data: learning to tame distributional shifts
A Ibrahim - 2024 - papyrus.bib.umontreal.ca
Machine learning (ML) models achieve remarkable performance on tasks they are trained
for. However, they often are sensitive to shifts in the data distribution, which may lead to …
for. However, they often are sensitive to shifts in the data distribution, which may lead to …
Building Question-Answer Data Using Web Register Identification
A Eskelinen, A Myntti, E Henriksson… - Proceedings of the …, 2024 - aclanthology.org
This article introduces a resource-efficient method for developing question-answer (QA)
datasets by extracting QA pairs from web-scale data using machine learning (ML). Our …
datasets by extracting QA pairs from web-scale data using machine learning (ML). Our …