Towards better structured and less noisy Web data: Oscar with Register annotations

T Kuzman, N Ljubešić - Language Resources and Evaluation, 2023 - Springer

Automatic genre identification (AGI) is a text classification task focused on genres, ie, text
categories defined by the author's purpose, common function of the text, and the text's …

被引用次数：107 相关文章所有 7 个版本

[PDF] arxiv.org

The responsible foundation model development cheatsheet: A review of tools & resources

S Longpre, S Biderman, A Albalak… - arXiv preprint arXiv …, 2024 - arxiv.org

Foundation model development attracts a rapidly expanding body of contributors, scientists,
and applications. To help shape responsible development practices, we introduce the …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

Simple and scalable strategies to continually pre-train large language models

A Ibrahim, B Thérien, K Gupta, ML Richter… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start
the process over again once new data becomes available. A much more efficient solution is …

被引用次数：48 相关文章所有 2 个版本

[PDF] arxiv.org

Untangling the unrestricted web: Automatic identification of multilingual registers

E Henriksson, A Myntti, A Eskelinen… - arXiv preprint arXiv …, 2024 - arxiv.org

This article explores deep learning models for the automatic identification of registers-text
varieties such as news reports and discussion forums-in web-based datasets across 16 …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

N Ljubešić, T Kuzman - arXiv preprint arXiv:2403.12721, 2024 - arxiv.org

This paper presents a collection of highly comparable web corpora of Slovenian, Croatian,
Bosnian, Montenegrin, Serbian, Macedonian, and Bulgarian, covering thereby the whole …

被引用次数：6 相关文章所有 3 个版本

[PDF] arxiv.org

A New Massive Multilingual Dataset for High-Performance Language Technologies

O De Gibert, G Nail, N Arefyev, M Bañón… - arXiv preprint arXiv …, 2024 - arxiv.org

We present the HPLT (High Performance Language Technologies) language resources, a
new massive multilingual dataset including both monolingual and bilingual corpora …

被引用次数：10 相关文章所有 4 个版本

[PDF] aclanthology.org

Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora

A Myntti, L Repo, E Freyermuth, A Kanner… - Proceedings of the …, 2024 - aclanthology.org

Web-scale corpora present valuable research opportunities but often lack detailed
metadata, making them challenging to use in linguistics and social sciences. This study …

[PDF] aclanthology.org

From Discrete to Continuous Classes: A Situational Analysis of Multilingual Web Registers with LLM Annotations

E Henriksson, A Myntti, S Hellström… - Proceedings of the …, 2024 - aclanthology.org

In corpus linguistics, registers–language varieties suited to different contexts–have
traditionally been defined by their situations of use, yet recent studies reveal significant …

[PDF] umontreal.ca

The shifting landscape of data: learning to tame distributional shifts

A Ibrahim - 2024 - papyrus.bib.umontreal.ca

Machine learning (ML) models achieve remarkable performance on tasks they are trained
for. However, they often are sensitive to shifts in the data distribution, which may lead to …

[PDF] aclanthology.org

Building Question-Answer Data Using Web Register Identification

A Eskelinen, A Myntti, E Henriksson… - Proceedings of the …, 2024 - aclanthology.org

This article introduces a resource-efficient method for developing question-answer (QA)
datasets by extracting QA pairs from web-scale data using machine learning (ML). Our …

被引用次数：1 相关文章