[HTML][HTML] Automatic genre identification: a survey

T Kuzman, N Ljubešić - Language Resources and Evaluation, 2023 - Springer
Automatic genre identification (AGI) is a text classification task focused on genres, ie, text
categories defined by the author's purpose, common function of the text, and the text's …

[HTML][HTML] Automatic genre identification for robust enrichment of massive text collections: Investigation of classification methods in the era of large language models

T Kuzman, I Mozetič, N Ljubešić - Machine Learning and Knowledge …, 2023 - mdpi.com
Massive text collections are the backbone of large language models, the main ingredient of
the current significant progress in artificial intelligence. However, as these collections are …

Exploring predictive uncertainty and calibration in NLP: A study on the impact of method & data scarcity

D Ulmer, J Frellsen, C Hardmeier - arXiv preprint arXiv:2210.15452, 2022 - arxiv.org
We investigate the problem of determining the predictive confidence (or, conversely,
uncertainty) of a neural classifier through the lens of low-resource languages. By training …

Camel Treebank: An open multi-genre Arabic dependency treebank

N Habash, M AbuOdeh, D Taji, R Faraj… - Proceedings of the …, 2022 - aclanthology.org
Abstract We present the Camel Treebank (CAMELTB), a 188K word open-source
dependency treebank of Modern Standard and Classical Arabic. CAMELTB 1.0 includes 13 …

The GINCO training dataset for web genre identification of documents out in the wild

T Kuzman, P Rupnik, N Ljubešić - arXiv preprint arXiv:2201.03857, 2022 - arxiv.org
This paper presents a new training dataset for automatic genre identification GINCO, which
is based on 1,125 crawled Slovenian web documents that consist of 650 thousand words …

Are UD treebanks getting more consistent? a report card for English UD

A Zeldes, N Schneider - arXiv preprint arXiv:2302.00636, 2023 - arxiv.org
Recent efforts to consolidate guidelines and treebanks in the Universal Dependencies
project raise the expectation that joint training and dataset comparison is increasingly …

A finite-state morphological analyser for Highland Puebla Nahuatl

R Pugh, F Tyers - Proceedings of the Workshop on Natural …, 2023 - aclanthology.org
This paper describes the development of a free/open-source finite-state
morphologicaltransducer for Highland Puebla Nahuatl, a Uto-Aztecan language spoken in …

Training and evaluation of vector models for Galician

M Garcia - Language Resources and Evaluation, 2024 - Springer
This paper presents a large and systematic assessment of distributional models for Galician.
To this end, we have first trained and evaluated static word embeddings (eg, word2vec …

On Uncertainty In Natural Language Processing

D Ulmer - arXiv preprint arXiv:2410.03446, 2024 - arxiv.org
The last decade in deep learning has brought on increasingly capable systems that are
deployed on a wide variety of applications. In natural language processing, the field has …

[HTML][HTML] Parlamint-it: an 18-karat UD treebank of Italian parliamentary speeches

C Alzetta, S Montemagni, M Sartor… - Language Resources and …, 2024 - Springer
Abstract The paper presents ParlaMint-It, a new treebank of Italian parliamentary debates,
linguistically annotated based on the Universal Dependencies (UD) framework. The …