Out of the bleu: how should we assess quality of the code generation models?

M Evtikhiev, E Bogomolov, Y Sokolov… - Journal of Systems and …, 2023 - Elsevier
In recent years, researchers have created and introduced a significant number of various
code generation models. As human evaluation of every new model version is unfeasible, the …

Annotation error detection: Analyzing the past and present for a more coherent future

JC Klie, B Webber, I Gurevych - Computational Linguistics, 2023 - direct.mit.edu
Annotated data is an essential ingredient in natural language processing for training and
evaluating machine learning models. It is therefore very desirable for the annotations to be …

If in a Crowdsourced Data Annotation Pipeline, a GPT-4

Z He, CY Huang, CKC Ding, S Rohatgi… - Proceedings of the CHI …, 2024 - dl.acm.org
Recent studies indicated GPT-4 outperforms online crowd workers in data labeling
accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies …

IMDB-WIKI-SbS: An evaluation dataset for crowdsourced pairwise comparisons

N Pavlichenko, D Ustalov - arXiv preprint arXiv:2110.14990, 2021 - arxiv.org
Today, comprehensive evaluation of large-scale machine learning models is possible
thanks to the open datasets produced using crowdsourcing, such as SQuAD, MS COCO …

Improving recommender systems with human-in-the-loop

D Ustalov, N Fedorova, N Pavlichenko - … of the 16th ACM Conference on …, 2022 - dl.acm.org
Today, most recommender systems employ Machine Learning to recommend posts,
products, and other items, usually produced by the users. Although the impressive progress …

Real-time visual feedback to guide benchmark creation: A human-and-metric-in-the-loop workflow

A Arunkumar, S Mishra, B Sachdeva, C Baral… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent research has shown that language models exploitartifacts' in benchmarks to solve
tasks, rather than truly learning them, leading to inflated model performance. In pursuit of …

Challenges in Data Production for AI with Human-in-the-Loop

D Ustalov - Proceedings of the Fifteenth ACM International …, 2022 - dl.acm.org
Today, successful Artificial Intelligence applications rely on three pillars: machine learning
algorithms, hardware for running them, and data for training and evaluating models …

AmbiFC: Fact-Checking Ambiguous Claims with Evidence

M Glockner, I Staliūnaitė, J Thorne, G Vallejo… - Transactions of the …, 2024 - direct.mit.edu
Automated fact-checking systems verify claims against evidence to predict their veracity. In
real-world scenarios, the retrieved evidence may not unambiguously support or refute the …

Developing a tool for fair and reproducible use of paid crowdsourcing in the digital humanities

T Hiippala, H Hotti, R Suviranta - Proceedings of the 6th Joint …, 2022 - aclanthology.org
This system demonstration paper describes ongoing work on a tool for fair and reproducible
use of paid crowdsourcing in the digital humanities. Paid crowdsourcing is widely used in …

Algorithms to mimic human interpretation of turbidity events from drinking water distribution systems

K Gleeson, S Husband, J Gaffney… - Journal of …, 2024 - iwaponline.com
Deriving insight from the increasing volume of water quality time series data from drinking
water distribution systems is complex and is usually situation-and individual-specific. This …