Out of the bleu: how should we assess quality of the code generation models?
In recent years, researchers have created and introduced a significant number of various
code generation models. As human evaluation of every new model version is unfeasible, the …
code generation models. As human evaluation of every new model version is unfeasible, the …
Annotation error detection: Analyzing the past and present for a more coherent future
Annotated data is an essential ingredient in natural language processing for training and
evaluating machine learning models. It is therefore very desirable for the annotations to be …
evaluating machine learning models. It is therefore very desirable for the annotations to be …
If in a Crowdsourced Data Annotation Pipeline, a GPT-4
Recent studies indicated GPT-4 outperforms online crowd workers in data labeling
accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies …
accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies …
IMDB-WIKI-SbS: An evaluation dataset for crowdsourced pairwise comparisons
N Pavlichenko, D Ustalov - arXiv preprint arXiv:2110.14990, 2021 - arxiv.org
Today, comprehensive evaluation of large-scale machine learning models is possible
thanks to the open datasets produced using crowdsourcing, such as SQuAD, MS COCO …
thanks to the open datasets produced using crowdsourcing, such as SQuAD, MS COCO …
Improving recommender systems with human-in-the-loop
Today, most recommender systems employ Machine Learning to recommend posts,
products, and other items, usually produced by the users. Although the impressive progress …
products, and other items, usually produced by the users. Although the impressive progress …
Real-time visual feedback to guide benchmark creation: A human-and-metric-in-the-loop workflow
Recent research has shown that language models exploitartifacts' in benchmarks to solve
tasks, rather than truly learning them, leading to inflated model performance. In pursuit of …
tasks, rather than truly learning them, leading to inflated model performance. In pursuit of …
Challenges in Data Production for AI with Human-in-the-Loop
D Ustalov - Proceedings of the Fifteenth ACM International …, 2022 - dl.acm.org
Today, successful Artificial Intelligence applications rely on three pillars: machine learning
algorithms, hardware for running them, and data for training and evaluating models …
algorithms, hardware for running them, and data for training and evaluating models …
AmbiFC: Fact-Checking Ambiguous Claims with Evidence
Automated fact-checking systems verify claims against evidence to predict their veracity. In
real-world scenarios, the retrieved evidence may not unambiguously support or refute the …
real-world scenarios, the retrieved evidence may not unambiguously support or refute the …
Developing a tool for fair and reproducible use of paid crowdsourcing in the digital humanities
T Hiippala, H Hotti, R Suviranta - Proceedings of the 6th Joint …, 2022 - aclanthology.org
This system demonstration paper describes ongoing work on a tool for fair and reproducible
use of paid crowdsourcing in the digital humanities. Paid crowdsourcing is widely used in …
use of paid crowdsourcing in the digital humanities. Paid crowdsourcing is widely used in …
Algorithms to mimic human interpretation of turbidity events from drinking water distribution systems
K Gleeson, S Husband, J Gaffney… - Journal of …, 2024 - iwaponline.com
Deriving insight from the increasing volume of water quality time series data from drinking
water distribution systems is complex and is usually situation-and individual-specific. This …
water distribution systems is complex and is usually situation-and individual-specific. This …