A general-purpose crowdsourcing computational quality control toolkit for Python

M Evtikhiev, E Bogomolov, Y Sokolov… - Journal of Systems and …, 2023 - Elsevier

In recent years, researchers have created and introduced a significant number of various
code generation models. As human evaluation of every new model version is unfeasible, the …

被引用次数：59 相关文章所有 5 个版本

[PDF] mit.edu

Annotation error detection: Analyzing the past and present for a more coherent future

JC Klie, B Webber, I Gurevych - Computational Linguistics, 2023 - direct.mit.edu

Annotated data is an essential ingredient in natural language processing for training and
evaluating machine learning models. It is therefore very desirable for the annotations to be …

被引用次数：27 相关文章所有 9 个版本

[PDF] acm.org

If in a Crowdsourced Data Annotation Pipeline, a GPT-4

Z He, CY Huang, CKC Ding, S Rohatgi… - Proceedings of the CHI …, 2024 - dl.acm.org

Recent studies indicated GPT-4 outperforms online crowd workers in data labeling
accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies …

被引用次数：7 相关文章所有 3 个版本

[PDF] arxiv.org

IMDB-WIKI-SbS: An evaluation dataset for crowdsourced pairwise comparisons

N Pavlichenko, D Ustalov - arXiv preprint arXiv:2110.14990, 2021 - arxiv.org

Today, comprehensive evaluation of large-scale machine learning models is possible
thanks to the open datasets produced using crowdsourcing, such as SQuAD, MS COCO …

被引用次数：10 相关文章所有 3 个版本

[PDF] archive.org

Improving recommender systems with human-in-the-loop

D Ustalov, N Fedorova, N Pavlichenko - … of the 16th ACM Conference on …, 2022 - dl.acm.org

Today, most recommender systems employ Machine Learning to recommend posts,
products, and other items, usually produced by the users. Although the impressive progress …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Real-time visual feedback to guide benchmark creation: A human-and-metric-in-the-loop workflow

A Arunkumar, S Mishra, B Sachdeva, C Baral… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent research has shown that language models exploitartifacts' in benchmarks to solve
tasks, rather than truly learning them, leading to inflated model performance. In pursuit of …

被引用次数：2 相关文章所有 4 个版本

Challenges in Data Production for AI with Human-in-the-Loop

D Ustalov - Proceedings of the Fifteenth ACM International …, 2022 - dl.acm.org

Today, successful Artificial Intelligence applications rely on three pillars: machine learning
algorithms, hardware for running them, and data for training and evaluating models …

被引用次数：4 相关文章

[PDF] mit.edu

AmbiFC: Fact-Checking Ambiguous Claims with Evidence

M Glockner, I Staliūnaitė, J Thorne, G Vallejo… - Transactions of the …, 2024 - direct.mit.edu

Automated fact-checking systems verify claims against evidence to predict their veracity. In
real-world scenarios, the retrieved evidence may not unambiguously support or refute the …

被引用次数：5 相关文章所有 9 个版本

[PDF] aclanthology.org

Developing a tool for fair and reproducible use of paid crowdsourcing in the digital humanities

T Hiippala, H Hotti, R Suviranta - Proceedings of the 6th Joint …, 2022 - aclanthology.org

This system demonstration paper describes ongoing work on a tool for fair and reproducible
use of paid crowdsourcing in the digital humanities. Paid crowdsourcing is widely used in …

被引用次数：3 相关文章所有 6 个版本

[PDF] iwaponline.com

Algorithms to mimic human interpretation of turbidity events from drinking water distribution systems

K Gleeson, S Husband, J Gaffney… - Journal of …, 2024 - iwaponline.com

Deriving insight from the increasing volume of water quality time series data from drinking
water distribution systems is complex and is usually situation-and individual-specific. This …