How to evaluate machine translation: A review of automated and human metrics

E Chatzikoumi - Natural Language Engineering, 2020 - cambridge.org
This article presents the most up-to-date, influential automated, semiautomated and human
metrics used to evaluate the quality of machine translation (MT) output and provides the …

The Eval4NLP shared task on explainable quality estimation: Overview and results

M Fomicheva, P Lertvittayakumjorn, W Zhao… - arXiv preprint arXiv …, 2021 - arxiv.org
In this paper, we introduce the Eval4NLP-2021shared task on explainable quality
estimation. Given a source-translation pair, this shared task requires not only to provide a …

Large language models effectively leverage document-level context for literary translation, but critical errors persist

M Karpinska, M Iyyer - arXiv preprint arXiv:2304.03245, 2023 - arxiv.org
Large language models (LLMs) are competitive with the state of the art on a wide range of
sentence-level translation datasets. However, their ability to translate paragraphs and …

Error classification and analysis for machine translation quality assessment

M Popović - Translation quality assessment: From principles to …, 2018 - Springer
This chapter presents an overview of different approaches and tasks related to classification
and analysis of errors in machine translation (MT) output. Manual error classification is a …

[PDF][PDF] Fine-grained human evaluation of neural versus phrase-based machine translation

F Klubička, A Toral… - The Prague Bulletin of …, 2017 - archive.sciendo.com
We compare three approaches to statistical machine translation (pure phrase-based,
factored phrase-based and neural) by performing a fine-grained manual evaluation via error …

First tragedy, then parse: History repeats itself in the new era of large language models

N Saphra, E Fleisig, K Cho, A Lopez - arXiv preprint arXiv:2311.05020, 2023 - arxiv.org
Many NLP researchers are experiencing an existential crisis triggered by the astonishing
success of ChatGPT and other systems based on large language models (LLMs). After such …

[PDF][PDF] How far are we from fully automatic high quality grammatical error correction?

C Bryant, HT Ng - Proceedings of the 53rd Annual Meeting of the …, 2015 - aclanthology.org
In this paper, we first explore the role of inter-annotator agreement statistics in grammatical
error correction and conclude that they are less informative in fields where there may be …

Gpt-4 vs. human translators: A comprehensive evaluation of translation quality across languages, domains, and expertise levels

J Yan, P Yan, Y Chen, J Li, X Zhu, Y Zhang - arXiv preprint arXiv …, 2024 - arxiv.org
This study comprehensively evaluates the translation quality of Large Language Models
(LLMs), specifically GPT-4, against human translators of varying expertise levels across …

Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian

F Klubička, A Toral, VM Sánchez-Cartagena - Machine Translation, 2018 - Springer
This paper presents a quantitative fine-grained manual evaluation approach to comparing
the performance of different machine translation (MT) systems. We build upon the well …

Agreement is overrated: A plea for correlation to assess human evaluation reliability

J Amidei, P Piwek, A Willis - 2019 - oro.open.ac.uk
Inter-Annotator Agreement (IAA) is used as a means of assessing the quality of NLG
evaluation data, in particular, its reliability. According to existing scales of IAA interpretation …