How to evaluate machine translation: A review of automated and human metrics
E Chatzikoumi - Natural Language Engineering, 2020 - cambridge.org
This article presents the most up-to-date, influential automated, semiautomated and human
metrics used to evaluate the quality of machine translation (MT) output and provides the …
metrics used to evaluate the quality of machine translation (MT) output and provides the …
The Eval4NLP shared task on explainable quality estimation: Overview and results
In this paper, we introduce the Eval4NLP-2021shared task on explainable quality
estimation. Given a source-translation pair, this shared task requires not only to provide a …
estimation. Given a source-translation pair, this shared task requires not only to provide a …
Large language models effectively leverage document-level context for literary translation, but critical errors persist
M Karpinska, M Iyyer - arXiv preprint arXiv:2304.03245, 2023 - arxiv.org
Large language models (LLMs) are competitive with the state of the art on a wide range of
sentence-level translation datasets. However, their ability to translate paragraphs and …
sentence-level translation datasets. However, their ability to translate paragraphs and …
Error classification and analysis for machine translation quality assessment
M Popović - Translation quality assessment: From principles to …, 2018 - Springer
This chapter presents an overview of different approaches and tasks related to classification
and analysis of errors in machine translation (MT) output. Manual error classification is a …
and analysis of errors in machine translation (MT) output. Manual error classification is a …
[PDF][PDF] Fine-grained human evaluation of neural versus phrase-based machine translation
F Klubička, A Toral… - The Prague Bulletin of …, 2017 - archive.sciendo.com
We compare three approaches to statistical machine translation (pure phrase-based,
factored phrase-based and neural) by performing a fine-grained manual evaluation via error …
factored phrase-based and neural) by performing a fine-grained manual evaluation via error …
First tragedy, then parse: History repeats itself in the new era of large language models
Many NLP researchers are experiencing an existential crisis triggered by the astonishing
success of ChatGPT and other systems based on large language models (LLMs). After such …
success of ChatGPT and other systems based on large language models (LLMs). After such …
[PDF][PDF] How far are we from fully automatic high quality grammatical error correction?
In this paper, we first explore the role of inter-annotator agreement statistics in grammatical
error correction and conclude that they are less informative in fields where there may be …
error correction and conclude that they are less informative in fields where there may be …
Gpt-4 vs. human translators: A comprehensive evaluation of translation quality across languages, domains, and expertise levels
This study comprehensively evaluates the translation quality of Large Language Models
(LLMs), specifically GPT-4, against human translators of varying expertise levels across …
(LLMs), specifically GPT-4, against human translators of varying expertise levels across …
Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian
This paper presents a quantitative fine-grained manual evaluation approach to comparing
the performance of different machine translation (MT) systems. We build upon the well …
the performance of different machine translation (MT) systems. We build upon the well …
Agreement is overrated: A plea for correlation to assess human evaluation reliability
Inter-Annotator Agreement (IAA) is used as a means of assessing the quality of NLG
evaluation data, in particular, its reliability. According to existing scales of IAA interpretation …
evaluation data, in particular, its reliability. According to existing scales of IAA interpretation …