News summarization and evaluation in the era of gpt-3

T Goyal, JJ Li, G Durrett - arXiv preprint arXiv:2209.12356, 2022 - arxiv.org
The recent success of zero-and few-shot prompting with models like GPT-3 has led to a
paradigm shift in NLP research. In this paper, we study its impact on text summarization …

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

Y Hu, B Liu, J Kasai, Y Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Despite thousands of researchers, engineers, and artists actively working on improving text-
to-image generation models, systems often fail to produce images that accurately align with …

One embedder, any task: Instruction-finetuned text embeddings

H Su, W Shi, J Kasai, Y Wang, Y Hu… - arXiv preprint arXiv …, 2022 - arxiv.org
We introduce INSTRUCTOR, a new method for computing text embeddings given task
instructions: every text input is embedded together with instructions explaining the use case …

Evallm: Interactive evaluation of large language model prompts on user-defined criteria

TS Kim, Y Lee, J Shin, YH Kim, J Kim - … of the CHI Conference on Human …, 2024 - dl.acm.org
By simply composing prompts, developers can prototype novel generative applications with
Large Language Models (LLMs). To refine prototypes into products, however, developers …

Summarizing, simplifying, and synthesizing medical evidence using GPT-3 (with varying success)

C Shaib, ML Li, S Joseph, IJ Marshall, JJ Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models, particularly GPT-3, are able to produce high quality summaries of
general domain news articles in few-and zero-shot settings. However, it is unclear if such …

Evaluating evaluation metrics: A framework for analyzing NLG evaluation metrics using measurement theory

Z Xiao, S Zhang, V Lai, QV Liao - arXiv preprint arXiv:2305.14889, 2023 - arxiv.org
We address a fundamental challenge in Natural Language Generation (NLG) model
evaluation--the design and evaluation of evaluation metrics. Recognizing the limitations of …

CheckEval: Robust Evaluation Framework using Large Language Model via Checklist

Y Lee, J Kim, J Kim, H Cho, P Kang - arXiv preprint arXiv:2403.18771, 2024 - arxiv.org
We introduce CheckEval, a novel evaluation framework using Large Language Models,
addressing the challenges of ambiguity and inconsistency in current evaluation methods …

UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations

W Zhao, JT Chiu, JD Hwang, F Brahman… - arXiv preprint arXiv …, 2023 - arxiv.org
Language technologies that accurately model the dynamics of events must perform
commonsense reasoning. Existing work evaluating commonsense reasoning focuses on …

Common law annotations: Investigating the stability of dialog system output annotations

S Lee, A DeLucia, N Nangia, P Ganedi… - Findings of the …, 2023 - aclanthology.org
Abstract Metrics for Inter-Annotator Agreement (IAA), like Cohen's Kappa, are crucial for
validating annotated datasets. Although high agreement is often used to show the reliability …

Neural language generation for content adaptation: Explainable, efficient low-resource text simplification and evaluation

GC Garbacea - 2023 - deepblue.lib.umich.edu
There are rich opportunities to reduce the language complexity of professional content
(either human-written or computer-generated) and make it accessible to a broad audience …