News summarization and evaluation in the era of gpt-3
The recent success of zero-and few-shot prompting with models like GPT-3 has led to a
paradigm shift in NLP research. In this paper, we study its impact on text summarization …
paradigm shift in NLP research. In this paper, we study its impact on text summarization …
Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering
Despite thousands of researchers, engineers, and artists actively working on improving text-
to-image generation models, systems often fail to produce images that accurately align with …
to-image generation models, systems often fail to produce images that accurately align with …
One embedder, any task: Instruction-finetuned text embeddings
We introduce INSTRUCTOR, a new method for computing text embeddings given task
instructions: every text input is embedded together with instructions explaining the use case …
instructions: every text input is embedded together with instructions explaining the use case …
Evallm: Interactive evaluation of large language model prompts on user-defined criteria
By simply composing prompts, developers can prototype novel generative applications with
Large Language Models (LLMs). To refine prototypes into products, however, developers …
Large Language Models (LLMs). To refine prototypes into products, however, developers …
Summarizing, simplifying, and synthesizing medical evidence using GPT-3 (with varying success)
Large language models, particularly GPT-3, are able to produce high quality summaries of
general domain news articles in few-and zero-shot settings. However, it is unclear if such …
general domain news articles in few-and zero-shot settings. However, it is unclear if such …
Evaluating evaluation metrics: A framework for analyzing NLG evaluation metrics using measurement theory
We address a fundamental challenge in Natural Language Generation (NLG) model
evaluation--the design and evaluation of evaluation metrics. Recognizing the limitations of …
evaluation--the design and evaluation of evaluation metrics. Recognizing the limitations of …
CheckEval: Robust Evaluation Framework using Large Language Model via Checklist
We introduce CheckEval, a novel evaluation framework using Large Language Models,
addressing the challenges of ambiguity and inconsistency in current evaluation methods …
addressing the challenges of ambiguity and inconsistency in current evaluation methods …
UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations
Language technologies that accurately model the dynamics of events must perform
commonsense reasoning. Existing work evaluating commonsense reasoning focuses on …
commonsense reasoning. Existing work evaluating commonsense reasoning focuses on …
Common law annotations: Investigating the stability of dialog system output annotations
Abstract Metrics for Inter-Annotator Agreement (IAA), like Cohen's Kappa, are crucial for
validating annotated datasets. Although high agreement is often used to show the reliability …
validating annotated datasets. Although high agreement is often used to show the reliability …
Neural language generation for content adaptation: Explainable, efficient low-resource text simplification and evaluation
GC Garbacea - 2023 - deepblue.lib.umich.edu
There are rich opportunities to reduce the language complexity of professional content
(either human-written or computer-generated) and make it accessible to a broad audience …
(either human-written or computer-generated) and make it accessible to a broad audience …