Large language model as attributed training data generator: A tale of diversity and bias
Large language models (LLMs) have been recently leveraged as training data generators
for various natural language processing (NLP) tasks. While previous research has explored …
for various natural language processing (NLP) tasks. While previous research has explored …
Dataset cartography: Mapping and diagnosing datasets with training dynamics
Large datasets have become commonplace in NLP research. However, the increased
emphasis on data quantity has made it challenging to assess the quality of data. We …
emphasis on data quantity has made it challenging to assess the quality of data. We …
NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks
Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple
calculations is an important skill of AI systems. While many datasets and models have been …
calculations is an important skill of AI systems. While many datasets and models have been …
Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions
In recent years, progress in NLU has been driven by benchmarks. These benchmarks are
typically collected by crowdsourcing, where annotators write examples based on annotation …
typically collected by crowdsourcing, where annotators write examples based on annotation …
Unleashing the power of data tsunami: A comprehensive survey on data assessment and selection for instruction tuning of language models
Instruction tuning plays a critical role in aligning large language models (LLMs) with human
preference. Despite the vast amount of open instruction datasets, naively training a LLM on …
preference. Despite the vast amount of open instruction datasets, naively training a LLM on …
Progen: Progressive zero-shot dataset generation via in-context feedback
Recently, dataset-generation-based zero-shot learning has shown promising results by
training a task-specific model with a dataset synthesized from large pre-trained language …
training a task-specific model with a dataset synthesized from large pre-trained language …
Generalized but not robust? comparing the effects of data modification methods on out-of-domain generalization and adversarial robustness
Data modification, either via additional training datasets, data augmentation, debiasing, and
dataset filtering, has been proposed as an effective solution for generalizing to out-of …
dataset filtering, has been proposed as an effective solution for generalizing to out-of …
Large-scale evaluation of topic models and dimensionality reduction methods for 2d text spatialization
Topic models are a class of unsupervised learning algorithms for detecting the semantic
structure within a text corpus. Together with a subsequent dimensionality reduction …
structure within a text corpus. Together with a subsequent dimensionality reduction …
ILDAE: Instance-level difficulty analysis of evaluation data
Knowledge of questions' difficulty level helps a teacher in several ways, such as estimating
students' potential quickly by asking carefully selected questions and improving quality of …
students' potential quickly by asking carefully selected questions and improving quality of …
Do we need to create big datasets to learn a task?
S Mishra, BS Sachdeva - … of SustaiNLP: Workshop on Simple and …, 2020 - aclanthology.org
Deep Learning research has been largely accelerated by the development of huge datasets
such as Imagenet. The general trend has been to create big datasets to make a deep neural …
such as Imagenet. The general trend has been to create big datasets to make a deep neural …