Large language model as attributed training data generator: A tale of diversity and bias

Y Yu, Y Zhuang, J Zhang, Y Meng… - Advances in …, 2024 - proceedings.neurips.cc
Large language models (LLMs) have been recently leveraged as training data generators
for various natural language processing (NLP) tasks. While previous research has explored …

Dataset cartography: Mapping and diagnosing datasets with training dynamics

S Swayamdipta, R Schwartz, N Lourie, Y Wang… - arXiv preprint arXiv …, 2020 - arxiv.org
Large datasets have become commonplace in NLP research. However, the increased
emphasis on data quantity has made it challenging to assess the quality of data. We …

NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks

S Mishra, A Mitra, N Varshney, B Sachdeva… - arXiv preprint arXiv …, 2022 - arxiv.org
Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple
calculations is an important skill of AI systems. While many datasets and models have been …

Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions

M Parmar, S Mishra, M Geva, C Baral - arXiv preprint arXiv:2205.00415, 2022 - arxiv.org
In recent years, progress in NLU has been driven by benchmarks. These benchmarks are
typically collected by crowdsourcing, where annotators write examples based on annotation …

Unleashing the power of data tsunami: A comprehensive survey on data assessment and selection for instruction tuning of language models

Y Qin, Y Yang, P Guo, G Li, H Shao, Y Shi, Z Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
Instruction tuning plays a critical role in aligning large language models (LLMs) with human
preference. Despite the vast amount of open instruction datasets, naively training a LLM on …

Progen: Progressive zero-shot dataset generation via in-context feedback

J Ye, J Gao, J Feng, Z Wu, T Yu, L Kong - arXiv preprint arXiv:2210.12329, 2022 - arxiv.org
Recently, dataset-generation-based zero-shot learning has shown promising results by
training a task-specific model with a dataset synthesized from large pre-trained language …

Generalized but not robust? comparing the effects of data modification methods on out-of-domain generalization and adversarial robustness

T Gokhale, S Mishra, M Luo, BS Sachdeva… - arXiv preprint arXiv …, 2022 - arxiv.org
Data modification, either via additional training datasets, data augmentation, debiasing, and
dataset filtering, has been proposed as an effective solution for generalizing to out-of …

Large-scale evaluation of topic models and dimensionality reduction methods for 2d text spatialization

D Atzberger, T Cech, M Trapp, R Richter… - … on Visualization and …, 2023 - ieeexplore.ieee.org
Topic models are a class of unsupervised learning algorithms for detecting the semantic
structure within a text corpus. Together with a subsequent dimensionality reduction …

ILDAE: Instance-level difficulty analysis of evaluation data

N Varshney, S Mishra, C Baral - arXiv preprint arXiv:2203.03073, 2022 - arxiv.org
Knowledge of questions' difficulty level helps a teacher in several ways, such as estimating
students' potential quickly by asking carefully selected questions and improving quality of …

Do we need to create big datasets to learn a task?

S Mishra, BS Sachdeva - … of SustaiNLP: Workshop on Simple and …, 2020 - aclanthology.org
Deep Learning research has been largely accelerated by the development of huge datasets
such as Imagenet. The general trend has been to create big datasets to make a deep neural …