- 学术资源搜索

Large language model as attributed training data generator: A tale of diversity and bias

Y Yu, Y Zhuang, J Zhang, Y Meng… - Advances in …, 2024 - proceedings.neurips.cc

Large language models (LLMs) have been recently leveraged as training data generators
for various natural language processing (NLP) tasks. While previous research has explored …

被引用次数：159 相关文章所有 5 个版本

[PDF] arxiv.org

Dataset cartography: Mapping and diagnosing datasets with training dynamics

S Swayamdipta, R Schwartz, N Lourie, Y Wang… - arXiv preprint arXiv …, 2020 - arxiv.org

Large datasets have become commonplace in NLP research. However, the increased
emphasis on data quantity has made it challenging to assess the quality of data. We …

被引用次数：391 相关文章所有 3 个版本

[PDF] arxiv.org

NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks

S Mishra, A Mitra, N Varshney, B Sachdeva… - arXiv preprint arXiv …, 2022 - arxiv.org

Given the ubiquitous nature of numbers in text, reasoning with numbers to perform simple
calculations is an important skill of AI systems. While many datasets and models have been …

被引用次数：90 相关文章所有 6 个版本

[PDF] arxiv.org

Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions

M Parmar, S Mishra, M Geva, C Baral - arXiv preprint arXiv:2205.00415, 2022 - arxiv.org

In recent years, progress in NLU has been driven by benchmarks. These benchmarks are
typically collected by crowdsourcing, where annotators write examples based on annotation …

被引用次数：62 相关文章所有 7 个版本

[PDF] arxiv.org

Unleashing the power of data tsunami: A comprehensive survey on data assessment and selection for instruction tuning of language models

Y Qin, Y Yang, P Guo, G Li, H Shao, Y Shi, Z Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

Instruction tuning plays a critical role in aligning large language models (LLMs) with human
preference. Despite the vast amount of open instruction datasets, naively training a LLM on …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Progen: Progressive zero-shot dataset generation via in-context feedback

J Ye, J Gao, J Feng, Z Wu, T Yu, L Kong - arXiv preprint arXiv:2210.12329, 2022 - arxiv.org

Recently, dataset-generation-based zero-shot learning has shown promising results by
training a task-specific model with a dataset synthesized from large pre-trained language …

被引用次数：56 相关文章所有 4 个版本

[PDF] arxiv.org

Generalized but not robust? comparing the effects of data modification methods on out-of-domain generalization and adversarial robustness

T Gokhale, S Mishra, M Luo, BS Sachdeva… - arXiv preprint arXiv …, 2022 - arxiv.org

Data modification, either via additional training datasets, data augmentation, debiasing, and
dataset filtering, has been proposed as an effective solution for generalizing to out-of …

被引用次数：31 相关文章所有 5 个版本

[PDF] arxiv.org

Large-scale evaluation of topic models and dimensionality reduction methods for 2d text spatialization

D Atzberger, T Cech, M Trapp, R Richter… - … on Visualization and …, 2023 - ieeexplore.ieee.org

Topic models are a class of unsupervised learning algorithms for detecting the semantic
structure within a text corpus. Together with a subsequent dimensionality reduction …

被引用次数：11 相关文章所有 11 个版本

[PDF] arxiv.org

ILDAE: Instance-level difficulty analysis of evaluation data

N Varshney, S Mishra, C Baral - arXiv preprint arXiv:2203.03073, 2022 - arxiv.org

Knowledge of questions' difficulty level helps a teacher in several ways, such as estimating
students' potential quickly by asking carefully selected questions and improving quality of …

被引用次数：20 相关文章所有 5 个版本

[PDF] aclanthology.org

Do we need to create big datasets to learn a task?

S Mishra, BS Sachdeva - … of SustaiNLP: Workshop on Simple and …, 2020 - aclanthology.org

Deep Learning research has been largely accelerated by the development of huge datasets
such as Imagenet. The general trend has been to create big datasets to make a deep neural …

被引用次数：36 相关文章所有 2 个版本