Large language models for data annotation: A survey

Z Tan, D Li, S Wang, A Beigi, B Jiang… - arXiv preprint arXiv …, 2024 - arxiv.org
Data annotation generally refers to the labeling or generating of raw data with relevant
information, which could be used for improving the efficacy of machine learning models. The …

Siren's song in the AI ocean: a survey on hallucination in large language models

Y Zhang, Y Li, L Cui, D Cai, L Liu, T Fu… - arXiv preprint arXiv …, 2023 - arxiv.org
While large language models (LLMs) have demonstrated remarkable capabilities across a
range of downstream tasks, a significant concern revolves around their propensity to exhibit …

AI models collapse when trained on recursively generated data

I Shumailov, Z Shumaylov, Y Zhao, N Papernot… - Nature, 2024 - nature.com
Stable diffusion revolutionized image creation from descriptive text. GPT-2 (ref.), GPT-3 (.
5)(ref.) and GPT-4 (ref.) demonstrated high performance across a variety of language tasks …

The science of detecting llm-generated text

R Tang, YN Chuang, X Hu - Communications of the ACM, 2024 - dl.acm.org
ACM: Digital Library: Communications of the ACM ACM Digital Library Communications of the
ACM Volume 67, Number 4 (2024), Pages 50-59 The Science of Detecting LLM-Generated Text …

A survey on llm-gernerated text detection: Necessity, methods, and future directions

J Wu, S Yang, R Zhan, Y Yuan, DF Wong… - arXiv preprint arXiv …, 2023 - arxiv.org
The powerful ability to understand, follow, and generate complex language emerging from
large language models (LLMs) makes LLM-generated text flood many areas of our daily …

Tinygsm: achieving> 80% on gsm8k with small language models

B Liu, S Bubeck, R Eldan, J Kulkarni, Y Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Small-scale models offer various computational advantages, and yet to which extent size is
critical for problem-solving abilities remains an open question. Specifically for solving grade …

Are large language models a threat to digital public goods? evidence from activity on stack overflow

M del Rio-Chanona, N Laurentsyeva… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models like ChatGPT efficiently provide users with information about
various topics, presenting a potential substitute for searching the web and asking people for …

Quantitative text analysis

KL Nielbo, F Karsdorp, M Wevers, A Lassche… - Nature Reviews …, 2024 - nature.com
Text analysis has undergone substantial evolution since its inception, moving from manual
qualitative assessments to sophisticated quantitative and computational methods. Beginning …

Large language models suffer from their own output: An analysis of the self-consuming training loop

M Briesch, D Sobania, F Rothlauf - arXiv preprint arXiv:2311.16822, 2023 - arxiv.org
Large language models (LLM) have become state of the art in many benchmarks and
conversational LLM applications like ChatGPT are now widely used by the public. Those …

Rephrasing the web: A recipe for compute and data-efficient language modeling

P Maini, S Seto, H Bai, D Grangier, Y Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models are trained on massive scrapes of the web, which are often
unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such …