Truthful AI: Developing and governing AI that does not lie

A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt

Y Cao, S Li, Y Liu, Z Yan, Y Dai, PS Yu… - arXiv preprint arXiv …, 2023 - arxiv.org

Recently, ChatGPT, along with DALL-E-2 and Codex, has been gaining significant attention
from society. As a result, many individuals have become interested in related resources and …

被引用次数：673 相关文章所有 2 个版本

[HTML] cell.com Full View

[HTML][HTML] AI deception: A survey of examples, risks, and potential solutions

PS Park, S Goldstein, A O'Gara, M Chen, D Hendrycks - Patterns, 2024 - cell.com

This paper argues that a range of current AI systems have learned how to deceive humans.
We define deception as the systematic inducement of false beliefs in the pursuit of some …

被引用次数：117 相关文章所有 10 个版本

[PDF] arxiv.org

Gpt-4 technical report

J Achiam, S Adler, S Agarwal, L Ahmad… - arXiv preprint arXiv …, 2023 - arxiv.org

We report the development of GPT-4, a large-scale, multimodal model which can accept
image and text inputs and produce text outputs. While less capable than humans in many …

被引用次数：5410 相关文章所有 3 个版本

[PDF] acm.org

Taxonomy of risks posed by language models

L Weidinger, J Uesato, M Rauh, C Griffin… - Proceedings of the …, 2022 - dl.acm.org

Responsible innovation on large-scale Language Models (LMs) requires foresight into and
in-depth understanding of the risks these models may pose. This paper develops a …

被引用次数：534 相关文章所有 7 个版本

[PDF] arxiv.org

Generative language models and automated influence operations: Emerging threats and potential mitigations

JA Goldstein, G Sastry, M Musser, R DiResta… - arXiv preprint arXiv …, 2023 - arxiv.org

Generative language models have improved drastically, and can now produce realistic text
outputs that are difficult to distinguish from human-written content. For malicious actors …

被引用次数：263 相关文章所有 3 个版本

[PDF] arxiv.org

Webgpt: Browser-assisted question-answering with human feedback

R Nakano, J Hilton, S Balaji, J Wu, L Ouyang… - arXiv preprint arXiv …, 2021 - arxiv.org

We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing
environment, which allows the model to search and navigate the web. By setting up the task …

被引用次数：1010 相关文章所有 8 个版本

[PDF] arxiv.org

Truthfulqa: Measuring how models mimic human falsehoods

S Lin, J Hilton, O Evans - arXiv preprint arXiv:2109.07958, 2021 - arxiv.org

We propose a benchmark to measure whether a language model is truthful in generating
answers to questions. The benchmark comprises 817 questions that span 38 categories …

被引用次数：1229 相关文章所有 7 个版本

[PDF] arxiv.org

Critic: Large language models can self-correct with tool-interactive critiquing

Z Gou, Z Shao, Y Gong, Y Shen, Y Yang… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent developments in large language models (LLMs) have been impressive. However,
these models sometimes show inconsistencies and problematic behavior, such as …

被引用次数：176 相关文章所有 4 个版本

[PDF] arxiv.org

Weak-to-strong generalization: Eliciting strong capabilities with weak supervision

C Burns, P Izmailov, JH Kirchner, B Baker… - arXiv preprint arXiv …, 2023 - arxiv.org

Widely used alignment techniques, such as reinforcement learning from human feedback
(RLHF), rely on the ability of humans to supervise model behavior-for example, to evaluate …

被引用次数：160 相关文章所有 7 个版本

[PDF] openreview.net

Discovering latent knowledge in language models without supervision

C Burns, H Ye, D Klein, J Steinhardt - arXiv preprint arXiv:2212.03827, 2022 - arxiv.org

Existing techniques for training language models can be misaligned with the truth: if we train
models with imitation learning, they may reproduce errors that humans make; if we train …

被引用次数：214 相关文章所有 3 个版本