InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety...

文章

学术资源搜索

获得 3 条结果（用时0.02秒）

我的图书馆

InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety...

在引用文章中搜索

[PDF] arxiv.org

Detoxifying Large Language Models via Knowledge Editing

M Wang, N Zhang, Z Xu, Z Xi, S Deng, Y Yao… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper investigates using knowledge editing techniques to detoxify Large Language
Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

High-Dimension Human Value Representation in Large Language Models

S Cahyawijaya, D Chen, Y Bang, L Khalatbari… - arXiv preprint arXiv …, 2024 - arxiv.org

The widespread application of Large Language Models (LLMs) across various tasks and
fields has necessitated the alignment of these models with human values and preferences …

Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context

N Das, E Raff, M Gaur - arXiv preprint arXiv:2407.14644, 2024 - arxiv.org

Previous research on testing the vulnerabilities in Large Language Models (LLMs) using
adversarial attacks has primarily focused on nonsensical prompt injections, which are easily …