- 学术资源搜索

Trustllm: Trustworthiness in large language models

Y Huang, L Sun, H Wang, S Wu, Q Zhang, Y Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs), exemplified by ChatGPT, have gained considerable
attention for their excellent natural language processing capabilities. Nonetheless, these …

被引用次数：244 相关文章所有 4 个版本

[PDF] arxiv.org

Jailbreak and guard aligned language models with only few in-context demonstrations

Z Wei, Y Wang, A Li, Y Mo, Y Wang - arXiv preprint arXiv:2310.06387, 2023 - arxiv.org

Large Language Models (LLMs) have shown remarkable success in various tasks, yet their
safety and the risk of generating harmful content remain pressing concerns. In this paper, we …

被引用次数：164 相关文章所有 2 个版本

[PDF] arxiv.org

Certifying llm safety against adversarial prompting

A Kumar, C Agarwal, S Srinivas, AJ Li, S Feizi… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) are vulnerable to adversarial attacks that add malicious
tokens to an input prompt to bypass the safety guardrails of an LLM and cause it to produce …

被引用次数：124 相关文章所有 3 个版本

[HTML] mlr.press

[HTML][HTML] Position: TrustLLM: Trustworthiness in large language models

Y Huang, L Sun, H Wang, S Wu… - International …, 2024 - proceedings.mlr.press

Large language models (LLMs) have gained considerable attention for their excellent
natural language processing capabilities. Nonetheless, these LLMs present many …

被引用次数：18 相关文章

[PDF] aaai.org

Red-Teaming for generative AI: Silver bullet or security theater?

M Feffer, A Sinha, WH Deng, ZC Lipton… - Proceedings of the AAAI …, 2024 - ojs.aaai.org

In response to rising concerns surrounding the safety, security, and trustworthiness of
Generative AI (GenAI) models, practitioners and regulators alike have pointed to AI red …

被引用次数：32 相关文章所有 2 个版本

[PDF] arxiv.org

Defending against alignment-breaking attacks via robustly aligned llm

B Cao, Y Cao, L Lin, J Chen - arXiv preprint arXiv:2309.14348, 2023 - arxiv.org

Recently, Large Language Models (LLMs) have made significant advancements and are
now widely used across various domains. Unfortunately, there has been a rising concern …

被引用次数：100 相关文章所有 3 个版本

[PDF] ciso2ciso.com

[PDF][PDF] Tree of attacks: Jailbreaking black-box llms automatically

A Mehrotra, M Zampetakis, P Kassianik… - arXiv preprint arXiv …, 2023 - ciso2ciso.com

Abstract While Large Language Models (LLMs) display versatile functionality, they continue
to generate harmful, biased, and toxic content, as demonstrated by the prevalence of …

被引用次数：143 相关文章所有 3 个版本

[PDF] arxiv.org

Jailbreak attacks and defenses against large language models: A survey

S Yi, Y Liu, Z Sun, T Cong, X He, J Song, K Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have performed exceptionally in various text-generative
tasks, including question answering, translation, code completion, etc. However, the over …

被引用次数：28 相关文章所有 4 个版本

[PDF] arxiv.org

Optimization-based prompt injection attack to llm-as-a-judge

J Shi, Z Yuan, Y Liu, Y Huang, P Zhou, L Sun… - Proceedings of the …, 2024 - dl.acm.org

LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set
of candidates for a given question. LLM-as-a-Judge has many applications such as LLM …

被引用次数：17 相关文章所有 2 个版本

[PDF] arxiv.org

Assessing the brittleness of safety alignment via pruning and low-rank modifications

B Wei, K Huang, Y Huang, T Xie, X Qi, M Xia… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) show inherent brittleness in their safety mechanisms, as
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …

被引用次数：61 相关文章所有 4 个版本