所有版本 - 学术资源搜索

Are aligned neural networks adversarially aligned?

N Carlini, M Nasr… - Advances in …, 2024 - proceedings.neurips.cc

Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …

被引用次数：175 相关文章

Are aligned neural networks adversarially aligned?

N Carlini, M Nasr, CA Choquette-Choo… - … -seventh Conference on … - openreview.net

Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …

Are aligned neural networks adversarially aligned?

N Carlini, M Nasr, CA Choquette-Choo… - arXiv e …, 2023 - ui.adsabs.harvard.edu

Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …

Are aligned neural networks adversarially aligned?

N Carlini, M Nasr… - 37th Annual …, 2023 - research-collection.ethz.ch

Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …

Are aligned neural networks adversarially aligned?

N Carlini, M Nasr, CA Choquette-Choo… - Proceedings of the 37th …, 2023 - dl.acm.org

Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …

Are aligned neural networks adversarially aligned?

N Carlini, M Nasr, CA Choquette-Choo… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models are now tuned to align with the goals of their creators, namely to be"
helpful and harmless." These models should respond helpfully to user questions, but refuse …