Stealing part of a production language model

N Carlini, D Paleka, KD Dvijotham, T Steinke… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce the first model-stealing attack that extracts precise, nontrivial information from
black-box production language models like OpenAI's ChatGPT or Google's PaLM-2 …

Privacy Backdoors: Stealing Data with Corrupted Pretrained Models

S Feng, F Tramèr - arXiv preprint arXiv:2404.00473, 2024 - arxiv.org
Practitioners commonly download pretrained machine learning models from open
repositories and finetune them to fit specific applications. We show that this practice …

A False Sense of Safety: Unsafe Information Leakage in'Safe'AI Responses

D Glukhov, Z Han, I Shumailov, V Papyan… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) are vulnerable to jailbreaks $\unicode {x2013} $ methods
to elicit harmful or generally impermissible outputs. Safety measures are developed and …

Sequencing the Neurome: Towards Scalable Exact Parameter Reconstruction of Black-Box Neural Networks

J Goldfeder, Q Roets, G Guo, J Wright… - arXiv preprint arXiv …, 2024 - arxiv.org
Inferring the exact parameters of a neural network with only query access is an NP-Hard
problem, with few practical existing algorithms. Solutions would have major implications for …