The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses

G Głuch, B Turan, SG Nagarajan, S Pokutta - arXiv preprint arXiv …, 2024 - arxiv.org
We formalize and extend existing definitions of backdoor-based watermarks and adversarial
defenses as interactive protocols between two players. The existence of these schemes is …

Hardness of Deceptive Certificate Selection

S Wäldchen - World Conference on Explainable Artificial Intelligence, 2023 - Springer
Recent progress towards theoretical interpretability guarantees for AI has been made with
classifiers that are based on interactive proof systems. A prover selects a certificate from the …

Models That Prove Their Own Correctness

N Amit, S Goldwasser, O Paradise… - arXiv preprint arXiv …, 2024 - arxiv.org
How can we trust the correctness of a learned model on a particular input of interest? Model
accuracy is typically measured\emph {on average} over a distribution of inputs, giving no …

[PDF][PDF] Extending Merlin-Arthur Classifiers for Improved Interpretability.

B Turan - xAI (Late-breaking Work, Demos, Doctoral Consortium), 2023 - ceur-ws.org
In my doctoral research, I aim to address the interpretability challenges associated with deep
learning by extending the Merlin-Arthur Classifier framework. This novel approach employs …

Unified Taxonomy in AI Safety: Watermarks, Adversarial Defenses, and Transferable Attacks

G Gluch, SG Nagarajan, B Turan - ICML 2024 Workshop on Theoretical … - openreview.net
As AI becomes omnipresent in today's world, it is crucial to study the safety aspects of
learning, such as guaranteed watermarking capabilities and defenses against adversarial …