Adversarial Robustness for Estimation and Alignment
Degree type
Graduate group
Discipline
Statistics and Probability
Subject
adversarial robustness
distribution shift
jailbreaking
minimax estimation
red teaming
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
As machine learning models are deployed in a multitude of settings with increasing levels of influence and competency, there is growing interest in ensuring these models are robust and align with human intentions. To this end, we analyze robust models and adversarial inputs in a variety of settings. We explore statistical estimation under the adversarial setting of Wasserstein distribution shifts, where every data point may undergo a bounded perturbation. We analyze several statistical problems, including location estimation, linear regression, and non-parametric density estimation. Furthermore, we evaluate alignment in modern foundation models, and propose automated methods to construct adversarial inputs. We develop black-box automated algorithms to generate adversarial prompts for text-to-image models and jailbreaks for language models. Lastly, we introduce a benchmark, JailbreakBench, for reproducible jailbreak evaluation.