Adversarial Robustness for Estimation and Alignment

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Statistics and Data Science
Discipline
Data Science
Statistics and Probability
Subject
adversarial prompt
adversarial robustness
distribution shift
jailbreaking
minimax estimation
red teaming
Funder
Grant number
License
Copyright date
01/01/2024
Distributor
Related resources
Author
Chao, Patrick
Contributor
Abstract

As machine learning models are deployed in a multitude of settings with increasing levels of influence and competency, there is growing interest in ensuring these models are robust and align with human intentions. To this end, we analyze robust models and adversarial inputs in a variety of settings. We explore statistical estimation under the adversarial setting of Wasserstein distribution shifts, where every data point may undergo a bounded perturbation. We analyze several statistical problems, including location estimation, linear regression, and non-parametric density estimation. Furthermore, we evaluate alignment in modern foundation models, and propose automated methods to construct adversarial inputs. We develop black-box automated algorithms to generate adversarial prompts for text-to-image models and jailbreaks for language models. Lastly, we introduce a benchmark, JailbreakBench, for reproducible jailbreak evaluation.

Advisor
Dobriban, Edgar
Date of degree
2024
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation