Adversarial Robustness for Estimation and Alignment

Loading...
Thumbnail Image

Degree type

Doctor of Philosophy (PhD)

Graduate group

Statistics and Data Science

Discipline

Data Science
Statistics and Probability

Subject

adversarial prompt
adversarial robustness
distribution shift
jailbreaking
minimax estimation
red teaming

Funder

Grant number

License

Copyright date

2024

Distributor

Related resources

Contributor

Abstract

As machine learning models are deployed in a multitude of settings with increasing levels of influence and competency, there is growing interest in ensuring these models are robust and align with human intentions. To this end, we analyze robust models and adversarial inputs in a variety of settings. We explore statistical estimation under the adversarial setting of Wasserstein distribution shifts, where every data point may undergo a bounded perturbation. We analyze several statistical problems, including location estimation, linear regression, and non-parametric density estimation. Furthermore, we evaluate alignment in modern foundation models, and propose automated methods to construct adversarial inputs. We develop black-box automated algorithms to generate adversarial prompts for text-to-image models and jailbreaks for language models. Lastly, we introduce a benchmark, JailbreakBench, for reproducible jailbreak evaluation.

Date of degree

2024

Date Range for Data Collection (Start Date)

Date Range for Data Collection (End Date)

Digital Object Identifier

Series name and number

Volume number

Issue number

Publisher

Publisher DOI

Journal Issues

Comments

Recommended citation