Adversarial Robustness for Estimation and Alignment

Chao, Patrick

Adversarial Robustness for Estimation and Alignment

Files

Chao_upenngdas_0175C_16402.pdf (9.6 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Statistics and Data Science

Discipline

Data Science
Statistics and Probability

Subject

adversarial prompt
adversarial robustness
distribution shift
jailbreaking
minimax estimation
red teaming

Copyright date

01/01/2024

Permalink

https://repository.upenn.edu/handle/20.500.14332/60202

View all metadata

Author

Chao, Patrick

Abstract

As machine learning models are deployed in a multitude of settings with increasing levels of influence and competency, there is growing interest in ensuring these models are robust and align with human intentions. To this end, we analyze robust models and adversarial inputs in a variety of settings. We explore statistical estimation under the adversarial setting of Wasserstein distribution shifts, where every data point may undergo a bounded perturbation. We analyze several statistical problems, including location estimation, linear regression, and non-parametric density estimation. Furthermore, we evaluate alignment in modern foundation models, and propose automated methods to construct adversarial inputs. We develop black-box automated algorithms to generate adversarial prompts for text-to-image models and jailbreaks for language models. Lastly, we introduce a benchmark, JailbreakBench, for reproducible jailbreak evaluation.

Advisor

Dobriban, Edgar

Date of degree

2024

Collection

Dissertations and Theses