Trustworthy Machine Learning: Specification, Verification, and Explanation

Xue, Anton

Trustworthy Machine Learning: Specification, Verification, and Explanation

Files

Xue_upenngdas_0175C_17199.pdf (7.75 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Discipline

Computer Sciences

Subject

Explainability
Large Language Models
Machine Learning
Robustness
Safety
Verification

Copyright date

2025

Permalink

https://repository.upenn.edu/handle/20.500.14332/61739

View all metadata

Author

Xue, Anton

Abstract

In the rapidly changing landscape of machine learning (ML), advanced models and systems are increasingly deployed in safety-critical domains. Nonetheless, ensuring their reliability remains a persistent challenge. The central difficulty lies in their often complex, opaque, and black-box nature, which hinders the formal specification, rigorous assurance, and clear interpretation of their behavior. We directly address these challenges by tackling three critical problem areas in trustworthy machine learning: specification, verification, and explanation of complex ML systems. We begin with the fundamental challenge of behavioral specification, focusing on precisely characterizing when a large language model (LLM) correctly follows user-specified instructions. To this end, we propose a logic-based framework for formalizing LLM rule adherence. We demonstrate that real instances of rule adherence and violation, e.g., jailbreak attacks, empirically mirror our theoretical predictions, providing a descriptive framework for understanding these behaviors. Next, we develop scalable verification techniques to formally guarantee the safety of ML models. Specifically, we address computational bottlenecks in expressive semidefinite programming (SDP)-based frameworks for neural network verification. By identifying and exploiting chordal sparsity patterns in large constraints, we obtain speedups of several orders of magnitude, making SDP-based methods more practical for modern networks. Finally, we address the pressing need for reliable insights into complex model behavior through robust explanations. We developed Multiplicative Smoothing (MuS) to provide certified guarantees on the robustness of feature attribution methods. Building on this, our approach extends to the Stability Certification Algorithm (SCA), offering more flexible and practical guarantees. Together, these methods provide non-trivial, practical, and model-agnostic guarantees for common explanation techniques, enhancing their reliability. In summary, this work provides advancements in the specification, verification, and explanation of complex machine learning systems. These contributions are important steps towards designing and building ML systems that are reliable and trustworthy.

Advisor

Alur, Rajeev
Wong, Eric

Date of degree

2025

Collection

Dissertations and Theses