Towards Trustworthy Large Language Models

Li, Shuo

Towards Trustworthy Large Language Models

Files

Li_upenngdas_0175C_17100.pdf (2.44 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Discipline

Engineering
Computer Sciences

Subject

AI Safety
Artificial Intelligence
Large Language Models

Copyright date

2025

Permalink

https://repository.upenn.edu/handle/20.500.14332/61646

View all metadata

Author

Li, Shuo

Abstract

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, including open-domain question answering, conversational agents, and mathematical reasoning. As their adoption increases, so too does user reliance on their outputs. However, current LLMs remain insufficiently trustworthy: they can produce hallucinations, toxic content, or incorrect code, potentially leading to harmful consequences if such outputs go unchecked. This thesis aims to advance the trustworthiness of LLMs. We begin by formally defining trustworthy LLMs, particularly in contexts where responses must satisfy correctness or safety constraints. We introduce the notion of a desirable response set $\Sigma(x)$, which maps an input $x$ to a set of acceptable outputs. A trustworthy LLM is then one whose generation $y$ for input $x$ satisfies $y \in \Sigma(x)$. This formulation provides a unifying framework applicable to various tasks, each requiring tailored methods to enforce trustworthiness. For open-domain QA, where $\Sigma(x)$ is typically unobservable, we propose TRAQ, a method that provides correctness guarantees for LLMs augmented with retrieval (RAG). We also present Conformal Structured Prediction, an algorithm that combines statistical testing and integer programming to construct interpretable prediction sets for structured tasks. Furthermore, for questions requiring involved reasoning to obtain trustworthy responses, we investigate reinforcement learning (RL) techniques for LLM fine-tuning and identify sampling as the principal computational bottleneck. To address this, we introduce DASH, an accelerated RL algorithm that reduces training time by 83% relative to standard methods while maintaining performance. Finally, for safety alignment—where strictly enforcing $y \in \Sigma(x)$ may degrade helpfulness—we develop CAN, a one-shot constrained optimization framework that balances safety and helpfulness, achieving alignment without extensive tuning or retraining.

Advisor

Lee, Insup
Bastani, Osbert

Date of degree

2025

Collection

Dissertations and Theses