Private Federated Analytics At Scale

Roth, Edo

Private Federated Analytics At Scale

Files

Roth_upenngdas_0175C_15412.pdf (7.49 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Subject

cryptography
privacy
security
systems
Computer Sciences

Copyright date

2022-09-17T20:22:00-07:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/31841

View all metadata

Author

Roth, Edo

Abstract

Collecting distributed data from millions of individuals for the purpose of analytics is a common scenario – from Apple collecting typed words and emojis to improve its keyboard suggestions, to Google collecting location data to see how busy restaurants and businesses are. This data is often sensitive, and can be overly revealing about the individuals and communities whose data is being analyzed en masse. Differential privacy has become the gold-standard method to give strong individual privacy guarantees while releasing aggregate statistics about sensitive data. However, the process of computing such statistics can itself be a privacy risk. For instance, a simple approach would be to collect all the raw data at a single central entity, which then computes and releases the statistics. This entity then has to be trusted to not abuse the raw data; in practice, it can be difficult to find an entity with the requisite level of trust. In this thesis, we describe a new approach that uses cryptographic techniques to collect data privately and safely, without placing trust in any party. Although the natural candidates, such as secure multiparty computation (MPC) and fully homomorphic encryption (FHE) do not scale to millions of parties on their own, our key insight is that there are ways to refactor computations in such a way that they can be done using simpler techniques that do scale, such as additively homomorphic encryption. Our solution restructures centralized computations into distributed protocols that can be executed efficiently at scale. The systems we design based on this approach can support billions of participants and can handle a variety of real queries from the literature, including machine learning tasks, Pregel-style graph queries, and queries over large categorical data. We automate the distributed refactoring so that analysts can write the query as if the data were centralized without understanding how the rewriting works, and we protect against malicious parties who aim to poison or bias the results.

Advisor

Andreas Haeberlen

Date of degree

2022-01-01

Collection

Dissertations and Theses