STATISTICAL FOUNDATIONS OF SINGLE CELL OPEN CHROMATIN ASSAYS

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Genomics and Computational Biology
Discipline
Bioinformatics
Genetics and Genomics
Statistics and Probability
Subject
computational biology
genomics
single cell
statistical test
Funder
Grant number
License
Copyright date
2024
Distributor
Related resources
Author
Miao, Zhen
Contributor
Abstract

Single cell genomic assays are emerging technologies that measure biomolecules in each individual cell. The ability to measure molecules at the single-cell level provides unique opportunities to disentangle complex biological processes. Powered by technological advancements and the ever-decreasing cost of next-generation sequencing, large volume of data has been generated. However, with these “big data”, efforts towards building a comprehensive “atlas” of cells have proven challenging. The challenge arises from several data characteristics: First, the single-cell data are very sparse, with biomolecules often incompletely profiled due to technological difficulties in recovering and amplifying molecules, as well as unsaturated sequencing for cost-effective considerations. Second, the data are sensitive to technical artifacts. Even subtle variations in the sample preparation steps can lead to substantial variations in recovered signals. Without correcting these confounding factors, biological discoveries may be hampered by arbitrarily large type 1 errors. Third, for many genomic assays, there is no consensus “feature set,” and the quantification approaches also diverge. Especially for DNA-level assays, there lacks a consensus notion of genomic regions of interest. Given a set of genomic regions, methods also vary considerably on counting reads in each region. In this dissertation, we first thoroughly review single-cell genomic assays and strategies to integrate datasets to glean biological knowledge beyond individual assays. We then focus on the single-cell open chromatin assay and aim to provide statistical foundations for consistent, uniform, and versatile data analysis. We introduce Paired Insertion Counting (PIC) to address inconsistent quantification steps and answers a long-standing question in the field regarding whether information from single-cell open chromatin assays is binary or quantitative. We also developed a Probability model of Accessible Chromatin in Single cells (PACS), which aims to conduct differential testing while addressing the sparsity and presence of multiple causal factors in complex datasets. By introducing a missing-data-corrected Cumulative Logistic Regression framework (mcCLR), we extend the conventional Generalized Linear Model (GLM) framework to further incorporate individual-specific missing data in statistical tests. Taken together, this dissertation comprises two main computational tools that model the underlying biological signals and account for the unique properties of single-cell assays.

Advisor
Kim, Junhyong
Date of degree
2024
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation