STATISTICAL FOUNDATIONS OF SINGLE CELL OPEN CHROMATIN ASSAYS

Loading...
Thumbnail Image

Files

Miao_upenngdas_0175C_16805.pdf (18.68 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Genomics and Computational Biology

Discipline

Bioinformatics
Genetics and Genomics
Statistics and Probability

Subject

computational biology
genomics
single cell
statistical test

Funder

Grant number

License

Copyright date

2024

Distributor

Related resources

Contributor

Abstract

Single cell genomic assays are emerging technologies that measure biomolecules in each individual cell. The ability to measure molecules at the single-cell level provides unique opportunities to disentangle complex biological processes. Powered by technological advancements and the ever-decreasing cost of next-generation sequencing, large volume of data has been generated. However, with these “big data”, efforts towards building a comprehensive “atlas” of cells have proven challenging. The challenge arises from several data characteristics: First, the single-cell data are very sparse, with biomolecules often incompletely profiled due to technological difficulties in recovering and amplifying molecules, as well as unsaturated sequencing for cost-effective considerations. Second, the data are sensitive to technical artifacts. Even subtle variations in the sample preparation steps can lead to substantial variations in recovered signals. Without correcting these confounding factors, biological discoveries may be hampered by arbitrarily large type 1 errors. Third, for many genomic assays, there is no consensus “feature set,” and the quantification approaches also diverge. Especially for DNA-level assays, there lacks a consensus notion of genomic regions of interest. Given a set of genomic regions, methods also vary considerably on counting reads in each region. In this dissertation, we first thoroughly review single-cell genomic assays and strategies to integrate datasets to glean biological knowledge beyond individual assays. We then focus on the single-cell open chromatin assay and aim to provide statistical foundations for consistent, uniform, and versatile data analysis. We introduce Paired Insertion Counting (PIC) to address inconsistent quantification steps and answers a long-standing question in the field regarding whether information from single-cell open chromatin assays is binary or quantitative. We also developed a Probability model of Accessible Chromatin in Single cells (PACS), which aims to conduct differential testing while addressing the sparsity and presence of multiple causal factors in complex datasets. By introducing a missing-data-corrected Cumulative Logistic Regression framework (mcCLR), we extend the conventional Generalized Linear Model (GLM) framework to further incorporate individual-specific missing data in statistical tests. Taken together, this dissertation comprises two main computational tools that model the underlying biological signals and account for the unique properties of single-cell assays.

Date of degree

2024

Date Range for Data Collection (Start Date)

Date Range for Data Collection (End Date)

Digital Object Identifier

Series name and number

Volume number

Issue number

Publisher

Publisher DOI

Journal Issues

Comments

Recommended citation