Insights Into Functional Noncoding Rna Elements Through The Analysis Of Human Genetic Variation

David Sheng Ming Lee, University of Pennsylvania


Most of the human genome is noncoding but knowing how and when genetic variation in noncoding regions of the genome can impact biology and disease susceptibility remains challenging. Here, we apply an integrated genomics approach towards understanding and elucidating new patterns of functional genetic variation in untranslated regions of protein-coding messenger RNAs.

G-quadruplex (G4) sequences are abundant in untranslated regions (UTRs) of human messenger RNAs, but their functional importance remains unclear. In Part 1 of this dissertation, we integrate multiple sources of genetic and genomic data to show that putative G-quadruplex forming sequences (pG4) in 5’ and 3’ UTRs are selectively constrained and enriched for cis-eQTLs and RNA-binding protein (RBP) interactions. Using over 15,000 whole genome sequences, we find evidence of strong negative selection acting on central guanines of UTR pG4s. At multiple GWAS-implicated SNPs within pG4 UTR sequences, we find robust allelic imbalance in gene expression across diverse tissue contexts in GTEx, suggesting that variants affecting G4 formation in UTRs may also contribute to phenotypic variation. Our results establish UTR G4s as important cis-regulatory elements and point to a link between disruption of UTR pG4 and disease.

In Part 2 of this dissertation, we examine patterns of selective pressure in non-canonical open reading frames (ncORFs) mapped throughout the human genome. Ribosome-profiling has uncovered pervasive translation in ncORFs, however the biological significance of this phenomenon remains unclear. Using genetic variation from 71,702 human genomes, we assess patterns of selection in translated upstream open reading frames (uORFs) in 5’UTRs. We show that uORF variants introducing new stop codons, or strengthening existing stop codons, are under strong negative selection comparable to protein-coding missense variants. Using these variants, we map and validate new gene-disease associations in two independent biobanks containing exome sequencing from 10,900 and 32,268 individuals, respectively, and elucidate their impact of protein expression in human cells. Our results suggest new mechanisms relating uORF variation to reduced protein expression and demonstrate that translation at uORFs is genetically constrained in 50% of human genes.

Together, these studies help emphasize the importance of noncoding RNA regulatory elements in mediating post-transcriptional regulation of gene expression and illuminate new patterns of functional variation in UTRs with human disease relevance.