Reproducible, Generalizable, and Scalable Analytic Software for Large Neuroimaging Datasets

Zhao, Chenying

Reproducible, Generalizable, and Scalable Analytic Software for Large Neuroimaging Datasets

Files

Zhao_upenngdas_0175C_16081.pdf (6.18 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Bioengineering

Discipline

Engineering
Bioinformatics
Neuroscience and Neurobiology

Subject

Big data
MRI
Neuroimaging
Reproducibility
Software

Copyright date

2023

Permalink

https://repository.upenn.edu/handle/20.500.14332/59426

View all metadata

Author

Zhao, Chenying

Abstract

Neuroimaging using Magnetic Resonance Imaging (MRI) has evolved to become one of the primary methods for understanding human brain structure and function. However, recently there have been numerous reports that together constitute a crisis of reproducibility in human neuroimaging studies. This problem can be particularly acute in large and complex neuroimaging datasets. Although researchers start to adopt standards such as Brain Imaging Data Structure (BIDS) and BIDS Apps, tools that facilitate reproducible research with large-scale datasets remain nascent. The overall goal of this thesis was to develop reproducible, generalizable, and scalable analytic software for large neuroimaging data resources. This effort yielded two novel software packages: BIDS App Bootstrap (BABS) and ModelArray. BABS is a user-friendly Python package that provides a reproducible and generalizable workflow for large-scale image processing using BIDS Apps. BABS automatically records the full audit trail of the image processing by utilizing the data version control tool DataLad and adopting the FAIRly big framework. BABS is scalable for large datasets, and supports job management at scale on high performance computing (HPC) clusters. BABS is also generalizable to different use cases, including different BIDS datasets and BIDS Apps. The user-friendly interface of BABS facilitates its application by general users. The second software package in this thesis, ModelArray, is an R package for memory-efficient and generalizable statistical analysis of large-scale datasets. Its memory efficiency allows it to be applied to large datasets even on local computers with limited resources. ModelArray supports mass-univariate statistical analysis using linear models and nonlinear, generalized additive models (GAMs). Diverse statistical models available in R can be incorporated in ModelArray by leveraging its extensibility. Furthermore, ModelArray provides a consistent workflow for large datasets with different data types, including fixel-wise, voxel-wise, and surface data, with generalizability to other data types. In addition to open-source code, ModelArray is released as a Docker container, which facilitates portability and reproducible statistical analysis. Taken together, the generalizable tools developed in this thesis facilitate reproducible neuroimaging research at scale.

Advisor

Satterthwaite, Theodore, D.

Date of degree

2023

Collection

Dissertations and Theses