Reproducible, Generalizable, and Scalable Analytic Software for Large Neuroimaging Datasets
Degree type
Graduate group
Discipline
Bioinformatics
Neuroscience and Neurobiology
Subject
MRI
Neuroimaging
Reproducibility
Software
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Neuroimaging using Magnetic Resonance Imaging (MRI) has evolved to become one of the primary methods for understanding human brain structure and function. However, recently there have been numerous reports that together constitute a crisis of reproducibility in human neuroimaging studies. This problem can be particularly acute in large and complex neuroimaging datasets. Although researchers start to adopt standards such as Brain Imaging Data Structure (BIDS) and BIDS Apps, tools that facilitate reproducible research with large-scale datasets remain nascent. The overall goal of this thesis was to develop reproducible, generalizable, and scalable analytic software for large neuroimaging data resources. This effort yielded two novel software packages: BIDS App Bootstrap (BABS) and ModelArray. BABS is a user-friendly Python package that provides a reproducible and generalizable workflow for large-scale image processing using BIDS Apps. BABS automatically records the full audit trail of the image processing by utilizing the data version control tool DataLad and adopting the FAIRly big framework. BABS is scalable for large datasets, and supports job management at scale on high performance computing (HPC) clusters. BABS is also generalizable to different use cases, including different BIDS datasets and BIDS Apps. The user-friendly interface of BABS facilitates its application by general users. The second software package in this thesis, ModelArray, is an R package for memory-efficient and generalizable statistical analysis of large-scale datasets. Its memory efficiency allows it to be applied to large datasets even on local computers with limited resources. ModelArray supports mass-univariate statistical analysis using linear models and nonlinear, generalized additive models (GAMs). Diverse statistical models available in R can be incorporated in ModelArray by leveraging its extensibility. Furthermore, ModelArray provides a consistent workflow for large datasets with different data types, including fixel-wise, voxel-wise, and surface data, with generalizability to other data types. In addition to open-source code, ModelArray is released as a Docker container, which facilitates portability and reproducible statistical analysis. Taken together, the generalizable tools developed in this thesis facilitate reproducible neuroimaging research at scale.