2023 Highlights from Biostatistics and Health Data Science Research

In the fields of biostatistics and health data science, our faculty members are constantly pushing the boundaries of knowledge and application. This section proudly presents a curated collage of their groundbreaking work published in 2023. Each entry provides a glimpse into the innovative research undertaken by our faculty, and their potential impact in understanding and solving complex health-related challenges. Let's celebrate their achievements and the contributions to medicine, public health, policy-making, and beyond.

Zonghua Liu, ScD

Dr. Zhonghua Liu and his collaborators have developed a machine learning based integrative pipeline and web server called ImmuneMirror for cancer neoantigen prediction using genetic and genomic data. ImmuneMirror will be strategically useful as a guide for clinicians to tailor treatment strategies according to the genomic and transcriptomic profiles for precision medicine and to facilitate clinical trial design and patient selection with broad prospects for clinical applications.

Chuwdhury, G., Guo, Y.*, Cheung, C., Lam, K., Kam, N., Liu, Z.#, Dai, W.#, (2024)ImmuneMirror: a Machine Learning-based Integrative Pipeline and Web Server for Neoantigen PredictionBriefings in Bioinformatics (In press). https://doi.org/10.1093/bib/bbae024

Jeff Goldsmith, PhD

Many health-related outcomes and exposures vary over the menstrual cycle. Accounting for this source of variation could improve model accuracy and statistical power. However, cycle day is difficult to measure through self-report or estimate using available methods, and is routinely overlooked in studies involving women's health. Dr. Jeff Goldsmith and his Ph.D student Madison Storm have developed and validated a novel analytic pipeline to estimate cycle day using hormone data derived from a single spot urine sample. This easy, affordable approach allows researchers to consider menstrual cycle days in studies from collected urine samples. Their collaborative team, which includes Julie Herbstman (EHS) and Lauren Houghton (Epi), anticipates using this pipeline to evaluate the contribution of time in cycle to variation in affective disorder symptoms (e.g., anxiety and depression).

Stoms, M., Houghton, L., Terry, M.B., Ulanday, K. Herbstman, J. and Goldsmith, J. Estimation of Menstrual Cycle Day using Cross-Sectional Biomarker Measurements. Under Review.

Yuanjia Wang, PhD

Digital technologies (e.g., mobile phones) can be used to provide objective, frequent, and real-world monitoring of an individual’s health status. However, modeling digital phenotypes collected by passive sensing or active sensing technologies poses substantial challenges due to confounding (e.g., time-of-day effect) and various sources of variabilities. Dr. Yuanjia Wang and her Ph.D. student, Tianchen Xu, have developed a mixed-response state-space (MRSS) model to jointly capture multi-dimensional, multi-modal digital phenotypes and their measurement processes by a finite number of latent state time series. These latent states reflect the dynamic health status and personalized time-varying treatment effects while adjusting for informative measurements. Their innovative approach has been applied to a large-scale observational smartphone-based study to effectively evaluate the feasibility of remotely collecting frequent information in Parkinson’s Disease patients about their daily changes in symptom severity and their sensitivity to medication treatment.

Xu T, Chen Y, Zeng D, Wang Y (2023). Mixed-Response State-Space Model for Analyzing Multi-Dimensional Digital Phenotypes,Journal of the American Statistical Association,118:544,2288-2300,DOI: 10.1080/01621459.2023.2225742

Shuang Wang, PhD

With multiple types of omics data aiming for disease subtyping, Dr. Shuang Wang and her research team developed PartIES: a Partition-level Integration framework that uses diffusion-Enhanced Similarities. PartIESpartitions individual diffusion-enhanced similarity matrices tocapture distinct data-type-specific cluster structures usinga spectral-based method and integrates low-rank partition-information-induced similarity matrices iteratively through a weightedaverage.

Yuqi Miao, Huang Xu, Shuang Wang (2024) “PartIES: a disease subtyping framework with Partition-level Integration using diffusion-Enhanced Similarities from Multi-omics Data”, submitted to Briefings in Bioinformatics

Ying Kuen Cheng, PhD

Dr. Cheung and his collaborators tackled the challenge of finding the lowest effective doses in clinical trials that test treatments with multiple aspects, inspired by a study aimed at reducing glucose levels by introducing breaks in sitting time. Their focus was on figuring out the least amount of break frequency and duration needed to lower glucose levels effectively. Since the effects of different break combinations on glucose reduction are complex, they developed a method to estimate the most effective combinations by considering the benefits of each option differently. This approach uses a special function that values correct decisions differently and is fine-tuned through an adaptive process to improve accuracy. Their tailored method is effective in reducing mistakes and enhancing the chances of correctly identifying beneficial treatments, achieving high success rates in various test scenarios.

Cheung, Y. K., Chandereng, T., & Diaz, K. M. (2022). A novel framework to estimate multidimensional minimum effective doses using asymmetric posterior gain and ϵ-tapering. The Annals of Applied Statistics, 16(3), 1445-1458.

Daniel Malinsky, PhD

Dr. Malinsky and his collaborators tackled two pervasive biases in observational research: confounding and outcome data that is missing not-at-random. Systematic missing data is often seen in scenarios such as substance use surveys and mental health assessments, where the data fails to capture information on participants who are the worst off. The team's pioneering approach utilized graphical models, specifically directed acyclic graphs (DAGs), to discern auxiliary variables capable of simultaneously correcting for both types of bias. This methodology not only clarifies the conditions under which certain variables can mitigate biases but also significantly advances the field by improving the integrity and validity of causal inferences from observational studies.

Chen, Malinsky, and Bhattacharya (2023). “Causal Inference with Outcome-Dependent Missingness and Self-Censoring” in Proceedings of the 39th Conference on Uncertainty in Artificial Intelligence (UAI).

Caleb Miles, PhD

When a causal relationship between a certain exposure and outcome is established, investigators will often be interested in understanding mechanisms that explain whythis causal effect occurs. Such an effect is known as a mediated effect, and the study of such effects is known as mediation analysis. Learning about mediated effects from data is known to be challenging in that it relies on very strong assumptions—stronger, in fact, than most causal effects. Dr. Miles critically evaluated the emerging concept of randomized interventional indirect effects, which try to sidestep these rigid assumptions. He demonstrates that such effects lack a true mediational interpretation (without making additional strong assumptions), and could falsely suggest mediation even when no mediated effect exists for any subject in the population. This critical insight not only refines the theoretical understanding of mediated effects but also guides researchers in the cautious application of this innovative approach to causal inference.

Miles, C. H. (2023). On the causal interpretation of randomized interventional indirect effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(4), 1154-1172.

Tian Gu, PhD

Despite the high-quality, data-rich samples collected by recent large-scale biobanks, the underrepresentation of participants from minority and disadvantaged groups has limited the use of biobank data for developing disease risk prediction models that can be generalized to diverse populations. Dr. Tian Gu and her collaborators address this critical challenge by proposing a transfer learning framework based on random forest models (TransRF). TransRF can incorporate risk prediction models trained in a source population to improve the prediction performance in a target underrepresented population with a limited sample size. The feasibility of TransRF was evaluated in building breast cancer risk assessment models for African-ancestry women and South Asian women, respectively, with UK biobank data. This approach holds the potential to significantly improve health outcomes by enabling more accurate and inclusive disease risk assessments across diverse populations.

Gu, T., Han, Y., & Duan, R. (2023). A transfer learning approach based on random forest with application to breast cancer prediction in underrepresented populations. Pacific Symposium on Biocomputing(pp. 186-197).

Xiao Wu, PhD

Wildfire and wildfire smoke are recognized as important threats to health that reach far beyond the flames. Dr. Xiao Wu and his collaborators delve into not just the problems but also the solutions. Through the novel use of synthetic control methods on satellite-derived fire activity data across entire forestlands of California, the research team quantifies the role of managed, low-intensity fire as a driver of beneficial fuel treatment in fire-adapted ecosystems and provides evidence that low-intensity fires substantially reduce the risk of future high-intensity fires. These findings support a policy transition from fire suppression to the restoration of a presuppression and precolonial fire regime in California through increased use of prescribed fire, cultural burning, and managed wildfire.

Wu, X., Sverdrup, E., Mastrandrea, M. D., Wara, M. W., & Wager, S. (2023). Low-intensity fires mitigate the risk of high-intensity wildfires in California’s forests. Science Advances, 9(45), eadi4123.

Molei Liu, PhD

Dr. Molei Liu and his collaborator have developed the Maxway conditional randomization test (CRT), an innovative approach that significantly enhances conditional independence testing and causal inference in complex datasets. This method stands out for its ability to provide exact inferences while accommodating any machine learning-based test statistics. The test is designed for biomedical applications such as surrogate-assisted semi-supervised learning and transfer learning, and was validated both theoretically and empirically. Its potential applications include improving genetic association studies in minority populations and investigating the obesity paradox in electronic health records (EHR) studies, thereby promising significant advancements in biomedical research and personalized medicine.

S. Li, M. Liu, (2023). Maxway CRT: Improving the Robustness of Model-X Inference. Journal of the Royal Statistical Society Series B; forthcoming.

Linda Valeri, PhD

In the realm of mHealth studies, the issue of missing data is particularly prevalent, and in psychiatric populations, non-responsiveness to surveys due to underlying symptoms is a common challenge. Dr. Valeri and her PhD student Charlotte Fowler have introduced a new approach to non-stationarity testing when dealing with missing data not-at-random. Beyond its methodological contribution, this research holds substantive importance by providing guidance to investigators in making appropriate modeling choices for the analysis of intensive longitudinal data. For instance, it sheds light on whether mixed-effect models could be appropriately employed for the analysis of intensive longitudinal data or if more flexible dynamic modeling, accommodating non-stationarity, is warranted.

Fowler, C., Cai, X., Baker, J. T., Onnela, J. P., & Valeri, L.(2024). Testing unit root non-stationarity in the presence of missing data in univariate time series of mobile health studies.Journal of the Royal Statistical Society, Series C, in press.

Todd Ogden, PhD

Dr. Todd Ogden has a long-standing interest in the area of precision medicine, e.g., choosing the treatment that is best for each patient based on his/her individual characteristics, particularly when those characteristics include imaging or other complex data. Working together with collaborators, they generalized the functional additive regression model by incorporating treatment-specific effects into additive effects, and constrained the main effects and interaction effects to be orthogonal. This circumvents the need to estimate the main effects of the covariates, so there is no need to specify their form and therefore avoids the issue of model misspecification. They illustrated their methods with application to data from a clinical trial for depression medication, using EEG functional data as the patients’ covariates.

Park H, Petkova E, Tarpey T, and Ogden RT (2023). Functional additive models for optimizing individualized treatment rules. Biometrics79: 113–126.

Wenpin Hou, PhD

Dr. Wenpin Hou and her collaborator are pioneering the research on integrating large language models in biomedical data and imaging analyses. They developed GPTCelltype, a GPT-4-based reference-free and cost-effective automated method for cell type annotation in single-cell RNA-seq analysis, which can be seamlessly integrated into existing analysis pipelines. They demonstrated that it can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines. Evaluated across hundreds of tissue types and cell types, GPT-4 generates cell type annotations exhibiting strong concordance with manual annotations and has the potential to considerably reduce the effort and expertise needed in cell type annotation.

Hou, W.*and Ji, Z*. 2023. Reference-free and cost-effective automated cell type annotation with GPT-4 in single-cell RNA-seq analysis. Preprint in bioRxiv, 2023 April 21. Software package: GPTCelltype. Accepted in principle by Nature Methods.

Yifei Sun, PhD

With the availability of massive amounts of data from electronic health records and registry databases, incorporating time-varying patient information to improve risk prediction has attracted great attention. To exploit the growing amount of predictor information over time, Dr. Yifei Sun and collaborators develop a unified framework for landmark prediction using survival tree ensembles, where an updated prediction can be performed when new information becomes available. Compared to conventional landmark prediction with fixed landmark times, the methods allow the landmark times to be subject-specific and triggered by an intermediate clinical event. Moreover, the nonparametric approach circumvents the thorny issue in model incompatibility at different landmark times. The methods are applied to the Cystic Fibrosis Patient Registry (CFFPR) data to perform dynamic prediction of lung disease in cystic fibrosis patients and to identify important prognosis factors.

Sun Y, Chiou SH, McGarry M, Huang C-Y (2023). Dynamic risk prediction triggered by intermediate clinical events using survival tree ensembles. Annals of Applied Statistics. 17(2):1375-1397.

Shing Lee, PhD

Traditional methods to summarize and report adverse event data from clinical trials are tabular failing to adequately depict the complex and high-dimensional nature of adverse events. Our faculty members have developed novel dynamic and data visualization methods with accompanying R shiny web applications to enable a more comprehensive assessment of Traditional methods to summarize and report adverse event data from clinical trials are tabular failing to adequately depict the complex and high-dimensional nature of adverse events. Our faculty members have developed novel dynamic and data visualization methods with accompanying R shiny web applications to enable a more comprehensive assessment of adverse events that reflects its highly dimensional nature without sacrificing the reporting of rare events. These novel data visualization approaches have already been implemented in practice to evaluate the adverse event profile of cancer treatments.

Shing M. Lee et al.Novel Approaches for Dynamic Visualization of Adverse Event Data in Oncology Clinical Trials: A Case Study Using Immunotherapy Trial S1400-I (SWOG).JCO Clin Cancer Inform7, e2200165(2023).

Qixuan Chen, PhD

Growth mixture models (GMMs) have been widely used to identify distinct longitudinal trajectories of psychiatric and mental health outcomes over time. However, research on GMMs for analyzing data from a complex sample survey is sparse. Dr. Qixuan Chen and her student Rebecca Anthopolos developed a Bayesian GMM for complex survey data and showed that the Bayesian GMM can yield a reduction in bias in the estimation of regression coefficients when design features are associated with survey outcomes and can lead to more efficient estimates than weighted approaches when the design is noninformative. Using the Bayesian GMM, they characterized four clinically meaningful longitudinal trajectories of post-traumatic stress disorder (PTSD) and identified associated risk factors among residents of southeastern Texas in the aftermath of Hurricane Ike. They built the R package Bsvygmmfor model fitting, selection, and checking.

Anthopolos, R., Chen, Q., Sedransk, J., Thompson, M., Meng, G., Galea, S. (2023). “A Bayesian growth mixture model for complex survey data: Clustering postdisaster PTSD trajectories.”The Annals of Applied Statistics.17(3)2494 –2514.