2025 Biostatistics Practicum/APEX Symposium

The Biostatistics MS and MPH Programs require students to complete one semester of hands-on, real-world experience through a Practicum (for MS students) or Applied Practice Experience, or APEx (for MPH students)
This requirement gives students the chance to apply their classroom knowledge in a real-world public health setting—translating theory into practice. Each project is tailored to the student’s interests, skills, and career goals, and reflects a partnership with a public health organization or research initiative. It is a culmination of their academic journey at Columbia Mailman School of Public Health.Â
The Practicum/APEx Symposium is a showcase of this work. On Friday, May 2nd, students will present their projects to faculty, peers, and the broader Columbia community. It’s a celebration of their achievements and a day of learning that spans the breadth and depth of Biostatistics in action.
We invite all CUIMC and Mailman School students, faculty, staff, and affiliates to join us in supporting and learning from the graduating Class of 2025.
You can view the titles and abstracts for each session below, organized by topic and room number or in the digital program booklet.
Welcome, Introduction, & Keynote Speaker (12:00 - 1:00)
Keynote Speaker: Hartaig Singh (Class of 2018)
"Paths & Perspectives"
Hartaig is currently a Staff Data Scientist at Zocdoc, where he leverages statistics and machine learning to drive strategic marketing insights. Since earning his MS in Biostatistics Theory & Methods from Columbia University’s Mailman School of Public Health in 2018, he has built a diverse career spanning consulting, healthcare, pharmaceuticals, and tech startups. His work has consistently focused on data science with a specialization in marketing analytics and enjoys combining rigorous statistical methods with real-world business impact.
Â
Session 1 (1:00pm - 2:00pm)
Environmental Health Research - MPH (HSC 303)
Qianxuan Huang
Predicting General Health Outcomes Using Environmental and Socioeconomic Factors: A Longitudinal Analysis of BRFSS Data
This study aims to develop a regression model to predict general health levels based on environmental health predictors, including race, income level, education, sex, disability status, diseases, medication use, medical testing, location, household composition, and cohabitation status, among others. The analysis is based on data from the Behavioral Risk Factor Surveillance System (BRFSS), a continuous, state-based surveillance system that collects information on modifiable risk factors for chronic diseases and other leading causes of death. The dataset includes survey responses from 2000 to 2023, covering 7,411 survey questions.
In addition to building the predictive model, this study examines how these variables have changed over time and their impact on general health. R is used for data cleaning, modeling, and visualization. As the study is still in progress, final results and conclusions are not yet available.
Xiwen Wu
Water Quality in Public Health
During my internship with the NYC Department of Environmental Protection (DEP), I worked within the Marine Sciences Section (MSS) of the Bureau of Wastewater Treatment. My responsibilities included conducting fieldwork aboard DEP vessels and landside vehicles to collect and analyze water quality samples as part of the Harbor Survey sampling program. In the laboratory, I assisted with sample processing and quality assurance/quality control (QA/QC) procedures. Additionally, I performed data entry to maintain accurate records in the DEP database and contributed to research projects, staying current with environmental policies and marine science literature. I also participated in special sampling projects, supporting the team in expanding water monitoring efforts. Through this experience, I developed technical skills in environmental data collection and analysis, strengthened my scientific research abilities, and enhanced my team collaboration and problem-solving skills in a professional setting.
Shizhe Zhang
Greenspace and Health Equity in the South Bronx: Assessing Environmental Disparities and Public Health Impacts
The South Bronx, a historically underserved region of New York City, faces significant environmental and health disparities, exacerbated by limited access to greenspace. This APEx examines the relationship between greenspace availability and health outcomes in the South Bronx, comparing it with similar urban areas. Utilizing data from the New York State government and recent literature, this study highlights the impacts of greenspace on reducing heat-related mortality, improving air quality, and mitigating chronic diseases such as asthma, cardiovascular diseases, and diabetes. Findings indicate that the South Bronx experiences disproportionately high rates of asthma hospitalizations and cardiovascular conditions, with greenspace playing a critical role in public health interventions.
However, accessibility to existing green spaces remains inequitable, further reinforcing health disparities. By advocating for the expansion and equitable distribution of greenspace, this report underscores the necessity of integrating environmental justice into urban planning to promote community health and resilience.
Aakansha Bagepally
Associations of Hair Product Use, Women’s Health, and Reproductive Outcomes
Personal care product use is markedly higher among women, indicating a higher likelihood of harmful chemical exposures. Research on PCP usage has indicated links to cancer and endocrine disruption, and negative reproductive effects and impacts on children’s development. This project presents a preliminary analysis of the impact of hair product use on birth outcomes. Data analyzed came from the Mothers and Newborns Cohort and Fairstart Cohort from the Columbia Center for Children’s Environmental Health. Maternal age, race, and socioeconomic status were among the covariates, with exposures of hair dye and chemical hair relaxers and gestational age and birthweight as continuous outcomes. 2x2 tables with dichotomous coding of the exposure and outcomes showed no significant differences. T-tests and regression analysis followed.
Additionally, this position entailed working on a new study focused on chemical hair relaxers in comparison with other straightening methods, and collecting and conducting bio marker testing on urine samples. Responsibilities with the study included sample collection, survey distribution, semi structured interviewing, and participant recruitment.
Data Analysis - MPH (HSC 305)
Yu Huang
Choice Of Metric And The Effect Of Scan Length For Reliability In Resting-State fMRI
Resting-state fMRI (rs-fMRI) is widely used to investigate brain functional connectivity, but the reliability of these measurements remains a key concern for ensuring reproducibility. In this study, we used distance-based intraclass correlation coefficients (dbICC) to assess the reliability of rs-fMRI data from the Midnight Scanning Club (MSC) dataset, which consists of 10 subjects, each undergoing 10 sessions of 30-minute rs-fMRI scans. We employed two distance metrics—Euclidean and Affine Invariant Riemannian Metric (AIRM)—to evaluate how the choice of metric affects reliability. Additionally, we investigated the impact of scan length and time intervals between sessions on reliability. The two metrics yielded variability in reliability estimates, reflecting their distinct mathematical properties. We found that longer scan lengths significantly improve reliability, while the time interval between sessions has minimal impact. These insights contribute to ongoing efforts to improve the consistency and accuracy of brain connectivity assessments.
Tianqi Li
Analysis And Interpretation Of Population Differences in Structural Variation Of The Human Genome
The Apex is at Institute of Biophysics, Chinese Academy of Sciences. Under the guidance of tutors and research group members, students participated in various research activities, including literature review, data analysis, and results compilation. The literature review involved reading relevant papers to inform the development of computational methods. Python was used to calculate the average number of tandem repeats, while R was employed for data visualization. Data from the 1000 Genomes Project (1KGP) were extracted to assign samples to different sub-populations, and the average tandem repeat count was computed for each sub-population. The findings were documented in a written report for result presentation.
Coco Ni
Enhancing Data-Driven Decision-Making for Long-Acting Injectable HIV Treatment: Development of an Interactive R Shiny Dashboard
This project contributes to the Accelerating Implementation of Multilevel Strategies to Advance Long-Acting Injectables for Underserved Populations (ALAI UP) initiative, which aims to assist clinics across the United States in developing long-acting injectable (LA ART) HIV treatment programs to address health inequities. The work focuses on data management, analysis, and visualization to facilitate evidence-based decision-making for program implementation.
Client-level data from seven clinical sites, varying in size and data quality, required extensive cleaning, validation, and merging to ensure consistency across key variables, including demographics, social determinants of health (SDoH), comorbidities, and HIV clinical markers. A major deliverable is an interactive R Shiny dashboard that dynamically visualizes trends in LA ART uptake, such as counseling, screening, prescriptions, insurance coverage, and injection adherence. The dashboard is designed to support real-time data exploration.
Xin Hui
Population Differences In Gene Expression And Resulting Differences In Disease Incidence
Human genetics is the scientific study of inherited human variation it can be used to discover and explain the genetic contribution to human diseases. By studying between-group genetic variation, we may develop more effective disease prevention methods by providing more disease-related testing in high-risk populations. This study aims to find population differences in gene expression by filtering out short tandem repeats (STRs) mutations with significant differences in length among different populations. The corresponding diseases caused by each STR mutation were then discovered based on previous studies.
Clinical Trails - MPH (HSC LL 103)
Enju Zhang
Building SDTM & ADaM for Regulatory Compliance
My presentation outlines the process of building SDTM (Study Data Tabulation Model) and ADaM (Analysis Data Model) datasets in compliance with regulatory requirements, with a focus on FDA submission readiness. SDTM provides a standardized structure for organizing raw clinical trial data across domains such as Demographics (DM), Adverse Events (AE), and Laboratory Data (LB), while ADaM transforms these into analysis-ready datasets like ADSL, ADAE, and ADLB with an emphasis on traceability.
The presentation covers the full data flow—from CRF design and raw data collection to the creation of Define.XML files and TLFs for submission. Key SAS programming techniques are introduced, including the use of macros, PROC SQL, RETAIN, and PROC TRANSPOSE, which support efficient data handling and automation.
Additionally, strategies for ensuring traceability and quality are discussed, such as the use of annotated CRFs, mapping specifications, and independent programming reviews. The session offers practical guidance for clinical programmers and data managers aiming to deliver compliant, auditable datasets for regulatory review.
Yuanyu Lu
Effects Of Transcranial Alternating Current Stimulation On Measures Of Cognition And Symptom Scores In Chinese Patients With Schizophrenia
Transcranial alternating current stimulation (tACS) may have effects on cognition and symptoms in psychiatric illness but there have been few randomized controlled studies in people with schizophrenia. We conducted a randomized sham-controlled double-blind study of 40 Hz tACS on measures of cognition and symptoms scores in 50 patients diagnosed with DSM-5 schizophrenia. tACS was delivered in 10 sessions (20 min each) over a 2-week period. Evaluations were conducted with multiple cognitive and symptom batteries after 10 sessions and at 2 weeks and 4 weeks post-treatment, and also on-line during the tACS stimulation session 1. The primary outcome measured changes in the MATRICS overall composite score. The results showed no statistically significant (P < 0.05) effects of active vs. sham on improvement in any of the cognitive measures or PANSS rated positive or negative symptoms. There was a trend (P <0.06) for the MATRICS Domain score of verbal learning to show greater improvement of active tACS compared to sham within 1–2 days after the 10 tACS sessions. Additional trials are needed to determine the effective tACS parameters targeting cognition and symptoms of schizophrenia.
Siqi Wang
Site Selection Tool For Analysts
This project developed an R Shiny-based Site Selection Tool for the Division of Analytics & Informatics (DAI) at the FDA to improve the efficiency of selecting clinical trial sites for inspection. The tool integrates statistical approaches, including Mixed Models for Repeated Measures (MMRM) and Cochran-Mantel-Haenszel (CMH) analysis, to identify sites that may significantly impact overall study outcomes. By automating data wrangling, statistical testing, and visualization, this tool streamlines the site selection process, addressing limitations in existing resources and reducing manual effort for statistical analysts and reviewers.
Cancer Research - MS (HSC LL 106)
Angela Dauti
Exploring Real-World Data for Clinical and Pharmacological Use Cases: A Tumor-Specific Analysis in Oncology
Curation of oncology real-world data at Flatiron Health focuses on clinical patterns and endpoints for a specific cancer type. Across research use cases, producing data visualizations in R and SQL are critical, as well as having a strong understanding of survival analysis. Projects delivered to biopharmaceutical clients include comparisons to common clinical practices (e.g. treatment patterns, patient characteristics, and progression-free survival measures) to ensure that these retrospective datasets are sound. Internally, teams run validation studies to improve our understanding of how machine learning models perform. Knowledge regarding the availability of these data types (e.g. biomarker testing or diagnosis dates) within the patient’s electronic health record helps to fuel innovation in data curation and QA methods. As with any observational data, there can be various factors which lead to findings that do not align with clinical expectations. However, it is important to understand this context to effectively apply real-world data sources to appropriate research questions. This practicum aims to describe some of my learnings as a Senior Data Analyst.
Shuyan Qiu
Real-world Cancer Treatment Patterns And Their Effects In Older Adults With Lung Cancer: A SEER-MHOS Analysis
Introduction: Lung cancer (LC) mostly affects older adults, with a median age of 71 at diagnosis. Due to competing mortality risks, real-world treatment use and outcomes may differ from clinical trial findings. This study examined treatment use and survival outcomes in older LC adults using large population-based data.
Methods: We used SEER-Medicare Health Outcomes Survey (SEER-MHOS) data to identify adults aged ≥65 with non-small cell LC (2007-2019). We assessed first course therapy by age and frailty. The primary outcome was LC-specific survival. Cox models with clone-censoring-weight methods addressed selection and immortal time biases
Results: Among 11,443 patients (mean age, 77.3), surgery (P<10-3) and chemotherapy (P<10-3) use declined with age. Frailty was associated with lower likelihood of surgery (P<10-3) and chemotherapy (P<10-3), with no significant difference for radiotherapy. No survival benefit from first course therapy was seen in patients ≥85
Conclusion: Age and frailty influence treatment use in older LC adults. Advanced methods reveal limited survival benefits of intensive treatments in those ≥85, underscoring the need of tailored, aging-sensitive care strategies
Haochen Shi
Examining Allostatic Load Disparities Across Age and Demographic Factors in Breast and Digestive Cancer Patients
Allostatic load (AL) measures cumulative physiological stress, impacting long-term health and cancer outcomes. Examining AL variations in cancer patients reveals health disparities and potential interventions.
This study explores associations between AL, age, and demographic factors in breast and digestive cancer patients, identifying potential disparities in stress burden. We first selected relevant demographic variables and biomarker data to construct a comprehensive dataset for analysis. Key demographic factors are included to examine association and potential disparities. The allostatic score is then computed based on selected biomarkers, providing a quantitative measure of cumulative physiological stress. This score serves as the primary outcome variable. Statistical tests will then be adopted to assess associations between allostatic load, age, and demographic factors, identifying significant relationships and disparities across different population subgroups.
We expect that findings can enhance understanding of stress-related disparities in cancer populations, informing healthcare strategies to reduce chronic stress risks and improve patient outcomes.
Siqing Wang
Disparities In Treatment Delays Among Patients With Pancreatic Adenocarcinoma: A SEER-Medicare Study
Pancreatic cancer is the 10th most common cancer in the U.S. and the 3rd leading cause of cancer-related mortality, largely due to the lack of effective screening, early systemic spread, and aggressive growth. Timely treatment initiation is critical for patients with pancreatic adenocarcinoma, yet disparities may exist across demographic and clinical subgroups. Using SEER-Medicare data, we evaluated the time from diagnostic biopsy to first treatment (<1 month, 1–2 months, >2 months) among patients aged 65 and older. Multivariate logistic regression revealed that race, cancer stage, tumor location, comorbidity, and treatment modality significantly influenced treatment delays. Hispanic patients were more likely to experience delays regardless of stage, and Black patients with distant-stage disease had higher odds of delays exceeding two months compared to White patients. Patients treated with surgery alone had lower odds of delay. Sex, age, socioeconomic status, and geographic location were not associated with delay. These findings underscore racial disparities in care timeliness and suggest that treatment modality and comorbidity burden further influence access to prompt treatment.
Deep Learning - MS (HSC LL 107)
Yimeng Cai
Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study
Systematic reviews nowadays play a crucial role in evidence-based healthcare research, serving as a key step in meta-analysis. However, traditional manual screening methods for identifying relevant studies are often time-consuming and prone to human error. This project explores the application Large Language Model to automate the paper screening and data extraction process for Long COVID-19. By leveraging GPT’s capabilities, the project aims to enhance the efficiency and accuracy of synthesizing health-related evidence while providing deeper insights into the epidemiology and risk factors of Long COVID subtypes. In this project, I will compare GPT’s precision in identifying clinically relevant studies against human screeners while evaluating its potential to reduce overall screening time and provide insights on the AI driven improvements in epidemiological research studies.
Siyan Wen
The Impact of Image Quality on Prediction Performance: A Study Using Alzheimer’s Disease Neuroimaging Initiative Data
Image quality plays a critical role in the performance of deep learning pipelines. In this study, we investigate the effect of image resolution on deep learning-based predictive analysis. We develop a convolutional neural network (CNN) to predict Mini-Mental State Examination (MMSE) scores using structural magnetic resonance imaging (MRI) data from Phase I of the Alzheimer’s Disease Neuroimaging Initiative (ADNI). To systematically assess the impact of image quality, high-resolution MRI scans are down sampled by varying factors to generate lower-resolution images. We evaluate prediction performance across different image resolutions using five-fold cross-validation. Our results demonstrate a continuous decline in model performance as image quality degrades, with a significant drop when using heavily down-sampled images. This study underscores the importance of image quality in predictive modeling and highlights the need for more robust deep learning approaches that account for the challenges posed by low-quality neuroimaging data.
Ruiqi Xue
Deep Learning For Identifying Key Brain Features Associated With Apathy In Alzheimer's Disease
Apathy is a common neuropsychiatric symptom in Alzheimer’s disease (AD), linked to poorer outcomes and dysfunction in the brain's reward network. This study applies a deep learning model to 314 neuroimaging-derived brain features to classify apathy status using supervised learning. Feature importance was assessed via permutation-based analysis, identifying key regions, including the caudal anterior cingulate cortex (ACC), rostral middle frontal gyrus, and insula—areas involved in reward processing. T-tests confirmed significant structural differences in these regions between apathy groups. Visualization using the 'ggseg' package in R highlighted the spatial distribution of top features, reinforcing the role of reward network impairments in apathy. These findings support deep learning as a valuable tool for identifying neuroimaging biomarkers of apathy in AD, and better understanding of the relevance of important biomarkers with apathy pathology.
Longitudinal Data Analysis - MS (HSC LL 108A)
Luxuan Zhang
Evaluating The Effectiveness Of Asthma Self-Management For Adolescents (ASMA) In Rural Areas: A GEE-Based Analysis
Background: Asthma remains a significant public health challenge among adolescents, particularly in rural. Asthma Self-Management for Adolescents (ASMA) is an evidence-based high school intervention designed to improve asthma control. However, its effectiveness in rural populations remains unexplored.
Objective: The effectiveness-implementation hybrid study aims to evaluate the effectiveness of ASMA among rural, ethnically diverse adolescents with uncontrolled asthma.
Methods: Using a Generalized Estimating Equations (GEE) model, we will assess the intervention’s impact on primary outcomes, including the Asthma Control Test (ACT) score, symptom-free days, and symptom-free nights.
Significance: By addressing a critical gap in asthma management among rural adolescents, this study will provide valuable insights into the clinical viability of ASMA, while advancing the field of behavioral epidemiology and implementation science.
Yuqi Cheng
Longitudinal Associations of Uric Acid with Metabolic Parameters - A Mixed-Effects Analysis in Chinese Patients with Metabolic Disorders
BG:While uric acid is increasingly recognized as a significant marker of metabolic health, longitudinal studies examining its relationship with multiple metabolic parameters remain limited, particularly in populations with high metabolic burden. This study investigated the temporal patterns and determinants of UA levels in Chinese adults with various metabolic abnormalities. Methods: We analyzed 903 Chinese adults (66.7% male, mean age 49.4±12.6) with 1,341 observations over a median 386 days. Linear mixed-effects models assessed associations between UA and body mass index (BMI), HbA1c, blood pressure, and lipids, adjusting for demographics. Results: Model (conditional R²=0.643) revealed UA was 53.84 μmol/L lower in females (p<0.001). BMI showed a positive association (6.34 μmol/L per unit, p<0.001), HbA1c was inversely associated (-2.89 μmol/L per unit, p=0.016), and systolic BP had a positive link (0.40 μmol/L per mmHg, p=0.036). UA levels changed minimally over time (0.13 μmol/L/month, 95% CI: -0.34 to 0.60). Conclusions: UA is strongly linked to sex and BMI, inversely to HbA1c, and modestly to BP, highlighting the need for integrated, sex-specific UA management strategies.
Ruoying Deng
Analysis of Medication Effects on Cogitive Outcomes in Adni Data
This study examines the impact of medication use on cognitive outcomes among participants in the Alzheimer’s Disease Neuroimaging Initiative . Multiple datasets including medication records, supplementary backmeds, and clinical data were integrated into a comprehensive longitudinal dataset. Rigorous data cleaning and standardization were applied to harmonize medication names and classify them into categories such as AD-specific drugs and antidepressants. Cross-sectional analyses at baseline assessed the associations between demographics (age, education, race, sex) and memory scores, while longitudinal mixed-effects models evaluated changes in memory over time and the influence of medication use. Visualizations, including LOESS-smoothed plots and subject-specific diagnostic trajectories, revealed that individuals with Alzheimer’s disease exhibit a steeper decline in memory scores compared to those with normal cognition or mild cognitive impairment. These findings underscore the importance of both baseline demographics and medication use in cognitive decline, providing valuable insights for future investigations into treatment effects and disease progression in Alzheimer’s disease.
Yuzhe Hu
Biomarker Trajectories in Alzheimer’s Disease: A Semiparametric Joint Modeling Approach and Its Application
Understanding the evolution of biomarkers prior to the onset of Alzheimer’s disease (AD) is essential for early disease characterization and intervention. This project employs a semiparametric joint modeling framework to characterize longitudinal biomarker trajectories in relation to AD onset. The proposed model conditions biomarker trajectories on event time, enabling a flexible trajectory structure that accounts for both chronological age and age at disease onset. To enhance computational efficiency, key components of the estimation process are implemented in Rcpp, improving computational speed while maintaining compatibility with R’s statistical modeling environment. A profile kernel estimating equation approach is used to estimate regression coefficients and nonparametric baseline mean functions, ensuring robust statistical inference. This methodology is applied to examine biomarker progression patterns, their temporal association with disease onset, and the influence of genetic risk factors.
Missing Data - MS (HSC LL109A)
Xiaoting Tang
Quantification of Biological Aging for Prognosis in ICU Patients
The research aims to use biological aging to improve the precision of prognosis for ICU patients. While chronological age is a leading risk factor for most chronic diseases, it is a crude proxy for underlying biology. Using a large dataset of clinical measures from Electronic Heath Records(EHRs) data of ICU patients, we calculate a pace of biological aging, which might tell us about how people end up in cardiothoracic surgery. Our primary focus is to predict 30-day post-surgery mortality by applying varies modeling approaches, including multinomial logistic regression, random forest, and latent profile analysis. The results indicate that bioage, particularly phenoage advance, is a robust predictor of 30-day mortality. We also observed that combining bioage measurements from multiple time points improved model performance. The work is significant for more accurately getting the prediction of the surgery and preparing for the aftercare. However, a key limitation of this research is the presence of missing data with only 150 observations available for analysis.
Eunice Wang
Sepsis Treatment Bundle Complicance
This practicum project is to investigate the timing and effectiveness of sepsis treatments initiated following a trigger alert, specifically focusing on why completing the recommended treatment steps within the crucial 3-hour window often fails. The primary aim la to identify specifie delays in the sepsis treatment process by examining time stamps associated with each step and sub-step of care, allowing for a detailed analysis of where breakdowns occur. Secondary aims include identifying patient subgroups that are most susceptible to delays and estimating the typical curation of those delays. Initially, exploratory data analysis will be conducted to assess data. Data cleaning, transformation, and wrangling will be crucial for data preprocessing. This will involve handling missing values, scaling continuous variables, and encoding categorical variables. The distribution of time stamps for each treatment step will be summarized using visualization techniques. Timing of each treatment step will be analyzed and statistically significant delays will be identified.Subgroup analysis will involve stratifying the data by key patient demographics to explore differential risks of delay.
Derek Lamb
Leveraging trans-pQTLs to Improve Protein Prediction in Biobank-scale Proteome-wide Association Studies
Proteome-wide association studies (PWAS) allow researchers to identify predictive and protective associations between thousands of proteins and phenotypes of interest. As sample sizes for proteomic assays are often much smaller than equivalent genomic studies, PWAS gain power by imputing protein expression from the genome. Developing an appropriate imputation model requires identifying a set of protein quantitative trait loci (pQTLs), while accounting for linkage disequilibrium between nearby genetic loci. Several frequentist and Bayesian tools exist for imputation, but they are limited by only using cis-pQTLs – loci in close proximity to a gene of interest – disregarding the information in the rest of the genome, i.e. trans-pQTLs. This practicum extends existing PWAS methodologies into a pipeline incorporating both cis- and trans-pQTLs. The pipeline is then applied to the UK Biobank Pharma Proteomics Project, containing data on 3k proteins for 42k individuals. Protein expression is imputed for the greater UK Biobank cohort (N=400k individuals) to perform PWAS on ischemic stroke and blood triglycerides.
Chenshuo Pan
Sepsis Treatment Analysis
Our project focuses on whether sepsis treatment bundles can be completed within three hours. This study examined the time intervals between sepsis onset and key treatment steps using data from the NYP EPIC system from 2020 to date. We conducted a comprehensive analysis of the time distribution of key steps, including blood culture collection, lactate, and antibiotic administration. We demonstrated the differences through data visualization techniques and statistic indicators. We also explored hourly trends and relationships between treatment times and healthcare providers to determine where potential delays may occur. Patient demographic characteristics were also analyzed to explore which population subgroups were more likely to experience treatment delays. Our findings provide insights into potential areas of improvement in sepsis clinical workflows.
Topics in Mental Health - MS (HSC LL 202)
Shuchen Dong
Resilience to suicide ideation in high risk adolescents: the role of physical activity
This study investigates whether physical activity (PA) reduces suicidality in U.S. adolescents affected by trauma or with a family history. The primary research question examines the association between PA and suicidal ideation (SI) or attempts (SA), with a focus on high-risk adolescents. The hypothesis posits that regular PA reduces suicidal thoughts and behaviors in these adolescents. Applying data from the Adolescent Brain Cognitive Development (ABCD) study, this research employs linear mixed-effects models to analyze the relationship between PA and suicidality. Key variables include PA levels, trauma exposure, and family history of mental health disorders (measured by ACEs). Models control for demographic factors such as age, sex, race/ethnicity, income, education, and familial relationships (twins). Interaction terms are included to assess whether the effects of PA vary based on trauma exposure. Longitudinal analyses track changes in PA, SI or SA over time, identifying subgroups that may benefit most from PA interventions. The findings suggest that PA is associated with reduced suicidality in high-risk adolescents, supporting its potential as a preventive measure.
Yilei Yang
Factors Influencing Workforce Retention Among Foreign Trained Nursing Workforce: A National Sample Survey Analysis
Nurses (RNs), the largest health workforce, are jeopardized by burnout, high workload, and poor work environments. Almost 16% of RNs in the U.S. are foreign-trained, yet job turnover reasons remain unknown. We conducted a secondary analysis of 2018 & 2022 National Sample Survey of Registered Nurses to examine differences between U.S. and foreign-trained nurses: 1) demographics or professional characteristics; 2) Job turnover versus retention factors. Descriptive statistics, Chi-squared, and regression were used. After weighting, our sample represents 3,957,661 (2018) & 4,349,377 (2022) RNs. Intention to leave factors including burnout [43.4% (2018), 65.8% (2022)] and better pay [(50.4% (2018), 61.1% (2022)] were most endorsed. There were significant differences between groups for years of experience (p<0.01) and job satisfaction (p<0.01). In adjusted models, as burnout increases by 1 unit, intention to leave one’s job increases 1.25 units (p<0.01) among foreign-trained nurses.This study highlights job turnover reasons by initial nurse training and should be considered when implementing interventions to increase nurse retention.
Wenyu Zhang
Machine Learning Identification of Thalamic Biomarkers for Early Cognitive Impairment in At-Risk MS
Cognitive deficits are a significant concern in multiple sclerosis (MS), yet early biomarkers remain elusive. This project leverages advanced MRI techniques and machine learning to identify distinct thalamic alterations in individuals at risk for MS. The primary aim is to characterize thalamic volumetric subtypes using the SuStaIn method and correlate these subtypes with cognitive performance on the SDMT and PASAT. Secondary objectives include classifying thalamic radiomic features and comparing their associations with cognitive outcomes. Using an existing dataset of MRI scans and cognitive test results from 99 at-risk subjects and 55 controls, the study employs algorithms like XGBoost with Recursive Feature Elimination and Spearman correlation analysis to uncover predictive imaging biomarkers. This work aspires to improve early detection and guide targeted interventions for cognitive impairment in MS.
Mengxiao Luan
Harmonizing Two Likert Variables with Identical Questions but Different Response Scales
Likert variables are common in questionnaires, but varying response scales even for the same question complicate data integration and analysis. Missing values add another challenge in such circumstances. To address these problems, we propose a Bayesian multilevel model to harmonize and impute missing Likert responses by assuming a common latent variable underlying different scales. Specifically, we consider a scenario where the same question is asked with 3- and 5-category Likert scales, treating the unobserved response scale as missing. Our simulation study includes four patterns: (1) both scales measured, (2) only the 3-category scale measured, (3) only the 5-category scale measured, and (4) both missing. Testing on simulated data, our model outperforms complete-case analysis and default Multiple Imputation by Chained Equations (MICE) in R. We applied it to the iHeart DepCare study, estimating the effect of a patient activation strategy on depression treatment optimization among depressed cardiac patients using the Decisional Conflict Scale (DCS). Future work will extend the model to handle multiple correlated variables, further advancing its applicability in real-world research.
Drug Development - MS (HSC LL 204)
Chen Liang
Efficacy of Batoclimab in gMG: Correlation of IgG and AChR-Abs with Clinical Outcomes
Generalized myasthenia gravis (gMG) is an autoimmune disorder caused by autoantibodies targeting acetylcholine receptors (AChR), leading to muscle weakness. This project aimed to evaluate the efficacy of batoclimab, a human monoclonal antibody targeting the neonatal Fc receptor (FcRn), by examining correlations between reductions in immunoglobulin G (IgG) and AChR-antibody titers and clinical improvements. Data from a Phase III clinical trial were analyzed using rank-based linear regression and logistic regression models. Batoclimab-treated patients demonstrated significant positive correlations between reductions in AChR-Ab titers and improvements in Quantitative Myasthenia Gravis (QMG) scores (p=0.0002). Reductions in IgG also correlated positively with improvements in both QMG (p<0.001) and Myasthenia Gravis Activities of Daily Living (MG-ADL) scores (p=0.0105). Additionally, greater IgG reductions strongly predicted AChR-Ab decreases (r²=0.3802, p<0.0001). Body weight, and sex were predictors of rapid IgG reduction. These findings suggest FcRn blockade by batoclimab effectively reduces pathogenic antibodies and significantly improves clinical outcomes in gMG patients.
Shaolei Ma
R You Ready? Developing R-Based Tools for TFL Review in Clinical Trials
Reviewing Tables, Figures, and Listings (TFL) in clinical trial dry runs is often time-consuming and prone to inconsistencies. This project aims to streamline TFL review by developing two R-based tools: an R package for metadata comparison and an R Shiny application for document integration. The R package facilitates automated comparison between TFL and the List of Analysis (LOA), identifying discrepancies in titles, footnotes, and abbreviations. It generates a summary table highlighting mismatches and an Excel output that consolidates metadata from both sources with automated checks. The R Shiny app enhances efficiency by providing an interactive interface where users can seamlessly access and cross-reference key documentation including LOA, TFL specifications, ADaM specifications, and the Statistical Analysis Plan (SAP) for a selected TFL. By reducing manual effort and improving accuracy, these tools accelerate the TFL review process, contributing to a more efficient clinical trial workflow.
Yuki Low
Statistical Designs for Rare Disease Trial Proposals to improve Clinical Trial Readiness for Regulatory Approval: An Update
We report on the pilot stage of a project which seeks 1) to identify design features of clinical trials for rare diseases (RDs) that have gained FDA approval, and 2) to provide this information to investigators to help improve the likelihood of approval of future submissions.
Data on trial design features and approval histories for 20 RD approvals from 10 FDA divisions from 2021-23 were collected from Drugs@FDA for New Molecular Entities and stored in a secure SQL database. 14 (70%) are STAs (Single Trial Approvals), and 6 (30%) are MTAs (Multiple Trial Approvals), based on >1 trial.
Approval characteristics (Time from Orphan Designation to Approval, primary analysis designs, and allocation ratios) vary widely. Given the demands of the manual pilot coding procedures, we are now developing an LLM model to expedite data collection for all target Orphan Drug approvals from 2014 and seeking to iteratively improve its quality.
This pilot project suggests the final project is feasible. This will provide detailed information on successful RD approvals, allow development of models of pathways to approval, and provide a much-needed resource for the rare disease community.
Zihan Wu
A Comprehensive Meta-Analysis of Phase 3 Randomized Controlled Trials for Alzheimer's Disease Treatments
Background: Phase 3 RCT meta-analysis assessed efficacy of Alzheimer's Disease pharmacological treatments and demographic impacts on outcomes.
Methods: Analyzed Phase 3 RCTs from ClinicalTrials.gov with ≥3 trials per treatment. Outcomes categorized into six domains. Effect sizes calculated as standardized mean differences using random-effects models.
Results: Small significant overall effect (d = 0.10, 95% CI [0.03, 0.16], I² = 69.7%). Domain effects varied: cognitive (d = 0.08), functional (d = 0.04, non-significant), behavioral (d = 0.08), amyloid imaging (d = 0.66). Donanemab showed consistent benefits. Percentage of White participants significantly moderated outcomes.
Conclusions: Current AD treatments show modest clinical efficacy despite stronger biomarker effects. Demographic factors influence outcomes, highlighting need for diverse trial populations and personalized approaches.
Data Visualizations - MS (HSC LL 205)
Sitian Zhou
Understanding TB Symptoms And Household Transmission In Uganda
Tuberculosis (TB) remains a major global health concern, with an estimated 10.8 million cases and 1.25 million deaths worldwide in 2023. Uganda, a high-risk country for TB, continues to face significant challenges in controlling its spread.
This study analyzed 211 TB index cases (IC) and 549 household contacts (HHC) in Uganda to examine possible factors that influence contact patterns and TB symptom severity.
A Tableau dashboard was developed to visualize summary statistics and for exploratory data analysis. Ordinal logistic regression was applied to assess risk factors for symptom severity in index cases, and generalized estimating equations were used to account for household clustering in analyzing symptom presence in contacts. The findings reveal that smoking status is a key risk factor for severe TB symptoms, and socioeconomic status and prior TB history of household contacts significantly influenced TB transmission.
Understanding these associations can help inform targeted interventions and public health strategies to reduce TB transmission and improve disease management.
Huanyu Chen
Spatial Heterogeneity In COVID-19 Response: Leveraging Regression Models To Explore Socioeconomic And Demographic Disparities In New York City
This study examines socioeconomic and demographic disparities in COVID-19 decision-making across New York City's United Hospital Fund (UHF) neighborhoods. Using Multiple Linear Regression (MLR) and Geographically Weighted Regression (GWR), the study analyzed neighborhood-level census data and decision-making indices (agency, temporal discounting, and loss aversion scores). Results revealed significant spatial heterogeneity, with wealthier areas showing lower temporal discounting scores, indicating a preference for long-term benefits. Employment status is strongly correlated with loss aversion. Agency scores showed minimal spatial variation, suggesting that decision-making is more influenced by personal traits. The GWR model outperformed MLR in capturing spatial patterns, highlighting the need for spatially adaptive public health strategies. These findings emphasize the importance of tailoring interventions to local socioeconomic contexts to improve equity in crisis responses.
Jiatong Li
Cognitive Trajectories Across Education Levels and Gender: A Longitudinal Analysis of the Health and Retirement Study
This study examines cognitive trajectories across education levels and gender using Health and Retirement Study data (1998-2020). Using GEE and Mixed Effects Models, we analyzed cognitive scores from approximately 35000 participants across multiple cohorts. Results showed higher education levels were associated with better cognitive scores across all ages. Gender differences appeared in both baseline scores and decline rates, with education-gender interactions suggesting varying protective effects. Cognitive decline patterns differed by education level, with higher education associated with delayed decline onset. Cohort effects independent of age suggested generational differences in cognitive performance. The study employed a comprehensive analytic approach that accounted for the complex longitudinal structure of the data while controlling for relevant demographic factors. By examining interactions between key variables, we identified specific risk and protective patterns that vary across subgroups. These findings enhance our understanding of cognitive aging factors and help identify at-risk groups.
Zixuan Qiu
Construction Of Clinical Trial data Integration Platform and Review Of CSM Methodology Framework
At present, the existing workflow is to export data from EDC system and complete the analysis work by writing code. Because the analysis plate has relatively high professional requirements, IT alone cannot realize the digitization of the process. And for clinical data statistics, we divided them into processes, centralized monitoring, and statistical analysis of results. The aim of this project is to develop a clinical data integration platform to serve all stages of clinical trials to the greatest extent.
Centralized monitoring, as a method to assist on-site monitoring, focuses on problems and indicators that are difficult to find by on-site inspection. However, the current guidelines lack implementation recommendations, and the existing fixed threshold monitoring method has shortcomings. To address these challenges, my job focus on organized CSM methodology framework and develop a R function for CSM QTL detection tailored for Chinese clinical data centers. Integrating QTL-based anomaly detection with Mahalanobis distance, mixed-effect models, and Bayesian model.
Data Analysis - MS (HSC LL 207)
Ziqi Liao
BLOOM: Robust Maximin Optimization for Data Fusion with Blockwise Missing Covariates
In this presentation, I will introduce BLOOM—a novel framework designed to tackle the challenges of integrating heterogeneous biomedical data plagued by blockwise missingness. I will discuss how the rapid growth of biomedical research and electronic health records has led to complex data fusion problems, where missing covariates and varied data collection protocols across institutions hinder traditional methods. BLOOM leverages a maximin effect model to ensure robustness and fairness, particularly by optimizing worst-case performance across diverse subpopulations and rare centers. I will explain the methodology behind constructing surrogate predictors using best linear projections, how auxiliary data is incorporated to adjust for missingness, and the theoretical guarantees of convergence. Overall, this speech will highlight how BLOOM addresses critical statistical and methodological challenges, paving the way for more reliable and equitable predictive modeling in biomedical research.
Kairui Wang
Metabolomic and Proteomic Analyses of Plasma in Gulf War Illness before and after Exercise
Gulf War Illness (GWI) is a chronic multi-symptom condition affecting veterans of the 1991 Gulf War. It is characterized by unexplained symptoms, including post-exertional malaise (PEM), systemic pain, fatigue, flu-like symptoms, and cognitive difficulties. This study examines differences in plasma levels of metabolites and proteins between patients with GWI and healthy controls before and after exercise, aiming to better understand the pathophysiology of PEM and to identify potential biomarkers for diagnosis and treatment. Plasma samples were collected from GWI patients and control participants at three time points: before, immediately after, and 24 hours after exercise, in cohorts from New Jersey and Wisconsin. Linear mixed-effects models paired with Bayesian inference were employed to examine the effects of GWI status, exercise, and their interaction on metabolomic and proteomic changes. Chemical enrichment analysis (ChemRICH) and Ingenuity Pathway Analysis (IPA) were applied to identify significantly altered chemical groups and biological pathways. Our findings provide new insights into the pathogenesis of GWI and may advance diagnosis and clinical interventions.
Miao Fu
Understanding the Mismatch Between Self-Reported and Actual HIV Status: Predictors of Unawareness Among HIV-Positive Individuals
Self-reported HIV status is crucial for public health, yet discrepancies with actual test results persist. This study examines factors associated with unawareness of HIV-positive status using survey data. Among HIV-positive individuals, 72% were unaware of their status. A logistic regression model assessed predictors of unawareness. Individuals in Douala had significantly higher odds of unawareness (OR ≈ 12, p = 0.049), while having an HIV-positive household head was linked to lower unawareness (OR ≈ 0.41, p = 0.011). Higher CD4 count was associated with increased unawareness (p = 0.003), possibly due to a lack of perceived risk. Blood testing in Douala reduced unawareness, emphasizing testing accessibility. Other demographic and socioeconomic factors were not significant predictors. Targeted awareness campaigns, expanded testing in high-risk regions, and status disclosure discussions within households may improve early diagnosis and HIV care engagement.
Allison Xia
An Interactive Model Performance Assessment for Advanced NSCLC Diagnosis
In healthcare machine learning, accurately classifying disease progression is critical for patient outcomes. This project presents an interactive R Shiny dashboard that evaluates a model classifying advanced vs. non-advanced non-small cell lung cancer (NSCLC) using electronic health record (EHR) data.
Traditional assessments rely on a fixed 0.5 threshold, which may not be optimal. Our dashboard enables dynamic threshold selection, allowing users to visualize changes in precision, sensitivity, and classification accuracy.
Key features include:
Overall Performance Metrics – Evaluating classification at different thresholds.
Stratified Performance Analysis – Identifying potential biases.
Model vs. Ground-Truth Diagnoses – Comparing predictions to manually abstracted data.
Survival Analysis – Examining patient survival under different thresholds.
By enabling real-time performance assessment, this dashboard enhances decision-making for healthcare analysts. Future improvements will focus on broader model applicability and enhanced visualization.
Epidemiology - MS (HSC LL 210)
Justin Vargas
Examining The Relationship Between Home And School Social And Environmental Exposures And Childhood Lung Function Outcomes
Physical and social environmental exposures at home, such as crowded living quarters, and at school, such as traffic exposure, can impact the lung function of children. Understanding the relationship between these exposures and lung function outcomes may support interventions to improve social and environmental conditions and childhood lung health. The primary objective of this study was to create three explanatory MLR models linking social and environmental exposures to three lung function outcomes, one of which was forced volume capacity (FVC), across 4 cohorts of children living in 5 U.S. cities. Five sets of MLR models were developed through backward selection for all cohorts together and each cohort individually using BIC as a model selection criterion in R. There was a positive association between neighborhood-level percentage of white residents near childrens’ homes and FVC for all cohorts combined and the CCCEH cohort. Results varied by cohort due to differences in cohort demographics. This study examines the complex relationships between air pollutant, traffic, and NS exposures and childhood lung function metrics to inform risk factors for lung function impairment.
Xue Zhang
Epidemiologic Analysis of COVID-19 and Gastrointestinal Outbreaks in Pima County, Arizona (2023)
In 2023, Pima County, Arizona, reported 92 infectious disease outbreaks, including 67 due to COVID-19 and 20 due to GI illnesses. This practicum analyzed outbreak and community COVID-19 cases using surveillance data to assess trends and identify risk factors for hospitalization and mortality. Descriptive analysis focused on 2023, while regression and survival models used 2017-2023 data. Hospitalization and mortality were significantly associated with age, gender, and race/ethnicity. Each year of age increased odds of hospitalization by 4% and death by 12%. API, Black, and Hispanic individuals had higher odds of severe outcomes compared to White individuals. Kaplan-Meier and Cox models showed lower survival in older adults and Hispanic groups. These findings highlight disparities in COVID-19 outcomes and underscore the need for targeted public health interventions and improved outbreak preparedness in high-risk settings.
Longyi Zhao
The Association Between Weight Changes Over Time and Pancreatic Cancer Risk in a Pooled Analysis of 8 Prospective Cohort Study in the Pooling Project of Prospective Studies of Diet and Cancer (DCPP)
Pancreatic cancer is the sixth leading cause of cancer-related deaths worldwide, with its incidence and mortality rates continuing to rise. This study investigates the association between weight changes over time and the risk of developing pancreatic cancer using individual level data from the DCPP, a pooled analysis of eight prospective cohort studies. A total of 2,954 pancreatic cancer cases were identified among 277,685 participants, who were categorized into 19 weight groups based on their weight fluctuation patterns. Study-specific multivariable hazard ratios and 95% confidence intervals were estimated using Cox proportional hazards models and pooled using a random-effects model. The analysis was adjusted for key confounders, including smoking status, history of diabetes, alcohol consumption, body mass index, physical activity, and daily caloric intake. Additionally, stratification was performed based on gender, year, and age at the time of the second questionnaire collection. The findings indicate that none of the weight cycling groups showed a statistically significant association with pancreatic cancer risk.
Break (2:00pm - 2:30pm)
Session 2 (2:30pm - 3:30pm)
Data Analysis - MPH (HSC 303)
Li Jiang
Leveraging Data Analytics for Biotech Operations and Market Strategy
During my internship at Kyinno Biotechnology, I developed and implemented data-driven solutions to enhance laboratory operations, stock management, and business development strategies. My work focused on automating data collection, utilizing web scraping techniques to gather insights from 100+ targeted websites and streamline information management in Innopedia, Kyinno’s pharmacological database. Additionally, I designed regression models to analyze cell-line sales trends, optimizing marketing strategies and business outreach.
Beyond internal analytics, I contributed to hypothesis testing and statistical modeling for mycoplasma contamination reports, ensuring data accuracy for preclinical research. By integrating data science techniques with biotechnology applications, this project underscores the impact of quantitative analysis in biotech operations and highlights the value of predictive modeling in driving strategic decision-making.
Huizhong Peng
Lifestyle Risk Factors in Middle-aged and Older Adults: Insights from the UK Biobank Study
With the rising burden of cognitive decline and chronic diseases in aging populations, understanding the impact of modifiable lifestyle factors is crucial for developing targeted prevention. The purpose of this study is to investigate the associations between lifestyle factors and major health outcomes, including mild cognitive impairment, dementia, heart disease and stroke, using data from the UK Biobank. Multivariate logistic regression was used in this study, and found that sufficient physical activity and non-smoking were significantly associated with a lower risk of health outcomes. My role in this study involved data cleaning, variable selection, applying multivariate logistic regression models to assess associations, and generating tabular representations of key findings to support result interpretation.
Yiru Xiang
Analyzing Gut Microbiome Dynamics in Intestinal Transplant Recipients: Investigating Its Role in Graft Rejection
Graft rejection remains a significant challenge in intestinal transplantation, with emerging evidence suggesting that gut microbiome composition may play a crucial role. This project analyzes 16S rRNA sequencing data from intestinal transplant recipients to investigate microbiome dynamics and their association with graft rejection. Key analyses include alpha and beta diversity calculations, differential abundance testing, and statistical modeling using generalized linear models (GLMs) and generalized linear mixed models (GLMMs). By integrating bioinformatics tools and R programming, we assess whether microbiome divergence from the donor or dysbiosis correlates with rejection risk. Findings from this study aim to enhance our understanding of microbiome-mediated immune responses and inform potential strategies for improving transplant outcomes.
Chengyuan Zhang
Data Processing And Data Analysis Regarding Medical Funds: Find Indexes That Correlate With Medical Funds' Performance in China in the Post-Covid Era
The medical industry has been playing an important role in the area of public health. Pharmaceutical companies, medical device companies, and health service providers are the basis for conducting intervention programs to improve public health. Medical funds connect the medical industry with the capital market, and the performance of medical funds serves as a barometer of the medical industry. In this project, I cleaned and processed datasets containing information on Chinese medical funds and the Chinese stock market. Then I used the processed data to conduct linear regression analysis in order to determine which indexes are correlated with the performance of medical funds since 2021. Medical funds that hold more large-cap stocks and high-dividend stocks have shown better performance in the post-COVID era. However, the results could be biased due to COVID's impact on the medical industry.
Topics in Mental Health - MPH (HSC 305)
Jiayi Li
Associations Between Maternal Allostatic Load and Placental DNA Methylation: Targeted Gene and Epigenome-Wide Analyses
Maternal stress during pregnancy may influence fetal development through epigenetic modifications. This study examines the link between maternal allostatic load (AL) and placental DNA methylation in 132 participants using targeted gene analysis and an epigenome-wide association study (EWAS). Placental DNA methylation was assessed via Illumina arrays, with preprocessing and statistical analyses identifying significant associations. A co-methylated region (CMR) in NR3C1 was linked to AL, and EWAS found hundreds of associated CMRs after multiple testing correction. Sex-stratified analyses confirmed the NR3C1 association in males but not females. Additional CMRs were identified in both sexes, with gene set enrichment highlighting pathways in macromolecule biosynthesis (ADAR, CAMTA1, DLL).These findings suggest maternal AL influences placental DNA methylation, with NR3C1 as a key gene. Sex differences indicate distinct epigenetic effects, emphasizing the need for further research on long-term developmental outcomes.
Ruiyang Wu
Prenatal Social Determinants of Health and Mental Health and Newborn Birth Outcomes and Feeding.
This APEx project assessed the influence of prenatal maternal social determinants of health (SDH) and mental health on newborn birth and feeding outcomes in the Well Baby Nursery at NewYork-Presbyterian Morgan Stanley Children’s Hospital from 2018 to 2023. Through an extensive literature review and robust statistical analysis, the project examined the associations between prenatal exposures and newborn outcomes, with particular attention to demographic factors, including the impact of COVID-19. Key activities involved identifying research gaps during the literature review and performing comprehensive data analyses. The findings provide valuable insights into how maternal well-being shapes early-life health outcomes, thereby informing future healthcare strategies and interventions.
Jiayi Zhang
Anxiety Disorder Analysis based on the Generalized Anxiety Disorder-7 survey
Generalized Anxiety Disorder (GAD) is a prevalent mental health condition impacting well-being. The Generalized Anxiety Disorder-7 (GAD-7) survey is a validated tool for assessing anxiety severity. This study processes, analyzes, and models GAD-7 survey data to identify anxiety patterns, demographic associations, and risk factors using statistical and machine learning techniques.
Data were cleaned through outlier detection, missing data handling, and categorical transformation. Descriptive analysis examined demographics and symptom prevalence, while correlation analysis identified associations among anxiety symptoms. A multiple linear regression model, enhanced with Principal Component Analysis (PCA), predicted GAD-7 risk scores. KMeans clustering revealed distinct anxiety-related demographic groups.
Results indicate higher anxiety levels in females, low-income individuals, and high-stress occupations. Identifying risk factors aids in targeted mental health interventions. Findings support public health strategies for anxiety screening, prevention, and treatment.
Xuezheng Wang
Demographic Disparities in Youth Psychiatric Inpatient Discharges: Evidence from New York State SPARCS Data (2019–2022)
Introduction: This study aims to examine how demographic factors influence youth hospital inpatient discharge outcomes, including their length of stay, total charges, and diagnosis types. There are two interested diagnostic categories: Substance-Related Disorders and Suicide Attempts.
Methods: This study utilizes data from the SPARCS dataset, focusing on youth aged 0 to 17 who were discharged from hospitals in 10 New York State counties between 2019 and 2022. Statistical analyses were conducted using t-tests, chi-squared tests, and ANOVA to assess the relationships between demographic factors and hospital inpatient discharge outcomes.
Results: Males are significantly more likely to have a substance-related disorder compared to females, and females are significantly more likely to have suicide-related diagnoses compared to males. Additionally, there is significant differences in total charges, severity of illness, risk of mortality among the different ethnicity groups.
Epidemiology - MPH (HSC LL 103)
Hongyi Chen
Identification of Preeclampsia Subtypes Via Machine Learning and Associations With Risk of Perinatal and Postpartum Outcomes
There has been increasing recognition of pre-eclampsia (PE) heterogeneity. We aimed to identify PE subtypes via UMAP and DBSCAN and assess associations with risk of perinatal and postpartum outcomes via modified Poisson regression and Cox regression. In a cohort study of 14132 racially and ethnically diverse individuals with PE in Northern California in 2011-2021, we split the sample into one training (80% of 2011-2020) and two validation sets (1: 20% of 2011-2020; 2: 2021). In the training set, 4 clusters (C) were identified: C1(68.3%), C2(20.4%), C3(8.7%), and C4(2.6%). C1 and C2 had nearly no GDM, whereas all individuals in C3 and C4 had GDM. Compared to no GH in C1 and C3, all individuals in C2 and C4 had GH. C3 and C4 were more likely to have pre-existing conditions (prediabetes, PCOS, dyslipidemia and obesity), and to be older and multiparous. Compared to C1 as the mildest subtype, C2 had higher risk of SGA, C3 had higher risk of C-section and LGA, C2, C3, and C4 had higher risk of preterm birth, NICU, and postpartum hypertension. Findings were similar in the validation sets. Subtyping PE may inform personalized risk assessment and management to improve associated outcomes.
Katherine Gong
Association Between Pre-Pregnancy Magnesium Intake and Preterm Birth: Evidence from the NuMoM2b Prospective Cohort Study
Magnesium (Mg) is an intracellular molecule that is important for fetal growth and development.The study included 10,038 nulliparous women enrolled between October 1, 2010, and September 30, 2013. The final analytic sample included 7,105 participants. The exposures of interest are total magnesium intake. Preterm birth, defined as delivery <37 weeks of gestation, was prospectively ascertained. Logistic regression models were used to evaluate the association between Mg intakes and preterm birth. After adjusting for potential confounders, participants in the highest quartile (Q4) of total Mg intake had 27% lower odds of preterm birth compared to those in the lowest quartile (Q1) [odds ratio (OR)= 0.73, 95% confidence interval (CI): 0.51-1.04], and those in the highest quartile of dietary magnesium intake had 29% lower odds compared to the lowest quartile (OR=0.71, 95% CI: 0.50-0.999). Among participants with higher calcium intake, those in the highest quartile of magnesium intake had 49% lower odds of preterm birth compared to those in the lowest quartile (OR=0.51, 95% CI= 0.25-1.02).
Sining Leng
Evaluating the Impact of Elder Justice Services Through Data Analysis and Public Health Strategies
During APEx at the Weinberg Center for Elder Justice, I utilized biostatistical methods and data analysis to assess the effectiveness of the elder justice shelter model. My primary focus was conducting quantitative analysis for grant reports, summarizing service trends, and evaluating program impact. Additionally, I attended conferences and workshops and integrated emerging insights into the Center’s initiatives.
A key component of my work involved drafting grant proposals and ensuring compliance with funding requirements. I collaborated with a multidisciplinary team of social workers, attorneys, and public health professionals to develop strategies for improving elder justice interventions.
In this presentation, I will share key findings from my work, the impact of elder justice programs, and the value of interprofessional collaboration in public health practice.
Machine Learning - MS (HSC LL 106)
Tianhui Huang
Novel Pipelines to Extract Differences in Proteome Dynamics Based on Health Status
Understanding the dynamics and co-regulatory patterns is essential for uncovering the molecular basis of health and disease. However, extracting meaningful insights from high-throughput proteomic data remains challenging. This study explores three novel approaches to distinguish proteome dynamics based on health status using longitudinal proteomics data from the MiSBIE study, comparing six healthy controls and six subjects with severe mitochondrial disease (MitoD). First, we designed a permutation test to detect global differences in proteomic co-regulation. Second, we applied nonlinear embedding and clustering to capture complex relationships between health and proteome elasticity. Third, we developed a machine learning algorithm to extract low-dimensional representations of proteome dynamics and cluster subjects without prior knowledge of their health status. All three methods revealed clear differences between MitoD individuals and healthy controls, with high-dimensional proteomic data outperforming a low-dimensional subset of diagnostic proteins (GDF15 and IL6). These approaches take a crucial step toward interpretable and accurate health prediction based on proteome dynamics.
Zhi Heng Shi
Anti-Sepsis for Infants and Moms and Its Application
This project presents the initial implementation of "Anti-Sepsis for Infants and Moms" (AIM), an AI-driven digital tool aimed at preventing sepsis in mothers and infants. It builds on the success of the original Anti-Sepsis platform, which is the grand prize at Columbia University’s 2024 Biomedical Engineering Society Data Science Hackathon. AIM can integrate wearable device data, daily self-assessments, and machine learning algorithms to generate personalized risk scores of sepsis. Principal component analysis was implemented and can identify blood sugar level and blood pressure (systolic and diastolic) as the most critical predictors, followed by maternal age, body temperature, and heart rate. AIM's predictive accuracy reaches 95.63% using an XGBoost classifier. AIM provides real-time monitoring and automated alerts. Although AIM is still in its early stages, its strong machine learning performance and ability to integrate into existing healthcare systems underscore its potential. Future work will focus on refining its algorithms for improved sensitivity and exploring additional applications in maternal and infant health.
Yuntian Xu
Machine Learning Module in EasyR for Clinical Data Analysis
This practicum project focuses on developing and implementing a machine learning module in EasyR, a statistical software for biomedical data analysis. The module integrates multiple algorithms—including logistic regression, k-nearest neighbors (KNN), random forest, support vector machines (SVM), XGBoost, and Bayesian models—to predict disease outcomes such as hypertension and cardiovascular risk. The project evaluates model performance using accuracy, AUC-ROC, sensitivity, and specificity, ensuring robust validation. Additionally, a user-friendly interface will be designed to enhance accessibility for researchers with minimal coding expertise. This project aims to improve the application of machine learning in biomedical research by providing an intuitive analytical tool.
Qiran Chen
Supervised Machine Learning Analysis for Prostate Cancer MRI Imaging
Accurate classification of prostate cancer and its severity is critical for treatment planning and patient prognosis. This study explores machine learning approaches for predicting Gleason grade groups, a measurement of cancer stages, using the Pi-CAI dataset (1500 cases), which includes MRI images and expert annotations. The preprocessing pipeline involved lesion separation, feature extraction using PyRadiomics, and data merging with clinical information. Severe data imbalance was addressed by reassigning Gleason levels into binary severity categories (severe vs. non-severe).
Several machine learning models, including SVM, random forest and XGboost were trained with highest accuracy of 0.70.
Key limitations included suboptimal feature extraction from MRI channels, data imbalance affecting predictivity, and vague decision boundaries in classification models. Future improvements will explore direct image-based deep learning approaches, refined feature selection, and additional training data to enhance staging accuracy and robustness in prostate cancer prediction.
Statistical Genetics - MS (HSC LL 107)
Shihang Zeng
Exploring Quantile-Specific Genetic Associations with Alzheimer’s Disease Using Quantile TWAS
Traditional QTL and TWAS analyses primarily focus on identifying genetic variants that influence the mean levels of gene expression or trait phenotypes. While these approaches have led to significant discoveries, they fail to capture genetic effects that vary across the full distribution of gene expression. Some variants may have disproportionate impacts on individuals with extreme expression levels, which cannot be detected using mean-based methods.
To address this limitation, we introduce a quantile-based QTL/TWAS framework that leverages quantile regression to model genetic associations across different points of the expression distribution. By applying this approach to Alzheimer's disease (AD), we identify genetic variants that exert differential effects at specific expression levels, revealing previously unrecognized genetic mechanisms underlying AD.
Our findings highlight the importance of considering the full spectrum of gene expression variation when studying complex diseases. This framework provides a more comprehensive understanding of genetic contributions to AD, offering new insights into disease risk, progression, and potential therapeutic targets.
Aiying Huang
MS-COEX: Multi-sample Cell-type-specific Coexpression Detection in Single-cell RNA-seq Data
Gene co-expression analysis in multi-sample single-cell RNA sequencing (scRNA-seq) data is crucial for understanding cellular interactions, yet existing methods often fail to account for sample heterogeneity, leading to inflated false positives. MSCOEX, a newly developed method, addresses this issue by mitigating multi-sample variability. The method was rigorously evaluated using null and spike-in simulation as well as real datasets from non-small-cell lung cancer (NSCLC), colon cancer, and COVID-19. Compared to a state-of-the-art (SOTA) method, MSCOEX identifies more functionally distinct and biologically meaningful GO terms, despite detecting fewer significant gene pairs. Clustering analysis further reveals that MSCOEX produces functionally coherent gene modules within cell types. These results highlight its potential for improving interpretability and controlling false positives in multi-sample scRNA-seq studies.
Xinyi Shang
Identifying Cell-Type-Specific Spatially Variable Genes with ctSVG
Spatially variable genes (SVGs) provide insights into functional and molecular differences among cells across distinct regions of a tissue. Existing methods for identifying SVGs primarily focus on sample-wide variation and often overlap with known cell-type marker genes. In this study, we introduce ctSVG, a computational method specifically designed for Visium HD spatial transcriptomics data. ctSVG integrates cell segmentation, cell clustering, and statistical modeling to enhance spatial gene expression analysis. By applying ctSVG to multiple Visium HD datasets across different tissues and species, we demonstrate its ability to identify new genes associated with spatially heterogeneous functions. Furthermore, its application to mouse embryo and human colon cancer datasets highlights its potential for uncovering biologically meaningful patterns of cellular spatial organization within tissues.
Joshua Carpenter
Multi-Omics Insights into Maternal Obesity and Fetal Metabolism
Maternal obesity and gestational diabetes mellitus (GDM) are linked to metabolic and transcriptomic alterations that may impact fetal development. In this pilot study, we performed RNA sequencing on placental tissue and metabolomic profiling of cord blood to investigate molecular differences among mothers with obesity, GDM, and those who had recently undergone bariatric surgery. Preliminary pathway enrichment analysis suggests potential disruptions in lipid metabolism, glucose regulation, and inflammatory pathways in the GDM and obesity groups. Samples from post-bariatric surgery mothers were sequenced separately, making it difficult to distinguish biological differences from batch effects. These findings lay the groundwork for future studies with larger cohorts to better understand the transmission of obesity from mother to child through the placenta.
Bayesian Statistics - MS (HSC LL 108A)
Iris Yang
Bayesian Hierarchical Modeling of U.S. Death Rates During 1973-2022
This practicum project utilizes Bayesian hierarchical modeling on the NCHS Multiple Cause of Death dataset to analyze changes in death rates across the United States during 1973 to 2022. To conduct this project, Bayesian hierarchical modelling and Markov Chain Monte Carlo (MCMC) methods, implemented in the NIMBLE package in R, are used to generate robust parameter estimates after adjusting for state, race, sex, and age effects. Additionally, spatial modeling techniques will identify geographic clustering of death rates, accounting for the inherent correlation between neighboring states. These methods aim to reveal regional disparities, environmental influences, and localized risk factors affecting death rates. The results of this analysis will provide a nuanced understanding of how death rates have evolved over the past five decades, highlighting the complex relationship between demographic and environmental factors shaping death rate trends in the United States.
Alexander McCreight
SuSiE.ash: A Sum of Single Effects Regression and Adaptive Shrinkage Model as a Unified Framework for Fine-mapping and TWAS
Many Bayesian variable selection methods, such as Sum of Single Effects Regression (SuSiE), are widely used in genetic association studies to identify causal variants influencing molecular phenotypes like gene expression. These methods oftentimes assume a sparse genetic architecture, implying that only a few variants have large effects on the outcome while the remaining have none. This assumption can lead to a high false discovery rate (FDR) in settings where multiple variants contribute to the phenotype, including oligogenic and infinitesimal effects. We introduce “SuSiE.ash”, a novel method that integrates adaptive shrinkage priors into the SuSiE framework. SuSiE.ash offers a flexible alternative to SuSiE by modeling both strong sparse-effect variants and a variety of oligogenic and infinitesimal effects. We conducted extensive simulations reflecting various expression quantitative trait loci (eQTL) settings and found that SuSiE.ash provides a more reliable tool for identifying causal variants in complex genetic architectures by reducing FDR by an absolute difference of 37.7% compared to the standard SuSiE model, only at the cost of 18.2% of power.
Kindle Zhang
Comparing Longitudinal Data Analysis Methods with Simulation Data
In this study, I employ Monte Carlo simulations to generate longitudinal datasets for comparing various statistical methods used in analyzing such data. The generated datasets contain both independent and dependent structures, allowing for a comprehensive evaluation of methodological differences. The models compared include Generalized Linear Mixed Models (GLMM), Generalized Linear Models (GLM), and Generalized Estimating Equations (GEE) for marginal models, as well as traditional approaches like Ordinary Least Squares (OLS) and Weighted Least Squares (WLS). To bridge theoretical understanding with practical applications, I subsequently apply these methods to real-world data, assessing their effectiveness in different scenarios. Finally, to facilitate usability and visualization, I implement these methods in an R package, providing researchers with an accessible tool for longitudinal data analysis.
Causal Inference - MS (HSC LL 109A)
Sixuan Chen
Improving Indirect Treatment Comparisons Using Matching-Adjusted Indirect Comparisons (MAIC)
Indirect treatment comparisons (ITCs) are essential in comparative effectiveness research, particularly when direct head-to-head trials are unavailable. However, traditional ITC methods using aggregate data often suffer from bias due to population differences and model assumptions. This project explores the use of Matching-Adjusted Indirect Comparisons (MAIC) to mitigate these biases by leveraging individual patient data (IPD). We present a mathematical framework for MAIC, including the derivation of weighted treatment effect estimators and the asymptotic variance formula. A simulation study will evaluate MAIC’s performance relative to traditional methods, assessing its effectiveness in bias reduction and variance estimation. Additionally, if feasible, we will apply MAIC to real-world datasets to validate its practical utility. Our findings aim to contribute to more accurate and reliable indirect comparisons in evidence-based medicine.
Ziqiu Liu
The Role Of Mitochondrial Energy Expenditure In Brain And Cognitive Aging
Aging has been a major focus of neuroscience research, with many studies investigating the mechanisms behind the decline in brain and cognitive function. One of the indexes associated with aging is the level of energy expenditure, in which mitochondria play an important role. Growth differentiation factor 15 (GDF15), a circulating protein that elevated with age, is an established biomarker of mitochondrial energy expenditure. Our sample included 352 participants aged 20-80 with GDF15 and MRI data. Using correlation and regression analysis, we assessed the associations of GDF15 with variables including age, brain structure (mean cortical thickness), brain function (backbone strength, backbone dispersion, and cycle strength), and cognitive ability in four domains (fluid reasoning, episodic memory, vocabulary, and processing speed). To further examine the effect of GDF15 in the pathways of brain and cognitive aging, we conducted mediation analysis with bootstrapping. While no direct evidence of GDF15 mediation was found, the results suggest a potential effect of GDF15 on processing speed partially mediated by backbone dispersion, which requires further investigation.
Yumeng Qi
Longitudinal Relationships Between REM Sleep Behavior Disorder, Dopaminergic Dysfunction, and Cognitive Decline in Parkinson’s Disease
Background: RBD is a common non-motor symptom of PD linked to cognitive impairment. While both RBD and dopaminergic dysfunction affect cognition, their combined role remains unclear.
Objective: This study examines cross-sectional and longitudinal associations between RBD, DaT loss, and cognitive performance in PD.
Methods: Data from 1,139 PPMI participants were analyzed. Baseline associations were tested via linear regression, while CLPA assessed bidirectional relationships over four time points.
Results: Cross-sectionally, RBD and DaT loss were associated with lower cognitive performance, but their interaction was non-significant. Longitudinally, RBD and DaT loss had a bidirectional relationship in the putamen, where only DaT predicted cognition. In the caudate, RBD had a unidirectional effect on DaT, and both contributed to cognitive decline.
Conclusions: RBD-related cognitive impairment involves both dopaminergic and non-dopaminergic mechanisms, with region-specific effects. The interplay between RBD, DaT loss, and cognitive decline underscores the need for a multifaceted approach to understanding and mitigating PD-related cognitive impairment.
Yixiao Sun
Enhancing Comparative Effectiveness Research with Matching-Adjusted Indirect Comparison: A Robust Approach for Treatment Comparisons
Comparative effectiveness research (CER) in healthcare faces challenges when direct head-to-head randomized trials are unavailable. This project utilizes Matching-Adjusted Indirect Comparison (MAIC) to enhance treatment comparisons across different patient populations by integrating individual patient data (IPD) from one trial with aggregate data from another. By applying propensity score weighting, MAIC aligns baseline characteristics, reducing cross-trial heterogeneity and bias. This study evaluates MAIC’s effectiveness in adjusting for clinically relevant differences in treatment populations using curated breast and ovarian cancer datasets. Statistical techniques such as weighted t-tests and chi-square tests will assess treatment effects, while sensitivity analyses will explore variations in outcome definitions and dosages. The findings aim to improve the reliability of treatment comparisons, offering robust evidence for healthcare decision-making and policy applications.
Machine Learning - MS (HSC LL 202)
Ru Jin Lim
HRSA Health Center Workforce Wellbeing Survey Analysis
The Community Health Center (CHC) workforce is crucial in delivering care to underserved populations, yet workforce wellbeing and retention remain significant challenges. This study analyzes HRSA’s Health Center Workforce Wellbeing Survey to identify factors associated with higher wellbeing among older adults in the CHC workforce. Using three modeling strategies, a proportional odds model, CART, and random forest models—predictive characteristics of wellbeing were assessed, with model performance evaluated via the log-loss metric. The random forest model demonstrated the highest predictive accuracy, highlighting key factors influencing four distinct wellbeing outcomes. Analysis revealed that older workers exhibited higher wellbeing than younger counterparts. Further examination of survey questions on workplace culture, leadership, organizational support, and other key factors identified actionable areas for improvement. These findings provide insights for CHCs to implement targeted interventions, reduce provider burnout, and improve workforce retention, strengthening long-term workforce sustainability.
Jiarui Yu
Actuarial Mortality Modeling: Enhancing Accuracy with Optimized Rates
The project's goal is to revolutionize how actuaries assess mortality rates and compute net present values (NPV) for actuarial products, shifting from traditional, intuition-based variable selection methods to an objective, data-driven approach. This research implements a recursive decision tree algorithm integrated with weighted mortality calculations to predict more accurately the financial liabilities associated with various demographic groups.
Wenxin Tian
Differential Enrichment Analysis of Antibody Responses Among Monkeypox Survivors Using Phage Display
Phage display technology enables the high-throughput profiling of antibody responses across different immune backgrounds. In this study, we used a library of 96,175 oligonucleotides from poxvirus proteome and screened samples from 42 individuals across six groups: smallpox vaccination, smallpox survivors, monkeypox survivors, Jynneos vaccination, healthy controls, and Ebola-MVA boosters and generated read count data that characterize immunogenicity of each oligonucleotide. Our primary objective was to identify differentially enriched antibodies, particularly in monkeypox survivors, to characterize their distinct immune signatures. We utilized dimension reduction and machine learning techniques to identify significant oligonucleotides. Key antibodies exhibiting significant enrichment in monkeypox survivors were identified, providing insights into the immune response specificity following infection. These findings enhance our understanding of antibody dynamics across vaccination and infection histories, with implications for vaccine development and immunodiagnostics.
Yuxuan Du
Trajectory Analysis For Multiple Sclerosis Patients
Our study explores the longitudinal trajectories of digital cognitive biomarkers in individuals with multiple sclerosis using multiple clustering approaches, including K-means, LCMM, and MULTLCMM. We aimed to identify distinct cognitive trajectories over time by leveraging different modeling techniques. For the K-means approach, multiple features were constructed and used to define individual trajectories, while for LCMM-based methods, we compared various link functions—including linear, beta cumulative distribution function, and quadratic I-splines—selecting the optimal model based on the Bayesian Information Criterion (BIC). The number of clusters was determined using posterior probabilities for LCMM models and silhouette scores for K-means. Among the methods tested, MULTLCMM did not yield optimal clustering results. Finally, we examined the relationship between identified clusters and key biomarkers, including gray matter fraction, white matter fraction, and PDDS scores.
Cancer Research - MS (HSC LL 204)
Lu Qiu
Global, Regional, and National Burden of Male Breast Cancer, 1990–2021, with Projections to 2050: A Systematic Analysis of the Global Burden of Disease Study 2021
Male breast cancer is a rare cancer that begins as an abnormal growth of cells in the breast tissue of men. Here, we present the most up-to-date global, regional, and national estimates for prevalence and years lived with disability (YLDs) due to male breast cancer by age and location from the Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) 2021, as well as forecasted prevalence to 2050. Male breast cancer prevalence and YLDs from 1990 to 2021 were estimated by drawing on population-based data from 204 countries. Nested Bayesian meta-regression models were used to estimate prevalence and YLDs due to male breast cancer by age and location. Prevalence was forecast to 2050 with a mixed-effects model. These findings provide insights into the changing burden of male breast cancer worldwide.
Yujia Zhang
Survival Model: Ovarian Cancer Patient Survival Measured with Treatment Data
Survival models are particularly valuable in cancer and epidemiological research, as they enable researchers to assess the impact of different covariates and gene target sites on time-to-event outcomes. Ovarian cancer is the fifth leading cause of cancer-related deaths among women worldwide, with a high mortality rate due to its typically late diagnosis. Key therapeutic targets include BRCA1, BRCA2, HER2, and PARP, with ongoing research seeking to discover additional innovative target sites for patients with limited treatment options. My practicum project aims to investigate the association between multiple factors and survival outcomes using a model selection process, as well as explore correlations between specific genetic mutations and increased hazard.
Yang Zhao
Assessing The Efficacy Of Dendritic Cell-Based Vaccine Therapy
Dendritic cell (DC)-based immunotherapy has gained attention as a potential cancer treatment, yet its clinical efficacy needs to be validated. This study aims to assess the immunological impact of DC vaccination by comparing clinical blood samples collected from patients before and after treatment. The dataset includes 37 paired blood samples obtained from patients undergoing DC therapy, with key immune biomarkers such as NK cell activity, T cell subpopulations, immunization scores, and cytotoxicity markers recorded.
To evaluate changes induced by the vaccine, we applied descriptive statistical analysis, paired hypothesis testing, and generalized estimating equations (GEE) to examine pre- and post-treatment immune responses. Preliminary findings suggest variations in immune markers following vaccination, yet statistical analysis does not provide conclusive evidence supporting the vaccine’s efficacy based on the current sample size. Further studies with longitudinal data and larger cohorts are needed to better understand the long-term immunological effects of DC-based immunotherapy.
Mingzhi Chen
Diffential Expression Analysis of Glioblastoma Cells
Glioblastoma multiforme (GBM) is a highly aggressive brain tumor characterized by complex genetic and metabolic alterations. In this study, I conduct a comprehensive differential expression analysis (DEA) to compare neoplastic GBM cells with their normal counterparts using single-cell RNA sequencing data (GSE84465). To
address the zero-inflation issue, I evaluate and compare various statistical models, including hurdle models (MAST and ZIAQ) and the Zero-Inflated Negative Binomial model (DEsingle). Additionally, Wilcoxon test results serve as a reference, and multiple testing correction is performed using the False Discovery Rate (FDR) method. The findings highlight the superior performance of the ZIAQ model in ranking biologically relevant genes and pathways compared to other approaches.
Statistical Genetics - MS (HSC LL 205)
Lehan Zou
LamianOmni: Extending Multi-Sample Pseudotime Analysis to DNA Methylation Data
Pseudotime analysis of single-cell sequencing data is a powerful approach for investigating dynamic gene regulation along continuous biological processes. While many methods can infer pseudotemporal trajectories within a single sample, comparing these trajectories across multiple samples or conditions remains a significant challenge. Lamian addresses this limitation with a robust computational framework for differential multi-sample pseudotime analysis. By modeling temporal patterns as functional data and leveraging a Bayesian hierarchical model, Lamian identifies changes in biological processes linked to experimental conditions while accounting for batch effects. However, its evaluation has so far been limited to single-cell RNA-seq data.
In this study, we propose LamianOmni by applying Lamian to single-cell multiomics data, particularly focusing on DNA methylation (DNAm) profiles. We assess its feasibility and potential for studying epigenetic dynamics in a pseudotemporal framework using real and simulated multi-sample single-cell multiomics data.
Jiying Wang
Lineage Tracing of Thymic Epithelial Cells Using Single-Cell Multi-omics
Background:
Thymic epithelial cells (TECs) guide T cell development. Cortical TECs (cTECs) support early T cells, while medullary TECs (mTECs) remove self-reactive ones. Single-cell RNA sequencing (scRNA-seq) has revealed TEC subtypes, but their developmental pathways remain unclear. We combined scRNA-seq and single-cell whole genome sequencing (scWGS) to trace TEC lineage relationships.
Methods:
Pediatric thymus samples, including CD45-negative and EPCAM+ cells, were analyzed. scRNA-seq data were processed using Seurat v5, including normalization, batch correction, and clustering. scWGS identified somatic mutations to reconstruct TEC lineages.
Results:
Among 969 cells, we identified cTECs, mTECs, endothelial cells, pericytes, fibroblasts, and immune cells. Some epithelial cells showed low marker expression, suggesting progenitors. Phylogenetic analysis revealed hierarchical TEC differentiation with distinct mutation patterns.
Conclusion:
This study uncovers TEC development using lineage tracing via scRNA-seq and scWGS. Findings enhance our understanding of immune system development and TEC differentiation.
Yunshen Bai
Regulatory Networks and Gene Expression: A Comprehensive Simulation Study
This study uses computer simulations to explore how genetic differences, specifically single nucleotide polymorphisms (SNPs), affect the activity of genes in a network. We created simulated genetic data to analyze how these genetic variations influence gene expression both directly and through complex networks involving multiple genes. Our approach aim to build and analyze patterns of gene expression. This research helps in understanding how genetic variations contribute to different traits by affecting gene activity, providing insights useful for genetic research.
Jingyi Xu
Evaluating Control Group Composition In Case-Control Studies: The Role Of Family History In Genetic Risk Assessment
Effective control group selection is critical in case-control studies, yet the impact of excluding individuals with a family history (FH) remains unclear. In this study, we evaluate how different control selection strategies influence the performance of genome-wide association studies (GWAS). Using the UK Biobank, we first construct a family pool dataset to simulate various control compositions. We then perform GWAS with logistic regression to compare three control selection methods: FH-excluded, mixed, and FH-only controls. We assess each approach by computing type I error rates and statistical power to determine their impact on effect size estimation and bias. Our results provide empirical evidence on optimizing control selection to enhance the validity of genetic association studies and reduce false discoveries in case-control designs.
Causal Inference - MS (HSC LL 207)
Yifei Liu
Exploring the Causal Pathway from Blood Pressure and Stroke to Alzheimer’s Disease: A Mendelian Randomization Study
Cerebrovascular health has received growing attention as a potential contributor to Alzheimer’s disease (AD) risk. While many studies have highlighted an overlap between AD and cerebrovascular diseases or vascular risk factors, evidence for a causal relationship remains limited. In this study, we performed a two-sample Mendelian randomization analysis using genome-wide association summary statistics from eight datasets, including two blood pressure traits (n = 757,601), four stroke types (n ranging from 456,348 to 1,296,908), and AD (n = 487,511), from the UK Biobank. Genetically predicted blood pressure was significantly negatively associated with AD (diastolic blood pressure: OR, 0.89 [95% CI, 0.89-0.90]; systolic blood pressure: OR, 0.93 [95% CI, 0.93-0.94]). In contrast, no evidence supported a causal effect of any stroke type on AD. These results suggest that lower blood pressure may causally contribute to AD pathogenesis, but the observed associations between stroke and AD in epidemiological studies may reflect shared risk factors or selection bias rather than a true causal link.
Ixtaccihuatl Obregon
Adaptation Of An Omnibus Test To Detect Subgroup Treatment Effects For Observational Studies
Observational studies may be used to identify subgroup-specific treatment effects for personalized medicine and optimizing patient outcomes. Observational studies differ from randomized clinical trials (RCTs) since they look at the “natural” state of outcomes and are subject to confounders and bias. This study aims to determine if subgroup-specific treatment effects can be detected and whether patient characteristics predict the most effective treatment in observational data. The omnibus test from Sun et al. (2022), designed for RCTs, was adapted and applied to observational data. A secondary aim will investigate patient subgroups that may exhibit differential treatment responses.
The original omnibus test was applied to observational simulated data to quantify bias occurrence. Propensity scoring and inverse probability weighting derived from simulation data were applied to the omnibus test to assess bias reduction. Bootstrapping was used to resample and generate robust statistical inference. This research is to enhance the understanding subgroup-specific treatment responses in real world scenarios and improve statistical methodologies in observational studies.
Hongzhu Ren
Exploratory Analysis of Causal Relationships Among 19 Inflammaging Biomarkers
Inflammaging—a chronic, low-grade inflammation associated with aging—is believed to play a critical role in the development of age-related diseases. Cytokine biomarkers are key to understanding this process. In this study, we applied the PC algorithm to explore causal relationships among 19 cytokine biomarkers using observational data. The analysis involved both the standard Gaussian conditional independence test and the Generalized Covariance Measure (GCM) test, which captures nonlinear associations. To ensure robustness, bootstrap resampling was conducted: 500 replications for the Gaussian test and 100 for GCM due to computational constraints. The Gaussian-based CPDAGs showed denser connections, while GCM-derived graphs were sparser but potentially more reliable. This highlights the impact of test choice on causal inference results. While the PC algorithm assumes full observability, future work will explore the FCI algorithm to address possible hidden confounders. These findings offer statistical insight into the structure of inflammaging pathways.
Zhiyi Zhu
Bias Correction in PET Kinetic Modeling: Refining the Simex Algorithm for Noisy Reference Inputs
Positron Emission Tomography (PET) is a crucial tool for quantifying metabolic processes and protein interactions in vivo. However, measurement noise in reference inputs, such as blood data or reference tissue signals, introduces bias in kinetic modeling, particularly when using the Simplified Reference Tissue Model (SRTM) to model time-activity curves (TACs). This project refines the Simulation-Extrapolation (Simex) algorithm to correct bias introduced by noisy inputs while maintaining computational efficiency. Using simulated PET datasets, we evaluate the impact of noise on parameter estimates and systematically compare our modified Simex approach with traditional methods. The improved algorithm enhances robustness and yields more accurate kinetic parameter estimates across varying noise levels. This research provides a framework for mitigating errors-in-variables problems in PET analysis, improving the reliability of radiotracer kinetic modeling for both biological and clinical applications.
Observational Study - MS (HSC LL 210)
Dylan Koproski
Analyzing Pre-Pandemic Age-Based Visitation Trends Using Advan Cell Phone Tracking Data
Understanding how different age groups visit various locations is crucial for public health, particularly in assessing exposure risks and access to essential services during crises. This study examines age-based mobility patterns using Advan cell phone tracking data combined with demographic census data to estimate visitation rates to different categories of points of interest (POIs). Analyses focus on three age groups: under 18, 19-65, and 65+. There is an emphasis on older adults due to their vulnerability in public health crises. Negative Binomial regression models with an offset for total visitor volume were used to compare visitation behaviors across POI categories and estimated marginal means analysis facilitated direct comparisons between each POI category pair. Additionally, predictive models built with ridge regression were developed for use on new data. The findings provide insights into baseline mobility behaviors that can inform public health strategies, ensuring at-risk populations maintain access to critical services while minimizing exposure risks.
Shaoyu Chen
Identifying Predictors of Survival and the Impact of Pretransplant Comorbidities on Pediatric Allogeneic Hematopoietic Cell Transplantation Outcomes
Allogeneic hematopoietic cell transplantation (HCT) is a critical therapeutic option for children with severe hematological and immunological disorders. However, transplant outcomes vary significantly, influenced by patient-specific factors, which include pretransplant comorbidities. This practicum investigates pediatric patients who underwent allogeneic HCT at Columbia University from 2008 to 2016. The primary objectives are to identify key factors associated with overall patient survival and to assess whether pretransplant comorbid conditions predict adverse post-HCT outcomes. Also utilizing retrospective data analysis, the study evaluates clinical characteristics, demographics, and comorbid conditions documented before transplantation. Results will enhance understanding of survival determinants and enable clinicians to refine pretransplant risk stratification, thereby optimizing outcomes and informing patient and family counseling in pediatric transplantation practices.
Jake Coldiron
Building and Validating a Federated Learning Algorithm with Virtual Pooling for Rare Genetic Diseases
Introduction: This practicum validated a federated learning algorithm via virtual pooling technology to compare against traditional methods. The use of these techniques is maintaining patient privacy via decentralized data analysis. The research question is whether federated learning and virtual pooling technologies can replicate traditional methods.
Methods: An algorithm was validated on patient with codes for genetic diseases and related medicines. Descriptive statistics were generated, and traditional analytical techniques, including Chi-Squared tests.
Results: Virtual pooling’s results were similar to traditional methods, and were consistent across domains, validating federated learning’s consistency and accuracy.
Discussion: These technologies offer comparable accuracy to traditional methods while maintaining patient privacy, suggesting exploration into its use in decentralized research and clinical practice.
Acknowledgments: These analyses used data from the Ronald Reagan UCLA Medical Center and UCSF Medical Center for ZebraMD for Alnylam Pharmaceuticals to research their drug Onpattro. Due to proprietary restrictions, the technology utilized in this project cannot be shared.
Ruoxi Li
Enhancing Predictive Accuracy in Small Subgroups Through Fairness-Constrained and Transfer Learning Methods
Background: Predictive models often exhibit reduced accuracy in smaller subgroups due to limited sample sizes. Recent fairness-constrained models have shown success in improving performance in small subgroups. In addition, transfer learning models that use similarity constraints have improved accuracy within subgroups.
Methods: We propose two novel frameworks that build on methods in fairness-constrained and transfer learning. First, we introduce a fairness-regularized transfer learning model, where source coefficients are estimated in a larger subgroup and then adapted to improve modelling in the smaller subgroup. Second, we integrate an angle-based similarity penalty into a joint fairness model.
Results: We evaluate the proposed models in simulated scenarios, assessing improvements in predictive accuracy and fairness. Our models effectively improve performance while reducing disparities compared to existing models.
Conclusion: This study introduces two innovations: a fairness-regularized transfer learning model and an angle-based similarity constraint within joint fairness model. By these techniques, we enhance predictive accuracy in small subgroups while reducing disparities.
Break (3:30pm - 3:45pm)
Session 3 (3:45pm - 4:45pm)
COVID-19 Research - MPH (HSC 303)
Chenyu Jin
Network and Regression Analysis of U.S. College Student Mental Health and Coping Strategies During COVID-19
Introduction: The COVID-19 pandemic had a significant impact on US college students’ mental health, evidenced by elevated rates of depression and anxiety. Many have reported sociodemographic correlates of their mental health. Yet, research on US college students’ coping strategies concerning their mental health and decision-making styles is limited and has rarely considered this age-group’s loneliness during the pandemic. A network analytic approach to understanding the associations of these constructs is absent. Deeper understanding of these associations is necessary to inform administrative and clinical intervention targets to support students’ resilience through future (prolonged) public health and life events.
Methods: US college students completed an online survey between September 2020 and May 2021 (N=1800) to assess sociodemographic, mental health, coping behaviors, COVID-related financial strain, and decision-making styles during the pandemic. Following a factor analysis of self-reported coping behaviors, data were analyzed using network analysis and regression analysis to identify significant correlates of depression and anxiety.
Results: Five main coping factors were identified: Habit Changes, Avoidance, Externalizing, Distraction, and Active coping. Network analysis highlighted the influence of Avoidance coping, which served as a crucial bridge with a stronger influence on its immediate neighbor nodes. Adjusted regression analysis supported network analyses and added that self-reported more loneliness, COVID-related financial strain, habit change, and avoidance coping behaviors were significantly associated with increased odds of depression and anxiety.
Conclusion: By clarifying the structure and broader network of coping behaviors and how they relate to mental health and decision-making style, these analyses provide informative intervention coping behavior targets for academic administrators, health service centers, or behavioral therapists that may effectively reach and address college students’ mental health and well-being.
Yanlin Cui
The Impact of Social Vulnerability and Childhood Opportunity on Child Traffic Mortality in New York State: A Pre- and Post-COVID Comparison
This project examines how child traffic mortality in New York State has changed before and after COVID-19 and whether these trends are associated with social vulnerability (SVI) and childhood opportunity (COI). The goal is to determine whether higher social vulnerability or lower childhood opportunities correlate with increased mortality rates and whether these relationships have shifted due to the pandemic.
To conduct this analysis, I obtained mortality data from my professor and sourced SVI and COI data independently. A major focus of my APEx work was data cleaning and integration. Using R, I converted longitude and latitude coordinates into census tracts and merged the three datasets, creating a structured database for analysis.
Although results are pending, this study aims to uncover pre- vs. post-COVID disparities in child traffic mortality, providing insights for equity-focused traffic safety policies.
Zhangfan Xia
Infant Feeding in the COVID-19 Mother-Baby Outcomes Study
This study investigates the association between maternal COVID-19 infection during pregnancy and infant feeding type at two months postpartum, as well as the impact of early feeding type on complementary food introduction and non-milk beverage intake. Using data from the COMBO20 study, I conducted a cross-sectional analysis with 324 mother-infant dyads. Chi-square tests revealed a significant association between maternal COVID-19 status and infant feeding type (p = 0.0109). Logistic regression adjusted for baby’s sex, gestational age, language, and delivery mode confirmed that COVID-positive mothers were 47% less likely to engage in breastfeeding or combination feeding (OR = 0.53, p = 0.0036). However, feeding type at four months was not significantly associated with the timing of complementary food introduction (p > 0.05) or non-milk beverage intake (p > 0.05). Findings highlight the need for targeted breastfeeding support for COVID-positive mothers and further research on long-term nutritional outcomes.
Data Visualizations - MPH (HSC 305)
Xinghao Qiao
Potential of DTP Pharmacies
Introduction: With increasing policy regulations and hospital budget constraints, DTP pharmacies have emerged as a critical access channel for specialty drugs, enabling pharmaceutical manufacturers to expand their market presence while ensuring patient accessibility.
Materials and Methods: The analysis is based on market data from IQVIA and policy interpretations. Key evaluation tools include city segmentation matrices, pharmacy performance assessments, and hierarchical management frameworks to identify growth opportunities, optimize partnerships, and enhance patient-centric pharmaceutical services.
Results: Since 2021, DTP pharmacies have experienced rapid expansion, particularly in the specialty drug sector. And the transition from in-hospital to out-of-hospital medication pathways is accelerating, emphasizing the need for integrated patient management systems beyond hospitals.
Discussion: The DTP retail pharmacy model is becoming a cornerstone of specialty drug distribution, requiring a data-driven, policy-aligned, and performance-based approach. By leveraging pharmacy segmentation, stakeholders can enhance patient access and strengthen brand influence in the healthcare ecosystem.
Huachen Shan
China Pharmaceutical Market Overview and Trends
I was the intern in the management consulting team in IQVIA, Shanghai. Together with my leader, we finished a PowerPoint regarding the China Pharmaceutical Market Overview and Trends. China’s pharmaceutical market, the second-largest globally, is forecast to grow at a CAGR of 3.5% through 2028, driven by innovation, improved reimbursement policies, and hospital infrastructure development. Post-COVID-19, economic recovery and policy shifts, such as healthcare anti-corruption measures and volume-based procurement (VBP), continue to shape the market landscape. The retail pharmaceutical sector is expanding, while multinational corporations (MNCs) leverage digital partnerships for patient engagement. Regulatory processes are accelerating, enhancing market access for innovative drugs. Despite challenges like price controls, China’s healthcare sector holds strong growth potential due to increasing demand and evolving policies.
Tiancheng Shi
Visualizing Hurricane Trends With An Interactive R Shiny Map
Hurricanes are among the most devastating natural disasters, causing significant loss of life and economic damage. Tracking long-term trends in hurricane activity is essential for improving disaster preparedness and reducing future risks. The goal of this project is to provide an effective tool for examining patterns, trends, and intensities of hurricanes over time, enabling public health professionals and policymakers to better understand the frequency, paths, and potential impact of high-intensity storms. This project visualizes historical hurricane data from the North Atlantic region, utilizing track records from 703 storms recorded since 1950. An interactive map was developed using R Shiny to explore different types of hurricanes, their paths, and associated wind speeds over time. Users can compare multiple storms directly on the map, making it a valuable tool for both educational and analytical purposes.
Sanne Glastra
Climate Resilience Dashboard for Mathematica Policy Research
I conducted my APEx at Mathematica Policy Research, where I was hired as a Climate Change Intern. As part of my APEx, I worked as lead R Shiny Developer to create a monitoring and evaluation dashboard for our client AGRA, an organization dedicated to agricultural resilience and food security in Africa. During this project, I served as a liaison among the UX design subcontractor, AGRA client representatives, and AWS hosting solution experts to create a modular app with 30+ pages. Beyond using R code within R Shiny, I self-taught CSS and HTML to implement unique features in the R Shiny application, including dynamically expanding cards and customized styling to enhance user experience. The app was ultimately delivered as a monitoring and evaluation tool to our client, to be used to track progress towards specific KPIs over the next 5 years.
Clinical Trials - MS (HSC LL 103)
Yizhuo Chang
The Efficacy of Computerized Cognitive Training in Patients with Coronary Heart Disease and Mild Cognitive Impairment
Background: Coronary heart disease (CHD) -related cerebral hypoperfusion increases cognitive decline risk, making computerized cognitive training (CCT) a vital approach where pharmacological treatments are limited.
Objective: The study is to determine the efficacy of CCT on global cognitive function in patients with CHD and mild cognitive impairment (MCI), over a 12-week training period.
Methods: A total of 200 patients with CHD and MCI were enrolled from 8 centers and interviewed at 12 weeks. Exploratory data analysis (EDA) was conducted to examine within-group and between-group changes, followed by Causal Forest to estimate heterogeneous treatment effects.
Results: EDA revealed subgroup variations and showed significant cognitive improvements in both groups. The treatment group demonstrated more stable and significant cognitive improvement than the positive control group, with causal forest analysis confirming a positive treatment effect.
Conclusion: CCT effectively improves cognitive function in CHD patients with MCI, with subgroup variations in treatment response, highlighting the potential of personalized interventions.
Yangyang Chen
Revolutionizing Clinical Decision-Making: A Natural Language Chatbot for Clinical Trials Data Retrieval
This practicum project aims to develop an intelligent chatbot that empowers clinicians to efficiently access and analyze clinical trials data. By integrating with proprietary databases like Citeline as well as public repositories such as ClinicalTrials.gov, the chatbot transforms natural language queries into targeted data searches. Advanced natural language processing techniques—including text extraction, vector embedding, retrieval augmented generation, prompt engineering, and SQL queries—are leveraged to ensure concise and accurate responses regarding trial phases, eligibility criteria, and patient outcomes. The system is designed with robust security measures to ensure HIPAA compliance and safeguard sensitive information. Through this project, students will gain hands-on experience in applying biostatistical and data science methodologies to address real-world health applications, ultimately enhancing clinical decision-making by reducing data retrieval time and improving access to critical trial insights.
Zheshu Jiang
Enhancing MOUD Treatment Insights: Identifying Underrepresented Subgroups and Estimating Conditional Treatment Effects Using Superlearner
Medications for Opioid Use Disorder (MOUD) play a critical role in substance use treatment programs, yet certain subgroups remain underrepresented in clinical trial datasets. This study aims to identify multidimensional subgroups prescribed MOUD in population-level treatment settings but underrepresented in clinical trial networks. By leveraging Superlearner, a machine learning ensemble method, we estimate the conditional average treatment effect (CATE) to study heterogeneous treatment effects within clinical trial network (CTN) datasets. Our methodology involves using three CTN datasets to detect missing subgroups through multiple imputation and decision tree modeling. Superlearner, which integrates multiple predictive models via cross-validation, is employed to enhance the robustness of CATE estimates. The primary outcomes analyzed include relapse at week 24 and quality of relations with family. The next step in this project involves generating visualizations to improve the interpretability of machine learning models. By addressing trial representation disparities, this research enhances personalized MOUD treatment and informs policy for more inclusive, effective interventions.
Lauren Lazaro
Optimizing Clinical Trial Design for Polypill Research
Polypills combine multiple medications into a single pill to improve adherence, simplify treatment regimens, and enhance clinical outcomes. They have been widely studied in cardiovascular disease and are now expanding to other areas such as diabetes and infectious diseases. However, clinical trial methodologies for evaluating polypills vary widely, leading to inconsistencies in efficacy assessments and regulatory challenges. This study systematically reviews existing polypill trials across multiple therapeutic areas, comparing statistical approaches, outcome measures, and Type I error control to identify inconsistencies. Variability in trial design, population selection, and analytical strategies affects comparability across studies. While some methods enhance reliability and statistical power, others introduce bias, limiting the generalizability of findings. Addressing these challenges through standardized methodologies will improve comparability, optimize trial designs, and enhance the credibility of polypill research, ultimately facilitating broader adoption and regulatory approval.
Cancer Research - MS (HSC LL 106)
Cynthia Cui
Evaluating the Impact of Medicaid Expansion on Access to NGS Testing for Advanced Cancer Patients: A Difference-in-Differences Analysis
Medicaid expansion, implemented in select states beginning January 1, 2014, allowed states to extend Medicaid coverage to adults with incomes up to 138% of the federal poverty level. Evaluating the impact of this policy on access to high-cost medical services remains a key area of interest. This study examines the effect of Medicaid expansion on access to next-generation sequencing (NGS) testing among patients with advanced cancers using a difference-in-differences (DiD) model. Following cohort selection, the DiD analysis revealed no statistically significant effect of Medicaid expansion on NGS test access across various types of metastatic cancers. These findings suggest that Medicaid expansion may not have substantially improved access to NGS testing for patients with advanced cancer.
Mirah Koota
Distribution of Early-Stage NSCLC Cases and Surgical Resections in the USA using SEER Database for Merck KN-671
This is a geographical study of surgical treatment for early-stage non-small cell lung cancer (NSCLC) in the USA, using the SEER database in support of Merck KN-671, a Keytruda regimen for preoperative therapy. The second goal was to investigate the SEER*Stat database as a novel resource for Merck. The method involved employing SEER Stat software to identify a cohort of NSCLC patients and merging this data with U.S. census data using RStudio. The analysis was done in RStudio to assess the distribution of NSCLC cases by county and surgery as a treatment across different stages: localized, regional, and distant. My study successfully identified counties with the highest number of NSCLC cases, by stage, and calculated the percentage of patients who underwent surgery. Several limitations of the SEER database were noted, including the availability of data only up to 2021, coverage of about 50% of the USA population, incomplete treatment records, and staging data. The results of this analysis are proprietary to Merck, but this work highlights the potential of SEER data for geographical insights into NSCLC treatments and identifies the database's strengths and limits for future research.
Yueyi Xu
Survival Outcomes Of Neoadjuvant, Adjuvant, And Perioperative Immunotherapy In Resectable Non-small Cell Lung Cancer: An Analysis Of The National Database
Background: Immunotherapy has significantly reshaped the therapeutic approach to non-small cell lung cancer (NSCLC), yet the optimal timing of its administration in the curative setting remains unclear. The study aimed to investigate the association between immunotherapy timing and overall survival using a populational-based cancer registry.
Methods: Using the National Cancer Database, we identified patients with stage I-III NSCLC who underwent surgical resection and received immunotherapy between 2016 and 2020 (n=2,285). The primary outcome was overall survival depending on the sequence of immunotherapy relative to surgery using Cox proportional hazards models adjusted for key prognostic factors and confounders.
Results: Among 2,285 surgically treated patients with stage I-III NSCLC who received immunotherapy, 84% were white and 40% squamous cell carcinoma. In adjusted survival analyses, neoadjuvant immunotherapy was non-inferior to perioperative immunotherapy (aHR = 1.04, p = 0.79). However, receipt of adjuvant immunotherapy alone was associated with a significantly higher risk of mortality compared to perioperative therapy (aHR = 1.61, p < 0.01).
Haotian Tang
Integrative Multi-Omics Approach Using Augmented Similarity Network Fusion For Cancer Subtype Identification And Clinical Validation
Cancer heterogeneity challenges disease understanding and personalized treatment. This practicum utilizes an integrative multi-omics approach, employing Similarity Network Fusion (SNF) and augmented SNF (ab-SNF), to combine genomic, transcriptomic, and proteomic data from The Cancer Genome Atlas (TCGA). The aims include data preprocessing, network integration, and identification of biologically and clinically significant cancer subtypes. Validation involves cross-validation and external datasets, while statistical analyses assess treatment responses and survival implications. Results promise to enhance precision medicine through improved patient stratification and personalized therapies.
Functional Data Analysis - MS (HSC LL 107)
Peng Su
Ecological Scaling Of Temporal Fluctuations With Bacterial Abundance In Gut Microbiota Depends On Functional Properties Of Individual Microbial Species
Macroecological relationships that describe various statistical associations between species’ abundances, their spatial, and temporal variability are among the most general laws in ecology and biology. One of the most observed relationships is a power-law scaling between means and variances of temporal species abundances, known in ecology as Taylor’s law. However, what determines its scaling exponents across species and ecosystems is not understood. Here, we use temporal trajectories of human gut microbiota to analyze the relationship between functional properties of individual bacterial species and microbial communities with the scaling of species-specific Taylor’s law. We find that species Taylor’s law depends on the individual species’ functional properties. Specifically, we observe lower Taylor’s law scaling for species with larger metabolic networks, for species that can grow on a larger number of carbon sources, and for species with particular metabolic functions. Overall, our study reveals that Taylor’s law scaling is strongly associated with the functional capabilities and biosynthetic properties of individual microbial species.
Li Tian
Echocardiographic Changes in Sickle Cell Disease: Longitudinal Trends and Sex-Specific Differences
Introduction:
Children with sickle cell disease (SCD) are at risk of cardiac complications, yet the progression of left ventricular function remains unclear. This study examines longitudinal changes in left ventricular mass (LVM), end-diastolic posterior wall thickness (LVPWD), end-systolic (LVES) and end-diastolic (LVED) diameters, and fractional shortening (LVFS), with a focus on sex differences.
Materials and Methods:
A longitudinal observational study was conducted using repeated echocardiographic measurements from pediatric SCD patients. Linear mixed-effects models will assess changes in Z-scores of LVM, LVPWD, LVES, LVED, and LVFS, adjusting for age, body surface area (BSA), and clinical factors. Interaction terms will evaluate sex differences.
Results:
We anticipate sex differences in cardiac changes over time. Males may show greater increases in LVM and LVED Z-scores, while females may maintain higher LVFS values. Statistical analysis will determine if these trends persist after adjustment.
Discussion and Further Directions:
Findings may inform sex-specific monitoring in SCD. Future work will examine clinical interventions, genetics, and disease severity on these trends.
Yuxin Yin
Implementation of a Transcriptome-Wide Association Study (TWAS) Pipeline Using Summary Statistics
Transcriptome-Wide Association Studies (TWAS) provide a powerful approach for linking genetic variants to gene expression and protein levels by leveraging summary statistics. This practicum project focuses on implementing a TWAS pipeline using proteogenomic summary statistics, with an emphasis on preprocessing steps such as allele quality control (QC), harmonization, and the application of weight models. The goal is to evaluate how effectively TWAS can expand individual-level genetic findings to broader datasets using summary-level data. This study contributes to the growing field of proteogenomics by demonstrating how TWAS can be efficiently implemented with summary statistics, enhancing the scalability and statistical power of protein-level genetic association analyses. The results provide insights into best practices for summary-based TWAS implementation and its potential for uncovering novel genetic associations in large-scale datasets.
Machine Learning - MS (HSC LL 108A)
Manye Dong
Predicting Investigator Performance for Efficient Clinical Trial Enrollment
Patient enrollment is a critical challenge in clinical trials, with 85% of trials experiencing delays. Optimizing site and investigator selection is key to improving trial efficiency. This study predicts the future performance of clinical trial investigators based on historical data to enhance enrollment timelines.
We utilized historical performance data from a Central Lab Services (CLS) database, which included 12,775 unique investigators. We categorized their performance into high, medium, and low levels using rankings within each protocol. Machine learning models, including XGBoost, AdaBoost, Random Forest, and Logistic Regression, were trained to predict future performance.
XGBoost outperformed other models, achieving a 21% accuracy improvement over the baseline. It also demonstrated better precision and recall, especially in identifying high and low performance investigators. The analysis also found that startup time and kit return rates were significant predictors of future success.
This predictive model offers a cost-effective method for selecting investigators likely to meet enrollment targets, improving trial efficiency and saving time for pharmaceutical companies.
Haitian Huang
Predicting Sleep Disorders Using Biostatistical and Machine Learning Methods
Insomnia is a prevalent sleep disorder linked to various adverse health outcomes, yet its underlying risk factors remain complex and multifaceted. This study aims to identify key predictors of insomnia using data from the National Health and Nutrition Examination Survey (NHANES). We employed a combination of traditional statistical methods, including logistic regression and generalized estimating equations (GEE), along with machine learning techniques such as random forest, XGBoost, and LASSO logistic regression to enhance predictive accuracy and variable selection. Key factors examined included demographic characteristics, lifestyle behaviors, mental health status, and socioeconomic conditions.
Qin Huang
Identifying Key Nonverbal Communication Features to Classify Autism in Minimally Verbal Children
This study examines nonverbal communication in minimally verbal children with Autism Spectrum Disorder (ASD) using the Short Observation for Social Communication (SOSC). By analyzing gaze, gestures, facial expressions, and posture, the research aims to identify ASD-specific patterns in children with autism, those with developmental disabilities, and typically developing children to improve early autism diagnosis. To identify distinguishing nonverbal features predictive of autism, the study employs a two-step approach: Random Forest (RF) for initial feature selection and LASSO regression for refinement, enhancing interpretability and predictive accuracy. These findings provide valuable insights for more targeted interventions. Ultimately, this research seeks to advance early identification and intervention strategies for minimally verbal autistic children by integrating behavioral analysis with computational modeling.
Zhuodiao Kuang
Leveraging Multi-Source Summary-Level Data: Enhanced Risk Predictio in Underrepresented Population Via Transfer Learning
We propose fastCOMMUTE, a novel transfer learning method aimed at improving risk prediction in a target site by efficiently integrating multiple source datasets while protecting cross-source data privacy. By using the trained model from each source to generate synthetic data, fastCOMMUTE offers a communication-efficient and privacy-preserving framework. The synthetic data is then integrated into a unified optimization process with the target data, and we also calibrate them using target site data to prevent negative transfer. Through simulations and real-world applications, we showed that fastCOMMUTE can improve predictive performance under different practical settings. Leveraging data from multiple sources, fastCOMMUTE has the potential to bridge health disparities, making advanced predictive models accessible to underrepresented populations with limited data.
Environmental Health Research - MS (HSC LL 109A)
Tianyuan Deng
Effects of Prenatal Metal Exposure on Birth Outcomes: Evidence from HHEAR Project Data
Being exposed to a variety of metal mixtures in the prenatal period may have a negative impact on fetal development; however, their exact dose-response relations and joint effects remain unknown. This study intends to assess the influences of prenatal metal exposure on prematurity using statistical methods as well as data-driven techniques.
This study uses HHEAR data that deals with metal exposure and birth outcomes. We will employ multivariable regression analysis and Bayesian Kernel Machine Regression (BKMR) to delve deeper into the effects of individual metals and their interactions. Machine learning methods, such as random forests and clustering, are useful for pattern recognition. Additionally, interactive data visualization presents exposure levels, temporal variations, and geographical trends, providing a clearer representation of complex relationships.
Preliminary studies declared that arsenic, Pb, Mn, and Zn were correlated with birth outcomes. Arsenic exposure decreases newborn weight, while zinc may counteract lead’s negative effect on head circumference. Geographical analysis identified hotspot regions, and exposure levels were found to change during pregnancy.
Ruijie He
Interactive Dashboard Development for the TEDDY Study: Enhancing Analysis of Environmental Determinants of Diabetes in Young Individuals
The TEDDY study—The Environmental Determinants of Diabetes in the Young—explores how environmental factors influence diabetes onset in youth. This practicum project enhances the TEDDY dataset's utility by developing an interactive dashboard within the HHEAR Data Repository. Aimed at providing a user-friendly tool for visualizing epidemiological and biomarker data, this dashboard improves research interactivity and accessibility, facilitating insights into public health interventions. The project includes creating a dynamic dashboard for clear data presentation and refining data cleaning scripts for high-quality, consistent data analysis. Additionally, a disease-specific dashboard integrates statistical analyses, enabling detailed examination of environmental impacts on pediatric diabetes, supporting targeted public health policies.
Nisha Lingam
Evaluating the Joint Effects of O3, PM2.5, and NO2 as an Exposure Mixture on Mortality
Long-term exposure to O3, PM 2.5, and NO2 has been consistently associated with all-cause mortality; however, the mixture effects of these pollutants are lesser-known. The project aim is to investigate the overall effect of long-term exposure to PM2.5, NO2, and O3 as a mixture of exposure species on all-cause mortality at the county-level in New York State from 2000-2016.
Mortality was measured as the total count of deaths in each county per month. Daily exposure concentrations for 1km x 1km grid cells were acquired using high-resolution, predictive models, which were aggregated by county and averaged by month. Covariates included demographic, environmental, and socioeconomic variables. To estimate the exposure mixture effects, two approaches were used: weighted quantile sum (WQS) regression and quantile g-computation. WQS regression involves categorically transforming the exposures defined by quantiles and creating an exposure index from the weighted average of the transformed exposures, which is then used in a generalized linear model. Quantile g-computation is an extension of WQS regression that reduces bias and can apply causal assumptions to the effect estimates.
Yuhan Wang
The Impact of Wildfire-Specific PM2.5 on Mortality in New York Counties 2006–2016
Wildfires are an increasing environmental and public health concern, contributing to elevated levels of fine particulate matter (PM2.5). It also has been linked to adverse health outcomes, including increased mortality. While the effects of total PM2.5 exposure on mortality are well-documented, the specific influence of wildfire-derived PM2.5 remains less understood. This study examines the relationship between wildfire-specific PM2.5 exposure and mortality across New York counties from 2006 to 2016, adjusting for key meteorological, environmental, and demographic confounders. We use time-series analysis with a Generalized Additive Model (GAM) to examine the temporal relationship between monthly wildfire PM2.5 exposure and mortality rates. The analysis controls for key factors, including non-smoke PM2.5, temperature, precipitation, rural-urban commuting patterns, primary health care access, and vegetation index (NDVI). Our findings highlight a significant positive association between wildfire PM2.5 and mortality rates, emphasizing the need for targeted public health interventions and policies to mitigate the impact of wildfire smoke exposure.
Longitudinal Health Research - MS (HSC LL 202)
Jennifer Li
Exploring The Impact of Periodontitis on Neurological Function
Introduction:
Periodontitis, a chronic inflammatory disease, may contribute to cognitive decline. This study examines its impact on neurological function, categorizing periodontitis using pocket depth, tooth count, and other dental measures in an aging, multi-ethnic population.
Materials and Methods:
Using Washington Heights and Inwood Community Aging Project (WHICAP) data, we assess cognitive function across memory, language, processing speed, and visuospatial ability (continuous variables). MCI and dementia status are binary outcomes. Longitudinal mixed-effects models identify key dental factors linked to cognitive decline, adjusting for demographic and health-related confounders.
Expected Contributions:
This study explores whether periodontitis accelerates cognitive aging. Findings may highlight oral health as a modifiable risk factor for neurodegeneration. Future work will examine potential mechanisms, including systemic inflammation and vascular pathways.
Jixin Li
Modeling Parent Fraction In PET Imaging
This study aimed to model PET, a method used for quantifying and measuring blood radioactivity over time. The primary objective was to identify the best model for predicting parent fraction. Several models were fitted and evaluated based on their performance. First, nonlinear least squares (NLS) and nonlinear mixed-effects (NLME) models were compared by modeling the relationship between parent fraction and time using a sigmoid function with three parameters. A comparison of the parameter density plots for NLS and NLME showed that NLME provides narrower estimates and lower variance. This results in a sharper density for NLME, reflecting the shrinkage effect. Next, the same procedure was applied using an inverse-gamma function with four parameters. Again, NLME outperformed NLS. Additionally, hierarchical generalized additive models (HGAMs) were used to identify the best model. The Global Smoother with individual effects model was fitted using both Gaussian and beta-transformed approaches. A comparison of the Gaussian and beta-transformed GS models was conducted using 5-fold cross-validation and mean squared error (MSE) evaluation. The beta-transformed model performed better.
Arthur Starodynov
NYC's Salary Transparency Law Bridging The Gender Divide
In November 2022, New York City enacted Local Law 2022/59 requiring employers to post salary ranges in job advertisements to address gender pay inequities. We analyzed salary expectations, negotiations, and outcomes among Wave 2 NYC job seekers (n=702) and recent hires (n=481) at 18 months post-law implementation. Despite salary transparency requirements, men reported significantly higher mean annual incomes ($90,100) compared to women ($66,200; p<0.001), with this disparity persisting in final negotiated offers (men: $116,000; women: $89,900; p=0.006) among new hires. Similar patterns emerged in initial salary expectations (men: $97,600; women: $87,000; p=0.093) and hourly wage negotiations (men: $44.20/hour; women: $33.80/hour; p=0.035). The gender gap remained consistent across educational levels, with no significant differences in educational attainment between men and women (p=0.495). These findings suggest that salary range disclosure alone may be insufficient to achieve pay equity, indicating the need for additional policy measures.
Longyu Zhang
Age-Related Changes In Brain Dynamic Functional Connectivity Based On Persistent Homology
In this project, firstly in terms of methodology, we expanded the persistent homology methods previously applied to static time slices to the analysis of dynamic brain fMRI data by introducing the temporal dimension. We integrated time series data using approaches such as sliding time windows and explored how the underlying topological structure of functional connectivity changes over time.
Secondly, the health-related question of interest to be addressed by the practicum project is the relationship between the stability of topological structures and the age of participants. With persistent homology-based measures for brain functional connectivity, we obtained time-varying curves of topological structures for participants of different ages and analyzed how these measures change with the age of the subjects.
Survey Statistics - MS (HSC LL 204)
Ou Sha
The Impact Of Updated Federal Guidelines For Collecting Race And Ethnicity Data
This study explores how question format affects race and ethnicity self-reporting using the NYC Health Panel, a survey of approximately 15,000 randomly assigned panelists. The experiment compares two versions of the question: the old two-question version of the question asked respondents about their Hispanic/Latino ethnicity first, followed by their race. The new single-question version of the question asked about race without first asking about Hispanic/Latino ethnicity. Two analytical approaches are used: (1) comparing overall distributions between the two versions and (2) assessing individual-level longitudinal changes among panelists who initially answered the old two-question version and later responded to the new single-question version. By analyzing these variations, the study provides insights into how survey design influences self-reported racial and ethnic identities, helping to improve questionnaire design and data collection methods for surveys conducted by the NYC Health Department.
Tongxi Yu
Multi-dimensional Proteomic Analysis for Age-related Biomarker Discovery
Age-related changes in protein expression profiles represent a critical but incompletely understood aspect of human biology with significant implications for health and disease. This study employs high-throughput proteomics to identify and characterize age-associated protein biomarkers across diverse tissue samples. Using Olink proximity extension assay technology, we analyze normalized protein expression (NPX) values from multiple sample types, integrating detailed sample attributes including age at collection, primary biosample type, collection timeline, body site, histological classification, and tumor status.
Our analytical framework implements a systematic approach to control for potential confounding variables while identifying proteins that demonstrate significant age-related expression patterns. The comprehensive dataset encompasses samples from various collection years and visits, enabling both cross-sectional and longitudinal perspectives on age-associated protein dynamics. Quality control measures account for technical variations, while statistical models address biological heterogeneity within and between sample groups.
Mengyuan Yu
Association Between Time-Restricted Eating And Dental Caries
Dental caries remains a prevalent chronic disease influenced by diet and oral hygiene. While time-restricted eating is recognized for metabolic benefits, its impact on oral health is unclear. This study examines the association between TRE and dental caries in a nationally representative U.S. population. We analyzed 8,724 adults from NHANES 2017–March 2020. TRE adherence was defined as ≤8-hour eating windows across two non-consecutive 24-hour dietary recalls, with 6-hour and 10-hour stratifications. Dental outcomes included untreated caries and restored caries through clinical examinations. Survey-weighted quasi-binomial models adjusted for demographic confounders, oral hygiene, and dental access. Propensity score matching is performed for validation. TRE was significantly associated with higher restored caries prevalence in both adjusted models (OR = 1.79, 95% CI: 1.15–2.77, p = 0.012; OR = 2.59, 95% CI: 1.03–6.51, p = 0.044), but not with untreated caries. Age interaction was non-significant (p = 0.082). TRE may influence dental care utilization rather than direct caries risk, warranting further study of behavioral and biological mechanisms.
Junyi Ren
Measuring Tick Risk Along an Urbanization Gradient
Tick-borne diseases are a growing public health concern, particularly in urban environments where human behaviors influence exposure risks. This practicum analyzes spatial and temporal patterns of tick exposure in New York and Boston using KAPP survey data, The Tick App, and SafeGraph data collected since 2019. The study evaluates how mobility patterns, outdoor activities, and environmental factors contribute to tick encounters across urban gradients.
A key focus is assessing individual- and population-level exposure patterns in relation to tick hazards. The analysis integrates GIS with advanced statistical modeling in ArcGIS and R. Generalized linear models (GLMs) and Bayesian hierarchical models (INLA) will quantify exposure risk, while spatial clustering and trajectory analysis will characterize movement patterns contributing to tick exposure. A risk mapping framework will visualize exposure hotspots.
This practicum enhances understanding of human-tick interactions in urban settings, informing public health strategies for mitigating tick-borne disease risks and improving vector surveillance.
Epidemiology - MS (HSC LL 205)
Wenwen Li
Chikungunya Vaccine Allocation
Introduction:
Chikungunya vaccine allocation in low- and middle-income countries faces financial and logistical constraints, requiring a balanced approach to cost and health impact.
Materials and Methods:
This study develops a dual-objective model integrating the Susceptible-Infected-Recovered (SIR) model and Pareto Optimality simulations to optimize vaccine distribution while minimizing costs.
Results:
Simulations show that balancing health benefits and economic constraints leads to more efficient vaccine allocation than single-objective approaches, reducing disease incidence and optimizing resource use.
Discussion and Further Directions:
Policymakers should adopt multi-criteria frameworks to enhance vaccine distribution. Future research will refine model parameters and explore real-world applications with public health organizations.
Integrating epidemiological and economic models provides a scalable, cost-effective strategy for improving immunization efforts in LMICs.
Qianying Wu
The Impact of Adverse Childhood Experiences on Cognitive Function in Older Adults
This study examines the relationship between Adverse Childhood Experiences (ACEs) and cognitive function in older adults, using data from the Adult Changes in Thought (ACT) Study. We will assess the impact of cumulative ACE exposure on key cognitive domains, including executive function (Trail Making Test A and B), memory (verbal fluency), and visuospatial abilities (clock drawing test). We hypothesize that higher cumulative ACE scores will be associated with poorer cognitive performance across these domains. Using linear and logistic regression models, we will explore sociodemographic, health, and social factors. Interaction terms will assess sex-, race-, and age-specific effects, with sensitivity analyses conducted on complete cases. Findings from this study will provide insights into how early-life adversity influences cognitive aging and may inform policy and prevention strategies to mitigate cognitive decline in at-risk populations.
Linshen Cai
Impact of Low-Molecular-Weight Heparin on Post-Hepatectomy Liver Failure: A Retrospective Cohort Study
Post-hepatectomy liver failure (PHLF) is a severe complication, and the role of low-molecular-weight heparin (LMWH) in reducing its risk remains unclear. This research identifies important risk variables and assesses the effect of LMWH on PHLF. After controlling for covariates, the relationship between LMWH usage and PHLF incidence was investigated using logistic regression on a retrospective dataset of hepatectomy patients (2019–2023). The findings indicated that LMWH was linked to a lower incidence of PHLF (OR = 0.73, 95% CI: 0.54–0.99, p = 0.048). Risk was considerably reduced by preoperative prothrombin time (OR = 0.47, 95% CI: 0.31–0.72, p < 0.001) and increased by preoperative international normalized ratio (OR = 3.02, 95% CI: 1.95–4.69, p < 0.001). Hypertension, vascular stump, and smoking history were not significant predictors. These results imply that preoperative coagulation indicators are important predictors and that LMWH may have a protective effect against PHLF. To confirm these results and improve anticoagulant treatments for individuals who have had liver resections, further research is required.
Observational Study - MS (HSC LL 207)
Yaduo Wang
Observational Study To Assess The Overall Treatment Effect Of AD Medication On Essential Cognitive Outcomes
Introduction:
The discrepancy of treatment effect between patients eligible to the study and those in practice is unknown. This study refines the patients through the applying exclusion criteria, and examines the effect of Alzheimer’s disease (AD) on cognitive outcomes.
Methods:
Exclusion criteria are applied towards the cross-sectional ROSMAP dataset. The exclusion criteria includes neurological disorders other than Alzheimer’s disease, significant systemic diseases, and major psychiatric disorders. Linear regression model and causal random forest are used to assess the effect of AD medication while adjusting for covariates in excluded and remaining patients. Further analysis on the effect of never using AD medication versus always using medication, and starting to use medication are conducted. The logistic regression models on severe CERAD scores and Braak Stage are conducted.
Results:
All the results demonstrate taking AD medication is associated with a steeper decline in cognitive outcomes overtime.
Discussion:
The results might be confounded by disease severity. Further analysis shows that taking AD medication is associated with worsening disease severity.
Ekaterina Hofrenning
Blood-based Biomarkers for Predicting Vascular Contributions to Cognitive Impairment and Dementia Brain Changes
This study investigates the utility of plasma neurofilament light chain (NfL) and glial fibrillary acidic protein (GFAP) as predictors of vascular contributions to cognitive impairment and dementia (VCID). Using data (n=1709) from the Mayo Clinic Study of Aging, a population-based cohort, we examined relationships between these blood-based biomarkers and VCID states, as defined by white matter hyperintensities and clinical dementia scores. Participants were categorized into four states: cognitively unimpaired without small vessel disease (SVD) presence (CU), cognitively unimpaired with SVD (Vas-CU), cognitively impaired with SVD (VCI), and cognitively impaired without SVD (Non-vas-CI). Logistic regression and ROC analyses revealed that both NfL and GFAP effectively discriminated Vas-CU and VCI from CU (AUC=0.83 and 0.88, respectively), with stronger performance in amyloid-negative patients. The markers showed a weaker ability to distinguish Non-vas-CI from CU (both AUC=0.69), suggesting vascular damage may be a driving factor. These results speak to the potential utility of blood-based biomarkers for VCID identification as a more accessible alternative to traditional approaches.
Yuki Joyama
The Role of Menopausal Status in Exercise-Driven Cognitive Benefits
Introduction: Physical activity can slow cognitive decline in men, but women may experience a reduced effect, potentially due to hormonal changes associated with menopause. This study aims to determine whether menopausal status moderates the cognitive benefits of exercise in women.
Material and methods: We analyzed data from the Reference Ability Neural Network (RANN) study and the Cognitive Reserve (CR) study (N = 1154) with five-year follow-up cognitive assessments (memory, speed/attention, reasoning, vocabulary). Linear regressions were performed on standardized cognitive scores with interactions between exercise and menopausal status (or age categories: <45, 45-55, >55), adjusting for relevant covariates.
Results: Postmenopausal women had a better 5-year memory change compared with premenopausal women, but higher baseline exercise levels attenuated this advantage. There was no consistent interaction between menopausal status and exercise for other domains and no significant interactions were observed in men.
Discussion: These findings suggest that menopausal status could influence the relationship between exercise and cognitive outcomes in specific domain in women.
Tianyou He
Investigating the Association Between BMI and HbA1c Using Observational Data and Statistical Modeling
Understanding the effect of metabolic risk factors on diabetes progression is crucial for effective disease control. This research utilizes a dataset from a biomedical research company to examine the relationship between Body Mass Index (BMI) and HbA1c. Taking into consideration that the dataset is non-experimental, several statistical modeling techniques are utilized, such as Ordinary Least Squares (OLS) regression, Generalized Additive Models (GAM), and Bayesian regression. This approach is adopted to help estimate and visualize the association between BMI and glycemic control. The results point out that higher BMI is cumulatively and directly related to an increased HbA1c level, and the relationship shows high complexity depending on the particular BMI range. The rationalization of causal inference in the healthcare system is another derivation of this research, which will potentially offer evidence-based statistical grounds for diabetes prevention.
Multi-Interest Session - MS (HSC LL 210)
Zilin Huang
Automating Survival Analysis for Clinical Trial Studies
Analyzing medicine safety and efficiency is the key objective of clinical trial studies, and multiple statistical methods have been utilized by researchers to derive such trends from the targeted dataset. In this project, I will introduce an integrated data analysis methodology developed by Roche through their self-developed R package. The R Shiny dashboard, generated from this package, can accept clinical trial datasets and automatically generate various types of visualizations, also called TLGs (Table, Listing and Graph) by Roche's definition. An important term adopted by this package is survival analysis, in which customized Kaplan Meier plot and Cox Regression plot is generated after modifying specific parameters, such as trial arms and patient demographic variables. Utilizing such methods helps measure important time-to-event outcomes including Overall Survival (OS) and Progression-free Survival (PFS) in the clinical trial's settings, which accelerates the study of patient safety for Roche in their researching areas related to oncology and hematology.
Ze Li
Investigating Gene-Environment Interactions in Age-Related Hearing Impairment Using a Case-Only Approach
Age-Related Hearing Impairment (ARHI) is influenced by genetic and environmental factors. This study employs a case-only design to investigate gene-environment interactions (GxE) associated with hearing aid usage, leveraging UK Biobank data (n=15,285). We analyze the interaction effects of sex, smoking behavior, and noise exposure (music and workplace) with genetic variants using logistic regression models. Our results reveal significant genetic loci associated with environmental risk factors, supporting the hypothesis that ARHI risk is modified by both genetic susceptibility and external exposures. Chromosome-wise analysis identifies key regions that may contribute to these interactions, underscoring the importance of incorporating environmental factors in genetic studies of ARHI. Future work will focus on validating these findings through gene mapping, pathway analysis, and cross-referencing with known disease associations to better understand the biological mechanisms underlying ARHI development.