2024 Biostatistics Practicum/APEX Symposium

You can view the titles and abstracts for each session below, organized by topic and room number.

Welcome & Keynote Speaker (12:00 - 1:00)

Keynote Speaker: Nadiya Pavlishyn (Class of 2019)

Nadiya Pavlishyn is a 2019 graduate of the Mailman School of Public Health, Biostatistics MS, Theory & Methods track. She applied to Mailman from Stony Brook University as an Applied Math & Statistics and Economics double major. However, before she joined our MS program, she was also a proud alum of our department's BEST program in 2015. She has worked as a Research Data Associate at NYU Dept of INternal Medicine, as the Research Coodinator & Statistician at CUMC Department of Radiology, and as a Senior Healthcare Data Scientist at Aetion. Currently, she is the Associate Director of Data Science at Aetion. Come to hear about Nadiya's time at Mailman, her professional journey, and her words of advice for our graduating class as they embark on their own professional pursuits!

Session 1 (1:00pm - 2:00pm)

Bayesian Statistics (HSC LL207)

Zirui Zhang
Comprehensive metabolomic and proteomic data analysis in Gulf War Illness with exercise tolerance tests

Gulf War Illness (GWI) is an unexplained illness occurring in veterans of 1991 Gulf War with symptoms including fatigue and musculoskeletal pain. The investigation of the altered proteomic and metabolomic profiles regarding exercise tolerance has the potential to yield insights into the exercise-induced symptom exacerbation and post-exertional malaise of GWI and provide promising biomarkers. We characterized proteomic and metabolomic profiles in GWI using plasma samples collected in New Jersey and Wisconsin from patients before exercise, immediately after exercise and 24 hours after exercise (n = 68, ncontrol = 28, ncase = 40). We analyzed associations using adjusted linear mixed effects models with Bayesian analyses. Chemical enrichment analyses (ChemRICH) and Ingenuity pathway analyses were performed to determine altered chemical clusters and biological pathways. We also employed machine learning algorithms to assess the utility of proteomics and metabolomics as GWI biomarkers at baseline (before exercise). We identified GWI associations with a variety of chemical compounds including isocitric acid, alpha-Ketoglutarate, lactic acid etc., and metabolite clusters including triglycerides, ether-linked lipids and ceramides etc. that are consistent with inflammation and energy supply. Machine learning classifiers further distinguished GWI cases from controls under baseline profile. These findings may provide new insights into the mechanism of exercise-induced symptom exacerbation and post-exertional malaise in GWI and have implications for discovery of biomarkers that may enable early detection and intervention.

Baoyi Feng
Impact of collaborative care models on depression screening and treatment in primary care settings across New York State

Depression is a common mental health issue. While collaborative care (CC) programs have been shown to be effective in improving depression screening and treatment in primary care settings, there is still much to learn about what it takes to maintain the effectiveness of these programs in real-world settings as well as determinants of variations in long-term CC fidelity. In this study, we studied the changes in depression screening and improvement rates among 81 healthcare clinics in New York State (NYS) before and after strategies were in place to sustain CC programs following COVID-19 from 2021 to 2022. Data were collected from the Collaborative Care Medicaid Program administered by the NYS Office of Mental Health. We used geographic heat maps to show the spatial distribution of CC fidelity (depression screening, treatment optimization, and engagement rates), reach, and depression improvement ates in participating CC clinics across counties and cities in NYS. To assess factors that were associated with depression treatment improvement rates before and after the sustainability of CC programs, we fit a Bayesian hierarchical spatial model accounting for both correlations between clinics in the same county and spatial correlations between adjacent counties. Further, we examined the mediation effects of the improved CC fidelity (e.g., depression screening rate, engagement rates, optimization rates) on depression improvement rates. This study is important because it evaluated the impact of CC programs on depression screening and treatment in primary care settings and provided insights into what made some CC programs more effective in some clinics than others following the COVID-19 mental health crisis. 

Charles Chen
Comparison of Different Setups for A Novel Bayesian Framework for Model Contamination in Serial Dilution Immunoassay

This study presents a novel Bayesian framework for addressing model contamination in serial dilution immunoassays, a tool for quantifying indoor allergens. Leveraging data from the New York City Neighborhood Asthma and Allergy Study, we compare the performance of four-parameter logistic (4pl) and five-parameter logistic (5pl) models in the context of Enzyme-Linked Immunosorbent Assay (ELISA) and Multiplex Array for Indoor Allergen(MARIA) plates. Our Bayesian approach demonstrates robustness against outliers and measurements below the limit of detection (LOD), challenges that traditional calibration methods often struggle to manage effectively. Key findings indicate that the choice between 4pl and 5pl models yields comparable calibration curves and concentration estimates, underscoring the framework’s versatility. Additionally, we explore the impact of outlier removal and assess the probability of sample contamination, revealing the framework’s capacity to maintain reliability across varying assay platforms. Our results affirm the Bayesian method’s potential to enhance the accuracy and efficiency of allergen quantification, offering a substantial improvement over classical calibration methods.

Zekai Jin
Mediation Analysis of Brain Connectivity and Apathy  under Hierarchical Drift Diffusion Modeling

Apathy is a major contributor to decreased quality of life for Parkinson’s Disease patients. To understand the neuropsychological mechanism of apathy, our study hypothesized that apathy is caused by abnormal changes in brain, and such effect is mediated by decreased reward and effort sensitivities. As the main data analyst in this study, I proposed a hierarchical Bayesian model in R Stan incorporating traditional drift diffusion framework and a multiple-mediation network. Using dataset collected from the study, my model successfully found a mediation pathway from STR-ACC connectivity through effort sensitivity and finally to das executive sub-score in PD patients. This finding along with other methods in the study will be published in a paper soon.

 

Cancer Research (HSC LL108A)

Jiazhen Liu
APEX Internship Report: Full Process of RNA-SEQ Data Analysis
The project focuses on the critical role of RNA sequencing (RNA-seq) data analysis within the broader context of Project 1604, a rigorous investigation into the tumor microenvironment with a particular emphasis on the impact of potassium ion concentrations on tumor progression and immune response modulation. 

Jiong Ma
Challenges and Strategies in Evaluating CAR-T Therapy: Navigating the Discrepancy Between Progression-Free and Overall Survival Outcomes

CAR T-cell therapy, an innovative immunotherapy approach, leverages a patient’s immune system to fight cancer by genetically modifying their T cells to target and destroy cancer cells. This method has shown significant efficacy in treating hematologic malignancies, yet it introduces complex challenges in clinical trial design and data interpretation, particularly regarding outcome measures. Traditionally, Overall Survival (OS) is the gold standard for regulatory approval, offering a comprehensive assessment of a therapy's safety and efficacy. However, CAR-T studies often prioritize Progression-Free Survival (PFS) as the primary endpoint due to its earlier assessability, which is critical for hastening approvals in severe hematologic conditions. Despite PFS's utility, recent evidence suggests a troubling lack of correlation with OS, with some cases indicating PFS improvements may not translate into OS benefits, and could potentially harm long-term survival. This discrepancy raises concerns about the reliability of PFS as a surrogate endpoint and underscores the need for our investigation into its causes. Our project aims to explore the factors contributing to this divergence and to develop a nuanced strategy for accurately estimating and testing the effects of CAR-T treatments. By addressing these challenges, we hope to refine the assessment of CAR-T therapies, ensuring that trial designs and endpoints more accurately reflect their true clinical value.

Yifei Zhao
DNAGmPT: a Transformer Model Predicting DNA Methylation from Sequence and Gene Expression

Extensive associations have been identified between DNA methylation (DNAm) and gene expression alterations, which may harbor valuable information for targeting various diseases, particularly cancer. Current methods for profiling DNAm are limited, as they either fail to measure gene expression or suffer from low coverage and data quality. Capitalizing on the connection between gene expression and DNAm could surmount these issues. Nonetheless, previous models for DNAm prediction are not widely applicable due to several constraints: (a) reliance on partially measured methylation levels as inputs, which defeats their purpose since it requires obtaining specific sample measurements prior to model application; (b) inadequate integration of CpG sequences with gene expression and other pertinent features, despite their collective influence on methylation; and (c) tendency to base the prediction of methylation levels solely on measured CpG sites and specific gene expression data. Addressing these challenges, we introduce DNAGmPT, a novel transformer-based model designed to predict DNAm levels using gene expression and CpG sequence. Inspired by advances in transformers and the inherent similarities between natural language and biological sequences, this innovative approach explores the 'grammatical' rules of biological sequences. DNAGmPT uniquely forecasts DNAm based solely on the gene expression profile of the sample and the adjacent CpG sequence, eliminating the need for prior methylation measurements. This sequence-based feature extends the model's predictive capability to previously unanalyzed CpG sites. By applying crossvalidation to ENCODE dataset spanning 37 tissues and TCGA data encompassing 33 cancer types, we demonstrate that DNAGmPT can reliably predict the DNAm landscape.

Yimin Chen
Assessing the Predictive Value of Pathologic Complete Response for Survival in Triple-Negative Breast Cancer Cases

Triple-negative breast cancer (TNBC) presents a unique challenge in oncology, characterized by its aggressive nature and lack of hormone receptors, making it resistant to standard hormone therapy treatments. This study delves into the prognostic value of pathologic complete response (pCR) following neoadjuvant therapy as a surrogate marker for long-term outcomes in TNBC, such as overall survival (OS) and event-free survival (EFS). Utilizing data from twelve distinct studies spanning the period before and after 2020, we employed a linear regression model that integrates median age, lymph node stage, and tumor size stage to enhance the accuracy of outcome predictions. Neoadjuvant therapy's effectiveness, traditionally assessed through pCR—defined as the absence of invasive cancer cells in the breast and lymph nodes at surgery—serves as a pivotal measure of treatment success. This study underscores the FDA and European Medicines Agency's recognition of pCR as a potential surrogate endpoint that could expedite the approval of new therapies for early-stage, high-risk breast cancer cases. Our findings reveal a statistically significant association between achieving pCR and improved OS and EFS, suggesting that pCR can serve as a reliable predictor of long-term survival in TNBC patients. The analysis further indicates the potential for leveraging pCR in developing predictive models and tailored treatment strategies aimed at enhancing survival rates and treatment efficacy in TNBC. By highlighting the critical role of surrogate endpoints like pCR in the evaluation of neoadjuvant therapy's effectiveness, this study contributes valuable insights into the ongoing efforts to improve the management and prognosis of TNBC. The implications of these findings extend beyond clinical practice, informing regulatory, healthcare decision-making, and funding frameworks regarding the potential benefits of new therapeutic interventions in TNBC.

Causal Inference and Environmental Health (HSC LL 210)

Ziyue Yang
Impact of heat alert on different population in US

The US National Weather Service issues heat alerts in order to communicate the risk of high temperature to public and local government officials. US government have issued heat alert through past 10 years and since there are few previous researches of whether heat alert can reduce mortality in given area, there are rising attention by researchers to investigate on the statistical significant reduction of mortality in heat alert by government, as well as the reason for the difference of effectiveness of heat alert in different demographic areas. The study focuses on three specific aims: first is to identify the best heat metric to predict the heat-related death by collecting weather data from government website; second is to examine the relationship between heat alert(a binary variable whether 0 or 1) and the mortality rate; third is to assess how the casual effect of NWS heat alerts vary across community-level demographic. For aim 2 and 3, the statistical machine learning method casual forest will be used to test the casual effect between the outcome and exposure, and a logistic regression may be used for the relationship between heat alert and the reduction of mortality rate. Data for this study will be collected on official government website and public questionnaire. This study will give government a better understanding of the weather index for issuing heat alert, as well as give advice on which demographic area to issue the alert for a maximum effect.

Lingke (Lincole) Jiang
Sample Size Calculations for High-Dimensional Mediation Analysis

High dimensional mediation analysis (HIMA) is an extension of unidimensional mediation analysis that includes multiple mediators and has been used in various health-related research including the evaluation of indirect omics-layer e↵ects of environmental exposures on health outcomes, in the evaluating the role of DNA methylation in the pathway of regulating smoking exposure to postnatal birth weight, as well as in assessing the role of brain locations in mediating the relationship between applying a thermal stimulus and self-reported pain, and many others. Despite a plethora of recent research in high-dimensional mediation analysis, however, the literature on evaluating sample sizes to achieve sufficient power in HIMA remains limited. We provide provide a general guideline in evaluating sample size sample sizes in context of HIMA with a straightforward software for implementation in form of a R-package.

Jasmine Niu
Simulation study of estimating causal effects of continuous exposures

Continuous treatments or exposures often arise in observational studies, leading to effects commonly described by dose-response curves. A simulation study compared different methods for estimating the causal effects of continuous exposures. Data were generated using a known process for single time points and longitudinal scenarios. Nonparametric and semiparametric estimators were applied with the “npcausal” package in R to estimate continuous treatment effects (i.e., dose-response functions). The performance of these estimators was evaluated and compared in terms of their integrated bias and mean squared errors. 

Hongpu Min
How the causal effects of NWS heat alerts vary across characteristics of communities

The US National Weather Service (NWS) issues heat alerts in advance of forecast extreme heat events to communicate these risks to the public and local government to reduce heat-related mortality and morbidity. This study aims to assess how the causal effects of NWS heat alerts vary across characteristics of communities. The study combines eight datasets: variables are grouped, means are calculated, variable types are converted, tables are pivoted to wide, tables are joined, and variables are dropped if they are similar measures of other variables or cover locations and time different from the majority. With the cleaned data, I run a 50-repeated-times regression simulation using regression tree, bagging, random forests, boosting, and BART methods. I compare average MSEs and check the variable importance in each model. The result shows all models have MSE lower than 0.1, and random forest has the lowest MSE, which suggests that it fits best. The variable importance analysis shows population size is the most important variable, and black percent, population density, land area, Asian percent are the second most important across all models. Further researches need to find specific influence of important variables and explore methods other than tree models.

Clinical Trials (HSC LL 108B)

Yi Huang
Enhancing Clinical Site Selection: Developing an internal tool for Statistical Analysis in the Drug Review Process

During the New Drug Application (NDA)/Biologics License Application (BLA) review process, statistical analysis is required in preparation for Site Selection meetings. The purpose of these meetings is to confirm if there is a need for clinical site inspections and to choose potential clinical investigator sites for inspection before the application filing meeting. However, the existing tools could be time-consuming and require additional resources. To address this challenge, the project proposes the development of a new internal application to enhance the efficiency of clinical site selection for the drug review process within the Office of Biostatistics, Center for Drug Evaluation and Research (CDER), FDA. The project enhances the site selection process by integrating statistical analyses, including resampling and other statistical tests, to enable detailed analysis by site and region. It starts with continuous and binary endpoints, employing analyses such as the Mixed Model for Repeated Measures (MMRM) and Cochran-Mantel-Haenszel (CMH) tests. This will benefit multiple divisions within the Office of Biostatistics as the need for statistical analysts during NDA/BLA reviews continues to grow.

Emily Potts
Are Safety Lead-In Phase II Clinical Trials Really Safe?

Safety lead-in (SLI) phase II trials, in which one or two dose de-escalations are permitted based on toxicities from the first six patients at a dose, are common in oncology. However, despite the small number of dose levels considered, the operating characteristics of these designs have not yet been carefully evaluated. This simulation study aims to examine an SLI of six-patient cohorts followed by a Simon Optimal Two-Stage, in comparison to alternative approaches, for two and three candidate doses. Alternatives include conducting the Continual Reassessment Method (CRM), Bayesian Optimal Interval (BOIN), or 3+3 designs to first estimate the maximum tolerated dose (MTD) and subsequently conduct a Simon. Across all designs, Bayesian toxicity monitoring was added to phase II to limit the probability of toxicity to 25% and all patients treated at the estimated MTD were included in the efficacy evaluation. Scenarios were varied on combinations of efficacy hypotheses; data generation strategies; and probability of toxicity at each dose level. Operating characteristics, based on 5000 trial simulations for each scenario, included the probability of correctly selecting the MTD, stopping for toxicity, rejecting or failing to reject the null hypothesis, and total sample size. The present study suggests that SLI phase II trials can often be unsafe. However, using the CRM or BOIN first to select a dose and including toxicity monitoring in phase II can increase the probability of selecting the correct dose and ensure safe implementation.

Xiyuan (Angel) Ji
A Multi-center, Single-Arm, Phase II Clinical Trial of Combination Therapy in Patients with Malignant Peripheral Nerve Sheath Tumors

Malignant peripheral nerve sheath tumors (MPNSTs) are highly aggressive soft-tissue tumors with a poor diagnosis (1). MPNSTs arise in connective tissue surrounding peripheral nerves. There exists an unmet need in medical advancements to develop new therapeutic agents for MPNSTs. Pre-clinical studies of therapy X have proven evidence for target inhibition and tumor growth suppression. This project is a Phase II clinical trial data analysis that aims to investigate the preliminary efficacy of the therapy (X) in patients with unresectable or metastatic MPNSTs, as well as to evaluate its safety profile. The primary endpoint is median progression free survival (mPFS). The secondary endpoint is objective response (OR) and median overall survival (mOS). Response endpoints were assessed using Response Evaluation Criteria in Solid Tumors (RECIST) 1.1. Data analysis and visualization were conducted in R. Progression Free Survival and Overall survival were estimated using the Kaplan Meier method. The waterfall plot and swimmer plot were generated using Biostatistics, Epidemiology, and Research Design (BERD) web applications. 

Shengzhi Luo
The impact of Metabolic Fatty Liver Disease(MFLD) in the physical examination

This research seeks to investigate the impact of metabolic fatty liver disease in the physical examination. During the last decades, with the development of economy in China, the living standard has highly increased. However, the occurrence of metabolic syndrome has also increased year by year, leading a huge economy burden for the country. The danger of the fatty liver disease is that it can easily cause cardiovascular disease, hepatic fibrosis and liver complications. Therefore, early detection and prevention of MFLD high risk groups are necessary. A total of 14664 people who did physical examination at the Physical Examination Centre, West China Hospital, Sichuan University between January 2018 and December 2021 were selected as research subjects. The subjects were evenly divided into a MFLD group and a non-MFLD group according to whether they had MFLD. The differences were compared and logistic regression was conducted to analyse the risk factors for MFLD, thereby establishing a nomogram prediction model. Validation and evaluation was conducted for the prediction effect of the model with the consistency index and the calibration curve. Among the 14664 subjects who underwent physical examination, 4013 were MFLD patients, presenting an overall prevalence of 27.37%, with significantly higher prevalence in men than that in women. Compared with those of the non-MFLD group, the levels of glucose (GLU), total cholesterol (TC), triglyceride (TG), low density lipoprotein cholesterol (LDL-C), aspartate transaminase (AST), alanine transaminase (ALT), gamma-glutamyl transpeptidase (GGT) and uric acid (UA) were increased, while the high-density lipoprotein cholesterol (HDL-C) level was decreased in the MFLD group. The results of logistic regression analysis showed that male sex, age, body mass index, GLU, TG and hypertension were all independent risk factors of MFLD, while HDL-C was a protective factor of MFLD. The risk factors were used to establish a nomogram risk prediction model and the C-index and calibration curve showed that the nomogram model produced good predictive performance. The receiver operating characteristic (ROC) curve showed that the nomogram model had good predictive value for the risk of MFLD. We found a relatively high prevalence of MFLD in the physical examination population, and the nomogram model established with routine physical examination screening can provide indications for the clinical screening and analysis of high-risk patients, which has an early warning effect on the high-risk population. 

Data Analysis (HSC LL 109A)

Emmanuel Latio
Merging of Departmental of Datasets to Analyze Student Trajectories

Across the O+ice of Field Practice, O+ice of Career Services, and O+ice of Educational Initiatives is an e+ort to summarize data collected on Mailman students pertaining to summer APEX roles, job prospects, and basic data collected upon admissions. The aim o this project was to launch e+orts to put data from these di+erent o+ices together and see what insight could be learned. Utilized SAS to clean data and alter variable names. Additionally, attended meetings with various faculty members at Mailman to clear questions on data, missing observations, inconsistencies, etc. No results or findings to report, other than the di+iculty that is to coordinate and undertake a huge cross-collaborative project while working with di+ering entities that store data di+erently and don’t communicate in a way that would make the data synthesizing process easier.  Due to the complexity of this task, this task will need another student, or students, to continue. What this role emphasized to me overall is the importance of test running surveys, formatting questions on surveys to reduce the ability for a response to have to be omitted in cleaning, and the importance of having a statistics/data minded person/expertise on huge organization wide programs like these, having a clear codebook, as the biggest hurdle faced in the APEX role was the inability to put sets together from varied data storage methods.

Pei Liu
Moderation effect of persistent homology-based functional connectivity on cognitive decline in Alzheimer’s Disease

Network analysis has been widely used to understand the complex interactions and organization of the human brain. Resting-state functional networks, often studied for cognition and aging, are analyzed through network analysis with modular assumption. However, it is more plausible to assume a core-periphery or rich club structure accounts for brain functions where the hubs are tightly interconnected to allow for integrated processing [1]. To address this, we introduced persistent homology-based functional connectivity (PHFC) indices, including backbone strength (BS), backbone dispersion (BD), and cycle strength (CS), to quantify integrated processing patterns. BS reflects overall functional integration, BD indicates di,erences in critical information flow, and low CS suggests strong information flow through the backbone rather than additional cycles to study cognitive aging [2]. Using large public data, our study investigates the role of PHFC indices in Alzheimer's disease pathology, revealing their potential advantages beyond traditional measures. 

Tiffany Zhou
Exploring the Impact of Enhanced Skin-to-Skin Contact Through a Novel Medical Device on Maternal and Infant Health Outcomes

Postpartum depression (PPD) affects approximately 20% of new mothers, significantly impacting maternal and infant health. Skin-to-skin contact (SSC), or kangaroo mother care (KMC), has been shown to improve a range of neonatal outcomes and maternal psychological well-being. Despite its proven benefits, global implementation of SSC remains low. This study introduces a novel medical device designed to facilitate daily SSC, aiming to evaluate its impact on reducing maternal depressive symptoms and improving infant health and development.  A randomized controlled trial (RCT) followed by a quasi-experimental phase will assess the efficacy of the medical device for SSC. Participants will be first-time mothers with full-term infants, randomly assigned to an intervention group using the device or a control group receiving standard care. Primary outcomes include maternal depression, anxiety, and stress levels, while secondary outcomes focus on infant weight gain, infection rates, and developmental milestones. Data will be analyzed using chi-square tests, t-tests, intention-totreat analysis, logistic regression, and ANOVA. The study anticipates that mothers in the intervention group will exhibit significantly lower levels of depression, anxiety, and stress compared to the control group. Additionally, infants in the SSC group are expected to demonstrate superior health outcomes, including increased weight gain, reduced infection rates, and enhanced developmental scores.  By facilitating easier and more effective SSC, the novel medical device has the potential to significantly reduce maternal postpartum depression and improve infant health outcomes. This research could provide a strong evidence base for integrating advanced SSC practices into standard postnatal care, offering a scalable, cost-effective intervention to address key challenges in maternal and infant health.

Feng Yan
g.bkmr R command for automated implementation of a BKMR approach for time-varying environmental mixtures

Being exposed to environmental chemicals can be harmful and currently there is no published R package that allows for multiple time-varying confounders where interest lies in estimating the health effect of mixtures at different time points. One currently available method is applicable to a single time-varying confounder in two time points scenario or a two time-varying confounders in three time points scenario. However, the code is limited if we have more complex real life situations that there are more confounders or more exposure time points. To tackle the limitations, R command g.bkmr was developed according to Bayesian Kernal Machine Regression approach which asks users to specify several inputs as instructed, and the output gives the average treatment effect comparing the intervention and reference levels. The simulations were run to ensure the validity of the command. The function/command assumes that all the confounders are independent of each other, or there is no interaction among them, but we may develop other methods to accommodate the situations when different confounders are dependent of each other in the future.

Health Policy (HSC LL 107)

Yiting Zhang
Validity and Reliability of an Evidence Metric for Public Policy Standards

There are many standards and methods for evaluating effectiveness and reliability. THEARI is an evidence metric for public policy standards. The main aim is to conduct research and discussion on the THEARI method and evaluate the theory through methods such as literature reading, literature data extraction, organizing contact information, collecting data, and analyzing data. Data comes from 252 surveys taken by policy professionals in different fields to indicate their thoughts of what is the most crucial and useless evidence in their policy making. We choose Tableau to work on this data.  General mean distribution of all evidence types and the clustering of means was calculated. Mean distribution is split by scientific discipline. Preference of old or new version of THEARI is analyzed as well as the highest impact and no impact rated. Interactive table of %s for each type of evidence is drawn. This research provides powerful evidence to public health policy standards, which will guide the public health policy-makers in the future. Especially in face of emergencies where there is little practical evidence to explore. As a result we can make good use of the limited resources and be more efficiently. We can avoid mistakes and risks as much as possible and gain benefits at our best.

Haoyang Wang
Predicting ICU Patients’ Mortality Outcome with Biomarkers’ Data by Applying Biological Aging Algorithms

Biological age has become an emerging significant indicator in predicting and assessing individuals’ overall health, chronic diseases, morbidity and mortality. However, there is limited research on how biological aging algorithms can predict health-related outcomes in Intensive Care Unit (ICU) patients. To address this knowledge gap, we applied the BioAge package that we developed to build customized biological aging algorithms with Levine Phenoage with ICU patients’ blood draw biomarker data. This retrospective study showed that the patients died in the hospital or (30 days after discharge) are associated with a higher biological age. The statistical results of the multinomial logistic regression analyses show that Phenoage is significant in predicting ICU patients’ outcomes such as discharged alive or dead in the hospital. Receiver operating characteristic curve (ROC) was used to compare the performance of Phenoage and existed ICU risk-prediction algorithms in predicting ICU patients’ outcomes.

Ruyi Chen
Analyzing Prescription Trends Across Demographics and States: A Biostatistical Approach to Pharmaceutical Marketing Strategies

In this practicum project, we investigate the multifaceted influences on prescription drug distribution across different demographics and professional practices. Utilizing insurance data encompassing variables such as patient ID, claim ID, health care professional's ID, state, professional categories, tumor type, and prescription, this study aims to elucidate the patterns and factors impacting prescription numbers. By employing linear regression analysis and visual data representation through bar charts and histograms, the project seeks to identify state-specific and professional categoryspecific trends in prescription practices. The objective is to provide actionable insights that could guide the modification of marketing strategies by highlighting popular prescriptions in various states. This endeavor not only leverages biostatistical and data science methods to address health-related research questions but also contributes to the optimization of healthcare delivery and pharmaceutical marketing approaches. The dataset, provided by clients and comprising comprehensive insurance information, facilitates a targeted analysis while ensuring the privacy and confidentiality of proprietary data. This project embodies the practical application of statistical analysis and data science within the health sector, striving for significant scientific findings that support the development of informed health policies and marketing strategies.

Isabella Patino
Analyzing Salary and Job Requirement Disparities between Government and Private Sector Public Health Occupations: Utilization of Lightcast Job Postings

Even though the public health workforce is critical for protecting community health in the US, recent evidence underscores disparities in our workforce, specifically regarding the government sector. These disparities range from funding to recruitment challenges to even scarce federal data and documentation. This study uses job postings of public health occupations from Lightcast, a large-scale database that collects US job postings, to compare salary and job requirements (education and experience) between the government and private sectors. Using this comprehensive dataset from Lightcast, 44 public health occupations were collected for each of the two sectors, with 16,284 job postings from the government sector and 12,609,441 from the nongovernment sector. We conducted an interval regression to highlight significant salary disparities between the different sectors for each occupation. As for education and experience requirements, we performed a partial proportional odds model. Results revealed that 24 occupations paid significantly less in government than in the private sector, which mainly encompassed management and computer-related occupations. 6 roles, like counselors and health education roles, are paid significantly more in the government sector than in private. As for education disparities, 37 jobs in government were more likely than the private to only require a Bachelor’s degree or lower, while only 3 occupations – nurse practitioner, physician, and “physicians, all other” – were more likely to require a Master’s degree or higher in government. Our research underscores the importance of utilizing real-time job posting data to inform strategic workforce planning and advocate for fair compensation in the public health sector. We aim to use this research to help create a workforce more prepared for public health crises, by shedding light on the lowered salary and job requirements in the government public health workforce.

Machine Learning (HSC LL 109B)

Shun Xie
User Churn Model and the Exploration of the Effect of Quarantine Status on Churn Likelihood

Following the pandemic, Yum China experienced a notable decrease in its user base, emphasizing the importance of retaining users to maintain profitability. In response, the team initiated a churn prediction project aimed at forecasting the likelihood of users stop buying products from KFC. This paper aims to summarized the project and discussed potential effect for quarantine on churn likelihood. The paper also considers the effects of other variables, which could inform the development of future strategies to reduce churn rate. Instead of using traditional business churn models such as RFM, which only captured three dimensions (recency, frequency and monetary value), I consider a machine learning approach that uses a feature matrix for its input and aims to predict user churn probability. As a showcase in the report, I retrieved 500 user data with 117 features to conduct classification analysis. Different classification methods are compared and gradient boosting using lightGBM package implemented in Python is preferred. Moreover, based on model explanation techniques, I discover that difference in purchase amount for KFC before last purchase is the most important factor to consider in predicting churn rate. On the other hand, quarantined user also shows an increase in churn rate comparing to other users. The outcome, however, might exhibit bias as a consequence of the holiday effect and a limited sample size. In the future, we may also improve the performance by extracting coupon features using natural language processing(NLP).

Yuze Yuan
Using Real Words Data to Predict High-Demand Clothing Materials: A Strategy for Healthier and Efficient Inventory Management

Today, there is a growing awareness of the impact of clothing on human health, such as sun protection and the potential adverse effects of certain materials on the skin. In this study, I tried to build a model to predict how combinations of healthy materials and other features can lead to the creation of high-demand products based on real vendor data. Then, My study utilizes data from Aloyoga, comprising 70,000 product records, which including color, materials, patterns of products, obtained through web scraping and stored in JSON format. The top 25% of products based on sales volume are identified as "hot designs." Various machine learning models including XGBoost and SVM are employed to analyze the data. The precision of models is about 0.60, recall is about 0.72, f1 score is about 0.66. These models need to be optimized including do more data engineering including add more features and communicate with some designers to learn information of read market.

Yuanhao Zhang
A Data-Driven Approach to Predict Hot Clothing Design in the Post-COVID Era

The garment industry has been greatly affected by the impact of COVID-19. In the post-COVID-19 era, how to better recover to the pre-epidemic era is an urgent problem to be solved. Our project started with the aspects of clothing design's based on materials on human health and other basic clothing features such as size, style, color etc., and our goal is to predict and know some clothes that are more in line with buyers' consumption concepts and their own hobbies based on models. The project involves conducting data exploration and developing machine learning models. Specifically, beginning by executing random forest, gradient boost, and linear regression models to characterize clothing material features, other basic clothing features, and predict the probability of whether new clothes could sell hot on a high-quality dataset with comprehensive product information and annotations. The accuracy of these models is approximately 70% after optimization. However, there is a problem that the style of clothing changes quickly, and the past data cannot fully predict the future product, so we should try to understand the product update trend and incorporate the error that this factor may produce into the model.

Shaohan Chen
Study and Implementation of Kernel Ridge Regression

This practicum project studied kernel ridge regression and its implementation in both Python and R. Kernel ridge regression (KRR) is widely used in public health data analytics and clinical research scenarios. However, there is no straightforward and accessible tools in R to help implement and solve problems that need to apply KRR. This project investigated how KRR was implemented in Python and how it could be possibly implemented similarly in R. The performance of KRR method was compared between using Python and using R based on simulated dataset. The results would provide potential insights for helping statisticians using KRR method in R packages. Future improvements include extending our analysis on weighted regression problems.

Statistical Genetics (HSC LL 208B)

Anjing Liu
Molecular Quantitative Trait Loci Discovery Using Quantile Regression

Alzheimer's Disease (AD) continues to pose significant challenges in the medical field, with its intricate genetic factors contributing to its complexity and variability among patients. Despite advances in genome-wide analyses, traditional linear regression (LR) often fall short in capturing the full spectrum of genetic influences, particularly those affecting trait distributions beyond the average. This study introduces quantile regression (QR) as a novel approach that explores the entire distribution of trait values to uncover molecular quantitative trait loci (QTL)—genetic regions influencing phenotypic variation of a complex trait, that might be missed by LR. By focusing on AD-related genes, our research seeks to broaden the understanding of AD's genetic complexities. Utilizing datasets from the ROSMAP study ( N > 400 individuals), we performed quantile QTL analyses to detect significant loci across various quantile levels. QR successfully identified novel loci in AD-related genes, revealing genetic variations that exhibit quantile-specific effects on trait distributions. These findings underscore the utility of QR in detecting genetic signals that are not apparent with LR, offering deeper insights into the genetic architecture of AD. However, the extender computational time than LR poses a limitation, potentially impacting the scalability and speed of broader applications. Despite this, the identification of quantile-specific loci opens new avenues for targeted therapeutic interventions, contributing significantly to the field of personalized medicine for AD.

Haochen Sun
Detecting and correcting discrepancies in summary statistics for fine-mapping studies

Inconsistencies in summary statistics, particularly between z scores and their correlations R often estimated from different data sources, can significantly impact the results of statistical fine-mapping. Several methods have been proposed to address these discrepancies, based on widely accepted multivariate Gaussian models for z and R. Among them, DENTIST and its simplified version, SLALOM, are commonly used in practice. However, these methods have multiple shortcomings. Simulations have shown that DENTIST fails to improve the power for detecting true causal variants, rather it helps in improving the power of detecting true causal variant numbers. SLALOM exhibits low precision and still shows miscalibration in non-suspicious loci. Additionally, under the assumption of a single causal variant, SLALOM may exclude genuine signals as outliers in regions with multiple causal signals. In this paper, we propose augmenting the SuSiE (sum of single effects) RSS model by incorporating an outlier test during each IBSS iteration when a single effect is detected. The key advantage of this method is its ability to identify and address LD z score mismatch near putative causal variants, and simulation results suggest that this quality control method is relatively more effective compared to other methods. This method can be applied in various areas where summary statistics serve as input, enhancing the robustness and reliability of downstream analysis after quality control.

Zining Qi
Refined Approaches for Missing Data Imputation Enhance Quantitative Trait Loci Discovery in Multi-Omics Analysis

Handling missing values in multi-omics datasets is essential for a broad range of analyses. While several benchmarks for multi-omics data imputation methods have recommended certain approaches for practical applications, these recommendations are not widely adopted in real-world data analyses. Consequently, the practical reliability of these methods remains unclear. Furthermore, no existing benchmark has assessed the impact of missing data and imputation on molecular quantitative trait loci (xQTL) discoveries. To establish the best practice for xQTL analysis amidst missing values in multi-omics data, we have thoroughly bench marked 16 imputation methods. This includes methods previously recommended and in use in the field, as well as two new approaches we developed by extending existing methods. Our analysis shows that no established method consistently excels across all benchmarks; some can even result in significant false positives in xQTL analysis. However, our extension to a recent Bayesian matrix factorization method, gEBMF, exhibits superior performance in multi-omics data imputation across various scenarios. Notably, it is both powerful and well calibrated for xQTL discovery compared to all the other methods. To support researchers in practically implementing our approach, we have integrated our extension to gEBMF into the R package flashier , accessible at https://github.com/willwerscheid/flashier. Addition ally, we provide a bioinformatics pipeline that implements gEBMF and other methods com patible with xQTL discovery workflows based on tensorQTL, available at https://cumc.github.io/xqtl-pipeline/code/data_preprocessing/phenotype/p....

Yvonne Chen
Analysis of Spatial Architecture in Actinic Keratoses

Squamous cell carcinoma (SCC) is the second most common skin cancer in the US while Actinic Keratosis (AK) is an abnormal growth of cells two stages before SCC. This project explored gene expression differences between normal cells and dysplastic cells in the AK stage to find out significant genes that may slow down or prevent the development of SCC. Four human samples with Actinic Keratoses and made formalinfixed paraffin embedded (FFPE) slides are included in the project. Spatial Transcriptomics technology including clustering analysis based on gene expression and differentiation gene expression analysis would be used, as well as gene ontology and cell communication analysis to filter significant genes by providing a comprehensive perspective. After analyzing the first branch of samples, two of them have low fraction reads which is not effective for further analysis. Based on two effective sample, Differential Expression Gene (DEG) list was conducted based on p-value and log2 fold change. Also performed Gene Ontology and cell communication analysis to get comprehensive perspectives on the research question. Analysis on more sample is crucial to narrow down the significant gene lists that affect the development of SCC on AK stage. To get high read fraction on all samples, we should choose samples with large amounts of viable cells and construct common gene list for further exploration.

Study Design (HSC LL209A)

Yichen Lyu
Macro for Win Ratio

This practicum will mainly be about designing a Macro program for win ratio using SAS. Win ratio is the total number of winners divided by the total numbers of losers. A 95% confidence interval and P-value for the win ratio are readily obtained. If formation of matched pairs is impractical, then an alternative win ratio can be obtained by comparing all possible unmatched pairs. It accounts for relative priorities of the components and allows the components to be different types of outcomes. The win ratio can provide greater statistical power to detect and quantify a treatment difference by using all available information contained in the component outcomes, and also incorporate quantitative outcomes such as exercise tests or quality-of-life scores.

Jiaoyang Li
What Statistical Designs for Rare Disease Trial proposals improve Clinical Trial Readiness for Regulatory Approval? A Pilot and Feasibility Study

Approval of therapies for rare diseases (RDs) is extremely challenging, given inherently small RD patient numbers and clinical diversity within individual RDs. We design and generate a new dataset and use it to assess the feasibility of a major project to enhance the clinical trial readiness (CTR) of RD trial proposals to gain regulatory approval. Data are design features, statistical review details, and approval histories from US FDA non-cancer Orphan Drug approvals from Drugs@FDA for New Molecular Entities from 2014-current. We have created a new dataset that codes these data in analyzable form. We coded and analyzed 20 recent pilot cases from the final target population of approximately 200 RD approvals, and achieved 5 prespecified aims: 1) Define CTR; 2) Specify the needed variables, collect the specified data; 3) Refine variable specifications and collection procedures to adequate levels; 4) Store the data in a secure SQL database with a comprehensive Data Dictionary; and 5) Evaluate these pilot data and identify hypotheses for test in the final project. We present the pilot data in multiple tables, figures, and visualizations, showing high data quality, and identify hypotheses on pathways that may enhance CTR. We conclude that the final project is feasible. By identifying statistical design features that have enhanced CTR for RDs, and suggesting refinements that may further improve these designs in the future, it will provide a much-needed resource for the rare disease community. 

Yuchen Hua
Prospective analysis of pancreatic cancer and meat intake based on 15 cohorts

Pancreatic nowadays is the seventh leading cause of cancer death around the word with increasing incidence and mortality rates. World Cancer Research Fund have suggested the reduction of red meat intake and white meat as an alternative. In the previous research which studies the various meat products and pancreatic cancer, the relationship between pancreatic cancer risk and meat intake showed positive, inverse, and null associations. The results of these studies were not consistent. In this study, 15 prospective cohort studies, 6685 pancreatic cancer cases were identified from the 1225926 individuals. The multivariable study-specific hazard ratios (MVHR) and 95% confidence interval were calculated by Cox proportional hazards models and pooled using random-effects model, after adjusting factors like smoking habits, personal history of diabetes, alcohol intake, body mass index (BMI), and energy intake. Null associations were found for processed meat, poultry, and seafood intake and pancreatic cancer risk. Though not significant, positive associations were found for these three categories. Inverse associations were found for unprocessed red meat intake (MVHR = 0.97; 95% CI = 0.89-1.06; P-value = 0.007) and beef, pork and lamb intake (MVHR = 0.97; 95% CI = 0.87-1.08; P-value = 0.01). This result suggests that processed meat, poultry, and seafood intake are not associated with pancreatic cancer risk. Inverse associations were also suggested for unprocessed red meat, and beef, pork and lamb intake and pancreatic cancer risk.

Junyan Zhu
DeepMed: whether gender disparities exist in income when education levels are the same

A recent published study found that gender disparities exist in the prevalence and treatment of heart disease in that smaller proportion of female underwent percutaneous coronary angiography compared to male. The result provides insights that such gender inequality may also exists in other filed. This study interested in whether gender disparity exists in the context of income when the education level is the same between two genders. To explore the effect of gender on income after adjusting for the education level, we used a novel method called DeepMed (DNNs). DNNs is similar to mediation analysis in that it decomposes the total effect into direct and indirect effect to unpack the underlying black-box causal mechanism rather than focuses on total effect. Compared to mediation analysis, DNNs reduce bias for estimating Natural Direct and Indirect Effects in mediation analysis. The result shows that female has lower salary compared to male even if they have the same education level, indicating the potential gender discrimination exist in working conditions.

Coffee & Tea Break (2:00pm - 2:30pm)

Session 2 (2:30pm - 3:30pm)

Causal Inference (HSC LL207)

Ziqing Wang
R implementation of a multistate approach for stochastic interventions on a time-to-event mediator in the presence of competing risks

A new R command that estimates causal effects of stochastic interventions for a non-terminal time-to-event mediator on a terminal time-to-event outcome has been developed. This command can be applied to a health disparities research setting where the following two causal estimands are of interest: RD (Residual Disparity), the change in survival probability between the exposed group and the unexposed group, had the mediator distribution for all individuals been fixed to that of the unexposed group, and SD (Shifting Distribution Effect), the change in survival probability within the exposed group, had the mediator distribution been changed to that of the unexposed group. Large-scale simulations were performed on the C2B2 high-performance computing cluster to check the validity of the implementation.

Yuchen Zhang
Simulation Study in Mediation analysis with time-varying mediators and time-to-event outcomes accounting for competing risks

Simulation studies are computer experiments that involve creating data through pseudo-random sampling, are valuable for understanding the behavior of statistical methods. This simulation study intends to examine Dr. Arce Domingo-Relloso's algorithm, developed in Dr. Valeri's lab, to estimate causal effects in Longitudinal/Time-to-Event mediators and Time-to-Event outcomes in the presence of competing risks while verifying method validity in diverse scenarios. Current simulation approaches hard to handle this complex situation, and the true effect of simulation data is often not clear. In this simulation study, I generated exposure, baseline confounder, Longitudinal mediators, Time-to-Event outcomes, and competing risks similar to the data structure used in Dr. Arce Domingo-Relloso's algorithm according to preset parameters and distributions. The true total effect, true indirect effect through the mediator, and true direct effect in simulation data could be obtained by the corresponding G-formula we inferred and algorithm about it. This simulation study could allow us to consider properties of our methods, such as bias. Therefore, the study could enhance the robustness of mediation analysis techniques, advancing our ability to understand complex causal relationships in health-related research.

Riya Bhilegaonkar
Causal Inference Program Evaluation of Pediatric Transitions of Care (TOC)

To measure the impact of a pediatric inpatient readmissions care management program (Pediatric TOC) at a New York Health Insurance Organization in reducing 30 day and 10-day inpatient hospital readmissions. Estimating the casual effect and association between the intervention and hospital readmissions for same-cause readmissions and all-cause readmissions. The program evaluation for the study population is done through a casual inference modeling framework. Upon compiling data on the study population, propensity score matching is to be used to create a matched dataset with an artificial control group. Using the matched dataset survival analysis is performed by fitting cox proportional hazard models to determine hazard ratios for our outcome and additional factors, survival curves were fit to display proportion of patients without a readmission. Upon performing propensity score matching with the optimal full matching method, the total study population of 4349 consists of members under the age of 21 and met organization specific inclusion criteria, where 1373 belonged to the treatment group and 2976 to the control group. Using the chosen cox proportional hazard models, for the current study population no effect was found in the readmission rate for same cause 30-day readmissions, similarly this was the case for both same cause and all cause 10- day readmissions. In the case of the 30-day all cause readmission rate we found an increased probability of readmissions, among the treated versus the control group, highlighting the distinctive features of the pediatric population. Findings highlight the importance of considering the differential characteristics of study population when implementing causal inference in program evaluations. Further analysis is to be conducted focusing on emergent readmissions filtering out elective or automatic readmissions which often occur in pediatric patients such as for children with ventilators. 

Binyue Hu
Addressing Bias in Mendelian Randomization: Introducing the Penalized Inverse-Variance Weighted Estimator with Application to Obesity-Related Exposures and Type 2 Diabetes Risk

Mendelian randomization is a method that uses genetic variants as instrumental variables (IVs) to estimate causal effects of exposure variables on outcomes, even when unmeasured factors are present. However, the commonly used inverse-variance weighted (IVW) estimator may be biased with weak IVs, a common issue in MR studies. In this study, a novel approach called the penalized inverse-variance weighted (pIVW) estimator, which addresses the bias by penalizing the IVW estimator and adjusting variance estimation. The method shows reduced bias and variance compared to the debiased IVW (dIVW) estimator under certain conditions. Extensive simulation studies support the performance of our proposed pIVW estimator. Furthermore, we apply the pIVW estimator to estimate the causal effects of four obesity-related exposures on type 2 diabetes outcomes. Notably, we find that hypertensive disease and higher body mass index is associated with an increased risk of type 2 diabetes.

COVID-19 Research (HSC LL209A)

Zozo Chunyu
Mendelian Randomization Analysis to Investigate the correlation of Covid-19 and Alzheimer’s Disease in European Populations

This project aims to investigate whether COVID-19 increases the risk of Alzheimer's Disease (AD) in the European population by replicating and extending the methodologies outlined in the paper on "Robust Mendelian Randomization Analysis." It intends to apply several novel analytical methods to datasets pertaining to Alzheimer’s and COVID-19 hospitalization, leveraging the mendelianrandomization package in R. The research will utilize exposure data from the COVID19-hg GWAS meta-analyses (hospitalized vs. population) round 7 and outcome data on AD from Dr. Liu, merging these datasets by "rsid" numbers. The study will incorporate 5-6 Mendelian Randomization (MR) methods, including MR-SPI, IVW, MR-Egger, and MRPRESSO, among others, to infer causal relationships. The validity of suggested causal links will be strengthened if multiple methods yield concordant ratio estimates. By fitting OLS models onto the ratio estimates from different mechanisms, this research aims to provide comprehensive insights into the causal relationship between COVID-19 and AD, offering potential implications for public health strategies in the European context.

John Cheng
Estimating the Effects of Long COVID on Employment Loss

The labor market effects of post-acute COVID-19 symptoms in the United States constitute a salient policy concern amid domestic recovery from the pandemic and are examined in this practicum project using logistic regression models. The analysis finds that post-acute COVID symptoms are associated with an elevated risk of job loss, accounting for vaccination status, sociodemographic characteristics, and other covariates. This outcome underscores the need for renewed attention to the health and economic experience of individuals affected by long COVID.

Wenjun Mo
Healthcare Access and COVID-19 Outcomes in New York City

This study investigates the critical relationship between healthcare access and the outcomes of COVID-19 infections across New York City's diverse boroughs. For a city with a densely populated urban area that became an epicenter of the pandemic in the United States, understanding the dynamics of healthcare accessibility and its effect on disease transmission and mortality rates is paramount for informed public health responses and policy-making. Utilizing a comprehensive dataset aggregated from public health records, mobility data from points-of-interest (POIs) related to healthcare facilities, and socioeconomic indicators, we conducted a statistical analysis to uncover patterns and disparities in healthcare access and its association with COVID-19 case outcomes

Yuan Yao
Analysis of the relationship between Covid-19 Prevention and Control and Depression in college students ——A case study of undergraduates at Shanghai Fudan University

The research on adolescent mental health has increasingly focused on depression among university students. Furthermore, the COVID-19 outbreak in 2019 and the enforcement of subsequent preventative and control measures have escalated the prevalence of depression within the general population. This study was conducted to examine the relationship between COVID-19 preventative and control measures and the occurrence of depression in university students. An exhaustive review of literature and systematic inquiry facilitated the analysis of various factors contributing to depression in this population. The investigative approach combined questionnaire surveys and semi-structured interviews targeting a particular cohort of undergraduate students at a specified university. Results from the study demonstrated that students experiencing extensive exposure to COVID-19 preventative and control strategies, along with lower professional recognition, decreased satisfaction with family relationships, heightened feelings of loneliness and helplessness, and insufficient social support, were significantly more likely to suffer from higher levels of depression. Drawing from the consequential insights of this research, the paper articulates strategic recommendations for the prevention and intervention of depression among university students, contextualized within the prevailing COVID-19 preventative and control measures.

Data Analysis (HSC LL210)

Ruihan Zhang
Analyzing Pharmaceutical Production Yield Losses: A Biostatistical Approach to Enhancing Efficiency in Pharmaceutical Manufacturing

In the realm of pharmaceutical manufacturing, optimizing production efficiency while maintaining product quality is paramount. This practicum project embarks on a detailed analysis of yield losses within the manufacturing processes of Lidocaine, Scopolamine, and Xulane at Teva Pharmaceuticals USA, Inc. Yield loss, defined as the discrepancy between the expected and actual output, directly impacts production efficiency, cost-effectiveness, and ultimately, drug availability and affordability. Utilizing a synthesized dataset reflective of real production scenarios, this study employs biostatistical methods and JMP software to systematically analyze yield data across various stages of the production process. The primary objective is to identify significant loss points, understand underlying causes, and propose actionable improvements. Secondary aims include the evaluation of in-process testing data to enhance quality control measures and the development of a statistical model to elucidate the relationship between production batch uniformity and final product quality. By pinpointing inefficiencies and recommending strategic interventions, this project aims to advance pharmaceutical manufacturing practices, thereby contributing to better health outcomes through improved drug accessibility. Through this investigation, we seek not only to enhance the operational efficiency of specific drug productions but also to set a precedent for biostatistical applications in pharmaceutical process optimization. 

Meng Fang
Analyzing Evacuation Dynamics in Louisiana During Hurricane Ida: A Comprehensive Zipcode-Level Study

In this practicum, we examine the evacuation patterns in Louisiana during Hurricane Ida, focusing on a comprehensive zipcode-level analysis across the state. The study aims to map evacuation behaviors, employing extensive datasets to track daily movements and resident dislocations. This is achieved through the innovative use of time series data, including the total number of visits and movement distances before and after the hurricane, with a baseline established from data collected in the preceding two weeks. Statistical models, notably the ARIMA model, are utilized to contrast normal visitation trends against those observed during Hurricane Ida. Furthermore, the study explores the average moving distance during the hurricane period for each zipcode region, linking these patterns to socio-economic factors and proximity to the hurricane's landfall. Additionally, we analyze the inflow and outflow of visits for each zipcode, correlating these numbers with socio-economic variables. A significant aspect of the research involves logistic regression to examine the proportion of residents moving out in relation to various factors. This analysis will also include clustering zip codes to identify groups with similar visitation patterns. This comprehensive study aims to provide deeper insights into evacuation behaviors and their determinants, offering valuable information for future emergency planning and response strategies.

Shuting Kang
COVID-19 Vaccine Claims Cleaning Program and Missing Dose Investigation

This study addresses the challenge of discrepancies in COVID-19 vaccination claims among Medicare beneficiaries from December 2020 to June 2023. Conducted by a statistical programmer intern at Acumen LLC's safety team, the research aims to identify and rectify recording issues and explore the reasons of missing doses in primary series, thereby enhancing the reliability of the vaccine claim dataset used for public health policy development. Leveraging the ADAPT SSD Claims Data and statistical analysis in SAS, the study developed data Cleaning Programs to correct and impute vaccine claim inconsistencies. This Cleaning Program focused on claims recording issues, including double billings, empty dose type, duplicated entries, incorrect order, and conflicts between multiple brands by applying predictive algorithms to infer true administration codes and actual brands for problematic claims. Post-cleaning, reductions in missing dose incidents were observed. Discrepancies were categorized into five types, with around 600,000 administration codes updated. Pfizer and Moderna were the dominant vaccine brands administered. The study identified age below 65, discontinuous Medicare enrollment, and coding errors as key reasons for missing primary series doses, with the Cleaning Program successfully recovering approximately 330,000 missing dose records. The Cleaning Program reduced missing dose incidents and enhanced dataset integrity, aiding in more informed healthcare policy decisions. The findings underscore the importance of meticulous data management and the potential for advanced programming to improve public health data quality. Further research should continue to monitor and refine data cleaning protocols for ongoing accuracy and reliability in healthcare data.

Sarah Tsang
Compiling Data for Government Grants

The Region 2 Public Health Training Center in New York City is a program that facilitates public health education through online courses. For a public health program to qualify for a government grant, it must submit an annual report that demonstrates its progress to the Health Resources and Services Administration (HRSA). The annual funding for a HRSA program ends on the last day of June and therefore, the report creation begins in July. The data of the program’s courses were downloaded from the TRAIN Learning Network. Using Microsoft Excel, the data was cleaned to display the number of people who used the 187 available courses the training center provides. The annual report required information about how many people attended an online webinar or accessed an online course for continuing education credits and what career field they were in. The required information was acquired by using basic equations in Microsoft Excel to analyze the data. After this data was collected, the information was entered formally on the HRSA website. This was one of the required reports. Another required report was compiling the information for every webinar hosted by the training center. This information included the name of the webinar, the topic discussed, how many people attended, and opinions of the content from a likert-scale rating. After completion, the report is sent to HRSA for approval.

Environmental Health Research (HSC LL109A)

Jennifer Osei
Heat Exposure on Health Outcomes in Ghana: Redefining & Comparing Climate Heath Stress Index Exposure Measures for Atmospheric Heat through Comprehensive Model Optimization

As the world gets warmer, health implications of heat need to be investigated and subsequent models to explain these implications are important to have better understanding of its impact on human health. For this study, heat stress indices (HSIs) commonly used in heat-health studies were examined. For example, Wet bulb globe temperature (WBGT), is a measure of heat stress in direct sunlight. It considers temperature, humidity, wind speed, sun angle, and solar radiation. However, WBGT is not the sole measure of climate and heat. Other exposure measures such as 1) Simplified WBGT (sWBGT) 2) Apparent Temperature 3) NOAA’s HI 4) Humidex and 5) Universal Thermal Climate Index (UTCI) are also considerable methods of heat index measurements. Utilizing these newly established measurements and considering its potential to better measure its impact to humans, seems promising. Ghana, West Africa which lies on the equator, holds some of the highest temperatures on the globe, is an area of interest as it is first in line to be impacted by climate change, and in turn, the health of its peoples, due to the warmer climate. These HSIs have yet to be further explored in comparison to other models and is the aim of this research investigation.

Tharina Messeroux
Exploring Housing Satisfaction Levels in Public Housing: Preliminary Analysis for Smoke-Free Policy Compliance

Public housing serves as a critical resource for low-income families and vulnerable populations. However, it is essential to recognize the disparities in demographics and living conditions among residents when implementing community programs. This paper presents preliminary analysis for a broader study aimed at identifying effective strategies to enhance compliance with federally mandated smoke-free policies in public housing, focusing specifically on tobacco use and secondhand smoke exposure. In the study, a survey is administered to collect data on residents' satisfaction rates as a baseline for the research, emphasizing the importance of understanding the factors influencing satisfaction levels in public housing communities to inform interventions and interpret study results effectively. Demographic characteristics, including age, sex, race, education, income, smoking status, second-hand smoke exposure, and household size, were obtained from participants. Satisfaction levels were assessed across various aspects of housing conditions using questions with answers ranging from "Very Satisfied" to "Very Dissatisfied." Bivariate analysis, including chi-square tests and t-tests, explored the relationship between demographic factors and satisfaction levels, with a specific focus on smoking status. The chi-square tests revealed no significant difference (p > 0.05) in satisfaction levels across all aspects of housing conditions assessed, regardless of smoking status, at the 5% significance level. While demographic and living condition disparities exist within these communities, our findings suggest that smoking status may not significantly influence overall satisfaction levels. These insights are crucial for informing targeted interventions aimed at improving the well-being of public housing residents and addressing their diverse needs effectively. 

Yining Chen
Identifying Changes in Human Mobility and Movement Patterns Linked to California Wildfires

A considerable amount of research has examined human mobility. However, significant largescale events such as hurricanes, wildfires, and outbreaks can lead to substantial variations in human activity patterns across different regions in the aftermath, lasting weeks to months. These changes often incur significant financial, medical, and quality of life costs. Recognizing these shifts in movement can greatly assist societies in crafting more efficient responses. This project aims to investigate available datasets to assess their suitability for representing movement patterns in wildfire-affected areas. Additionally, it focuses on analyzing the natural tracking and anonymized mobility behavior of individuals in California, with the goal of quantifying migration patterns during various weather conditions, specifically examining human mobility during wildfires. It is important to note that this project is still in progress, and we are currently in the preliminary analysis stage, with no final results generated yet.

Eileen Ramirez del Rio
Neurobehavioral Developmental Outcomes in Young Children Born to Ecuadorian Women with Exposures to Ethylenethiourea (ETU)

Ethylenebisdithiocarbamates (also known as EBDCs) are commonly used fungicides in agriculture, floriculture, and horticulture. EBDCs can be metabolized into a more toxic carcinogenic and teratogenic compound, ethylenethiourea (ETU). This compound has been associated with decreased serum thyroxine levels, increased thyroid stimulation hormone levels, and thyroid gland disorders, the latter of which during pregnancy and early infancy can impact fetal and infant neurobehavioral development. The main purpose of our research was to determine the possible factors (sociodemographic characteristics, general home environment, maternal labor history, partner’s labor history, reproductive and pregnancy history, and maternal health history) during pregnancy and early developmental years including that could influence potential ETU exposure and therefore, possibly affect newborn thyroid function and early childhood growth and neurobehavioral development. Pregnant women living near or working at flower farms located in Cayambe and Pedro-Moncayo in Ecuador were recruited for inclusion into a longitudinal birth cohort study. Eligible women had to be seeking prenatal care at local clinics/hospitals, had lived in the region for at least 1 year, and were between 10 – 20 weeks into gestation. Self-reported pesticide exposure data were collected through a baseline questionnaire, and ETU measurements were recorded from urine and blood samples. Datasets pertaining to the obtained results would be cleaned and prepared for subsequent model building. A total of 409 subjects were recruited for the study, 111 agricultural workers (including floriculture and other agriculture), 149 non-agricultural workers, and 149 non-workers. Discrepancies found between the baseline questionnaire and codebook were resolved and variables were recoded in cases deemed necessary for the purpose of our analysis. Summary tables were created for the entirety of the baseline questionnaire, giving results overall and by maternal work sector. Most of the data analysis performed during this period was descriptive in nature. After its completion, our aim is to look at our results and determine which variables might serve best to predict possible ETU measurement, both in an overall sense and by maternal work sector.

Health Policy (HSC LL108B)

Jingchen Chai
Medicare Data Insights: Strategic Approaches to Enhancing Drug and Vaccine Safety

This report outlines the analysis of over 60,000 Medicare claims to assess pharmaceutical and vaccine safety and efficacy for the Food and Drug Administration (FDA). Through logistic regression, we explored the association between flu vaccination and demographic factors, revealing key insights into vaccination strategies. Additionally, we developed a standardized adverse events indicator program using SAS, enhancing the efficiency of summary table generation for patient data analysis. Poisson regression was utilized to calculate vaccination rate ratios and confidence intervals, with forest plots effectively translating statistical data into accessible visual summaries. Overall, the study offers significant advancements in drug and vaccine safety evaluation, contributing to informed public health decisions.

Shodai Inose
Health System Strengthening in South Sudan: Understanding Post-Independence National Trends in Utilization of Maternal and Child Health Services

Upon gaining independence in 2011 and after over 30 years of a civil conflict, South Sudan’s health system was rebuilt with the hopes of improving access to and utilization of maternal and child health services. This practicum aims to understand trends in year-onyear growth in utilization of several key health indicators from facility-reported data across the nation, accounting for factors such as seasonality and population growth. This analysis relies on monthly routine health facility data collected in the district health information system database (DHIS) in two distinct time periods: January 2014 – July 2017 (DHIS 1.4) and April 2020 – November 2022 (DHIS 2). The analysis focuses on the following maternal health indicators: deliveries and antenatal care visits (first and fourth visits); the following child health indicators: malaria, diarrhea, and pneumonia cases; and the following immunizations: the first and third doses of the oral polio vaccine (OPV) and the first and third doses of diphtheria, tetanus, pertussis vaccines (DTP/DPT). However, not all indicators were recorded in both databases. Trends were identified for each indicator using linear regression with (seasonal) autoregressive-moving average error terms. Findings from the analysis identified no statistically significant (α = 0.05) changes in utilization of health services for all indicators throughout the duration of study. Significant seasonality was detected in the malaria indicator with peaks during the summer months.

Nirali Patel
Racial and Ethnic Disparities in Postpartum Readmissions at New York State Hospitals

Racial and ethnic disparities persist in maternal health outcomes across the United States, demanding urgent attention from public health authorities. Research indicates that minoritized birthing people face up to three times higher risks of pregnancy-related deaths or severe maternal morbidity compared to non-Hispanic White individuals. Despite the critical nature of the postpartum period, limited research exists on racial and ethnic differences in postpartum health, particularly regarding readmission rates in the state of New York. This study aims to investigate such disparities, focusing on postpartum readmission rates, diagnoses, and associated risk factors, to inform targeted interventions and mitigate maternal health disparities.  The study will utilize hospital discharge records from the New York State Inpatient Database, provided by the Healthcare Cost and Utilization Project (HCUP). These records encompass all inpatient discharges from New York acute care hospitals, detailing patient demographics, hospital identifiers, and diagnoses/procedures utilizing ICD-10-CM codes. The study sample comprises delivery hospitalizations in New York State between 2016 and 2017, focusing on individuals aged 10-44. Readmission rates, risk factors, and diagnoses, categorized by race/ethnicity and timing, will be analyzed, acknowledging the socioeconomic context and healthcare disparities using SAS 9.4 software. The analysis reveals that Black and Hispanic postpartum individuals exhibit a 1.71-fold and 1.24-fold increased risk of readmission within one year compared to their White counterparts within this study population, respectively. A comprehensive assessment of 34 variables encompassing sociodemographic, comorbidities, labor and delivery, and hospital-related factors was conducted. Eleven variables with absolute standardized differences exceeding 10% were identified as candidate variables. Significant risk factors included age, health insurance status, obesity, abnormal fetal heart rhythm, and severe maternal morbidity. Pregnancy-associated hypertensive disorders, calculus of gallbladder, and diseases of the digestive system emerged as the most prevalent diagnoses leading to readmissions. Limitations include a small coefficient of determination (R² = 0.0109), suggesting unaccounted-for risk factors, possibly related to structural racism. The ASD method for predictor selection may have limited the inclusion of important variables. Future steps entail comparing overall readmissions with early and late readmissions analyses and computing the Intraclass Correlation Coefficient to assess variance distribution across hospitals and patients. Despite insights gained, further research is imperative to identify and address underlying factors contributing to maternal health disparities effectively.

Angsi Shi
LIGHTSPEED” - How Project “Lightspeed” Changed Our Approach to Imminent Public Health Crisis

The Lightspeed Project is an internal initiative within Pfizer focused on accelerating the development and manufacturing of key medicines and medical products upon facing an imminent public health crisis. The idea of “Lightspeed” originated back in 2017 by a gene therapy team within Pfizer. Then adopted during the COVID-19 pandemic – The Lightspeed Project and carried on ever since. Issues marked with the label “Lightspeed” are meant to be crucial to key product development for impending public health problems and several projects that I worked on during my internship for Pfizer dedicated to making “Lightspeed” achievable.

Longitudinal Data Analysis (HSC LL108A)

Mingyi Du
Predicting CHD from Metabolic Components Using Dynamic Bayesian Network in longitudinal cohorts

This study analyzes the relationship between Metabolic Syndrome (MetS) components and Coronary Heart Disease (CHD) occurrence, leveraging a dynamic Bayesian network in face of rising global CHD prevalence. Materials and Methods: Utilizing data from 9189 participants over a 3-year span from the Jining Municipal Hospital health examination cohort in Shandong, China, the study employed Dynamic Bayesian Networks (DBN) to forecast CHD risks based on eight metabolic indicators. Predictive variables included blood test results and physical examination parameters, with the data divided into training and validation subsets for model construction and evaluation. Our DBN model captured the temporal dynamics of metabolic risk factors, reflecting strong calibration, with an intercept (A) and slope (B) closely aligned with ideal values in both derivation and validation cohorts. Discriminative analysis yielded AUC values exceeding 0.7, indicative of the model’s robustness. Decision curve analysis showcased the model’s clinical usefulness by offering higher clinical net benefits compared to traditional strategies. By integrating the temporal complexity of MetS factors, the DBN model provides an advanced predictive tool for CHD, enhancing risk stratification and potentially guiding preventive strategies in clinical practice. This model stands as a testament to the efficacy of temporal modeling in understanding chronic disease trajectories, and posits a shift toward dynamic risk assessment in cardiology. The research validates the use of DBN in the prognostic modeling of CHD from MetS factors, underscoring its potential for enhancing clinical decision-making. It paves the way for future studies to further refine dynamic predictive models, aiming for precision medicine applications in cardiovascular health.

Xicheng Xie
Analyzing trends in humid heat stress in New York State: A Bayesian spatiotemporal analysis

Heat stress have devastating effects on society. This study investigates variations in humid heat stress across New York State over the last 40 years (1980-2020) and explores its associations with socioeconomic factors, urbanicity, and demographics. We aimed to identify how environmental stressors interact with the social and economic conditions of census tracts. Bayesian spatiotemporal models were formulated to examine the census-tract level spatial and yearly variations of Wet Bulb Globe Temperature (WBGT) and extreme heat events, defined as periods of one day or longer where the daily maximum WBGT (WBGTmax) exceeds 28 °C. These variations were analyzed in relation to various socioeconomic and demographic covariates. Our findings indicate that wealthier areas, characterized by higher incomes and property values, may experience greater temperature variability. Moreover, higher educational levels correlate with increased susceptibility to temperature changes. These results contribute to understanding environmental inequalities and underscore the importance of integrating socioeconomic factors into climate change models to devise effective adaptation strategies.

Jiawen Zhao
Examining the Impact of a Sleep Intervention on Nurses' Mental Health Status (Stress and Depression Level): A Longitudinal Analysis using GLMM and GEE

Alzheimer’s disease (AD) is intricately associated with cardiovascular and cerebrovascular risk factors (CVRFs) observed in middle age and beyond, often culminating in cerebrovascular pathology upon death. While the link between CVRFs such as hypertension, obesity, diabetes, and coronary heart disease with AD is established, the mechanistic pathways connecting vascular risk factors to ischemic microvascular pathology remain underexplored. Investigating the nexus between CVRFs and genetic variants may unravel the complex pathogenesis of AD. In this study, 1092 participants from the Religious Orders Study and Rush Memory and Aging Project (ROSMAP) were selected and categorized into vascular and neurodegenerative subgroups based on their pathology profiles. Logistic regression using normalized RNA-seq data from the dorsolateral prefrontal cortex (DLPFC) region identified 3127 and 367 gene expressions that are significantly associated with vascular and neurodegenerative pathologies respectively (p-vale = 0.05), followed by pathway analysis to elucidate underlying biological pathways. Integrating genotype information from Genome-Wide Association Study (GWAS) enabled expression quantitative trait loci (eQTL), identifying single nucleotide polymorphism (SNP) associated with significant gene expression for each subgroup. 1cis-eQTL and 53 trans-eQTL were identified for the vascular subgroup whereas no cis-eQTL and 3 trans-eQTL were identified for the neurodegenerative subgroup (FDR corrected p-value < 0.05). Our findings highlight the distinct genetic underpinnings of AD's vascular and neurodegenerative pathologies, underscoring the value of multi-omic integration for identifying novel biomarkers in AD. 

Chenyao Ni
Longitudinal Profiles of Suicidal Ideation in Older Adults with Depression: Associations with Depressive Symptoms, Cognitive Deficits, and Clinical Measures

In a recent study, we extended previous work by exploring four identified suicidal ideation profiles among mid-life and older adults, analyzing their connections with depressive symptoms and cognitive function over time, and their impact on ideation progression. We followed 337 depressed adults aged 50-93 for up to 14 years, categorizing them into Low/non-ideators (22.8%), Chronic ideators (27.6%), Variable ideators (18.7%), and Fastremitting ideators (30.9%). Using Kruskal-Wallis tests and proportional odds models, we compared baseline depressive symptoms across groups and employed linear mixed effects and cumulative link mixed models to examine changes over time, including assessments of the Mini-Mental State Examination and the Cumulative Illness Rating Scale-Geriatric. Initial analyses identified significant differences in depressive symptoms across profiles, with chronic ideators showing more severe symptoms than low ideators at baseline, and variable ideators displaying higher anxiety. Over time, chronic ideators consistently experienced more severe depressive symptoms, motivation loss, and sleep issues compared to fast-remitting ideators, who demonstrated a stronger linkage between depression and ideation scores, suggesting a higher risk of recurrence. Cognitive performance was notably lower in chronic and remitting ideators, without significant differences in cognitive decline or illness severity over time. This study highlights the complex nature of suicidal ideation in depressed older adults, suggesting the need for targeted interventions based on ideation profiles and emphasizing the importance of continuous mental health support, especially for remitting ideators, to prevent ideation recurrence. Further research into the influence of various factors on these profiles could inform more personalized prevention and treatment approaches.

Machine Learning (HSC LL109B)

Hongru Tang
Inference in Linear Regression with Informative Sampling: A Comparison of Methods

In survey sampling, informative sampling occurs when the probability of including a particular unit in the sample is related to the value of the response variable of interest, leading to potential selection bias. If not properly addressed this phenomenon presents significant challenges, as it can distort inferences about population parameters. This study compares several methods for inference about parameters in a linear regression when the survey data are subject to a selection effect. These methods use different likelihoods, i.e., pseudo-likelihood, a likelihood without correction for a selection effect, and a likelihood based on the sample distribution. The latter is implemented using the Leon-Novelo and Savitsky (LN&S) Bayesian method that jointly models the survey outcome variable and sample inclusion probability. We identified a scenario where none of the existing methods work well and proposed a novel modification to the LN&S method to handle informative sampling in this scenario. Our simulations demonstrated that the modified LN&S method outperforms the alternative methods with the lowest bias and mean squared error and close to the nominal level 95% probability interval for both intercept and slope. In contrast, the pseudo-likelihood and the likelihood without correction for a selection effect may experience significant bias, and their 95% credible intervals often show poor coverage. Moreover, we explored the intrinsic correlations within the survey data, assessing their impact on model efficacy. We applied the methods to data from the 1988 Mental Health Organization survey.

Noah Zhou
Predictive Modeling of Diabetes Risk Using Biostatistical and Data Science Methods

Diabetes continues to pose a significant global health challenge, with millions affected worldwide. Timely identification of individuals at high risk of developing diabetes is crucial for implementing preventive measures and reducing the associated morbidity and mortality. Early identification of individuals at high risk of developing diabetes is crucial for implementing preventive measures. This practicum project aims to develop a predictive model based on biostatistical and data science methodologies for identifying such individuals. Preliminary analysis of the data reveals promising outcomes, with the developed predictive model achieving an AUC exceeding 0.85, indicating excellent discrimination ability. Significant predictor variables identified through the analysis include age, BMI, family history of diabetes, and fasting blood glucose levels. Moreover, the developed model demonstrates superior performance compared to traditional risk assessment tools, exhibiting a sensitivity of 0.80 and specificity of 0.85. Subgroup analysis further validates the model's robustness, showing consistent performance across different demographic groups. Additionally, the incorporation of lifestyle factors, such as diet and physical activity levels, enhances the model's predictive power, providing a comprehensive assessment of diabetes risk for individuals.

Jiahe Deng
Discriminating Malignant Breast Cancer from Benign: Insights from FNA Image Analysis

The study examines the distinguishing characteristics of cell nuclei in digitized fine needle aspirate (FNA) images of patients with malignant and benign breast cancer. Our objective is to effectively analyze and identify key differences between malignant and benign cases using various modeling techniques. Given the binary nature of the final diagnosis outcome, malignant or benign, we employ linear regression, random forest classification, k-nearest neighbors classification (KNN), and decision tree classification. The prompt identification of malignant tumors is crucial, as they tend to rapidly invade nearby tissue and metastasize. Our analysis reveals that linear regression and KNN models exhibit superior performance. Furthermore, they highlight the significance of mean radius, mean concavity, and mean fractal dimension values of the cell, with p-values approaching 0. In conclusion, our findings underscore the efficacy of employing computational models to discern crucial features in FNA images, facilitating accurate diagnosis and potentially enhancing patient outcomes in the management of breast cancer.

Yang Yi
Identifying multiple sclerosis subtypes using unsupervised machine learning and clinical data

Multiple sclerosis (MS) is a chronic disease involving demyelination and neurodegeneration of the central nervous system. With 2.8 million individuals affected worldwide, it is the most common neurologic cause of disability in young adults due to symptoms that may include impaired motor, cognitive, psychosocial function, limited occupational attainment, and debilitating fatigue. The disconnect between clinical presentation and quantifiable disease burden based on magnetic resonance imaging (MRI) variables (atrophy, T2 lesion volume) has been referred to as the clinico-radiological paradox, and represents a major obstacle to progress. Whereas one patient may have high disease burden but relatively few symptoms, another may have low disease burden but present with debilitating symptoms with unclear pathophysiological substrates, such as fatigue or cognitive impairment. This disconnect has stymied efforts toward precision treatment, it also presents a challenge for predictive modeling, as our best biomarkers of disease correlate only weakly with clinical symptom burden. Current disease subtypes (clinically isolated syndrome, relapsing-remitting, secondary progressive, primary progressive) were developed to capture MS disease activity and progression. Within each subtype group, however, there can be vast heterogeneity of symptom and radiologic profiles, leading to calls for precision phenotyping based on incorporation of pathological processes. One recent study by Eshaghi and colleagues took a data-driven approach to disease classification based on brain changes on MRI. Applying an unsupervised machine learning algorithm (Subtype and Staging Inference, SuStaIn) to data from 6,322 patients, three MS subtypes were identified. The subtypes were characterized by distinct temporal patterns of change on MRI and identified based on the earliest abnormalities observed: lesion-led, cortexled, and normal appearing white matter-led. Subtypes differed in disability progression, relapse rate, and treatment response, suggesting that MRI-based subtypes are predictive of relevant clinical outcomes.

Topics in Mental Health (HSC LL107)

Qingyue Zhuo
The Impact of Functional Connectivity Measures on ADHD Comorbidity Profiles in Children

Attention-deficient / hyperactivity disorder (ADHD) is one of the most prevalent neurodevelopmental conditions in childhood, characterized by traits like inattention, hyperactivity, and impulsiveness. Additionally, a substantial majority, ranging from 60% to 100%, of children with ADHD also experience concurrent conditions, known as ADHD comorbid disorders, such as anxiety and obsessive-compulsive disorder (OCD). Notably, children with comorbid disorders typically demonstrate worse outcomes, reflected in observable social, emotional, and psychological challenges in clinical settings. Our study aims to investigate the relationship between types of ADHD comorbid disorders and functional connectivity measures among children aged 9-10 years old. Participants were selected from the ABCD study, and separated into the either typical developing group(TD) or one of the three psychiatric groups: ADHD-ODD (AO), ADHD-OCD-Specific (AOS), and sparse comorbid profile (SPA). Classification models were used to fit the relationship. Both accuracy and AUC were used to measure the models’ performances. For both boys and girls, there is no significant difference in the functional connectivity measures across different types of comorbid disorders as indicated by the classification models. From the perspective of functional connectivity measures, it did not contribute to the different ADHD comorbidity profiles in children at 9-10 years old. Other methods may provide further insights. 

Haotian (Matthew)Ma
Correlation between cognition, functional impairment, and neuropsychiatric symptoms

The Neuropsychiatric Inventory–Questionnaire (NPI-Q) was developed and crossvalidated with the standard NPI to provide a brief assessment of neuropsychiatric symptomatology in routine clinical practice settings (Kaufer et al, J Neuropsychiatry Clin Neurosci 2000, 12:233-239). The NPI-Q is adapted from the NPI (Cummings et al, Neurology 1994; 44:2308-2314), a validated informant-based interview that assesses neuropsychiatric symptoms over the previous month. The original NPI included 10 neuropsychiatric domains; two others, Nighttime Behavioral Disturbances and Appetite/Eating Changes, have subsequently been added. Another recent modification of the original NPI is the addition of a Caregiver Distress Scale for evaluating the psychological impact of neuropsychiatric symptoms reported to be present (Kaufer et al, JAGS, 1998;46:210-215). The NPI-Q includes both of these additions. The Functional Activities Questionnaire (FAQ) measures instrumental activities of daily living (IADLs), such as preparing balanced meals and managing personal finances. Since functional changes are noted earlier in the dementia process with IADLs that require a higher cognitive ability compared to basic activities of daily living (ADLs) (Hall, 2011; Peres et al., 2008), this tool is useful to monitor these functional changes over time. The FAQ may be used to differentiate those with mild cognitive impairment and mild Alzheimer’s disease. To further exemplify the importance and utilization of the FAQ, thousands of research participants across the United States are administered the FAQ annually as part of the National Alzheimer’s Coordinating Center (NACC) longitudinal research study taking place in 29 National Institute on Aging-funded Alzheimer’s Disease Centers (Weintraub et al., 2009). For my project, we aim to check the data collection status and evaluate the feasibility of the OASIS database for NPS neuroimaging study by running regression analysis including the variables contained in the NPIQ and FAQ. We are particularly interested the association of apathy and other variables.

Shangsi Lin
Exploring the Impact of Oversampling Techniques on the Performance of Postpartum Depression Prediction Models and the Mediating Role of Substance Usage: An Investigation Based on New York State Patient Data

Postpartum depression (PPD) poses a significant risk to maternal health, with potential implications for maternal mortality. Despite its rarity in the overall population, PPD prevalence has increased over the years, posing challenges for predictive modeling due to imbalanced datasets. This study explores the effectiveness of oversampling techniques to address this imbalance and investigates the mediating role of substance usage, specifically tobacco and opioid consumption, between previous depression history and PPD. Using hospital discharge records from the New York State Inpatient Database, logistic regression and random forest models were employed to predict PPD status. Random oversampling, SMOTE, Borderline SMOTE, and Density SMOTE were utilized to balance the dataset. While oversampling improved logistic regression performance, it did not significantly impact random forest models due to dataset dimensionality. Mediation analysis revealed no substantial mediation effect of substance usage on the relationship between previous depression history and PPD. Overall, oversampling techniques enhanced model performance in logistic regression, with random oversampling and SMOTE yielding the best results. However, the efficacy of oversampling varied across modeling methods, emphasizing the importance of understanding dataset structure when selecting oversampling techniques. Furthermore, the absence of mediation effects underscores the need for further research into the complex relationship between substance usage and depression.

Jiayi Shi
Predictive Modeling of Childhood Anxiety Disorders Using Task-Based fMRI: Insights from the ABCD Study

Anxiety disorders are prevalent psychiatric conditions among children and adolescents, yet comprehensive neuroimaging research in this population remains limited. This study aims to address this gap by utilizing data from the Adolescent Brain and Cognitive Development (ABCD) Study to construct predictive models for pediatric anxiety disorders. SpeciIically, we employ lasso regression and task-based functional magnetic resonance imaging (fMRI) measures to identify key brain regions associated with anxiety disorders. Our analysis focuses on baseline demographic, clinical, and neuroimaging data from the ABCD Study, employing linear mixed effects models to identify signiIicant contrasts and weighted lasso for variable selection in logistic regression models. Notably, using the Monetary Incentive Delay (MID) task fMRI, we identify three signiIicant contrasts and highlight speciIic brain regions, including the transverse temporal pole, posterior cingulate, cuneus, pallidum, insula, postcentral gyrus, and caudal anterior cingulate, as consistently contributing to anxiety disorder diagnosis in children. Our predictive model demonstrates promising performance with an area under the curve (AUC) of 0.713, representing a 5.32% improvement over the base model. While we extend our analysis to other fMRI modalities such as resting-state MRI and nBack task fMRI, no signiIicant brain measures are selected, and the AUC does not surpass that of the base model. Future studies might need to explore additional MRI modalities, novel statistical approaches, and alternative diagnostic classiIications to reIine anxiety disorder diagnostic biomarkers effectively.

Observational Study (HSC LL209B)

Sophie Chen
The role of neural flexibility in cognitive aging

Research has found that various aspects of brain structure and function significantly impact cognitive performance in adults. Previous studies mainly used static methods to study how brain networks affect cognitive aging. However, newer research suggests that dynamic measures, which track connectivity changes during brain scans, which might better explain cognitive performance in younger adults. Moreover, studies indicate differences in how the brain functions depending on whether the person is doing a task during the scan. The current study aimed to: (1) investigate if changes in brain network connections during rest and tasks relate to age and cognitive ability in adults of various ages, and (2) compare brain network changes during rest and tasks. The study involved 133 healthy adults aged 20–80, who underwent resting state and task-based brain scans. The conclusion section still requires further exploration. But for now, by examining the resting state data, all ROIs have no specific trend among the trajectories for all parameters.

Landi Guo
Analysis of Persistent Homology-Based Functional Connectivity and Cognitive Function Across the Lifespan

Functional connectivity networks obtained from resting-state functional magnetic resonance imaging (fMRI) reflect the degree of interconnectivity among different brain regions during integrated brain processing. Recently, persistent homology-based functional connectivity (PHFC) measures offer an alternative to standard graph-theorybased metrics by quantifying the pattern of information. Specifically, PHFC comprises backbone strength (BS), backbone dispersion (BD), and cycle strength (CS). Previous studies have shown that PHFC is associated with cognition across the lifespan; however, little is known about the formation and changes of PHFC over an individual's lifetime. Our primary objective is to estimate the lifespan growth curve of PHFC, with secondary aims to identify brain thickness measures predictive of cognition and explore the association between PHFC and cognitive scores in cognitively normal elders. We employed generalized additive models (GAM) on human connectome project (HCP) data. Findings indicate sex-specific differences in lifespan PHFC patterns, suggesting potential variations in brain function development. Additionally, the identification of brain thickness measures predictive of cognition scores underscores the relationship between brain structure and cognitive function. Exploring correlations facilitates understanding of the potential association between brain function and cognition as people age. Overall, this analysis provides insights into the complex interplay among brain function, structural features, and cognitive abilities.

Yuhuan Lin
Modeling Piecewise Relationships in Laboratory Data Using Linear Mixed Effects Model

The report provides a comprehensive overview and statistical analysis of laboratory data pertaining to the assessment of a skin calcium sensor's functionality in a rat model. The study aims to validate the sensor's effectiveness in detecting induced hypercalcemia and hypocalcemia, correlating intradermal calcium variations with serum ionized calcium fluctuations. Student participation involves analyzing laboratory data to determine whether sensor readings and realtime serum ionized calcium levels movements are accordant. Lab controlled rats’ ionized serum calcium level elevate from time 0 to 60 minutes and decline from time 60 to 120 minutes. The study involved 15 rats, contributing sensor data at multiple time points over a 120-minute period. To analyze sensor reading data, linear mixed effects models with time dummy variable and quadratic regression are employed. Models developed indicate sensor reading increases from time 0 to 60 minutes and decreases from time 60 to 120 minutes, validating the sensor's ability to detect calcium fluctuations accurately. This advancement in calcium sensing technology shows promise for further applications of convenient calcium detection in patients with hypoparathyroidism, offering potential benefits for their medical management.

William Anderson
Uranium exposure and kidney tubule biomarker levels in the Multi-Ethnic Study of Atherosclerosis (MESA)

Mechanistic studies suggest that uranium exposure is toxic for the kidney tubules. We evaluated the association of chronic low-level uranium exposure, as measured in urine, on kidney tubule biomarkers for tubule cell dysfunction (1-microglobulin [A1M], uromodulin [UMOD], epidermal growth factor [EGF]), tubule cell injury (kidney injury molecule-1 [KIM-1], monocyte chemoattractant protein [MCP-1], human cartilage glycoprotein-40 [YKL-40]), and a biomarker for glomerular function and injury (albuminuria), among participants in the Multi-Ethnic Study of Atherosclerosis (MESA). In the MESA population, 461 individuals with all kidney tubule biomarker measurements, and 4,726 individuals with only albuminuria measurements, were included. Two progressively adjusted linear models were used to calculate the geometric mean ratio (GMR) for each log-transformed kidney tubule biomarker to quantify the effect of uranium exposure on renal function. Statistically significant GMRs, from comparing the 75th and 25th percentiles of urinary uranium, were observed for KIM-1 (1.12 [1.02, 1.24]), for MCP-1 (1.12 [1.03, 1.22]), and for albuminuria (1.35 [1.04, 1.76]), in a model adjusted for sociodemographics. In the sample of 4,726 participants, a statistically significant GMR was observed for albuminuria (1.13 [1.05, 1.23]). We did not find any statistically significant associations for the kidney tubule biomarkers YKL-40, A1M, UMOD, and EGF. A statistically significant association was observed between low exposure levels of uranium and kidney tubule biomarkers KIM-1, MCP-1, and albuminuria. Our findings provide evidence that uranium exposure may adversely affect kidney tubule health and function at low concentrations and stricter uranium contaminant monitoring may be necessary to address this issue. 

Session 3 (3:45pm - 4:45pm)

Cardiovascular Disease (HSC LL 108B)

Yucheng Li
Early Onset and Progressive Cardiac Remodeling in Pediatric Sickle Cell Disease- A Longitudinal Echocardiographic Study

Cardiac abnormalities are common in sickle cell disease (SCD), but their onset and progression during childhood remains poorly characterized. This retrospective longitudinal study examined serial outpatient echocardiograms from a cohort of pediatric SCD patients treated at a single center over many years. Left ventricular dimensions, mass, systolic function, and estimated pulmonary pressures were assessed. The analysis revealed that echocardiographic abnormalities began manifesting at very young ages in these patients, with the proportion of children aCected progressively increasing through adolescence. By late adolescence/young adulthood, a majority had evidence of abnormal left ventricular geometry, hypertrophy, and elevated pulmonary artery pressures, while systolic dysfunction was less prevalent. Older age, male sex, more severe genotypes, markers of hemolytic anemia, and a history of acute complications emerged as risk factors associated with cardiac abnormalities. The findings indicate that children with SCD develop cardiac remodeling and dysfunction early in the disease course that worsens over time, likely driven by chronic anemia, hemolysis, and overall disease severity. Implementing routine echocardiographic screening in high-risk pediatric SCD patients may allow timely monitoring and interventions to prevent or mitigate cardiovascular complications that increase mortality risk.

Zizhao Lin
Association of public water arsenic exposure with incident fatal and non-fatal CVD

Scientific evidence has shown that high-chronic arsenic exposure is associated with cardiovascular disease, yet water could be a suspicious origin of chronic arsenic exposure. In this study, the association of arsenic in federally regulated community water systems (CWS) and cardiovascular disease incidence is examined within the Multi-Ethnic Study of Atherosclerosis (MESA). Participants were followed for incident CVD from baseline through 2019 with a mean follow-up year of 14.9. Records with missing values are filtered out for comprehensive concern. Baseline characterization table and exploratory analysis are performed to check the overall distribution of variables of interest. For the hazard ratio of water arsenic exposure, Cox proportional hazard mixed-effects models are used to account for potential aggregation by zip code with adjustment for sex, baseline age, body mass index, smoking status, and education. Out of N=6666 participants with complete records, there are 445 (6.7%) CVD death events, and the prevalence is 6.7%, 6.5%, and 7.2% in low, medium, and high water arsenic level regions, respectively. According to the Cox-PH mixed effect models fitted with MESA participants, the hazard ratio per doubling in CWS arsenic was 1.04 (0.93, 1.17). The hazard ratio drops from 1.18 (0.98, 1.41) to 0.91 (0.74, 1.13) as BMI increases from smaller than 25 to larger than 30. Although these hazard ratios are not significant at the α=0.05 level.

Qilin Zheng
Heart Failure Prediction

Cardiovascular diseases are the primary cause of death worldwide, leading to approximately 17.9 million deaths annually. This research employs a dataset with 11 key features to predict heart disease, using machine learning for early detection and management in high-risk individuals. Firstly, I undertook data cleaning and visualization, transforming the data into a format suitable for analysis. Subsequently, I employed machine learning techniques such as logistic regression, decision trees, and random forests to develop models for predicting the risk of heart disease. Finally, I compared the accuracy of these models to identify the most effective one and also determine the factors closely associated with accurate heart disease prediction. This approach holds significant potential for the prevention and control of heart disease.

Xuesen Zhao
Investigating the genomic interactions between Alzheimer’s Disease and cardiovascular/cerebrovascular risk factor in ROSMAP cohort

Alzheimer’s disease (AD) is intricately associated with cardiovascular and cerebrovascular risk factors (CVRFs) observed in middle age and beyond, often culminating in cerebrovascular pathology upon death. While the link between CVRFs such as hypertension, obesity, diabetes, and coronary heart disease with AD is established, the mechanistic pathways connecting vascular risk factors to ischemic microvascular pathology remain underexplored. Investigating the nexus between CVRFs and genetic variants may unravel the complex pathogenesis of AD. In this study, 1092 participants from the Religious Orders Study and Rush Memory and Aging Project (ROSMAP) were selected and categorized into vascular and neurodegenerative subgroups based on their pathology profiles. Logistic regression using normalized RNA-seq data from the dorsolateral prefrontal cortex (DLPFC) region identified 3127 and 367 gene expressions that are significantly associated with vascular and neurodegenerative pathologies respectively (p-vale = 0.05), followed by pathway analysis to elucidate underlying biological pathways. Integrating genotype information from Genome-Wide Association Study (GWAS) enabled expression quantitative trait loci (eQTL), identifying single nucleotide polymorphism (SNP) associated with significant gene expression for each subgroup. 1cis-eQTL and 53 trans-eQTL were identified for the vascular subgroup whereas no cis-eQTL and 3 trans-eQTL were identified for the neurodegenerative subgroup (FDR corrected p-value < 0.05). Our findings highlight the distinct genetic underpinnings of AD's vascular and neurodegenerative pathologies, underscoring the value of multi-omic integration for identifying novel biomarkers in AD. 

COVID-19 Research (HSC LL109B)

Wenjia Zhu
Cytokine and Chemokines Dysregulation in Long COVID Patients

Long COVID, characterized by symptoms persisting beyond 12 weeks post-acute SARS-CoV-2 infection, presents significant diagnostic and prognostic challenges. This study aimed to elucidate the predictive potential of cytokine for LC, offering insights into its complex pathophysiology and identifying biomarkers for early detection and therapeutic targeting. A dataset of 44 COVID-19 patients stratified by recovery status formed the basis for a biostatistical analysis of cytokine dynamics. Employing random forest analysis for biomarker identification and Receiver Operating Characteristic analysis for predictive power assessment, tested by non-parametric tests, 10 key cytokines were identified as significant predictors of LC, including PDGF-AA, EGF, TARC, IP-10, sCD40L, IL-8, IL-7, PDGF-AB/BB, and IL-3. Notably, cytokines such as PDGF-AA and EGF, associated with tissue repair, and sCD40L, involved in thrombosis and inflammation, were indicative of their roles in the disease's progression and the post-acute sequelae. This research underlines the need for further validation of these biomarkers in larger cohorts and the exploration of targeted interventions to mitigate the long-term effects of COVID-19, thus contributing to the ongoing efforts to enhance clinical practice and epidemiological research. 

Yahui Zhou
The Paxlovid Rebound Study

This study aimed to investigate the epidemiology of Paxlovid rebound in individuals with acute COVID-19 infection, as concerns over this phenomenon have considerably restricted the adoption of Paxlovid. By prospectively comparing treated and untreated participants, we sought to gain insights into the occurrence and characteristics of Paxlovid rebound, filling the gap in evidence surrounding this phenomenon. A decentralized, digital, prospective observational study was conducted, enrolling COVID-19-positive participants eligible for Paxlovid. Participants were divided into Paxlovid and control groups based on treatment choice. Both groups underwent regular rapid antigen testing and symptom surveys for 16 days to assess viral and symptom rebound. Viral rebound incidence was 14.2% in the Paxlovid group (n=127) and 9.3% in the control group (n=43), while COVID-19 symptom rebound rates were 18.9% and 7.0% respectively. No significant differences in rebound rates were observed based on demographics or major symptom categories. This preliminary analysis indicates higher than previously reported rebound rates post-clearance of positivity or symptoms in both Paxlovid and control groups. Further extensive studies with diverse cohorts and longer follow-ups are necessary for a comprehensive understanding of rebound phenomena.

Zuoqiao Cui
Estimating the Causal effect of Cigarette Smoking on COVID-19 severity: A Comprehensive Mendelian Randomization Analysis

COVID-19 emerged in late 2019 and quickly escalated into a global pandemic. Therefore, understanding the risk factors associated with severe manifestations of the disease has become imperative. Cigarette smoking, a wellknown detriment to respiratory health, has been speculated to exacerbate COVID-19 outcomes, yet the causal pathways remain unclear. This project aims to estimate the causal effect of cigarette smoking on hospitalized and severe COVID-19 using different Mendelian randomization (MR) methods and compare the results. This project employs LD clumping to eliminate correlated genetic variants, with the remaining variants being selected as instrumental variables (IVs). MRCIP, MR-Egger, median-based method and contamination mixture method are utilized to estimate the causal effect of cigarette smoking on hospitalized and severe COVID-19 using the selected IVs. All MR methods indicate a positive effect of cigarette smoking on both hospitalized and severe COVID-19 cases. Among these methods, MRCIP and MREgger report significantly different estimation results compared to the median-based and contamination mixture methods. By evaluating the significance of the Correlated Pleiotropy Index (CPI), we confirm that the Instrument Strength Independent of Direct Effect (InSIDE) assumption is valid. Therefore, MR-Egger is more likely to provide a more accurate estimation compared to the median-based and contamination mixture methods. The latter two methods depend on a portion of valid IVs, which may not be present in the data. Meanwhile, MRCIP does not rely on either assumption and thus offers a reliable estimation.

Xin Ren
Predicting Mutations in SARS-CoV-2 with a Deep Learning Model

The emergence and rapid evolution of SARS-CoV-2 pose significant challenges for vaccine development and public health responses. In this study, we employ a deep learning model to predict mutations in the SARS-CoV-2 virus, with a specific focus on the influence of sequence length on prediction accuracy. Utilizing a pre-trained model adapted for genomic sequences, we investigate a range of sequence lengths to identify the optimal input size for accurate mutation forecasting. Data was collected from the GISAID database and analyzed using PyTorch and PyMOL for model training and visualization, respectively. This research not only provides valuable insights into the application of cyber-physical systems and deep learning in biological data analysis but also highlights the critical importance of sequence length in genomic deep learning models. The findings have implications for enhancing computational methods in virology and improving strategies for monitoring viral mutations.

Topics in Data Visualizations (HSC LL 109A)

Chao Gao
Enhancing Clinical Trial Data Review with Interactive Visualization Through R/R Shiny

The biotechnology and pharmaceutical industries are increasingly demanding fast results and comprehensive analysis of clinical data. Traditionally, the robust programming language, SAS, has been extensively used for statistical analysis and reporting throughout the entire lifespan of drug development. However, during current scenarios such as safety or data quality reviews for complex data during ongoing trial phases, a more customized approach is required. This is where R shiny comes into the picture, oHering interactive data exploration that can be deployed on either local disk or cloud, and rapidly accessed by cross-functional teams for eHicient data filtering and visualization. This initiative aims to provide an elucidative guide to constructing dashboards utilizing the R/R Shiny framework, coupled with a brief demonstration of its integral features for visualizing clinical trial safety data, employing dummy datasets. Key elements to be highlighted include summary statistics and prevalent visual representations such as interactive line and box plots, and an example of dummy complex data will undergo smoothing through a generalized additive model for enhanced eHect monitoring visualization.

Fengdi Zhang
Clinical Data Management and Visualization in Medical Device Industry

Medical device clinical studies generate vast amounts of data that continuously evolve throughout the study duration. The primary challenge lies in transforming this data into insights to enable data-driven decision-making. By utilizing SAS to extract, manipulate, and preprocess information from relational databases in electronic data capture systems, followed by Power BI to develop interactive dashboards, we aim to improve data accessibility, review processes, and interactivity. As a result, we have developed sophisticated data visualizations that facilitate realtime monitoring, reporting, and analysis of study data as the study progresses. These dynamic and interactive visualizations, covering a wide scope of data, are centralized on one platform, improving data accessibility and overall productivity. In summary, the combination of SAS and Power BI offers an innovative solution for clinical data management and visualization, empowering clinical research teams to interpret extensive datasets and drive positive patient outcomes.

Jingya Yu
Exploring the Pharmacokinetics of ERAS-007: A Comprehensive Analysis and Visualization

Pharmacokinetic (PK) studies play a crucial role in understanding the behavior of drugs within the body, guiding dosage regimens, and optimizing therapeutic outcomes. In this study, we investigate the PK profile of ERAS-007, a novel therapeutic compound, through a detailed analysis and visualization approach. Utilizing data visualization tools, we present interactive plots that showcase how ERAS-007 concentrations vary over time across different subjects and dosing regimens. Our analysis encompasses data from both the original and updated datasets, enabling automatic report generation and providing insights into any changes in drug behavior. The visualizations offer a comprehensive overview of ERAS-007's pharmacokinetic properties, facilitating a deeper understanding of its absorption, distribution, metabolism, and excretion dynamics. This study sheds light on the intricate interplay between drug administration, body’s reaction, and patient-specific factors, contributing to the optimization of ERAS-007 therapy and paving the way for future pharmacological research endeavors.

Yiying Wu
Establishing a Public Health Infrastructure in Vernon, CA: A Comparative Demographic Analysis and GIS Evaluation of Healthcare Accessibility

This project aims to establish a comprehensive public health infrastructure in Vernon, CA, a small city positioned southeast of Downtown Los Angeles. Despite its modest residential population of 210, Vernon is a hub for approximately 35,000 low-wage workers who commute there daily. It borders neighborhoods characterized by high poverty rates and significant homeless populations. Our objective is to develop public health services tailored to the city's needs. To this end, I conducted a thorough analysis of Vernon's demographic factors, such as population density, age distribution, gender ratio, race, and ethnicity, as well as economic indicators like household income and poverty rates, comparing these aspects with adjacent communities like Bell and Huntington Park. Furthermore, I employed Geographic Information System (GIS) technology to examine the spatial distribution of healthcare facilities in Vernon and its vicinity. The findings underscored a pronounced gender disparity and a critical shortfall in public health amenities within the city.

Topics in Epidemiology (HSC LL107)

Tingyi Li
Risk factors of diabetes

Diabetes is characterized as a chronic disease which the elevated blood sugar gradually causes severe damage to heart, blood vessels and kidneys. Around 422 million people have diabetes especially those who live in low-income countries. In this analysis, we use Pima Indians Diabetes dataset to investigate the risk factors of diabetes. Logistic regression analysis was used to evaluate the significance of all potential predictors at first. Given the unbalanced nature of this dataset that the number of non-diabetic individuals exceeds people with diabetes, we use machine learning algorithms like random forest and KNN to further improve the prediction of diabetes. By providing a comprehensive overview of these risk factors, we conclude that glucose and BMI are the most significant predictors contribute to the development of diabetes. Future research are needed to further explore the potential risk factors of diabetes, developing more advanced prevention and treatment strategies.

Tvisha Devavarapu
Meta-Analysis of Resistance to first-line Anti-Tuberculosis Drugs within Central Nervous System Tuberculosis Cases

Central Nervous System Tuberculosis (CNS-TB), a particularly challenging form of TB, is characterized by diagnostic complexities such as inefficient tests and drug resistance issues that contribute to its significant burden on effective TB management. Countries with a high burden of CNS-TB annually account for ~87% of multi-drug resistance cases that are complex and cost intensive to address. To assess the burden of resistance to first-line anti-TB drugs in CNSTB, a team of collaborators have performed a systematic review to collect information on demographics, forms of CNS-TB, drug resistance prevalence to the first-line anti-TB drugs Isoniazid, Rifampin, Ethambutol, Streptomycin, and Pyrazinamide, and the type of drug resistance observed (mono-resistance, multi-drug resistance, and hetero-resistance). Using this information from a mix of 34 studies (cross-sectional, case-control, and retrospective designs) across 9 countries, this project aimed to identify the prevalence of resistance to first-line anti- TB drugs in CNS-TB cases, and to discern potential differences in drug resistance patterns amongst relevant sub-groups. For this assessment, a meta-analysis of a single proportion (prevalence) using a Generalized Linear Mixed Model was performed with random effects model considerations and the inverse variance method. As a result, this approach attempts to estimate the prevalence by taking both within study and between study variability (heterogeneity) into consideration. Results from these findings can potentially lead to the generation of effective and targeted strategies for better management of CNS-TB cases, particularly in regions with a high burden of multi-drug resistance to first-line anti-TB drugs.

Anzhuo Xie
Stroke Data and Underlying factors Analysis

According to WHO, stroke is the second main cause of death around the world. More attention and funds should be paid to increase people’s awareness and help people cultivate health habits to prevent stroke. This study used the hospital data set and aimed to figure out the underlying factors of stroke. In the data analysis and modeling section, the logistic model and random forest model were used. The logistic model had a better performance with a higher AUC-ROC. Finally, the result showed that the leading factors to stroke are age, hypertension level, glucose level, and work type. This indicated that we needed to focus on these four factors to decrease the stroke risk and this study made some recommendations about how to prevent strokes from these four perspectives.

Olivia Schulist
Migration and Nutrition Epidemiology in Mexico and the United States

The International Standard Classification of Occupations (ISCO) occupational code index created by the United Nations International Labor Organization (ILO) agency was created to provide “a basis for the international reporting, comparison, and exchange of statistical and administrative data about occupations; a model for the development of national and regional classifications of occupations; and a system that can be used directly in countries that have not developed their own national classifications.” The index was developed in a European context, first published in 1958, and last revised in 2008. The 1988 version of the ISCO index has been translated into Spanish. During her APEx, Olivia assessed the limitations of directly applying these codes within the context of Mexico. She additionally conducted data analysis for Columbia University’s food pantry. She performed data visualization and tested the following hypotheses: (1) The distribution of food security status scores did not significantly differ from the last analysis. About 53% of food pantry visits occurred during experiences very high food insecurity at alpha 0.05 (H0: P = 52.9), and (2) the observed proportion of individuals experiencing food insecurity exceeded the national proportion of individuals experiencing food insecurity of 3.8% (0.6513 > 0.038).

Akbobek Amangeldi
Preparedness for Future Disease Outbreaks in the United States

The recent COVID-19 pandemic changed the landscape of public health and response to infectious disease outbreaks. Even though this public health emergency officially ended, we continue to experience periodic increases in a wide range of respiratory infectious diseases. Thus, adhering to recommended actions to prevent infectious disease spread and future outbreaks remains important. In May 2023, Heluna Health conducted a national online panel survey of 4,498 adults in the United States. The survey assessed public opinions regarding risk for future outbreaks, preferred sources of outbreak information, and preparedness for future outbreaks. Survey results indicate moderate-to-high levels of outbreak preparedness among the majority of U.S. households, as indicated by self-reported levels of preparedness and ability to follow federal guidelines. In addition, the majority of U.S. adults said they believe that vaccinations are important to prepare for disease outbreaks, and that they would be willing to be vaccinated if recommended by health authorities. Despite these results, we identified areas in which there is room for improvement in outbreak preparedness. Even though most U.S. households would be moderately or highly prepared, a small percentage of adults reported that their households would be unprepared. One of the areas in which adults were least prepared was in the ability to isolate sick family members if needed, as one in five adults reported being unable to isolate a sick household member in their own household. Also, we identified a large knowledge gap (among ~20% of adults) regarding how to use or install high efficiency air filters. We identified disparities in overall levels of household preparedness according to age, race, ethnicity, income, and education. Specifically, adults who were younger, of Black or African American race, Hispanic ethnicity, with less than a high school education, or with annual household incomes <$35,000 reported being the least prepared. The findings suggest that while a significant portion of U.S. households reported moderate to high levels of preparedness for future disease outbreaks and willingness to get vaccinated, there are areas for improvement. Identified disparities can be used by public health agencies and community leaders to strengthen preparedness for future outbreaks. 

Machine Learning and Dimension Reduction (HSC LL108A)

Qinzhen Sun
Comparative Analysis of Penalized Regression Models with High-dimensional Imaging Biomarkers in a Large-scale Alzheimer’s Disease Study

This project investigates the predictive and variable selection capabilities of various penalized regression models that can handle high-dimensional features using the Alzheimer's Disease Neuroimaging Initiative Phase I (ADNI 1) MRI imaging data, aiming to identify important biomarkers and enhance diagnosis for the Alzheimer’s disease (AD). Both continuous outcome (mini-mental state examination, MMSE) and binary outcome (AD vs. normal control) were analyzed with regression models using lasso, SCAD, MCP penalties and iterative sure independence screening (ISIS) process with lasso, SCAD and MCP penalties. In terms of prediction performance, regression model with MCP penalty achieved the lowest RMSE at 5.06 for the continuous outcome, while model with lasso penalty achieved the highest accuracy at 0.872 for the binary outcome. Four imaging biomarkers were identified as the most important for the continuous outcome of MMSE score, including the volume of left hippocampus and right fusiform, and the average cortical thickness of left inferior temporal and right entorhinal. For classifying AD from normal control individuals, five imaging biomarkers were identified as the most important, including volume of left hippocampus, and the average cortical thickness of left and right entorhinal, left middle temporal and right inferior parietal. Regarding variable selection, model with lasso penalty tends to select an inflated number of features, while model with SCAD penalty demonstrates a robust selection for both binary and continuous outcomes, and it is also observed that model with MCP penalty selects a notably lower number of features compared to other models. Incorporating the ISIS process with different penalties marginally affects the prediction performance but achieves a much higher consistency in variable selection across models with the three penalties. These findings underscore the importance of selecting appropriate modeling techniques in AD research to advance our understanding of the disease and facilitate effective clinical interventions.

Runze Cui
Regularized Generalized Method of Moments

This paper explores the Generalized Method of Moments (GMM), an esteemed statistical technique celebrated for its broad applicability across diverse fields, including biostatistics. Introduced by Hansen, Nobel Laureate in Econometrics, GMM excels in parameter estimation through leveraging moment conditions directly derived from sample data, encompassing means, variances, skewness, and kurtosis. Distinct for its versatility, GMM is adept at addressing complex models marked by nonlinearity, heteroskedasticity, or autocorrelation, offering a significant advancement over the traditional Method of Moments by accommodating scenarios where estimating equations outnumber the parameters. Within biostatistics, GMM's robustness against model misspecification and minimal reliance on stringent assumptions render it invaluable for analyzing complex biological data. Our project enhances GMM's methodology by integrating a regularization component, specifically an L1 penalty, to improve parameter estimation efficiency, validated through simulation results showcasing the augmented estimator's performance. 

Ruiji Pan
Using Machine Learning Methods to Predict Customer Segmentation

In this IQVIA internship, I was responsible for data cleaning and data analysis of customer behavior data and prediction of customers’ segmentation. KNN, support vector machine, random forest, logistic model, naïve bayes, xgboost methodare used to predict the segmentation.The potential features we used include gender, marry status, graduation status, profession, working experience, spending score, family size and variable1.Thestudyappliedprinciplecomponentanalysistofurtherimproveaccuracy.The result shows the classification accuracy is around 60%.

Zhaoqiany Xiong
Investigating the Model Performance in Multi-Pollutant Mixtures with Complex Confounders

A novel method, Bayesian kernel machine regression (BKMR), has emerged as a promising approach for estimating the complex relationships between exposure and health outcomes. BKMR offers a flexible and parsimonious means of estimating multivariable exposure-response functions. However, questions persist regarding its performance in scenarios with multiple confounders, particularly when these confounders exhibit moderate to high correlation and intricate relationships with the outcome of interest. This report aims to explore the performance of BKMR and compare it with alternative methods, such as BART in the presence of multiple confounders, focusing on its robustness and applicability in diverse epidemiological contexts. We set the causal estimand as the change in the mean of outcome Y when the exposure transitions from one value to another. To enhance the performance of the BKMR model, we employ different propensity score adjustment methods, including Multivariate Generalized Propensity Score (mvGPS), Covariate Balancing Propensity Score (CBPS), and Generalized Boosted Models (GBM). Our simulation study reveals that the BKMR model adjusted by mvGPS minimizes Relative Bias. However, results from data application show minimal differences among the methods. Our findings underscore the absence of a universally superior method; rather, tradeoffs between bias and variance exist, potentially leading to under- or over-fitting. This study contributes to a deeper understanding of BKMR's utility. Further investigation into techniques such as hierarchical variable selection may enhance model performance, particularly in settings with high multicollinearity.

Social Behavior Research (HSC LL209A)

Melike Aksoy
Marital Status and Sedentary Behaviors in Older Adults

Although there is an association between marital status and health outcomes, the relationship between sedentary behaviors and marital status is unclear. The current study data is from Adult Changes in Thought (ACT) study from Kaiser Permanente Washington. Participants from ACT study is aged 65 and older without dementia reported marital status and sedentary behaviors between 2016 to 2020. We investigated a cross sectional relationship between marital status and sedentary behaviors. Sedentary behaviors included time spent sitting and sitting or lying down when using a computer, watching television, or reading. Additionally, we looked at the covariates such as educational attainment, social support, depressive symptoms, age, sex, number of comorbidities. From 2253 participants, 56% (n = 1254) were married or living with a partner, 18% (n = 397) were divorced or separated, 6% (n = 137) were never married, and 21% (n = 465) were widowed. For all marital groups, the sedentary sitting time was same 8-9 hours (Inter Quartile Range (IQR) = 6-11 hours). Widowed participants spent more sedentary time watching TV (median 3-4 hours, IQR= 2-5 hours versus median 2-3 hours, IQR= 1-4 hours for married participants). Sedentary time using computer was same for married or living with a partner, divorced, or separated, and never married group, while widowed group spent less time compared to other marital status groups. We found that while marital status and overall sitting time is not related, the different marital status groups spent sedentary times differently. Specifically for widowed group spent their sedentary time differently than other groups.

Jingyi Tian
Analysis of Influencing Factors of Online Public Opinion Dissemination on Public Health Emergencies: Based on Weibo Platform

With the rise of online social media, users can participate more in opinion formation and dissemination processes. Social events such as public health emergencies are more likely to trigger online public opinion, and improper handling will lead to social disorder. It is of great significance to understand the characteristics of public opinion dissemination and analyze its influencing factors to guide the trend of public opinion. This study investigates the impact of various factors on the dissemination effects of Weibo blogs under the topic of COVID-19, using a zero-inflated model to analyze blogger and blog factors. Results indicate that media credibility indices like verification and follower count play crucial roles in expanding dissemination scope and enhancing user interaction. Additionally, active users are more likely to garner visibility and engagement. However, posting frequency negatively impacts interaction, with users preferring higher-quality content. Moreover, the timing of post publication significantly affects interaction counts, with peak user activity observed between 18:00 and 24:00. Overall, this study provides valuable insights into the mechanisms of blog dissemination during public health emergencies on social media platforms, aiding in guiding relevant policies and practices.

Xiangyi He
Optimizing School-Based Caries Prevention: A Comparative Study of Simple vs. Complex Regimens in Underserved Communities

This study assesses the effectiveness of two caries prevention programs in high-risk school-age children from low-income Hispanic/Latino communities with limited oral health care access. We compare a "simple" regimen (silver diamine fluoride, SDF + fluoride varnish) against a "complex" regimen (traditional sealants, therapeutic sealants + fluoride varnish), hypothesizing that the simple approach is equally effective in caries arrest and prevention. Utilizing a school-based cluster randomized controlled trial design, the project aims to enhance access to oral health care and address care barriers. The secondary analysis in this project investigates the clinical efficacy of SDF, particularly the frequency of application required for successful caries arrest. Given the variability in SDF application protocols among dentists and the existence of non-responders, this aspect of the study seeks to determine the optimal frequency of SDF application to overcome treatment failures. Using data from the CariedAway trial, we aim to refine SDF application recommendations for school-based care, employing mixed-effects multilevel models for analysis. This research not only explores effective caries prevention strategies but also aims to improve oral health access and outcomes in underserved communities, potentially informing best practices in school-based oral health programs.

Farizah Rob
Social relationships age the immune system in early midlife: Evidence from the National Longitudinal Study of Adolescent to Adult Health

Aging of the immune system is characterized by changes in the T-cell compartment, including a decrease in naive T-cells and an increase in memory T-cells. Previous research in older adults (age 55+) found stress exposures to be predictive of accelerated immune aging. However, social relationship characteristics, established to be linked to stress mechanisms, have not been widely studied in relation to these adaptive immune biomarkers. Moreover, most population-level research on immune aging has been conducted in older adults. To examine associations between social relationships, in terms of quantity and quality, and immune aging in a U.S-representative early midlife population (age 33-44). CD4+ memory:naive and CD8+ memory:naive ratios were obtained from DNA methylation of venous blood samples collected during Add Health Wave V (n = 4,453). Constructed social relationship variables include the social network index (integration within the community), close contacts index (integration with friends and family), and quality of relationships with spouse/partner, friends, and family. Survey-weighted least squares regression was conducted to quantify the association between social relationships and each log-transformed immune ratio, adjusted for age, sex, race/ethnicity, and education. Higher social integration, indicated by the close contacts index, was associated with a 0.16 (95% CI: -0.26, -0.06) log-unit decrease in covariate-adjusted CD4+ memory/naive ratio, compared to social isolation. Higher quality family relationships were associated with a 0.10 (95% CI: -0.18, -0.02) log-unit decrease in covariate-adjusted CD4+ memory/naive ratio, compared to low quality family relationships. Our results suggest that, among early midlife adults, CD4+ memory:naive ratios might be more informative than CD8+ memory:naive ratios. We hypothesized higher social integration and higher relationship quality would be associated with a less aged immune system, and this was supported for close contacts and family members, respectively.

Statistical Genetics (HSC LL209B)

Tianshu Liu
Evaluation of Eigen Higher Criticism, Eigen Berk-Jones, and Omnibus Tests on Multi-Trait GWAS Summary Statistics

In Genome-Wide Association studies, one common research goal is to test the association between a Single Nucleotide Polymorphisms (SNPs) and multiple correlated traits based on the summary statistics to reveal genetic architecture of human traits and diseases. Traditional GWAS analyses are generally focused on univariate level, testing the associations between one SNP and each trait separately, which were considered less powerful. It is necessary to include multi-trait methods and dimension reduction technique into GWAS analysis. Eigen Higher Criticism (eHC), eigen Berk-Jones (eBJ) tests and an omnibus (OMNI) test which combine a range of tests by assigning weights, have been proved to gain correct type I error and great power as multi-trait methods in both simulation study and the real GWAS summary data. This practicum project aims to test the performance of these proposed tests based on settings with 4, 8, and 12 correlated traits. Simulation studies are performed to estimate and compare type I error and power across different tests. These methods are also applied to publicly available multi-trait GWAS summary data to identify additional SNPs.

Yixuan Jiao
Assumption Validation and Comparison across Different Spatially Variable Genes Identification Methods

Spatially resolved transcriptomics (SRT) is a method that maps gene expression in tissue while preserving spatial information. A common analysis task utilizing SRT data is to identify varied genes expression across a tissue/ specific cell domain, defined as spatially variable genes (SVGs). Currently, tools for identifying SVGs are based on different statistical methodologies, such as standard spatial autocorrelation, different variations of gaussian process, nonparametric covariance tests, etc. This project selects representative SVGs detection methods including SpatialDE, nnSVG, SPARK, SPARK-X, and Moran’s I and compares across them. The objective of the project is to validate the model assumptions of each method by conducting hypothesis test and visualizing the input distribution. Also, the performance and detection results of each method are also evaluated and compared using both real and simulated spatial omics data. The evaluation includes visualizing the overlapped SVGs from each detection methods, the consistency of detection results within same methods (real data vs. simulated version) & across methods (among same spatial omics dataset), and the computational performance of each method. Even though disparities exist among those methods, the advanced features and tradeoffs of each SVGs detection methods are summarized and recommendation will be made based on that. This project gives insight into the characteristics of each method and can be served as a guide for practitioners engaged in spatial transcriptomics data analysis.

Yunxi Yang
Robust multi-source data integration for enhanced ADRD risk prediction for underrepresented Populations

This study investigates the role of Single Nucleotide Polymorphisms (SNPs) in Alzheimer's Disease and Related Dementias (ADRD) across African (AFR) and European (EUR) populations to identify genetic variations influencing disease risk. The aim is to understand ethnic disparities in ADRD predispositions to guide more effective interventions. Utilizing data from the 1K genome projects and GWAS, , analyzing 374 and 352 SNPs respectively associated with ADRD. Methods included imputation, principal component analysis for population structure adjustment, and ridge and lasso regression for data analysis. Significant SNP clusters associated with ADRD risk were identified, revealing pronounced differences between African and EUR populations. These findings highlight the genetic diversity in ADRD predisposition, suggesting these SNP clusters as potential genetic markers for ADRD risk assessment. The analysis supports the potential of personalized medicine in ADRD prevention.The study emphasizes the importance of considering genetic diversity in ADRD risk assessment and paves the way for personalized medicine approaches. Future research should explore the functional impacts of these SNP clusters and their role in ADRD pathogenesis, aiming to improve strategies for ADRD management across diverse populations.

Jingyi Yao
A computational method for the detection of cell type-specific gene-gene correlation in multi-sample single-cell RNA-seq data

Gene-gene correlations are pivotal for uncovering gene collaborations among cells. However, accurately measuring these correlations in multi-sample single-cell RNA-seq data remains unresolved. Current methods overlook sample variability, leading to bias in the results. To address this, we propose a novel gene-gene correlation measurement method that considers inter-sample variations, and benchmark it against existing methods using simulated and real datasets. The simulation results show that our method can effectively mitigate false positives in calculating cell type-specific gene-gene correlations. Its performance on real data also outperforms the existing methods.

Survival Analysis (HSC LL207)

Kyle Schichl
Characterizing Apathy in ADNI using Joint Latent Class Linear Mixed Models

Apathy is one of the most common symptoms in mild cognitive impairment (MCI) and Alzheimer’s disease (AD) populations, serving as clinical hallmarks and impacting both patients and caregivers’ quality of life. Understanding the longitudinal trajectory of apathy and its association with other psychiatric symptoms could facilitate treatments tailored to individual patient needs. We examined a subset of 673 patients who were diagnosed with MCI at baseline, and were subsequently assessed at least twice during the course of the study. We employed a joint latent class linear mixed model to evaluate the longitudinal trajectory of apathy in combination with a survival model estimating the time until onset of Alzheimer’s disease. We identified four latent classes based on the joint trajectory of apathy and Alzheimer’s disease progression. These four groups exhibited distinct survival trajectories, with those demonstrating an average increasing incidence of apathy showing the poorest survival outcomes according to Kaplan-Meier estimates. Additionally, these particular latent groups exhibited significantly higher AV45-SUVR ratios at baseline, reinforcing findings that apathy, as a hallmark symptom of Alzheimer’s disease, is closely associated with Alzheimer’s pathology. Further research is needed to incorporate multiple Neuropsychiatric inventory (NPI) outcomes alongside Alzheimer’s disease progression to comprehensively elucidate the complex interplay between psychiatric symptoms and the progression of Alzheimer’s disease.

Youlan Shen
A Comparative Analysis of Cox Hazards and Generalized Linear Models in Evaluating the Impact of PM2.5 Exposure on Mortality Rates

This project addresses the comparative effectiveness of the Cox hazards model and generalized linear models in evaluating the influence of an environmental exposure parameter—PM2.5—on mortality rates using an air pollution and health dataset. The study aims to determine whether these models yield divergent results and what assumptions inherent to each model might drive these differences. Environmental exposure data were simulated and applied to various epidemiological models, including the Andersen and Gill model, the Prentice, Williams, and Peterson model, and generalized linear models (generalized nonlinear model, generalized linear model, generalized additive model). The comparison focused on the coefficient results and variances for PM2.5 exposure. Findings from the simulation study reveal minimal differences in parameter estimation between the log-linear and Cox’s survival models; however, it was observed that log-linear models typically offer greater computational efficiency than the Cox model.

Wenjing Yang
Circadian polygenetic risk scores determining hyperglycemia in multi-ethnic study of atherosclerosis

The study analyzes the association between hyperglycemia with circadian polygenetic risk scores (PRS) and lifestyle factors in multi-ethnic study of atherosclerosis (MEAS). The outcomes in this research included the measurements of glucose, insulin, HOME index, BMI, waist circumference, the incidence of obesity, and the incidence of diabetes. To study the association between these outcomes with three chosen polygenetic risk scores which are chronotype, sleep duration, and short sleep duration, linear regression model and logistic regression model were used by adjusting for sociodemographic factors, health factors, and medication. Cox proportional hazard regression model was also applied for analyzing the incidence of diabetes and obesity. Considering reasonable interaction, models included interaction terms such as race, sex, and lifestyle to study if it is significant. Overall, the scientific question is using survival analysis to study global and ethnic-specific circadian polygenetic risk scores determining hyperglycemia in multi-ethnic study of atherosclerosis. 

Haoyu Tian
Study On the Early Prediction of Myopia in Children and Adolescents by the interaction between environment and Genes

This is a comprehensive research study that collected data from Beijing Tongren Hospital, which aims to establish predictive models for childhood myopia using artificial intelligence (AI) and traditional statistical methods. The study collected data from over three thousand primary school students with varying genders, spanning six years of eyesight data. The primary objectives were to develop a myopia predictive model incorporating genetic and environmental factors, as well as to observe the impact of these interactions on model performance.

Transfer Learning (HSC LL210)

Zhengwei Song
Enhancing Breast Cancer Risk Prediction in Hispanic Women Through Transfer Learning

Breast cancer remains a significant public health challenge, particularly among Hispanic women who experience disparities in risk prediction and outcomes. This practicum project explored the potential of transfer learning models to enhance breast cancer risk prediction in this underrepresented group. Utilizing high-dimensional genotype data from the NIH's dbGap database, the project employed advanced high-dimensional regression and transfer learning algorithms, focusing on single nucleotide polymorphism (SNP) selection methods to identify relevant genetic markers. Through the innovative application of SNP clumping and employing high-performance computing, the study aimed to construct and evaluate improved risk prediction models specifically for the Hispanic population. The methodologies included two novel transfer learning strategies that leverage genetic data across different ancestral populations to improve predictive accuracy, as well as two benchmark methods for comparison. The project demonstrated a significant improvement in risk prediction metrics after applying transfer learning methods. These findings highlighted the potential of transfer learning to improve breast cancer risk prediction among Hispanic women. 

Hongjie Liu
Transfer Learning of Conditional Average Treatment Effect

Predicting the treatment effect of a medication given a patient’s covariates is a common task in personalized medicine. However, labeled data often come from specific clinical trials or observational studies that may not accurately represent the target population, posing a challenge for generalization. We aim to learn conditional average treatment effects (CATE) with minimal mean squared error over a target distribution, leveraging both unlabeled data from the target population and labeled data with potentially different feature distributions. We introduce new algorithms for estimating CATE within the kernel ridge regression (KRR) framework, inspired by established CATE estimators such as RA-Learner and independent learner, but uniquely addressing the covariate shift problem through a pseudo-labeling technique. Our algorithms involve dividing the labeled data into subsets and conducting KRR on them separately to obtain a collection of candidate models and an imputation model, the latter used to impute missing labels and select the optimal candidate model. Our excess risk bounds and simulation studies demonstrate that our algorithms can adapt to the target distribution as well as the covariate shift, offering improved reliability in general scenarios.

Zijian Xu
Transfer Causal Learning with Instrumental Variables

We explore the integration of the transfer learning with instrumental variable methods to address causal inference challenges. Traditional instrumental variable regression techniques often fall short when faced with complex, high-dimensional relations in nonlinear treatment model. Our approach leverages transfer learning and machine learning approaches to improve the prediction accuracy in the first stage of instrumental variable regression, thereby enhancing the overall estimation quality of the treatment effect. Utilizing transfer learning, we harness auxiliary data sources to enrich the information set available for the target estimation task, addressing scenarios where direct estimation faces limitations due to data scarcity or distribution biases. Through a series of simulations, we demonstrate the advantages of our method over traditional approaches, highlighting its potential to yield more accurate and reliable causal estimates.

Jiajun Tao
Transfer learning in Cox model

Survival analysis plays a pivotal role in various fields including medical research and the Cox Proportional Hazards model is one key tool for this. We aim to leverage external source data through a transfer learning approach within the Cox model framework to overcome the limitations when the sample sizes are not large enough. Our novel approach, named CoxTL, simultaneously accounts for the heterogeneity in covariate distribution and regression coeEicients, enhancing the model’s predictive accuracy and robustness. The approach uses the density ratio weighting to adjust for covariate shift and power prior to leverage source data while controlling for potential bias. The performance is evaluated in simulation studies by C-index comparing with target-only, source-only, pooled, and partial likelihood transfer learning models. CoxTL shows its potential in improving the model and remains to be tested by real-world data.