2023 Biostatistics Practicum/APEX Symposium

You can view the titles and abstracts for each session below, organized by topic and room number.

Session 1 (10:00am - 11:00am)

Health Intersecting with Policy, Economics, & Government (HSC 312)

Suning Zhao
Staffing Ratios in U.S. Home Health Care Agencies: A Comparison Before and During the COVID-19 Pandemic
In the U.S., a home health care (HHC) agency is an organization that delivers skilled medical care and support services to individuals in the comfort of their homes. These agencies are typically staffed by trained professionals, such as registered nurses (RNs), licensed practical/vocational nurses (LPN/LVNs), and home health aides (HHAs). Despite the well-known importance of staffing ratios in hospital and nursing homes, [1,2] few studies have been conducted on HHC agency staffing ratios. HHC agencies have been facing a significant shortage of staff for several years, [3] which has been exacerbated by the COVID-19 pandemic [4]. In anticipation of future pandemics, it is crucial to examine the impact of COVID-19 case and death burden on HHC agency staffing ratios.

Erlai Xu
Analyzing COVID-19 infection data and UVC disinfection product sales in different channels
This report presents an analysis of the relationship between COVID-19 infection data and sales of Ultraviolet-C (UVC) disinfection products across different distribution channels. Using Microsoft Excel, the data structure was reorganized to facilitate a more in-depth examination of the sales of UVC products through pivot tables for each channel. A significant correlation was found between the monthly sales of UVC products in the Original Equipment Manufacturer (OEM) channel and the COVID-19 monthly cases, with a two-month delay in time. The findings of this study provide valuable insights into the impact of the pandemic on the demand for UVC disinfection products, and highlight the importance of OEM channel sales in understanding the market dynamics. The results may also serve as a guide for manufacturers and distributors to better predict and respond to fluctuations in demand for these essential products, thus enabling more efficient allocation of resources and improved public health outcomes.

Xinyu Dai
Analysis of Non-communicable Diseases and the Specific Medical Donations Demands in 12 Developing Countries
Non-communicable diseases, the leading causes of death worldwide, are increasingly becoming a prioritized public health problem worldwide, especially in developing countries. 41 million people die from NCDs each year, equivalent to 74% of all deaths globally, and 77% of all NCD deaths are in low- and middle-income countries (WHO, 2022). In order to better enhance the accessibility of quality medicines worldwide, particularly the medical products deployed for treating NCDs, the CMMB Medical Donations Program (MDP) team has been conducting a project, aiming to understand the prevalence and risk factors of cardiovascular diseases, diabetes, and cancer, as well as investigating specific drug products demands in each of partner countries. Those countries include Dominican Republic, El Salvador, Ghana, Guatemala, Haiti, Honduras, Jamaica, Kenya, Lebanon, Philippines, Sierra Leone, and Zambia. The latest country-specific NCD profiles were retrieved from WHO Noncommunicable Diseases Data Portal, which also introduces multiple risk factors such as alcohol consumption, tobacco use, and physical inactivity. The medicine needs lists are obtained from collaborating organizations, along with detailed information for every single facility which those organizations work with. Data cleaning, processing, and visualizations were carried out in R studio, and Poisson and Linear regression modeling were used for both NCDs and medical products needs analysis using SAS.

Lijia Zheng
Estimating the financial risks associated with vaccine-preventable diseases in sub-Saharan Africa
Vaccine-preventable diseases (VPDs) can cause financial hardship to households in low- and middle-income countries, especially in settings where local health systems are weak, like in Ethiopia, the second-largest populated country in Africa. We compute the risk of catastrophic health expenditures (CHE) – out-of-pocket (OOP) medical expenditures surpassing a certain threshold of household consumption expenditures – due to selected VPDs in Ethiopia. We estimated the risk of CHE due to VPDs and pathogens that can be prevented by seven vaccines: Hepatitis B, Haemophilus influenzae type b (Hib), human papillomavirus (HPV), measles, meningococcal bacteria serogroup A (MenA), pneumococcal disease (PC), and rotavirus (RV). We derived the risk of CHE associated with VPDs across wealth quintiles in Ethiopia, drawing from a combination of secondary data sources for OOP expenditures and healthcare utilization associated with the treatment of VPD. Conditional on being affected by one VPD, the CHE risk for a household was computed based on: the likelihood of healthcare utilization, OOP expenditures for VPD treatment, and consumption expenditures.

Ruiyang Xu
Master Data and Canadian Health Systems Analytics
As the practicum student at Canadian Institute for Health Information (CIHI), my main purpose was to support the Data Governance and Standards Office (DGSO) department with master data work and health systems analytics more broadly. Master data is data that is common across different data holdings at CIHI. I assisted the DGSO department with operational activities and projects and had the opportunity to learn more about CIHI and its role in the Canadian health care system. I was also able to refine my data analysis skills in multiple programming languages like SAS and R and my code translation skills between SAS and Python as CIHI tried to transition away from exclusively using SAS. Specifically, I worked with the organization and geography domains of master data at CIHI. For the organization domain, I managed databases by standardizing, updating, and verifying the data quality of over 1500 organizations. I liaised with internal and external collaborators, including but not limited to coworkers across different CIHI teams, correspondents in various provincial governments, staff at public hospitals, and administrative people at private long-term care facilities. For the geography-domain master data, I prepared graphical and mapping products to visualize the data while updating the geography domain data at the postal code and smaller levels. I taught myself GIS, which complemented my quantitative programming skills well. This APEx position supplemented my current studies well.

Missing Data (HSC 203)

Yiming Li
Semi-supervised Checkerboard Learning
Electronic Health Record (EHR) is an electronic version of a patient's medical history, including all of the key administrative and clinical data related to individuals. Two common types of EHR missing are missing response (unlabelled) and missing partial prediction (incomplete). We proposed a method with guaranteed efficiency incorporating two types of missing information.
We calculate the conditional mean of score functions for the naive generalized linear model (GLM) based on the incomplete data and decompose the score functions into two orthogonal parts while one of them is only related to incomplete data. Thus, the variance of the target parameter is reduced by including more incomplete data. We further apply a similar procedure to the unlabelled data and have a more efficient parameter. We have also theoretically verified that semi-parametric methods and many deep learning methods could work when estimating the distribution of nuisance parameters.
We find that introducing incomplete data reduces the variance by up to 50%, and the introduction of unlabelled data could further reduce the variance by up to 18%. The magnitude of the reduction depends on the distribution of the data. We present a more efficient unbiased estimate relative to GLM while considering a more complicated missing setting compared to the previous works. We also hope to further include the federated learning framework in the study.

Minxiu Shi
How different proportion of missingness will affect performance
The project is aim to analyze how diverse percentage of missingness will affect random forest model performance. It was based on the real data from NESARC, where we will predict which patients that have depression but no mania at Wave 1 are highly possible to develop mania at Wave 2.

Dantong Zhu
Database Management and User Interface Development: Enhancing Data Organization and Server Maintenance
This project involves the management and maintenance of a database platform, with a focus on cleaning and reorganizing user data, polishing the user interface, and maintaining the database back server. The goal is to enhance the user experience by improving the organization of data and streamlining the platform's functionality. Tasks include data cleaning, restructuring and normalization, and designing and implementing user-friendly interfaces. Additionally, ongoing server maintenance ensures smooth operation and minimal downtime. This project requires attention to detail, proficiency in database management, and a strong understanding of user interface design principles. By successfully executing these tasks, users can more efficiently and effectively interact with the database platform.

Hao Zheng
Association of pregnancy and childhood c-peptide levels using HAPO-FUS data
During the 2nd and 3rd trimester of pregnancy, the body develops a state of physiological insulin resistance which prioritizes glucose delivery to the growing fetus. The HAPO study has shown that pregnancy hyperglycemia is linked to adiposity, hyperglycemia, and impaired glucose tolerance in the child. This project aimed to investigate the connection between pregnancy and the offspring c-peptide level, measured during the oral glucose tolerance test (OGTT).
First performed multiple imputation assuming missing at random using HAPO-FUS data from the NIH-NIDDK repository. Then multiple linear regression assessed the association of scaled pregnancy fasting and 1-hour c-peptide with cord c-peptide and child OGTT c-peptide at follow-up visit (ages 10-14 years).
After imputation, a total of 4,802 mother-child pairs were included. Pregnancy c-peptide was associated with cord c-peptide. Pregnancy fasting c-peptide and pregnancy 1-hour c-peptide was also associated with child OGTT c-peptide (fasting, 1-hour and 2-hour). Pregnancy c-peptide levels were positively associated with HOMA-IR and negatively with child insulin sensitivity and Matsuda index while adjusting for child BMI z-score and pubertal stage.
Insulin level during pregnancy is closely linked with the child insulin response during the OGTT. This indicates that maternal insulin levels may play a part in programming the development of the offspring's pancreas.

Seonghun (Hun) Lee
Using Data Fusion and Multiple Imputation to Correct for Misclassification in Self-reported Substance Use: A Case-Control Study of Cannabis Use and Homicide Victimization
Although cannabis use has been causally linked to violence in case studies, the association between cannabis use and homicide victimization has not been rigorously assessed. A case-control analysis can be performed using two national data systems: cases were homicide victims from the National Violent Death Reporting System (NVDRS), and controls were participants from the National Survey on Drug Use and Health (NSDUH). While the NVDRS detected cannabis use in the blood, the NSDUH only collected self-reported data, and thus the potential misclassification in the self-reported data need to be corrected. We considered a data fusion approach by concatenating the NSDUH with a third data system, the National Roadside Survey (NRS), which collected data on both blood test and self-reported results from drivers. This data fusion approach provided multiple imputations (MIs) of blood test results for the participants in the NSDUH, which were then used in the case-control analysis. Bootstrap was used to obtain valid statistical inference. The analyses revealed that cannabis use was associated with 3.61-fold (95% CI: 2.75 - 4.47) increased odds of homicide victimization. Alcohol use, Black race, male sex, 21-34 years of age, and less than high school education were also associated with significantly increased odds of homicide victimization.

Topics in Data Visualizations (HSC 202)

Yan Wang
Quantifying Intrinsic Health Using Stress-evoked Information Flow between Physiological Systems
The patterns of human physiological parameters including heart rate, blood pressure and respiration can be affected by both health status and stress condition. Transfer entropy (TE) can be used to measure the directed information flow between two time series processes. Our study highlights the potential of transfer entropy analysis as a tool for understanding the physiological interactions under stress, and diagnosing distinct patterns of information transfer among them. We aimed to quantify information transfer between these physiological parameters for patients with different health statuses and stress conditions. The transfer entropy we calculated identified differences in the direction and strength of information flow for patients with different health statuses under different stress conditions. We proposed to conduct clustering analysis and statistical tests for the transfer entropy within patient health status, in order to reveal patterns of information transfer between the physiological variables, serving as an implications for diagnosing. Certain patterns of information between these physiological variables can act as an indication for health problem, allowing for earlier diagnosis and treatment.

Xiao Ma
Marketing Data Analytics in Personal Care Industry
Experience in utilizing various data analytics tools and collaborating with different teams to support healthcare businesses. By using R and Excel to perform exploratory data analysis and showcase bi-weekly, monthly, seasonal, and annual sales records for a deep understanding of companies within the sector. Also worked with Operations, Sales, and Product Development staff to facilitate the implementation of new clients' healthcare data through the Impact Product Suite in Q3 2022. Additionally, collaborated with the product development team and customers to launch new products and product enhancements by arranging beta test environments, developing go-to-market strategies, and executing on the launch to market.

Yuxuan Wang
Enhancing Efficiency and Decision-Making in Healthcare Data Analysis: A Practicum Report on My Data Analyst Internship at IQVIA
During my internship as a Data Analyst at IQVIA Plymouth Meeting, PA from May to August 2022, I worked on a series of projects aimed at improving efficiency, data accuracy, and decision-making processes. I primarily focused on monitoring major KPIs and producing weekly summary statistics reports through the implementation of automated quality check SQL queries.
One of my significant achievements was utilizing scalable SQL scripts to update a 27GB+ database, which reduced processing time by 25%. Additionally, I applied statistical analysis in R and SQL to identify the optimal screening methods for various diseases. This led to a 30% decrease in manual processing, and by conducting A/B testing to compare different screening methods, we were able to increase the accuracy of data-driven decisions by 67%.
Lastly, I developed a Python automation program for comprehensive quality control on large Healthcare Market monthly data reports. By generating user-friendly reports using Tableau, I helped increase overall efficiency by 72%. My time at IQVIA provided me with valuable hands-on experience in applying data analysis techniques and tools to optimize processes, enhancing my skills as a data analyst and contributing significantly to the organization's operations.

Hengxuan Ma
Pre-marketing Data Analysis of SaaS Recruiting Software
This report presented a data-based research project for a software as a service (SaaS) recruiting product of HireBeat company. The project was designed for a pre-marketing strategy for the company’s SaaS recruiting product, which determined the target customer population and optimized the product’s competitiveness.
There were three parts of the research: the job opening level by industry analysis, the industry sub-sector selection analysis, and a SWOT analysis on competing products. The main method used in the analysis was categorical data analysis and data visualization through SAS and Tableau to generate descriptive statistics and reasonable interpretations.
The results discovered our potential target industries which provided the largest number of job openings and helped us determine the optimal sub-industry and population. Furthermore, through SWOT research we obtained the strengths and weaknesses of two major SaaS recruiting software in the market, which helped us improve the product’s competitiveness for our company.
The conclusion stated that we had successfully completed the initial pre-marketing research for the new SaaS recruiting product about customer population and strength improvement. The information we interpreted from our data analysis provided sufficient suggestions for the product’s potential development and made preparation for a further plan of go-to-market strategy.

Malvika Venkataraman
Assessment of Actual vs. Projected Payer Agnostic Performance
The payer agnostic approach is a model in which health care services are provided to serve multiple payers. Being payer agnostic, Humana’s CenterWell pharmacy accepts members and patients from a variety of Medicare, Medicaid and commercial plans, not just those offered by Humana. The average wholesale price (AWP) is a term that describes the average price paid by a retailer to buy a drug from a wholesaler. AWP is used to determine pricing and reimbursement of prescription drugs to third parties. CenterWell signs contracts with payer agnostic clients, that include AWP discount ranges. However these values may differ from the actual discount applied. The goal of this report is compare the Actual vs. Projected AWP Discount for all of CenterWell Pharmacy’s payer agnostic clients. The student’s role was to analyze the data and build an interactive dashboard to manage CenterWell’s client contracts and track the pharmacy’s performance. This included building a derived dataset, from wrangling sales, contract, claims and drug table data, as well as performing AWP discount calculations at the script and quantity level. The use of SQL/SAS and PowerBI allows for automation and minimal maintenance. The dashboard’s interactive platform allows users to select, slice and interact with different visualizations and tables. Analysis of the dataset revealed quantitative differences between actual vs. projected discounts, providing insights for future contracts with clients.

Bayesian Statistics (HSC 107)

Haolin Zhong
Comparing the Performance of Multi-Armed Bandit Algorithm and Hypothesis Testing in Clinical Trials: A Simulation Study
This practicum project aims to compare the performance of two methods in clinical trials, the multi-armed bandit (MAB) algorithm and hypothesis testing, through a simulation study. The study focuses on evaluating the ability of both methods to identify the best treatment arm in a clinical trial as well as to assign patients to the optimal arm as much as possible, considering factors such as sample size, number of treatment arms, and effect size. The results of the study will provide insights into the strengths and weaknesses of each method and their applicability to different clinical trial scenarios, with the potential to inform future trial designs and decision-making processes.

Yifei Xu
Metabolomic Analysis of Myalgic Encephalomyelitis/Chronic Fatigue Syndrome (ME/CFS) with Exercise Tolerance Test
Myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) is a disabling and complex illness, associated with neurological, immunological, autonomic, and energy metabolism dysfunction. People with ME/CFS often experience unexplained physical fatigue, sleep impairment, cognitive issues, orthostatic intolerance, sensory intolerance, and gastrointestinal problems. The onset of ME/CFS in individuals is often characterized by flu-like symptoms, leading to the suspect that the condition may be triggered by an infection. Evidence has also demonstrated that exercise and even increased activities may reduce their physical and mental capacity over time. In order to identify which metabolite and which cluster of metabolites are significantly associated with ME/CFS and exercise status, we performed lognormal regression, Bayesian and enrichment analyses from 112 ME/CFS cases and 104 matched healthy controls, together with stratification analyses by sex, age, sr-IBS status and duration. Subjects with ME/CFS had significantly decreased levels of PC-vlc, Phosphatidylcholines and Unsaturated triglycerides compared to controls (p < 0.001). Levels of Carboxyibuprofen were significantly increased in ME/CFS cases compared to controls (p < 0.001), as well as ME/CFS cases after exercise. Our findings provide metabolomic evidence for both ME/CFS status and exercise intolerance and insights into pathogenesis of ME/CFS.

Qingdou (Paulina) Han
Racial and Economic Disparity in Child-Injured Crashes in California Cities
To assess the disparities in child railroad crash injuries with respect to the socioeconomic and demographic composition of urban populations in California in 2017. This retrospective observational study used city-level crash data from the Transportation Injury Mapping System of 206 cities in California. We compared the crash injury rate between cities with different median household income levels among White, Black, Hispanic, and Asian children using negative binomial regression. We considered both the household income composition in 2020 and the income trajectories from 1980 to 2020 estimated using the latent class growth analysis. Black children have the highest crash injury rate and Asian children have the lowest crash injury rate. Given the median household income level in a city, a higher percentage of a racial population is significantly associated with lower child crash rates in that racial group. The higher percentage of the White population was associated with a higher risk of railroad crashes among Black children. White children living in a city with a higher percentage of Black and Asian populations have an increased risk of crash rates. White and Asian children residing in higher-income cities had a lower risk of railroad crashes when the racial composition of that city remains the same. White and Asian children benefit from living in a wealthier city with respect to road safety. All children are safer when they live in a city with more people of their own race.

Anyu Zhu
Using Bayesian Machine Learning Methods to Improve Cellphone Survey Inferences in Low and Middle-Income Countries
During the COVID-19 pandemic, in-person data collection using in-person surveys became challenging. To estimate the COVID-19 vaccination rate in Uganda, we conducted a cellphone survey in 2022 from a subset of the cellphone owners who participated in the Uganda Population-based HIV Impact Assessment (UPHIA) survey in 2019-2021 and consented for a follow-up. Given that 60% of the Uganda people have a cellphone, the cellphone survey suffers from coverage error and bias. The statistical analysis of key health outcomes using the cellphone survey participants thus creating biased inference. In this paper, we applied the Bayesian Adaptive Regression Trees for statistical inference from non-random samples to correct for the coverage bias in the cellphone survey and to improve validity in the survey estimates by employing the over 200 variables collected about each UPHIA participant. We compared the Bayesian machine learning methods to the conventional weighting approach to estimate the COVID-19 vaccination rate and the method was validated on a shared variable in cellphone survey data and percentage of visiting a health worker in a health facility in the past 12 months. The BART models can effectively reduce bias in survey estimates and provide valid statistical inference, and thus cellphone surveys using BART models provides a cost-effective and time-efficient alternative for generating statistical inference about a target population when a representative sample is impossible.

Yimiao Pang
Simulation Study of Adaptive Randomization in Multi-arm Clinical Trials
In recent years, adaptive design has emerged as a promising approach to clinical trial design, allowing for increased flexibility in the evaluation of new treatments. While previous studies have explored the use of adaptive design in multi-arm clinical trials, the statistical performance of adaptive design remains to be fully understood. In particular, the power of adaptive design to identify the most effective treatment arm in multi-arm trials has not been thoroughly investigated.
Our study focuses on evaluating the statistical performance of different adaptive designs and comparing them to the traditional clinical trial design in terms of power. We define power in our study as the probability of identifying the optimal treatment arm in a multi-arm trial setting. The adaptive approaches are determined by different combinations of futility rule, selection rule and early termination rule.
Adaptive designs have higher power of identifying the optimal treatment arm compared with traditional design in the multi-arm setting. Furthermore, adaptive designs in multi-arm setting can significantly reduce the sample size required for a trial by utilizing early termination while it maintains high statistical power.
Adaptive designs not only save time and resources in multi-arm trials, but also maintain high statistical power, enabling more efficient and effective clinical trial design compared with traditional clinical trials.

Genetics Research (HSC 201)

Jo Hsuan (Brian) Lee
Improved Disease Protein and Metabolite Prioritization Performance With MultiNEP
Omics networks that integrate cross-omics interaction information are by design more informative than isolated homogenous networks. For trait-associated feature prioritization studies, leveraging various omics profiles and their respective interactivity data can improve performance by traversing more comprehensive biological pathways. For example, joining gene expression, metabolite abundance, and a gene-metabolite interaction network can elucidate signals otherwise undetectable via separate omics analyses, as shown by the MultiNEP framework described by Xu et al. (2022). MultiNEP uses a modified random walk algorithm on an enhanced feature interaction matrix to prioritize signals. Its robustness stems from the weights introduced to balance the various interaction networks. However, the omics adaptability of the said algorithm has not been explored. This work employs the MultiNEP framework on a protein-metabolite network with proteomic and metabolomic profiles. Similar to the imbalance issue identified between genes and metabolites networks, when the contribution of metabolites is increased relative to that of proteins, MultiNEP identifies disease-associated features with higher accuracy compared to methods that ignore the disparity. Application on protein expression and metabolite abundance data of endometrial cancer patients further demonstrates that MultiNEP prioritizes more cancer-related features by effectively utilizing within- and between-omics interactions.

Yuxuan Chen
scRNA-seq Analysis of Lung Adenocarcinoma Mouse Models
Single-cell RNA sequencing is a new technology that extracts detailed information from the genome and provides unprecedented opportunities to study disease heterogeneity at the cellular level. In this practicum, I wrote and performed rigorous pre-processing workflow involving quality control, normalization, demultiplexing, mapping, and transcript quantification steps for the RZiMM-scRNA method, the unified Regularized Zero-inflated Mixture Model framework designed for scRNA-seq data. This new method simultaneously detect cell subgroups and identify gene differential expression based on a developed importance score, accounting for both dropouts and batch effects. My empirical investigations focus on lung adenocarcinoma mouse models and my goal is to delineate cell heterogeneity and identify driving biomarkers associated with this lung tumor. Notably, I successfully clustered the cells and identify the top highly variable genes for the overall cells and each cell cluster. Also, I identified the genes that differentiate two biological conditions (disease vs. control).

Hening Cui
A joint modeling on single RNA sequencing spatial data
Congenital heart defects (CHDs) are a common type of structural birth defect that can affect the heart's walls, valves, or blood vessels. The severity of CHDs varies and finding effective treatments can be challenging. Recent advances in sequencing technology and data analysis have enabled researchers to gain a deeper understanding of the gene expression patterns during the cardiogenesis process, which may help to identify potential treatments for CHDs.
In this study, we aim to use joint modeling in single-cell RNA sequencing data to identify gene expression patterns during cardiogenesis. Specifically, we will use linear mixed-effects models (LMMs) and Cox spatial models to analyze the chicken heart dataset, which contains both single RNA sequencing data and spatial data. This approach will allow us to incorporate both within-cell and between-cell correlations, providing a more accurate representation of the biological processes involved in cardiogenesis.
Our findings may help to identify new therapeutic targets for CHDs and improve our understanding of the genetic mechanisms underlying heart development. The analysis will be performed using the R programming language and the Seurat, spatstat, and nlme packages.

Xuanhe Chen
Investigating the Relationship Between Splicing Quantitative trait loci and Protein Quantitative trait loci
Quantitative trait loci (QTL) are genetic regions in the genome associated with variation in quantitative traits. Different traits are related to different types of QTL, including sQTL, which studies the genetic basis of alternative splicing, and pQTL, which studies protein expression. Although sQTL and pQTL have been independently studied, few research have been looking into their relationship. In this study, we investigated the relationship between sQTL and pQTL using Religious Orders Study/Memory and Aging Project (ROSMAP)'s splicing and proteomics data analyzed via the ADSP FunGen-xQTL computational protocol. Our analysis partially revealed an overlap between the two types of QTL, indicating that genetic variation that affects splicing may also impact protein expression in the downstream. As proteomics data are more difficult to collect than splicing data, these results may provide new insights into pQTL identification using sQTL as evidence and may have important implications for understanding the genetic basis of complex traits and diseases.

Matthew Neky
Incorporating Shapley Values into the Hierarchical Statistical Mechanical Model
Predicting protein sequences using machine learning algorithms is of great interest to the scientific community, as such prediction tools help further both basic research as well as applied fields, such as biomedical engineering and drug development. However, there has yet to be a perfect model created, and so myriad approaches exist at present. The Hierarchical Statistical Mechanical (HSM) model is one such biophysical prediction model that predicts protein-peptide interactions and signaling networks using machine learning. The focus of this study was to investigate whether Shapley values, a concept originating in the game theory sub-field of economics, can be incorporated into the HSM model to achieve more robust prediction results. Shapley values, in this context, approach each amino acid (or other biophysical phenomena) as a ‘game,’ and assigns that game a value based on how vital it is to the overall structure of the protein. My role, and the scope of this practicum, involved writing code that could produce the Shapley values for an entire protein sequence as well as being properly incorporated into the already-existing HSM model. From these results, we can see that Shapley values have use outside of the narrow context of economics and can be applied to protein prediction tools.

COVID-19 Research (HSC 110)

Yingshuo Liu
Health Equity Exploration during COVID-19 pandemic
The goal of the project is to evaluate the health equity landscape during the COVID-19 pandemic in the US. Health equity is defined as the chance for everyone to attain their full potential for health and well-being according to WHO, and it depends vitally on the individuals to challenge and change the unfair and steeply graded distribution of social resources to which everyone has equal claims and rights.
During the pandemic, there are still differentiations in the healthcare access for people in the US, and the health disparity exists in different granularity, ranging from county differences to gender and race differences. The project uses covid vaccination rate and death rate as the dependent variables to quantify access to health resources and uses different covariates including Social Vulnerability Index (SVI), gender, race, etc. to explore the correlation with dependent variables and demonstrate the health equity landscape during the pandemic. ANOVA test, generalized regression model, and other parametric methods are employed to find the correlation.
As a result, we found many demographic/social factors are significantly correlated with health access in the US during the pandemic, and that represents that the health disparity still exists at different levels in the US. In conclusion, pursuing health equity is still an ongoing process and it requires collaboration among all stakeholders including policy-makers, health systems, payers, and individuals.

Yuanyuan Zeng
Understanding of COVID Vaccine Hesitancy
COVID vaccine hesitancy is defined by the World Health Organization as a delay in acceptance or refusal of vaccines despite the availability of vaccination services. Delaying in receiving the vaccine is one of the obstacles to vaccine uptake and mitigating the COVID-19 pandemic. Understanding the factors associated with vaccine uptake will help care providers address vaccine hesitancy.

The purpose of the study is to investigate factors associated with vaccine hesitancy. We used the online cross-sectional survey to collect data from 320 participants. After categorizing participants into different vaccination groups, we conducted descriptive analysis and hypothesis testing to investigate variables. Latent class analysis (LCA) was conducted to identify the unmeasured class membership among the participants using variables. The variables including chronic condition, health views, mental health condition, morality, spirituality, uptake of flu vaccine, and experience of racism are significantly different. Using those variables, we performed 3-class and 4-class LCA models and investigated the profile of each group to predict what makes one fall into a particular membership.

Vaccine hesitancy entails a complex mix of factors. Knowing the reason behind vaccine hesitancy would provide practical tips on vaccination-relation consultation.

Yitian Zhang
The Prevalence of Osteoporosis in China, a Nationwide, Multicenter DXA Survey
Until now, a BMD reference database based on uniform measurements in a large‐scale Chinese population has been lacking. A total of 75,321 Chinese adults aged 20 years and older were recruited from seven centers between 2008 and 2018. BMD values at the lumbar spine (L1–L4), femoral neck, and total femur were measured by GE Lunar dual‐energy X‐ray absorptiometry systems. BMD values measured in each center were cross‐calibrated by regression equations that were generated by scanning the same European spine phantom 10 times at every center. Cubic and multivariate linear regression were performed to assess associations between BMD values and demographic variables.

Saryu Patel
COVID-19 Spread Through NYC Hospitals
COVID-19, an infectious disease caused by the SARS-CoV-2 virus, was discovered in December 2019, and has since then spread worldwide. The purpose of this study was to assess the spread of the disease through the NYC hospitals through a network analysis.
Networks were created by connecting patients who stayed in the same hospital room or ward on the same day(s). Various network statistics were calculated and compared between COVID negative and positive patients to determine if the NYC hospitals have taken measures to prevent the spread of the disease.
The network statistics calculated included degree, betweenness, and eigenvector centrality. Degree, which is the number of connections each patient has, was found to be greater in COVID negative patients than in COVID positive patients. Similarly, betweenness, which measures the extent to which a patient lies on the shortest path between other patients in the network, was found to be greater in COVID negative patients than in COVID positive patients. Eigenvector centrality, which describes the amount of influence a patient has in the network was similar between COVID negative and positive patients.
The differences in the degree and betweenness measures show that COVID positive patients are not as exposed to other patients. COVID positive patients share a hospital room or ward with other patients much less than COVID negative patients, so the NYC hospitals have indeed taken measures to prevent the spread of COVID-19.

Yunyi Jiang
Environmental Monitoring for Quality Control Using a Linear Regression Model
The quality of experimental environments, in terms of air, water, surface, and temperature quality, plays a key role in assessing the Quality Control experimental environment against the Quality Assurance standards. In this paper, an effective linear regression model is designed to predict the environmental experiment quality based on parameters like air particulates, surface particulates, room temperature, room operating status, water quality, and number of personnel presented. The environmental monitoring dataset was collected over 12 months, with 7 parameters recorded for 5 different operation sites. Linear regression was used to analyze the relationship between the parameters and experiment quality. The results showed that there was a significant correlation between the environmental variables and experiment quality. The model was able to predict experiment quality with high accuracy based on the environmental parameters. This study demonstrates the importance of environmental monitoring for quality control and the effectiveness of linear regression models in analyzing the relationship between environmental variables and experiment quality.

Functional Data Analysis (HSC 207)

Jianting Shi
Distinguishing Microglia and Macrophages within Glioma Tumor Microenvironment by scRNA-seq Lineage Tracing of Mitochondrial Somatic Mutations via LINEAGE algorithm
Microglia and macrophages are the two morphologically and transcriptionally similar immune cells in the glioma tumor microenvironment. The poor survival rate of glioma patients is associated with the abundance of macrophages highly immunosuppressive in the tumor, while microglia can be more pro-inflammatory. To understand the disease mechanism of gliomas and develop novel targets for treatment, being able to distinguish these two cell types in the tumor microenvironment is of the highest importance. Advancements in single-cell RNA sequencing promote dissecting the composition of complex tumors for both malignantly transformed cells and cells in the tumor microenvironment, yet not enough to distinguish microglia from macrophages. Given the fact that somatic mutations are at high rate in the mitochondrial genome, a recently published algorithm LINEAGE utilizes label-free identification of endogenous informative single-cell mitochondrial RNA mutation for lineage analysis. The lineage tracing to distinguish microglia and macrophages in the glioma tumor using LINEAGE showed distinct cell clusters in 4 patient samples. However, the clustering does not seem to correlate with the myeloid marker expression pattern from the conventional scRNA-seq analysis. Further modifications and optimization to the LINEAGE methods need to be done to better assess LINEAGE applications.

Pei Hsin Lin
A Preliminary Investigation of the Impact of Pre-Term Birth on Mental Health, Substance Use, Brain Connectivity, and Cognitive Performance in Children
This initial analysis examines the differences between full-term and pre-term babies in terms of mental health conditions, substance use, pregnancy and birth information, functional connectivity within and between brain networks, and cognitive performance. The results show that while the mental health condition between full-term and pre-term babies is non-significant, substance use before knowing of pregnancy is significantly different, with parents of pre-term babies more likely to have at least one substance use. Pre-term babies have specific mental health conditions and increased functional connectivity within and between brain networks, and perform worse on cognitive tests in the NIH toolbox compared to full-term babies. These findings suggest that pre-term birth can have significant impacts on various aspects of a child's development and highlight the need for further research and support for pre-term babies and their families.

Xueqing Huang
Case Study of Simulation-extrapolation Method on Scalar-on-function Regression with Measurement Error
In this paper, we conduct a case study of applying simulation-extrapolation (SIMEX) algorithm on scalar-on-function regression with measurement error. The SIMEX method first estimates the error variance, then establishes the relationship between a sequence of added error variance and the corresponding estimates of coefficient functions. Finally, the relationship is extrapolated to zero measurement error, including linear, nonlinear, and local polynomial extrapolation. Data comes from 2005-2006 NHANES program which was established by CDC to describe the health of Americans. Participants wore wearable devices to track their daily physical activity during waking hours, excluding water activities and sleep. True physical activity intensity is considered as the true function-valued covariate. The device-based measure of true physical activity intensity is considered as the observed covariate and BMI value is considered as response variable. Results show that the linear extrapolation method performed better than the nonlinear and local polynomial method.

Sze Pui (Sallie) Tsang
Quantifying Intrinsic Health Using Stress-evoked Information Flow between Physiological Systems
As a canonical complex dynamical system, human bodies require interactions between constituent physiological systems to perform functions and maintain health status. Quantifying the dynamic interactions between organ systems using biomarkers collected in the context of psychobiological challenges may provide a novel means of measuring the energy-dependent communication processes that underlie intrinsic health. In this study, we use transfer entropy (TE), an information-theoretic metric, to quantify information flow between three biomarkers (heart rate, blood pressure, and respiration) in 71 participants of the Mitochondrial Stress, Brain Imaging, and Epigenetics (MiSBIE) project.Our finding highlights the potential use of TE as a computational tool for detecting mitochondrial disorders and, more broadly, quantifying intrinsic health using information transfer between physiological biomarkers.

Ziwei (Zoe) Zhao
Assessing Treatment Efficacy through Functional Data Analysis from Accelerometers
Mitochondrial diseases are rare genetic disorders caused by oxidative phosphorylation defects. This study aimed to evaluate the treatment efficacy of a "molecular bypass" therapy for patients with TK2 deficiency-induced mitochondrial diseases. To achieve this goal, we applied a non-parametric inference framework to analyze accelerometer data from an ongoing Phase III clinical trial. 42 sets of raw accelerometer data from 19 patients were cleaned and processed to generate occupation time curves. Patients underwent one to five visits after receiving the therapy, and we hypothesized that their activity levels would increase over time. Treatment effects were calculated from the occupation time data at patients’ first and last visits. An empirical likelihood (EL) confidence band for the mean treatment effect was then constructed from those data. Current analysis results from 12 patients with multiple visits indicated a promising treatment effect of the therapy, particularly at activity levels around 0.2 and 2. However, further data collection is necessary to obtain more conclusive results.

Observational Study in Human Behavior (HSC 210)

Weiheng Zhang
Clinical Gait Analysis on Osteoarthritis Patients for Diagnostic Models and Arthroplasty Treatment’s Evaluation
Osteoarthritis (OA) is a degenerative joint disease that seriously disturbs the patients’ motor ability. In occupational therapy, clinical gait measurement is widely used to quantify gait kinematic impairments from diseases. Dr. Bertaux’s team from Univ. of Bourgogne Franche-Comté conducted such measurements by recruiting 80 healthy volunteers and 106 unilateral hip OA patients. Each of them walked along a straight line with force-sensing plates and motion capture sensors recording their body motion and step forces. The same procedure was conducted on the patients 6 months after total hip arthroplasty (THA). Based on Bertaux et al.’s dataset, stratified by the subjects’ BMI, we built a computational pipeline to extract seven kinematic features that can significantly distinguish OA patients from healthy subjects. These features, along with demographic variables, participated in regression models for OA predictor interpretation, and in classification models for OA diagnosis. CV results showed that the final diagnostic model could achieve 92% sensitivity under 90% specificity, with an AUC above 0.95. We also analyzed the patients 6 months after the THA: most of the kinematic features shifted towards the healthy volunteers, while some features between treated and healthy groups are still different. Clinical gait analysis, along with data science strategies, can have substantial potential for OA diagnosis and arthroplasty treatment's evaluation.

Shiyin (Fiona) Li Activity pattern of granule and mossy cells in the dentate gyrus during sleep and wakefulness
Sleep is crucial for memory consolidation, and hippocampal neuronal activity plays a vital role in this process. We recorded calcium activity from granule and mossy cells in the dentate gyrus of a head-restrained mouse during sleep-wake cycles. Using a bootstrap-based significance testing approach, we found that 33.3% of cells had higher activity during NREM, while 44.9% showed decreased activity. Our study contributes to the understanding of hippocampal activity during sleep and its role in memory consolidation.

Lin Qin (Lynn) Chen Associations between individual and structural level racism and gestational age in the Nulliparous Pregnancy Outcomes Study: Monitoring mothers-to-be (nuMoM2b)
Preterm birth is a major cause of neonatal mortality and birth defects, with significant emotional, economic, and financial effects on families. Individual experiences of discrimination and structural racism have been linked to PTB among Black women. This study aims to investigate the interaction between individual experiences of racial discrimination and structural racism and their association with PTB among nulliparous women. The nuMoM2b study was a multicenter prospective cohort study of nulliparous pregnant women conducted from 2010-2015. Individual experiences of racial discrimination were assessed using Krieger's 9-item scale. The Index of Concentration at the Extremes (ICE) was the structural racism measure used for the study. Descriptive statistics and bivariate analyses were conducted for individual predictor variables and the ICE structural racism measure to describe maternal sociodemographic characteristics. Linear-mixed effects models were used to test the associations between individual and structural racism measures and gestational age. All models were adjusted for maternal age, education, and smoking. The result show that both individual experiences of racial discrimination and structural-level racism were associated with gestational age. Future research using longitudinal datasets and more diverse samples is needed to better understand the relationship between multidimensional nature of racism and gestational age.

Waveley Qiu Approximating Population Distributions of an Ubiquitous Binary Feature through Modeling and Raking
While the development of web surveys have provided researchers with a relatively simple and cost-effective way to collect data, it can be difficult to ensure the collection of a sample resembling a probability sample is achieved through this sampling method. YouGov, a London-based market research and data analytics firm provides a methodology in which, under certain assumptions, some aspects of a probability sample can be achieved by matching survey respondents to individuals who were randomly selected from a known population, called a "frame", and applying weights to the survey responses according to the individual's likelihood of being selected from the frame. The frame that is used is typically derived from a census-type data source, however, it is often the case that not every characteristic of interest in the matching procedure is available in the data source. Over the course of my internship at YouGov, along with learning about the sample matching methodology used at the company, I proposed a modeling and iterative proportional fitting combination method in which additional population-level characteristics could be appended to a sampling frame in existence. While there are certainly limitations to using simulated data in lieu of collected data, if the required distributional assumptions can be reasonably made, this procedure seems to provide a viable solution to overcome dimensional limitations of available population-level data within the context of sample matching.

Lin Yang
The role of patient activation and social support in health management among US adults with type 2 diabetes
Patient activation (PA) refers to a patient’s willingness to monitor, regulate, and self-manage their health, and has been associated with better outcomes among adults with type 2 diabetes (T2DM). Here, we explored the possibility that the magnitude of these PA-related health benefits may depend on patients’ access to adequate social support.
Adults who self-reported a physician diagnosis of T2DM on the 2022 National
Health and Wellness Survey were stratified by their level of PA using the Patient
Activation Measure (PAM: Level 1, 2, 3, 4) and their level of social support using the Modified Medical Outcomes Study Social Support Survey (median split: >31 indicated high social support). Health outcomes included preventive behaviors, health complications, hospitalizations, and emergency room (ER) visits. The Likelihood Ratio Test from nested generalized linear models was used to evaluate whether the covariate-adjusted contributions of patient activation, social support, and their interaction significantly improved the prediction of patient-reported outcomes (α = 0.05).
Most respondents were highly engaged in managing their health based on their PA level (PAM Level 3-4 = 76%). While PA was associated with greater management of lifestyle behaviors (p<0.001), neither PA nor social support were related to diabetes-related complications, hospitalizations, or ER visits (all NS). Further, none of the effects of PA were significantly moderated by social support (all PAM*Social Support interactions, NS). Rather, health outcomes were better predicted by individual differences in comorbidity burden (p<0.001), mental health (p<0.001), and sociodemographic characteristics (e.g., income, age, health insurance; all p<0.05).
Patient activation is an effective predictor of positive health behaviors among type 2 diabetics irrespective of social support. However, other predictive factors (income, insurance, comorbidities) may exhibit greater influence over less-proximal health outcomes (healthcare utilization) in this patient population.

Clinical Trial Cancer Research (HSC 109B)

Jie Liu
A Multicountry, Multicentre, Noninterventional, Prospective Study to Determine the Prevalence of EGFR Mutations in Patients With Earlystage, Surgically Resected, Nonsquamous, Non-small Cell Lung Cancer
Non-small cell lung cancer (NSCLC), which can be subclassified histologically as nonsquamous and squamous, accounts the majority percentage of all lung cancer cases [1]. A certain percent of patients with NSCLC present with an early-stage disease for which surgical resection is usually the most appropriate therapy to be considered. Mutations of epidermal growth factor receptor (EGFR) is the most well-established driver mutation in NSCLC [2]. Targeted therapy with EGFR tyrosine kinase inhibitors (TKIs) has proven clinical benefits with significantly higher response rates and prolonged progression-free survival (PFS) in patients with EGFRm-positive advanced NSCLC [3]. However, only limited data supporting the use of EGFR TKIs in earlier stages of NSCLC [4-5]. The role of molecular testing and targeted therapies for earlier stages of NSCLC remained unclear. This study aims to determine the prevalence and nature of EGFRm in patients with surgically resected early-stage (IA to IIIB on the basis of pathologic criteria) nonsquamous NSCLC, at a country and aggregated multi-country level. We will also describe the selected early-stage nonsquamous NSCLC population with surgical management, surgical outcome, and neoadjuvant and adjuvant therapies.
The main outcome of our study will be the overall proportion of patients with different EGFR mutations.

Allison Randy-Cofie
Analysis of a phase 2 trial of Sitravatinib in patients with advanced liposarcoma
The goal of this study is to use a Simon two stage design to evaluate an improvement in progression free survival at 12 weeks.

Gonghao Liu
Visualization Apps for Clinical Trial Data
In clinical trials, data visualization allows researchers to present this information in a more accessible and meaningful way, making it easier to analyze, interpret, and communicate. A series of apps in the program of Biostatistics, Epidemiology, and Research Design (BERD) including dose-finding plot, waterfall plot, spaghetti plot, spider plot, and swimmer plot can help people without professional statistical knowledge like clinicians and physicians to visualize the clinical data. Those apps are powerful and convenient tools that can be accessed by the CUMC website for free.
All the data visualization apps were developed by R shiny. In each app, the user will be introduced to the app by a description combined with text and pictures, and each app includes sample data that can be downloaded and played with each app. After uploading the data and variable selection, a plot will be generated in the following tabs, customization can be done by options provided on the side, and all the customization will consist if the user downloads the plot. The package ggplot2 and its affiliated packages were used to draw all the plots.
Data visualization is a very important part of descriptive analysis. The visualization apps can help the clinical and physicians to identify patterns and trends, communicate findings, monitor safety, and plan future trials. People can use the high-quality plots generated by the apps for publication and real-world clinical research.

Ragyie Rawal
Bulk RNA-Sequencing Analysis of Renal Cell Carcinoma Phase III Clinical Trial Transcriptomic Data
Recent advancements have improved our understanding of the immune system’s response to cancer, but providing effective cancer immunotherapy to patients still proves challenging. Predicting response to immunotherapy is thus critical for determining patients who are most likely to benefit from treatment. Transcriptomic signals could exist that allow prediction of patient response to treatment. Utilizing transcriptomic data from IMmotion 151, a phase III clinical trial comparing atezolizumab plus bevacizumab versus sunitinib in patients with untreated metastatic renal cell carcinoma, this analysis aims to predict patient response from the sunitinib arm. Gene and sample-wide clustering suggests a response related transcriptomic signal is present. A series of analyses are used to interrogate this signal. Dimensionality-reduction via principal component analysis suggests separation between patients with response to treatment versus no response. Differential gene expression analysis identifies possible genes associated with treatment response. Gene set enrichment analysis suggests non-responder enriched gene sets are related to cell cycle progression. Weighted gene co-expression network analysis identifies possible gene sets associated with treatment response. Ultimately, a transcriptomic signal seems to exist for treatment response, and a set of genes are identified describing different pathways which may assist in future patient stratification for the sunitinib treatment arm.

Shritama Ray
Changes in Nutritional Status, Diet, Fatigue, QOL, and Neuropathy In Pediatric Brain Tumor Patients receiving Chemotherapy alone or Chemotherapy and Radiation therapy
In this longitudinal observational study, 61 pediatric brain tumor patients aged 7 to 18, were recruited to be studied over the course of 6 timepoints. Patients’ treatment exposures were denoted as chemotherapy only, or chemotherapy with radiation. At each visit, data was collected on the patients’ BMI/MUAC, quality-of-life indexes, sensory & motor neuropathy, nutritional factors, and fatigue. The patients’ parents quality of life scores were also recorded at each timepoint. An analysis of this data shows that patients’ BMI generally decreased from timepoints 1 to 3, but then steadily increased again back to baseline levels by timepoint 6. A less observable trend was observed for MUAC, with fluctuations but no significant net changes over the duration of treatment. Quality of life metrics did improve over the course of treatment for both patients and their parents, with significant increases at timepoints 5 and 6. There are significant decreases in constipation and jaw pain over time, but no clear trends in other forms of neuropathy. Adolescent and parent fatigue scores decreased over the course of treatment as well. The only observed significant association was between BMI and quality of life, specifically for overweight patients. Loss to follow-up was a major issue, with less than half of the recruited patients having valid data at the last timepoint. For these reasons, further studies should be conducted to better understand these associations.

Session 2 (11:15am - 12:15pm)

Environmental & Mental Health (HSC 312)

Yisha Zhang
Semantic segmentation of Google Street View images to derive environmental characteristics in NJ communities
The characteristics of our environment have an impact on our well-being, and natural elements in particular can help reduce stress and lower the risk of stress-related illnesses. To study how specific environmental characteristics can provide therapeutic benefits, we used Google Street View (GSV) images to represent different urban environments that people encounter in New Jersey. Participants were immersed in these virtual environments using the Visualization and Immersive Studio for Education and Research (VISER) at Kean University and answered questions about their perceptions of the environment and their psychological state. We analyzed the GSV images using semantic segmentation, which involved using a pre-trained neural network in PyTorch to identify and classify different objects in the images, such as trees, buildings, roads, and cars. This allowed us to calculate various environmental attributes, such as visual complexity and green index. To validate our machine learning approach, we also manually segmented the same images using MIT's LabelMe Open annotation tool. In this project, we describe these experiments and present our results.

Liyi Dai
Mental Health Analysis Among Migrants and Asylum-seekers
The purpose of this research is to investigate migration dynamics towards the US and identify associations between environmental and sociodemographic factors and the mental health status of the migrants. The study uses survey data collected from migrants on the move from NCA and Mexico to the US and performs t-tests and ANOVA tests to analyze the data using SAS. The mental health status is measured by a mental health score, with a higher score indicating more serious mental health issues. The results reveal that the mental health score is significantly higher among female migrants and those who have experienced deportation. The score also varies significantly among participants with different educational levels and countries of birth. The findings suggest that public health interventions are necessary to raise awareness about mental health among migrants and to address the specific needs of vulnerable subpopulations. This study sheds light on the challenges faced by migrants on the move and provides insights for policymakers and public health professionals working to improve the mental health of this population.

Yishan Chen
Working Overtime and Mental Health Problems among New York Healthcare Workers
Responding to the COVID-19 pandemic put healthcare workers (HCWs) under extraordinary working conditions, resulting in a global mental health crisis. This study assessed the association between working overtime and mental health problems among New York HCWs during the COVID-19 pandemic.
Data for this study came from the COVID-19 Healthcare Personnel Study (CHPS) project – a longitudinal study of HCWs licensed in New York State with baseline data collected in April-May 2020 and follow-up data in February 2021. Logistic regression and multivariate analyses were used to estimate odds ratios (ORs) and 95% confidence intervals (95% CIs) of self-reported mental health problems (anxiety, depression, anger, and stress) associated with working overtime, defined as working more than 80 in the past two weeks.
Of 4394 HCWs participated in the baseline survey, 2405 (54.7%) completed the follow-up questionnaire. Working overtime was reported by 23.2% of the respondents at baseline and 33.4% at follow-up. Working overtime was associated with significantly increased odds of mental health problems at baseline (aOR= 1.48, 95% CI 1.26, 1.74) and at follow-up (aOR=1.31, 95% CI 1.12, 1.53). Working overtime appeared to pose a particularly heightened risk to stress (at baseline, aOR= 1.86, 95% CI 1.53, 2.27; at follow-up, aOR=1.51, 95% CI 1.25, 1.84).
Working overtime was an independent risk factor for mental health problems for HCWs, which should be taken into future consideration.

Qiongyu Shi
The Association Between Marijuana Use and Suicide: A Case-Control Analysis
Abstract:
Research indicates that substance abuse is a risk factor for suicide. However, the association between marijuana use and suicide has not been adequately assessed due to data and methodological challenges. To examine the association between marijuana use and suicide, we performed a case-control analysis using data fusion and machine learning techniques. We selected suicide cases aged 16 years and older from the National Violent Death Reporting System (NVDRS) and controls from respondents in the National Roadside Survey of Alcohol and Drug Use by Drivers (NRS) and the National Survey on Drug Use and Health (NSDUH). Data fusion of the control datasets (NRS and NSDUH) and multiple imputations based on machine learning were utilized to address misclassification and missing data concerns. Our weighted multivariable logistic models revealed a significant positive association between marijuana use and suicide [using NRS controls, Adjusted Odds Ratio (aOR) = 1.98; 95% Confidence Interval (CI) 1.52, 2.58; using NSDUH controls, OR = 1.79; 95% CI: 1.33, 2.25]. Moreover, our results confirmed that alcohol use, being male, White, or aged 35-49 years, and having less than a high school of education were associated with significantly increased risks of suicide.

Yi Fang
Project Opioid Court REACH: Implementation and Evaluation of Opioid Intervention Courts
The opioid epidemic remains a significant public health concern in the United States. To address this issue, Project Opioid Court REACH, funded by the National Institute of Drug Abuse, aims to facilitate the implementation and evaluation of Opioid Intervention Courts (OIC). This project uses quarterly reports to assess the implementation and impact of the OIC model for 10 counties in New York, utilizing various data sources such as ODMAP, UCMS, and REDCap. The reports will highlight opioid court performance over time, including progress updates, participant and county outcomes, recommendations for improvement, and an overview of project activities and resources. By providing regular feedback to each site, Project Opioid Court REACH aims to improve participant enrollment, completion, and data collection to increase the number of individuals who successfully receive the care they need.

Topics in Observational and Survey Data (HSC 203)

Xinran Sun
Analysis of Treatment of Deep Vein Thrombosis in Hospitalized Patients Using the National Impatient Sample Dataset
Deep vein thrombosis (DVT) is a major health problem worldwide and several treatment methods have been proposed to treat it. This study aimed to explore patients' health conditions and socioeconomic factors that influence treatment decisions. The analysis was performed retrospectively on the National Impatient Sample (NIS) dataset, using R. The cases for DVT were selected based on ICD-10-CM/PCS (International Classification of Disease) diagnosis codes for DVT. The treatment methods were grouped into mechanical thrombectomy, Catheter-directed thrombolysis (CDT), surgery, multiple methods, or other, based on the ICD-10-PCS procedure code.
The data analysis involved descriptive statistics, such as mean, median, and mode, to summarize the sample characteristics. Additionally, inferential statistics, such as regression analysis and correlation analysis, were used to identify relationships between variables.
The results indicated that the primary reason for hospitalization and the medical level of the hospital were the most significant factors influencing patients’ received treatment method. The is also a geographic difference in the treatment used.
Overall, this study provides valuable insights into DVT treatment methods and can inform the research scientist that they should separate primary DVT patients from secondary DVT patients to avoid confounding factors in the data analysis.

Mengfan Luo
Weight Change in Mid to Late Adulthood and Pancreatic Cancer Risk: A Pooled Analysis of 12 Cohort Studies
Pancreatic cancer is one of the top causes of cancer-related deaths in the United States, with a low 5-year survival rate of only 9%. Due to the link between cancer and obesity, weight maintenance is recommended for healthy individuals, and weight loss is recommended for those who are overweight or obese to prevent cancer, including pancreatic cancer. However, despite a high percentage of people attempting to lose weight, few achieve long-term weight loss, and many experience weight cycling. Previous studies have focused on the association between weight cycling and mortality or other types of cancer, but no study has examined the relationship between weight cycling or sustained weight loss and pancreatic cancer risk. In this study, we aim to use multiple measures of weight collected over mid to late adulthood to investigate the link between sustained weight loss and weight cycling with pancreatic cancer risk. The results of this study will provide new evidence on the relationship between weight loss and weight cycling and cancer risk, informing the development or modification of preventive guidelines.

Jiacheng Wu
Using Statistical Analysis to Improve NICU Call Efficiency
NICU volunteers working at the New York Presbyterian Morgan Stanley Children's Hospital make daily phone calls to newborns' families to introduce the hospital's parent support resources . This paper examines the relationships between various attributes including the parents’ language, the assigned nursing section, the average number of calls made, the call times, etc, and the interest in the hospital's resources. Data was collected from the families of the babies born in Jan to July, 2022, at the NICU. The results show that there is a significant correlation between the newborn’s prematurity status and the family’s interest in the NICU parent support resources. In addition, calling the newborns' families at noon and after dinner results in the highest chance of confirmative answer as well as the lowest average number of calls needed to be made.

Hongrui (Renee) Wang
Descriptive statistics and Likert plot application construction
BRED Web applications is a collection of web-based application for data visualizations of various types of data including categorical, numeric, survival, longitudinal etc. These applications are developed utilizing the R shiny package. The data visualization applications facilitate the uploading of data and the creation of publication and presentation-grade graphics, affording users with flexible means of manipulating the data to meet their needs. Descriptive data visualizations app allow user to do exploring analysis in many ways including summary statistics table, boxplot, histogram, density plot and bar plot. Likert plot app is capable to visualize survey data with people’s feeling degree. The BRED web applications prioritize user experience and graphic quality. We are endeavoring to develop a graphic design tool that is both user-friendly and error-tolerant, while providing users with a diverse range of options to modify their graphics.

Weize Sun
Impact of Newbody to Fat Loss
More Healthier Technology Ltd. is a national high-tech enterprise, founded in 2015 in China. Newbody, its flagship product, contains a variety of active nutrients to repair the body's metabolism and achieve the two-way regulation of fat loss and muscle gain, as a result to improve the symptoms of various chronic diseases caused by obesity.
This project focuses on understanding the impact of the product on fat loss. In order to study the potential relationship, a controlled product trial has been designed to collect fat loss data while using the product.
Hypothesis: using Newbody supplement to scientific dietary meals will have significant impact on a person’s fat loss result.
The trial was set for one month (from 11 March to 14 April total to 5 weeks) with 100 participants. The 100 people who are willing to lose fat are randomly assigned to 5 fat-loss camps, with 20 in each camp. And the 20 people in each camp are randomly divided into 2 groups, one is the control group that only follows scientific dietary guidance (no Newbody applied), whereas the other treatment group applies Newbody with their meals.
Every camp is supervised by a coach who guides the participants in both the control group and the treatment group to follow the designed 3 meals a day.
Each participant is equipped with a body fat scale and uploads their body data collected from their body fat scale into a central database following instructions given by the coach every day.

Internship Projects in Machine Learning (HSC 202)

Huili Zheng
Development of Non-Invasive Glucose Monitoring System and Physiological Indicators Analysis for Diabetes Management
Background technique: Effective exercise means that the user can achieve a certain exercise intensity and amount of exercise when doing exercise. After a short period of exercise, the user can relax physically and mentally, the heart rate can return to normal within 40 minutes, and the frequency of skin conductivity response can be greatly reduced. How to enable users to relax physically and mentally after effective exercise in a short period of time, increase energy, control exercise intensity, and reduce possible risks caused by exercise has become one of the difficulties in research.
The purpose of this project is to develop a model for discriminating effective exercise based on multiple physiological indicators and sensors. The model will use data collected from various sensors including electrocardiogram, respiratory rate, skin conductance, and three-axis gravity acceleration to improve the accuracy of exercise state classification.
Outcome variable: user's physiological background state information, which would be classified into 4 categories.
Covariates: Age, sex, physical fitness level, exercise intensity, exercise time, exercise type.
This project aims to develop a model for discriminating effective exercise based on multiple physiological indicators and sensors. The model will use data collected from various sensors to improve the accuracy of exercise state classification and provide users with feedback on their exercise performance.

Ziqian (Hester) He
Robust estimation of tumor mutation burden from minimized gene features
Immune checkpoint blockade (ICB) is becoming a useful treatment for many types of cancers. However, whether let patients take this treatment remains a challenge. Tumor mutational burden (TMB), the total number of somatic coding mutations in a tumor, is a promising biomarker for predicting immunotherapy response. However, TMB assessment using whole genome sequencing or whole exome sequencing (WES) is expensive. So now targeted gene panels are commonly used for TMB assessment in clinical settings, and recent studies have shown that TMB can be estimated accurately from targeted sequencing of gene panels.
We developed a method for predicting TMB in lung cancer patients using a tumor-specific gene panel. We employed linear regression, random forest, and recursive feature elimination models to find out the gene panel to predict TMB and find an optimal cutoff for this TMB. Then we validated our results using survival analysis.
Our lung cancer gene panel accurately predicted TMB and provided an optimal cutoff for TMB assessment in these patients. Our panel can serve as a reference for designing TMB-oriented panels to identify lung cancer patients who will benefit from immunotherapy.

Ruilian (Roxy) Zhang
Dealing with Imbalanced Data Using Model-level Methods for LOI Status Prediction
An imbalanced dataset is a dataset where one or more labels make up the majority of the dataset, leaving far fewer examples for other labels. Dealing with imbalanced data is an important topic in either statistical analysis or machine learning studies since 1) many of the datasets in the real world are inherently imbalanced, 2) the model cannot learn to predict the minority class well because of the class imbalance. To deal with imbalanced data, there are three general methods: data-level methods, model-level methods, and metric-level methods.
Here we acquired the company’s quote and receipt data from 09/03/2015 to 07/28/2022 for the use of this study, merging them into one dataset. Among the features of the merged dataset, letter of intent (LOI) status is the key metric for business decisions. For different stages in LOI status, Opportunity Won (LOI/COO Agreed) is the one that indicates a successful turnover. In the merged and cleaned dataset, only 2.756% of the rows account for Opportunity Won (LOI/COO Agreed). Therefore, we are dealing with a binary classification problem where over 97.2% of the data is in the non-LOI agreed class. In this study, we combined the model-level methods for dealing with imbalanced data and machine learning theories to make a prediction of LOI status.

Yujia Li
Development of a Self-Assessment Diabetes Risk Test
The company expected to offer an online quick diabetes risk test to promote acknowledgement of the disease and also warm up the promotion for a new drug in China.
Data is collected from questionnaires distributed by the company, which identified age, gender and additional ordinal variables along with the outcome of whether participants have high risk (Type I or Type II diagnosed) or low risk (mild/health status). 6 models -Logistic Regression, Decision Trees, KNN, Back Propagation, SVM, Bayes- are developed to note the best one at predicting the outcome.
The naive Bayes method gives the highest coefficient of determination of 0.7 among all. For further promotion and generalization, we encapsulated algorithms for model and estimation all in one function so as to train by traversing all the mentioned models, outputting their renderings, and parameters. Since there are no remarks provided for the scoring of predictors or the outcome level, our group is out of explicit explanation on the result. Nevertheless, since we simply intended to develop a preliminary algorithm, the integrated functions for modeling through the current data works.
This self-assessment risk test could be useful for one to reflect on whether her/his lifestyle is at risk for diabetes. Extra data with clearly defined and other measurements are needed to calibrate the validity and accuracy of the tools for the interest of the population.

Yuchen Zheng
Diabetes Status Prediction Using a Low-code Machine Learning Approach in SAS Viya
SAS Viya is an AI-based data analytics and management platform. It provides an easy drag-and-drop feature to empower people with limited programming experience to quickly build machine learning models. This project focuses on exploring the capabilities of two modeling environments, Model Studio and Visual Analytics in SAS Viya by an example of diabetes status prediction using four machine learning models. A diabetes data set was obtained from Kaggle. The data set has 253,680 survey responses and 21 feature variables. The response variable has three classes, no diabetes, prediabetes and diabetes. There is imbalance in this dataset, so oversampling was applied before modeling in Model Studio. Basic data wrangling was done in SAS studio and machine learning models were applied to make predictions in both Model Studio and Visual Analytics. The models were trained using the autotuning feature in SAS Viya. The results from modeling were visualized and scored in Visual Analytics. Comparing to Visual Analytics, the results in Model Studio were more robust. The selected champion model was a gradient boosting model fitted with all the variables. The champion model had an accuracy of 0.58, an AUC of 0.83, a sensitivity of 17.48%, and a specificity of 98.23% which were all from the validation data set. Model Studio is better when one is hoping to get robust results from modeling while Visual Analytics is more useful for preliminary analysis model building and visualization.

Cancer Research (HSC 107)

Peilin Zhou
Neighborhood-level drivers of cancer incidence and mortality rates in NYC
Cancer incidence and mortality rates are known to vary across neighborhoods, but the factors that contribute to these variations are not fully understood. This study aims to identify the neighborhood-level drivers of cancer incidence and mortality rates in New York City (NYC), with a focus on lung, breast, and colorectal cancers. We analyzed data on cancer incidence and mortality rates, as well as neighborhood-level socio-demographic and environmental variables for all NYC neighborhoods. Linear regression models were run to estimate the associations of each risk factor with age-adjusted incidence rates, including individual-level behavioral variables such as smoking status and physical activity. Additionally, we will assess the spatial autocorrelation of incidence and mortality rates using measures such as Moran’s I statistic. We will also run spatial linear regression models, accounting for neighborhood clustering, to estimate the associations of each risk factor with age-adjusted incidence rates. The analysis will adjust for potential confounders, such as age, sex, race/ethnicity, and socioeconomic status. The results will provide potential insights into the neighborhood-level drivers of cancer incidence and mortality rates in NYC, and inform the development of targeted interventions to reduce cancer burden in high-risk neighborhoods.

Louis Sharp
An Overview of Factors Associated with Cancer Outcomes in the Sister Study
Factors associated with various outcomes in the Sister Study have been previously identified by several groups, but more research is needed to elucidate prognostic factors that may explain cancer outcomes. Based on previous research, environmental exposures may be associated with cancer outcomes in the Sister Study cohort. The potential impact of individual level air pollutant exposures on cancer outcomes has not been systematically studied in this cohort and we aim to robustly evaluate these potential associations. We have developed a cohort for analysis from the larger Sister Study cohort, which included 50884 women aged 35-76 enrolled in the study between 2003 and 2009 with a sister with breast cancer. Information on NO2 and PM2.5/PM10 exposure predictions based on coded residence location data was included. Other data including demographics, treatments, comorbidities, and medications was also available. Logistic regression and Cox proportional hazard modeling are being used to test the apriori associations between air pollution constituents’ exposures, adjusting for potential confounders. Models stratified on cancer type/smoking status will also be explored. Other exploratory analyses include machine learning techniques to identify factors with large influences on cancer outcomes. Preliminary analysis has confirmed associations with pollution like energy sources in homes, among other exposures with robust data currently being further investigated.

Jing Lyu
An R Shiny Application in Oncology: Exploration of Interactive Visualization Methods in Pharmaceutical Data
The aim of this project is to enhance the R Shiny knowledge within the Oncology Programming Team at Bristol Mayer Squibb, in order to explore Oncology endpoints through visualization. The goal was to create simple, interactive, and informative visualizations. There were two steps involved in achieving this objective. Firstly, the application structure was built, followed by designing the statistical visualization functionality. The initial step focused on developing an organized and user-friendly application structure based on the R shiny technique. The second and most critical stage of the project was the visualization process. The project considered the content of both efficacy and safety data of clinical Oncology research and produced an interactive form of various plots, including waterfall plot, swimmer plot, adverse event listing plot, adverse event summary plot, among others. Moreover, the project also explored individual-level visualization, which is accessible through the R shiny application. The end result of the project was a well-performing application that has already been launched into the BMS Oncology team. Going forward, the application will be standardized in terms of data input and made available as a service to clinicians and statisticians.

Yanling Xue
Pediatric oncology multidisciplinary provider perspectives on chemotherapy-induced nausea and vomiting (CINV) management and preferred improvement approaches
Chemotherapy-induced nausea and vomiting (CINV) always happen in children undergoing cancer treatment, and dramaticly influences quality of life (Sommariva et al. 2016). Though there are guidelines, not all providers utilized them. Importantly, studies in adult oncology patients have shown that following guideline recommended antiemetic regimens improves CINV control (Aapro et al. 2012, Dupuis et al. 2017, Mellin et al. 2018). Few studies have examined patient perception of CINV control. This study assess pediatric oncology provider perceptions of CINV management at NYP-MSCH under current protocols and pediatric oncology provider awareness of CINV guidelines. It also investigates pediatric oncology RN approach to administering antiemetics in patients with CINV.

Yangruijue (Anna) Ma
Two-Stage Design Sample Size Determination for Two Doses in Oncology Phase II Trials
Phase II clinical trials in oncology evaluate new drugs or regimens for efficacy and safety after dose escalation to the maximum tolerable dose. Simon’s single-arm two-stage design is commonly used for studies with only one dose but may not be sufficient for identifying doses with a balanced benefit-risk ratio for compounds with difficult-to-tolerate side effects. Including multiple doses in phase II studies is crucial for exploring the dose-response relationship but can be challenging as ineffective doses or compounds must be terminated quickly to minimize patient exposure. This paper presents an extended version of Simon's two-stage design that includes multiple doses. It covers decision probability calculations, considerations in dose-response evaluation, and an enumeration algorithm for sample size calculation. This extension can facilitate the exploration of dose-response relationships while minimizing exposure to ineffective drugs and unacceptable safety risks, in line with FDA guidance for expansion cohorts. The guidance recommends establishing an infrastructure to streamline trial logistics, considering multiple doses/regimens, randomization, and sample size justification to detect clinically important differences in safety and activity.

Dimension Reduction (HSC 201)

Keming Zhang
Application of Model-X Knockoffs algorithm in autism spectrum disorders metabolomics data
The 2018 Autism and Developmental Disabilities Monitoring (ADDM) Network report highlighting that approximately one in 44 children in the United States have ASD underscores the urgency in comprehending the pathogenesis of autism spectrum disorders (ASD). We utilized ASD metabolomics datasets to find its possibility to be a biomarker tool for ASD. Due to the high dimensionality in the metabolomics data, we applied a novel feature selection method called Model-X knockoff that teases apart the important variables from noises in high-dimensional datasets while controlling the FDR and employed four machine learning algorithms with different variable sets. The result shows that the algorithm with variables selected by Model-X knockoff gets the highest AUC value in the different datasets.

Trisha Dwivedi
Machine learning models of 6-lead ECGs for the interpretation of left ventricular hypertrophy (LVH)
Left Ventricular Hypertrophy (LVH) is closely linked to the cardiovascular disease prognosis, and thus, timely diagnosis improves outcomes. Diagnosis is challenging due to dependency on doctor's visits and a 12‑lead ECG. The aims of this study are to evaluate different big data-driven machine learning models for ECG LVH interpretation based on limb leads only.
The first two models are binary class Random Forest models. One is trained using the following features: lead aVL R-wave amplitude, lead I, II, aVL ST segment amplitude, and QRS duration. The second RF model uses 54 features across all limb leads, including those in the smaller model. The second type of model is a multi-class deep neural network (DNN) which takes median beats of 6 limb leads as input. The DNN consists of 1 lead-formation convolutional layer, 5 downsampling convolutional resnet blocks with skip connections, and 3 fully connected layers. 1.25 million 10s 12‑lead ECGs from Mayo Clinic were used.
The five-parameter RF model has AUC of 0.78, and the larger RF model 0.83. The DNN for ECG LVH detection achieves AUC 0.92 using only the limb leads, compared to an AUC of 0.98 for the full 12‑lead DNN.
We observe that the RF model splits parameters by thresholds known to be characteristic of LVH, and that the DNN model can automatically detect morphology differences from 6 limb lead ECGs. This will be meaningful for expanding the capabilities of potential electrical LVH detection in mobile 6‑lead ECG devices.

Jesse Ames
Evaluating the performance of clustering algorithms for microbiome data
Modern sequencing technologies allow microbiologists to obtain taxonomic count data, in which each feature is the count of a species or strain of bacteria, archaea, or fungus found in the sample. Taxonomic count data is zero-inflated, high-dimensional data with complex correlation structures. To manage the high dimensionality of taxonomic count data and better understand differences between healthy and diseased states of the human microbiome and environmental microbiota, microbiologists use clustering algorithms.
Clustering algorithms partition samples into subgroups so that samples in the same cluster are similar to one another, and samples in different clusters are different from one another. Clustering is unsupervised because there is no a priori response variable used to train the model. Currently, there is no universally accepted method for performing cross-validation on clustering methods, though metrics such as the adjusted Rand Index (ARI) are used to evaluate their performance.
In this project, we address two questions. First, we investigate which combinations of data transformations and clustering algorithms produce the best clustering results on microbiome data as assessed by ARI. Second, we investigate which method for determining the optimal number of clusters is most accurate to the true number of metadata categories in microbiome datasets. To answer these questions, we use several different microbiome datasets publicly available through the MGnify database.

Jialiang Hua
Identifying Important Risk Factors for Future Mania in Individuals with Major Depressive Disorder Using Weighted Random Forest Models and the NESARC Dataset
This study aimed to identify significant risk factors for future mania in individuals with major depressive disorder using weighted random forest models trained on the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) dataset. The dataset was cleaned and some explorative data analyses were conducted. Cross-validation techniques were employed to train the random forest models, and parallel computing was applied to expedite the modeling process. Furthermore, this research explored the performance of random forest models on raw versus preprocessed variables. Our findings highlight the importance of identifying key risk factors to improve prediction and inform preventive interventions for individuals at risk of developing mania in the context of major depressive disorder.

Yushan Wang
R Shiny Application for Clinical Trial Result Outputting
This project aims to create an R Shiny application that can produce TLFs (table, listing, figure) and outlier analysis in a user-friendly and efficient way using laboratory data generated in clinical trial. In total, two figure and three tables are produced in this app and they can be customized according to user inputs. The application is coded into separated modules and buddled into a single R file, which is a newly developed R coding approach. In addition, this app has a user-friendly and intuitive interface with clear instructions. It provides a range of customization options for users, allowing them to navigate using tabs and menus, select different filters for the data, adjust visualization parameters, and download the tables and figures to their local machine.
The project will be particularly useful for investigators who are not proficient in R programming but need to generate clinical trial key components. The application will also provide an efficient way to communicate clinical trial results to stakeholders, such as regulatory agencies, investors, and patients.

COVID -19 (HSC 110)

Xuehan Yang
Biomarkers Of Inflammation And Aging Are Not Associated With Fibrotic-like Pulmonary Radiographic Patterns In Severe Covid-19 Survivors
Biomarkers of inflammation and aging have established associations with interstitial lung abnormalities in community-dwelling adults, and worse physical function in critical illness survivors. We sought to determine whether biomarkers inflammation and aging are cross sectionally and longitudinally associated fibrotic-like pulmonary radiographic patterns in severe COVID-19 survivors.
We conducted a single-center New York city-based prospective longitudinal cohort study of 102 adults hospitalized with severe COVID-19 in 2020, with sampling weighted to include 50% survivors of invasive mechanical ventilation. Serum samples were obtained at hospital discharge, 4-months, and 1-year, and were compared with chest CT scans at 4-months and 1-year. Radiographic patterns were categorized and quantitated using a severity scoring system developed by ARDSnet and used in ARDS survivors. Fibrotic-like patterns included abnormalities consisting of reticulations, traction bronchiectasis, or honey combing. We used generalized additive logistic models with covariate balanced propensity scores to test adjusted associations between biomarkers and the risk of fibrotic-like patterns while controlling for demographics, comorbidities, and COVID therapies. We corrected for false discovery via the Benjamini-Hochberg method.

Jia Ji
COVID-19 data project
With hundreds of researchers engaged with downloadable data have been embedded and shared 45 thousand times. We have been inspired policy recommendations grants and journal articles as well some poster presentations at conferences. Each state has a unique reporting system. In order to have the highest quality data we frankly needed to collect it and quality assure it. The project includes COVID-19 cases and deaths, and state policies. vaccine distribution The Primary Data Goal was to create and maintain a research-ready dataset around COVID-19 cases, deaths, disparities, and policies for every county in the nation. After learning about public health infectious disease surveillance systems in the United States, it was supposed to build datasets, identify trend breaks, and complete data entry.

Jiahao Fan
COVID Fake News Detection
Social media is indivisible to individuals nowadays. There are 4.59 billion social media users in 2022 according to Statista. Social media spreads news, opinions, and sometimes misinformation. Misinformation is used to refer to the spreading of false information disregarding the true intent. It tends to attract people's attention more with made-up novelty, exaggeration and inflaming. Due to the fragmented content conveyed by social media, individuals do not deliberate over the accuracy of the information in a few minutes. As a matter of that, misinformation spreads more rapidly and frequently. Meanwhile, real news posts may be buried with comments attempting to disprove the truth and faceless pages may share intentionally misleading news content. It is considered to be one of the greatest threats all over the world, leading to huge commercial damage, intensifying social conflict, and discrediting belief in scientific findings.
In the past years, the global pandemic COVID-19 triggers heated discussions continuously. However, some rapidly spread information arousing panic and contention turns out to be fake. Therefore, the purpose of this research is to analyze the difference between fake and real tweets regarding COVID-19, extract keywords in both categories, and finally distinguish fake from real ones.

Kaiyu He
Evaluating completion rates of COVID-19 contact tracing surveys in New York City
Contact tracing is an important means for reducing the spread of infectious diseases. Understanding factors associated with completion rates of contact tracing surveys can help design improved interview protocols for ongoing and future programs.
The primary outcomes were the completion rates of case investigation calls in NYC ZIP code areas.
The overall completion rate in NYC was 0.79, with substantial variations across ZIP code areas. Using a generalized linear mixed model that controls for demographic and socioeconomic factors at the ZIP code level, we found that residents over 65 years old had 0.033 (95% CI: 0.026 – 0.040) lower completion rate than adults aged between 24 and 64 years old. In addition, phone calls made between 6 pm and 9 pm had 0.020 (95% CI: 0.012 – 0.028) higher completion rate compared with calls attempted between 12 pm and 3 pm. We further used a machine learning algorithm to assess the potential utility of predictive models for selecting phone call time. The overall completion rate in NYC was marginally improved by 0.012; however, certain ZIP code areas had improvements up to 0.1.
These findings suggest that age and phone call time were associated with completion rates of phone surveys. It is possible to develop predictive models to estimate best phone call time for improving completion rates in certain communities.

Machine Learning (HSC 207)

Hao Mei Zheng
Overall Survival Prediction on Non-small Cell Lung Cancer Patients
This project is to predict overall survival of patients with first diagnosis of non-small cell lung cancer in 2015-2020 with machine learning methods XGBoost and Random Survival Forest based on features like demographics and longitudinal bio markers. The performance measures include Area under curve (AUC) and Brier Score. For missing values, this project imputed continuous baseline variables with medians, created a separate category for categorical baseline variables, and used Last Observation Carried Forward (LOCF) to impute longitudinal markers. The best model achieved an accuracy of 0.7358 and generated analysis about feature importance to guide further experimental design.

Qixiang Chen
A Longitudinal Study to the Performance of Sales Representatives and Data Analysis for Historical Sales
In the summer internship, I redesigned a new internal-used relational database and rewrote a series of new SQL queries for the commonly used data with my mentors to improve the efficiency of retrieving data. Then, in the rest time of the internship, I did several analyses for the historical data, including the performance of salesmen and the sales of several products.
For the practicum project, a brief introduction of the work result is presented first. Then, the project introduces the exploratory data analysis of the two datasets first, which provided a comprehensive overview of the data. Several hypothesis tests are conducted for each of the two datasets. Then, the historical sales dataset is separated into training dataset and a test dataset. Several regression models are fitted by using the training data and tested by using test data. The models fitted for the historical sales dataset includes LASSO, Ridge, elastic net, PLS, PCR, GAM, and MARS. Finally, a model selection to choose the best model was conducted by comprehensively considering bias-variance-trade-off, RMSE, Mallows’ Cp, AIC, and BIC. After that, for the salesmen performance dataset, the model fitted to analyze the salesmen’s performance dataset include a GEE model to study the population average effect of the dataset and a GLMM model to learn the subject specific effect among the salesmen. Since the data are confidential, I am only allowed to use a little part (one product set) of data for the practicum.

Wenshan Qu
Reducing the bias of random survival forests by averaging martingale estimating equations
The Cox proportional hazard model is a commonly used method to study the time-to-event data, while as a semi-parametric model, it forces the outcome and the covariates to have a special connection. Hence, the survival tree and the corresponding random survival forest (RSF) are highly favorable non-parametric methods when studying survival data. Currently, the predict() function provided by the ranger package in R is commonly used for training RSF, while the accuracy of this method cannot reach our expectation according to our simulation studies. In this article, a novel ensemble procedure based on averaging martingale estimating equations is applied when training the RSF for predicting the survival probability based on baseline covariates. Simulation studies are conducted to compare the performance of the proposed function and the predict() function in ranger package. The simulation results indicate that the proposed function performs better than the Cox model.

Fei Sun
Evaluating and Predicting the Factor of Ocular Disease
The eyes are the most active and sensitive organ of the body, and eye diseases may be encountered at all ages. These different types of diseases can largely be prevented from serious harm if they are detected and treated in time and in an early way. Using penalized logistic regression, linear discriminant analysis (LDA) model, generalized additive model (GAM), multivariate adaptive regression splines (MARS), random forest, and support vector machines (SVM) to predict in this project. Once the appropriate model is selected, we will predict the probability of the factors that impact different ocular diseases, such as age, gender, occupation, race, etc. The eye is an important sensory organ of the human body, and more than 70% of the external information obtained by the human brain comes from vision. Eye diseases not only bring different degrees of visual impairment or loss to patients but also is a major public health problem. Eye diseases include glaucoma, cataract, age-related macular degeneration, refractive eye disease, loss of near vision, and other vision loss. The burden on public health and the economy will be significantly increased by these eye diseases. Machine learning can be applied to predict and assess the risk of ophthalmic diseases. By training on a large amount of clinical and genomic data, machine learning algorithms can predict an individual's risk of developing ophthalmic diseases and provide personalized prevention and treatment recommendations.

Shihui Zhu
Interpretable machine learning models improve clinical understanding of factors with the greatest impact on perioperative mortality risk
Perioperative mortality risk stratification is a critically important aspect of surgical and anesthetic care. Prior methods for risk stratification simplify risk prediction models by limiting the number of included variables and providing inference on the average effects of each predictor on the outcome. Machine learning (ML) can fit models with complex, non-linear relationships between variables but have limitations in interpretability. We aimed to use interpretable ML methods to develop personalized risk prediction models for individual patients.

Mental Health (HSC 210)

Pinyi (Paula) Wu
Understanding the Relationship between Childhood Trauma and Change in Adulthood Cognition
Accumulating evidence suggests that childhood trauma exposure is associated with adulthood cognitive functioning and could lead to various mental health issues. On the other hand, various eudemonic factors claim to be able to attenuate the deterioration. However, the nature and extent of the relations has yet to be fully explored. In this study, we used multilevel modeling to examine trauma exposure, measured in self-administered questionnaire scores, as the predictor of the change in cognitive functioning over a 9-year period. Data were from the Midlife in the United States study, a national survey that began in 1995. Data regarding childhood trauma scores were obtained from the wave 2 (2004) study, while cognitive functioning data were collected from both wave 2 (2004) and wave 3 (2013) using the same test battery. The analyses were conducted using data from 867 participants [age: 34-81] who had complete and valid data on all variables from the 2004 wave. Higher childhood trauma scores are related to greater decline in episodic memory (EM), B = -0.006, SE = 0.002, p =0.007, but not in executive functioning (EF), B = 0.0003, SE = 0.001, p =0.73. None of the eudemonic factors has been found to be significantly related to change in cognition. We also found that the higher the age, the greater the decline in both EM and EF. These findings identify childhood trauma exposure as a risk factor for cognitive decline in adulthood and highlight the elevated risk associated with aging.

Yijing Tao
Relationship between Childhood Experience of 911 and Grown-up Trauma Mediated by Blood Cytokines
Experienced 911 might lead to traumas for children in their growing-up progress. To learn more about then impact of experiencing 911 as a kid, we conduct this study to find out the relationship between 911 experience and traumas, considering different kinds of cytokines as media. We made a survey and collect response from people experienced or not experienced 911 when they were children, get the information of the traumas they have, and test 60 kinds of cytokines from their blood. Then we use machine learning models such as linear regression and SAM model to test the relationships. The current result shows that there is not a strong relationship between 911 experiment and grown-up cytokines , and also not a strong relationship between grown-up trauma and grown-up cytokines. The result might help the mental hospital, insurance industry and social welfare policies to make better plan for clients or patients experienced 911 as a kid. The result in this study might need a bigger sample size to make stronger evidence.

Yuxi (Kaitlyn) Wang
Understanding interaction effects of parental encouragement to diet and parent/child sex in child unhealthy weight control behaviors and emotional well-being
Prior research has shown association of parent encouragement to diet with unhealthy weight control behaviors, dieting, lower self-esteem, and binge eating, but it remains unclear on the influence of parent or child sex on these outcomes. To be able to better inform family-level interventions, it is important to understand whether parent/child sex are associated with parent encouragement to diet on child unhealthy weight control behaviors and emotional well-being outcomes. In Project EAT, a total of 4361 child and 3855 parent survey data were collected in 2010 and 2018. To understand the association of parent/child sex and parent encouragement to diet on child outcomes, this study focused on 927 parent/child pairs who have child and parent data in both timepoints. Cross-sectional descriptive statistics showed that parent encouragement to diet level is associated with unhealthy weight control behaviors and body satisfaction scores. Linear regression models and logistic regression models were conducted to examine the associations between child outcomes and interactions by parent/child sex and encouragement to diet. Using the emmeans package in R, plots and two-way/three-way interaction contrasts between parent sex, child sex and encouragement to diet were generated and showed the interactions effect for different types of child outcomes.

Yiming Zhao
Resting-State Functional Connectivity Changes in Older Adults with Sleep Disturbance and the Role of Amyloid Burden
The impact of sleep disturbance on resting-state functional connectivity (rsFC) in older adults and the role of amyloid β (Aβ) burden in this relationship are not yet fully understood. To investigate this relationship, a cross-sectional study was conducted using a large-scale dataset from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The study included 489 participants, consisting of 53.6% cognitively normal, 32.5% mild cognitive impairment, and 13.9% AD individuals, who had completed sleep measures, PET Aβ data, and resting-state fMRI scans at baseline. The study compared within and between rsFC of the Salience (SN), the Default Mode (DMN), and the Frontal Parietal network (FPN) between participants with sleep disturbance versus those without. Linear regressions were conducted to evaluate the interaction between Aβ positivity and sleep disturbance, controlling for age, diagnosis status, gender, sedatives and hypnotics use, and hypertension. While the study did not find any significant main effect of sleep disturbance on resting-state functional connectivity (rsFC), a significant interaction term between sleep disturbance and Aβ burden on rsFC of the Salience network (SN) emerged (β=0.11, P=0.006). Specifically, the presence of Aβ burden was found to be a critical factor in the association between sleep disturbance and SN hyperconnectivity. The results suggest that sleep disturbance may lead to altered connectivity in the SN only when Aβ accumulates in the brain.

AnMei Chen
Association between Schizophrenia and Olfactory Deficit
Schizophrenia, which is identified by psychotic symptoms, continues to pose challenges in terms of its causes and treatment. Patients who develop schizophrenia have been reported to experience olfactory dysfunction and impaired cognitive ability. The main objective of this study is to understand the association between schizophrenia, olfactory deficit, cognition, and mRNAs in lymphocytes. This study included a total of 58 patients with chronic schizophrenia and 48 controls. Olfactory function was assessed using the Sniffin' Sticks test battery, which measures odor discrimination and odor identification scores. Cognitive function was measured using the MATRICS Consensus Cognitive Battery, and mRNAs in lymphocytes were assessed by qPCR using TaqMan probes. Logistic regression models were constructed to identify variables that have significant effects in discriminating the presence and absence of schizophrenia. Variable importance ranking was determined based on standard coefficients, c-statistics, and logistic pseudo partial correlation methods. The results showed that only MATRICS cognitive measure social cognition, speed of processing and odor discrimination score are significant variables. Several limitations of the studies should be noted, including significant gender imbalance within both the disease and control groups, the lack of evaluation of thresholds or other measures of perception, and the absence of information regarding the educational levels of the subjects.

Survival Data Analysis (HSC 109B)

Jiayao Sun
Survival Analysis on RCT for Catheter-Associated Urinary Tract Infections
Catheter-associated urinary tract infections (CAUTIs) are a common complication among hospitalized patients who require urinary catheterization. The research applied randomized controlled trials (RCTs) to evaluate interventions aimed at preventing CAUTIs. The purpose of this study is to examine the effectiveness of a new intervention, which involves continuous monitoring of the urinary tract status by a doctor every two days, compared to the standard practice of monitoring without a fixed time. The research question is whether the new intervention can reduce the incidence and prolong the time to the occurrence of CAUTIs among hospitalized patients.
The research applied randomized controlled trials (RCTs) and used R to manipulate, clean and analyze clinical trial data from the NYC medical record system over two years. The research built a framework for both intent-to-treat and per-protocol samples and developed user-defined functions to improve descriptive statistics commands. Additionally, survival analysis techniques such as Kaplan Meier and Cox PH models were used to evaluate the effectiveness of the intervention.

Ying Jin
Correcting Bias Caused by Differential Assessment Intervals
This is a research project related to survival analysis (interval censoring, specifically) and aims to develop a new statistical methodology to address practical problems occurred when the spacing of visits for the treatment and control groups is unequal. The developed method – realigned method – assumes a constant hazard rate and virtually realigns the assessment intervals of different groups to make them comparable. After that, a simulation was conducted to compare the performance, including type-I error, power, and hazard ratio, of the new methodology with right-censored and interval-censored methods.

Jibei Zheng
Lessons and Implications from Three CAR T-cell Therapy Studies with Conflicting Results
The rapid advancement of CAR T-cell therapy heralds a promising new era in cancer treatment. Nevertheless, its distinct features pose new challenges in trial design and data analysis. When not conducted appropriately, these trials can produce highly divergent findings.
As an illustration, three randomized controlled trials investigating comparable CAR T-cell therapies versus Standard of Care therapy in large B-cell lymphoma - namely, the ZUMA-7, BELINDA, and TRANSFORM studies yielded conflicting findings despite numerous similarities. Specifically, while ZUMA-7 and TRANSFORM demonstrate a statistically significant deferential effect of CAR T, BELINDA displays no such effect. From a study design perspective, how can we explain these disparate results, and what lessons can we learn to guide the appropriate design of future trials?
In this project, we aim to investigate the methodological factors that primarily contribute to the discrepant results observed in the ZUMA-7, TRANSFORM, and BELINDA trials. Specifically, we examine how critical design features impact study outcomes and trial efficiency by first developing a simulation algorithm that mimics the published survival patterns of the three studies based on reported study settings. We then vary critical design parameters to quantify their impact. Finally, we will provide recommendations to inform the proper design and analysis of future cell therapy studies.

Yijia Jiang
Simulation for Rerandomization in Oncology Maintenance Therapies
Re-randomization is often used in oncology clinical studies to evaluate changes in treatment strategies. For example, patients with certain blood cancer may be initially treated with an autologous cell therapy. After patients recovered from adverse reactions from cell therapy, a maintenance therapy may be added later to strengthen the effect of the cell therapy. The primary objective of this clinical study is to evaluate the treatment effect of the combination of the cell therapy plus the maintenance therapy in comparison with a standard of care (SOC). In addition to assessing the treatment's effectiveness, the FDA also mandates an evaluation of the risk/benefit of the added maintenance therapy in comparison to the cell therapy alone (monotherapy). Therefore, re-randomization may appear to be a natural choice for the comparison between combo and mono treatment groups. However, when using a progression-free survival endpoint for survival analysis, the use of re-randomization may lead to a bias in comparison between combo and SOP. In order to identify and account for such biases to ensure accurate study results, simulations are conducted.

Wen Cheng
A propensity score matching approach to comparing the efficacy of arthritis management programs
In our research, we analyzed a dataset that was comprised of patients diagnosed with arthritis who participated in either exercise programs or self-management programs. Our primary research question was focused on examining whether there were differences in outcomes based on the specific program that the patients participated in.
To answer this question, we employed various statistical methods and measures. Firstly, we conducted pre- and post-program assessments for a range of outcomes. We then explored different matching algorithms to ensure that our results were robust and unbiased. The methodology we ultimately chose was propensity score matching, which helped us balance the baseline characteristics of the two groups and minimize confounding effects.
In addition to propensity score matching, we also employed survival analysis methods, specifically Cox's proportional hazards regression model. This allowed us to examine the time-to-event data, such as the time it took for patients to experience a specific outcome, and to determine if there were significant differences between the two groups.
Overall, our study aimed to shed light on the effectiveness of exercise programs versus self-management programs for arthritis patients. By utilizing a range of statistical methods and measures, we were able to draw more reliable conclusions and contribute to the growing body of knowledge in this field.

Session 3 (1:45pm - 2:45pm)

Network Analysis, Clustering, and Functional Data Analysis (HSC 312)

Xiangyuan Cao
Assessing Publication and Author Impact in Autism Research: A Network Analysis Approach
As the rate and number of publications continue to grow, metrics that capture the impact of scholarly work become increasingly important in order to recognize the contribution of an author or research project to its research space. Bibliometrics, which often present author and journal-level metrics, can be overwhelming to present for more specific use cases like grantmaking or research collaboration. Visualizations resulting from such analyses as network graphs also suffer from similar pitfalls. This work seeks to address some of the readability issues of bibliometrics by providing categorized information in summary graphs relevant to the autism research space. It is inspired by the efforts of Altmetrics and PlumX but seeks to retain the richness of information that network analyses provide and provide a context to it by comparing them to publications within the same research space. A network analysis of PageRank centrality was calculated using citations from autism publications queried in Pubmed and Semantic Scholar from 2000 to 2020. Additionally, PageRank centrality was also calculated for weighted author collaborations and a parallel decomposed version of the citation network. Percentiles for the respective centrality values were calculated to compare publication impact, degree of author collaboration, and the impact of an author’s body of work. Visualizations for individual publications and authors were subsequently generated, showing summary information.

Yingchen Xu
Functional Data Modeling of Melatonin Cycles
Background and objectives: Melatonin is a hormone that regulates sleep, and sleep quality has been implicated to impact mental health. In this study, we focused on modeling melatonin cycles across subjects, and we hypothesized that sleep deprivation could influence the melatonin cycles of people with Major Depression Disorder (MDD).
Methods: Subjects were asked to measure their baseline melatonin concentration and melatonin concentration after 24 hours of sleep deprivation. We used function-on-scalar- regression to model subject’s melatonin cycles. The significances of the covariates are tested using permutation tests.
Results: The predicted function suggests that the melatonin concentration for male is higher compared to female after 1 am, and subjects who are older would have overall higher melatonin concentration. There is slightly higher melatonin concentration for subjects with MDD before 2 am and slightly higher for control group between 2 am and 8 am.
Discussion: Based on the predicted functional models, we found that gender and age play significant role in melatonin cycles and MDD status seems to be unrelated to melatonin cycles. We would need further analysis to test the statistical significance.

Han Bao
Data Analysis of PQMD NGO member survey
Completed the basic data visualization of the results of survey data with various charts using Excel and Python.
Used correlation heat map with SAS to explore the relationship between the results of the survey to explore any potential relation of organizations making choices.
Used K-means clustering with Python on Databricks to put organizations into groups in order for potential further cooperation between organizations.

Fei Xiao
Prenatal drinking, smoking and maternal mood disorder on birth outcome
This study focuses on the adverse impact of prenatal drinking, smoking, and maternal mood disorders on birth and child development outcomes. It is well-known that these exposures often co-occur, and women with multiple comorbidities and substance abuse are at a higher risk of adverse outcomes. Therefore, identifying women with multiple exposures is important to target high-risk women and children. However, combining exposures measured using varied scales and time points presents significant methodological challenges. Using data from PASS, this paper describes the clustering methods used to identify groups based on trajectories of maternal prenatal drinking, smoking, and mood symptom severity. Nonparametric K-means cluster analysis was conducted to identify groups in the combined site of the United States and South Africa, as well as each site separately. The paper shows how these clusters perform in predicting and associating with adverse birth outcomes. The findings have important programmatic implications for identifying high-risk women and children and developing targeted interventions.

Yongzheng Li
Applied practice experience as consultant
This practicum project is aim to share my works and experience as a healthcare consult in Phalanx Analysis group. I will share 3 projects I have done during my internship. The first one involves using logistic regression to analyze heart failure patient data. The second one involves applying data visualization on injury death data. The third one involves data visualization and prediction on infant mortality rate.

Population Health (HSC 203)

Jingnan Yuan
CALERIE allostatic load trial project
The overall goal of this APEx is to report the results of a post hoc analysis of the influence of CR on DNAm measures of aging in blood samples from the Comprehensive Assessment of Long-term Effects of Reducing Intake of Energy (CALERIE) trial.
My role is to conduct preliminary screening and sorting of the research datasets.
During the practicum, I met with Dr. Belsky and finished the assigned work: Learned the background of the study by reading related papers, and read through the code books. Filtered out biomarkers from the database that is needed to construct the Allostatic Load measure, and generated demographic tables of the observations.

Qi (Chloe) Jian
Association between Polypharmacy Use and Hard Braking Events in Older Drivers
Older drivers may take various medications due to age-related medical conditions. Polypharmacy use poses a serious concern for driving safety. This study assesses the association between polypharmacy use and hard-braking events (an indicator of unsafe driving commonly called near misses) in older drivers. Data for this study came from the AAA Longitudinal Research on Aging Drivers project, a multi-site, prospective cohort study of 2990 older drivers aged 65-79 years at baseline. The primary vehicles of the study participants were instrumented with in-vehicle data recording devices for up to 44 months. Multivariate negative binomial model was used to estimate the adjusted incidence rate ratios and 95% confidence intervals (CIs) of hard-braking events. Of the 2990 participants, 2872 (96.1%) were eligible for this study. At the time enrollment, 157 (5.5%) drivers were on fewer than two medications, 904 (31.5%) on 2-5 medications, 895 (31.2%) on 6-9 medications, 571 (19.9%) on 10-13 medications, and 345 (12.0%) on 14 or more medications. Overall, the incidence rate of hard-braking events was 1.2 per 1000 miles driven. Compared to non-polypharmacy users, the risk of hard-braking events increased 9% for users of 2-5 medications, 12% for users of 6-9 medications, 19% for users of 10-13 medications, and 34% for users of 14 or more medications. Polypharmacy use by older drivers is associated with significantly increased incidence of hard-braking events in a dose-response fashion.

Alexander Furuya
Quantifying and Validating Pace of Aging Using Framingham Heart Study
The Pace of Aging measure is a methodology that uses longitudinal biomarker data to quantify a person’s speed of biological decline. In this project, the Pace of Aging methodology was adapted to the data accumulated over four decades of longitudinal follow-up within the Offspring Cohort of the Framingham Study. To develop the measire, Furuya first screened over three dozen biomarkers collected from the cohort and identified a set of 20 that were measured at multiple waves and showed approximately linear age patterning. Second, he undertook data cleaning to ensure standardized biomarker measurement across the nine different measurement waves. Third, Furuya fitted mixed-effect models for each of the biomarkers to identify each participant’s personal slope of change over time. Finally, the slopes were aggregated together across the biomarker panel to form the measure. There were three key findings from this work. First, the Pace of Aging methodology was successfully adapted to an older adult population. Second, the resulting FHPoA proved predictive of individual differences in cohort members’ risks of developing chronic diseases, including heart disease and stroke, and of dying. Third, FHPoA was moderately correlated with the DNA methylation version of the Pace of Aging. Ultimately, this program of work aims to develop novel biomarkers to monitor the effectiveness of therapies and other interventions designed to increase healthy lifespan.

Yujia Wang
Manhattan Vision Screening and Follow-up Study in Vulnerable Populations: Assessment of Falls Risks
The Manhattan Vision Screening and Follow-up study is a 5-year prospective, cluster-randomized controlled trial conducted in affordable housing buildings and senior centers in NYC funded by CDC. The Assessment of Falls Risk in RCT paper aims to assess falls risk among an underserved population aged 40+ in Manhattan using the Stopping Elderly Accidents, Death, and Injury (STEADI) Falls Risk Tool Kit developed by CDC. This community-based eye health screening study detected a significant amount of falls risk in an underserved population. The STEADI Falls Risk screening questions are highly predictive of falls risk and may be adequate for referral to occupational health and/or physical therapy.
Pre-screening questions determined whether participants were at risk of falling. STEADI tests classified participants at low, moderate, or high-risk of falling. Multivariate logistic regression determined odds of falls risk. 708 participants completed the eye health screening; 351 (49.6%) performed STEADI tests. Mean age of STEADI participants: 71.0 years (SD ±11.3); 72.1% female; 53.6% African American, 37.6% Hispanic. Level of falls risk: 32 (9.1%) low, 188 (53.6%) moderate, and 131 (37.3%) high. Individuals older than 80 years, had blurry vision, high blood pressure, arthritis, or foot problems had significantly higher odds of falling, emergency room visits or hospitalization due to falling.

Jiayi Luo
Data Analyst at Cello Threapeutics
During my period working at Cello Therapeutics, a biotech company designated to find cures to cancers, as a data analyst intern.
The Company buys in platelet cells from legal entities to perform their labs on the biological field(which I am not very aware) and conduct labs on daily basis to perform analysis with lab data to find cancer cures.
The majority of the work was to compile donor data and lab results. I figured that using SAS would be so much more efficient than using Excel to put in data for each cell manually and decided to use SAS instead. I in addition to the requirement from the company, peformed regression analysis as well as survival tests on the clinical trial data for my additional curiosity.
On the other hand, I also worked for the HR part as I Contacted and negotiated with the providers about prices and created a large directory for the company and eventually helped the company found 4 more providers for needed Apheresis Platelet cells for lab purposes.

Exploring Data Analysis (HSC 202)

ShaoCong Zhang
Data Analysis of multi-reader multi-case clinical trial
The assessment of medical imaging devices frequently includes clinical studies in which multiple readers (MR) interpret images from multiple cases (MC) for a specific clinical task, commonly referred to as MRMC studies. In addition to determining the sample size of patient cases as required in most clinical trials, MRMC studies also necessitate determining the number of readers, as both readers and cases contribute to the uncertainty of the estimated diagnostic performance, often quantified by the area under the ROC curve (AUC). Since there is limited prior information, determining the appropriate study size can be unreliable. In this study, I aim to analyze a MRMC clinical trail dataset using the traditional method. An adaptive design method, which simultaneously resize the study and adjust the critical value for hypothesis testing after an interim analysis to achieve a target power and maintain the type I error rate when comparing AUCs of two modalities, will also be conducted during the analysis. Furthermore, I will briefly illustrate the comparison between the traditional and adaptive designs.

Yoo Oh
The Association Between Federally Qualified Health Center Characteristics and Quality of Care Performance
Federally Qualified Health Centers, also referred to as the community health centers, provide primary care services in the medically underserved areas. By focusing on comprehensive health care delivery to the vulnerable population regardless of their ability to pay, health centers strive to reduce health disparities. As a non-profit organization receiving federal grant fundings, the clinical quality of a health center is measured using compliance rates of a predetermined set of fundamental health care services. However, the highly varying patient demographics that each health centers serve, the role of grant revenues, and the health professional availability provide different contextual factors to health center’s performance in terms of quality of medical care. The main goal of this study is to investigate the heterogeneities of the diverse Federally Qualified Health Centers and the association between various factors that characterize these health centers and their quality-of-care performances. The annual submission of health center level data to the Uniform Data System (UDS) from 2017 to 2021 was used to examine the potential latent typologies of health centers and the comparable effects on clinical performance.

Wentong Liu
Telehealth Implementation Strategies and Adoption Analysis during the COVID-19 Pandemic
The Telehealth Project focuses on developing effective telehealth implementation strategies during the COVID-19 pandemic while investigating factors influencing adoption. Using data from the Medicare Current Beneficiary Survey (MCBS), the project explores predictors of telehealth usage and reasons for non-usage, emphasizing demographic, lifestyle, and accessibility factors.
Data cleaning, manipulation, exploratory analysis, and visualization techniques are utilized to ensure data quality and inform research questions. Logistic regression examines the relationship between telehealth usage and key factors. The project also investigates non-usage reasons, assesses individuals' access to in-person healthcare, and identifies high-risk patients for further analysis. Sampling weights are employed to account for potential biases.
Findings inform recommendations for improving telehealth adoption rates, compiled into a comprehensive report featuring data visualizations. The Telehealth Project aims to enhance understanding of telehealth adoption during the pandemic and contribute to targeted strategies for improving usage across populations.

Yu Si
Prostate cancer progression molecular
The androgen receptor (AR) is a paramount player in both the initiation and progression of prostate cancer. Cancer cells initially respond to treatment with anti-androgen drugs, such as enzalutamide, but often become resistant. For those type of patients whose PCa no longer responds to anti-androgen drugs. Understanding the mechanisms of resistance resulting in progression as well as identifying new targetable pathways and therapies are greatly needed.

Yujin Zhang
Frequency and prognostic impact of prior stroke in patients undergoing percutaneous coronary intervention
Whether prior stroke is a predictor of adverse outcomes after percutaneous coronary intervention (PCI) is the study interests. The analysis evaluates the frequency and prognostic impact of prior stroke on 1-year outcomes in an ethnically diverse population undergoing PCI with drug eluting stent (DES). Patients with acute or chronic clinical presentation that underwent PCI with DES at a large tertiary-care center in the United States between 2012 and 2019 were considered for inclusion. Patients were stratified according to the history of prior stroke. The primary outcome was major adverse clinical events (MACE), a composite of all-cause death, myocardial infarction (MI) and stroke; secondary outcomes included the individual components of primary outcome, target vessel revascularization and major bleeding. Events were assessed at 1-year after PCI. Among 17,302 patients, 9,610 (56%) were non-Caucasian. Patients with prior stroke were 1,765 (10%) and their proportion was stable over time. After cox model adjustment, prior stroke history was associated with a higher risk of recurrent stroke (adjusted HR 2.74, 95% CI 1.45-5.17, p-value= 0.002) and major bleeding (adjusted HR 1.36, 95% CI 1.11-1.67; p-value= 0.003). In conclusion, prior stroke was a predictor of recurrent stroke and major bleeding only. Further studies are needed to understand the mechanism of these associations and evaluate strategies to decrease adverse outcomes in patients with prior stroke.

Causal Inference in Observational Data (HSC 107)

Yunlin Zhou
Assessing the Impact of Amazon Fulfillment Centers on PM2.5 Levels in Surrounding Areas
Air pollution is a serious environmental and public health concern worldwide, and fine particulate matter (PM2.5) is one of the most significant contributors to air pollution. In recent years, there have been growing concerns about the potential contribution to air pollution of e-commerce companies. This study aims to address this issue by investigating the impact of Amazon Fulfillment Centers (FCs) on PM2.5 levels in the surrounding areas.
We collected PM2.5 data from the Atmospheric Composition Analysis Group at Washington University in St. Louis and obtained the addresses of Amazon FCs from open-source data. Using causal inference statistical methods, we estimated the treatment effect of Amazon FCs on PM2.5 levels in the surrounding areas. The treatment group was defined as areas with Amazon FCs, and the control group was defined as areas without Amazon FCs but with the same Rural-Urban Continuum Codes (RUCC) score as the treatment group.
This study can contribute to the development of causal inference methods in environmental health research by providing an example of how to apply causal inference methods to observational data. In summary, this study has significant implications for public health, environmental policy, and the e-commerce industry. It can inform interventions and policies to reduce air pollution and protect public health, and it can provide valuable insights into the environmental impacts of e-commerce.

Yongzi Yu
Relationship between prenatal education class and labor epidural
Previous studies show that there are some benefits of using a labor epidural on maternal health, such as the decreased risk of SMM. In recent years, more women would like to attend prenatal education classes. Antenatal education can reduce maternal stress, improve self-efficacy, lower the cesarean birth rate, and decrease the use of epidural anesthesia (Jafari, 2020). Our hypothesis is women who take prenatal education classes are more likely to use labor epidurals. Exposure groups are women who did not attend antenatal classes and women who attended antenatal classes (or will attend). The outcome variable is the use of labor epidural.
The study is an observational study that included a total of 2423 women in Australia. Statistical methods such as propensity score matching and the IPTW method are applied to deal with the imbalances between exposed and non-exposed groups with regard to baseline characteristics. Confound variables are income, the number of times have been pregnant, and socioeconomic status. Multinomial logistic regression is later used to investigate the relationship between exposure and outcome variables.
The results imply that there is no significant relationship between prenatal education class and increased use of labor epidurals. Attending prenatal education classes may not influence mothers’ choices of having a labor epidural.

Zachary Katz
A Simulation Study for Evaluating Model Performance in the Estimation of Health Effects From Complex Environmental Mixtures
Bayesian Kernel Machine Regression (BKMR) and other machine learning (ML) methods are commonly deployed to identify individual causal effects on health outcomes for elements within high-dimensional environmental mixtures. However, there remains a need for direct head-to-head comparison in such settings with complex structure, including non-linearities and interactions, as well as moderate multicollinearity, between exposures and confounders in the functional form of the outcome. Here, simulation studies are conducted on various ML methods — particularly those commonly used in Environmental Health Sciences (EHS) research and that allow for inference and flexible functional form specification — to better understand comparative performance. Data is generated for 10 exposures (metals) and five confounders. Evaluation occurs initially in two scenarios: one with a linear relationship between exposures and outcome, the other with a more complex relationship between exposures and outcome, but both with complex confounding and similar correlation structure. Model performance explores (relative) bias of estimates vs. true effects, estimate variance, computational run-time, and other measures that will enable scientists to make better decisions about which model(s) to use in their mixture studies. Ultimately, no single method consistently and accurately controls for confounding in these settings, indicating a continued need for model and ensemble optimization in EHS applications.

Fengjia Chen
Mediational Effects Between Parental Encouragement to Diet and Adolescents’ Weight-related Behaviors and Mental Health
Parental encouragement to diet is a crucial factor that affects the mental well-being and dieting behaviors of young people. This report aims to investigate whether adolescents’ perception of parental encouragement acts as a mediator for mental health problems. Additionally, it explores the mediating roles of positive and negative factors in the influence of parental encouragement on adolescent mental health. The sample for this study included 2374 adolescents 46.2%male, mean age = 14.44 ± 3.88 years, all of whom had at least one parent pairing in 2010 and completed the EAT survey. The results indicate that children’s perception plays a mediating role in the relationship between parental encouragement and adolescent mental health. Specifically, parental encouragement can directly reduce the development of binge eating behaviors, but it can also become a risk factor through the chain mediating effects of children’s perception. Moreover, the increased frequency of parental encouragement had adverse effects on adolescent behaviors and mental health (e.g., dieting behaviors, unhealthy weight control behaviors, body satisfaction, depression, and self-esteem) when they recognized the impact of their parents. These results allow us to better understand the association between the frequency of parental encouragement and adolescent mental health problems, which can be further developed for prevention and intervention.

Deep Learning (HSC 201)

Pooja Mukund
Resident Fixation Prediction on Glaucoma TopCon Report using CNN-Based Saliency Prediction Methods
Artificial Intelligence has been introduced to the field of medical education as a way to improve reliability and overall patient outcomes. Eye tracking studies in mammography have aimed to predict fixations in order to overcome limitations of human perception and increase diagnostic accuracy. Deep learning-based visual saliency models have achieved remarkable success, mainly due to the availability of well-established deep CNNs. One study found adding transformers to traditional CNN-based saliency architecture significantly improved performance on public benchmarks. The goal of our study is to predict resident opthamologist fixations of Optical Coherence Tomography (OCT) glaucoma reports from eye tracking data using CNN based saliency prediction methods in order to aid in the education of future opthamologists and opthamologists in training. 15 resident opthamologists were recruited to examine 20 randomly selected OCT reports and evaluate the likelihood of glaucoma for each report on a scale of 1-100. Eye movements were collected using an eye-tracker. Fixation heat maps were generated using fixation data and used as input for the TranSalNet saliency prediction model. The TranSalNet model was able to predict fixations within certain regions of the OCT report with reasonable accuracy but more data is needed to improve model accuracy. Future steps include data augmentation, increased data collection, and model architecture changes in order to improve the accuracy of the model.

Zheyan Liu
Verint call ingestion and processing
This project is about speaker diarization, which includes classifying Aetna call center audio data to different speakers (agent/customer) and transcribe them. Deep Learning and Statistical methods such as VAD, Spectral Clustering and Embedding are adopted to accomplish that.

Brandon Rojas
Application, Benchmarking and Optimization of Machine Learning on EEG
Real-time applications of EEG (clinical, virtual reality/entertainment, etc) require low latency, high performance data recording and processing for signal detection. Progress in machine learning and data processing libraries, along with novel algorithms for event classification, and the proliferation of open access EEG data have lowered development barriers for EEG-BCI development.
We benchmark several classification models across multiple freely available datasets. We also attempt to increase the performance of reading data from an EEG device by modifying various open source EEG libraries, by enabling CPU parallel processing where they are currently single threaded. Also discussed is the impact of EEG preprocessing on model selection.

Boqian Li
DNA Methylation State Prediction Using Machine Learning Classification Models
DNA methylation, an essential epigenetic modification, plays a crucial role in various biological processes. Accurately predicting DNA methylation states can provide valuable insights into gene regulation and help advance our understanding of complex diseases. We developed a machine learning-based approach to predict DNA methylation states using classification models and determined the optimal DNA sequence context surrounding CpG sites for improved prediction accuracy.
We use DNA methylation data in GEO (Gene Expression Omnibus). We then employed various classification models, such as logistic regression, random forests, support vector machines, to predict the methylation state of CpG sites.
To optimize the prediction accuracy, we performed a systematic evaluation of different DNA sequence contexts surrounding the CpG sites. By comparing the performance of each classification model in various sequence contexts, we identified the optimal DNA sequence length that maximized the prediction accuracy.
Our results demonstrated the effectiveness of machine learning-based classification models in predicting DNA methylation states, with the optimal DNA sequence context significantly improving prediction performance. This study offers a novel computational approach for DNA methylation state prediction, with potential applications in epigenetic research, biomarker discovery, and personalized medicine.

Juyoung Hahm
Comparative Validation of AI and non-AI Methods in MRI Volumetry to Diagnose Parkinsonian Syndromes
Automated segmentation and volumetry of brain magnetic resonance imaging scans are essential for the diagnosis of Parkinson’s disease and Parkinson's plus syndromes. To enhance the diagnostic performance, we adopt deep learning(DL) models in brain MRI segmentation and compared their performance with the gold-standard non-DL method. We collected brain MRI scans of healthy controls(n=105) and patients with PD(n=105), MSA(n=132), and PSP(n=69) at Samsung Medical Center from January 2017 to December 2020. Using the gold-standard non-DL model, FreeSurfer(FS), we segmented six brain structures: midbrain, pons, caudate, putamen, pallidum, and third ventricle, and considered them as annotated data for DL models, the representative CNN and ViT-based models. Dice scores and the AUC for differentiating normal, PD, and P-plus cases were calculated to determine the measure to which FS performance can be reproduced as-is while increasing speed by the DL approaches. The segmentation times of CNN and ViT for the six brain structures per patient were 51.26 and 1101.82 s, respectively, being 14 to 300 times faster than FS(15735 s). Dice scores of both DL models were high so their AUCs for disease classification were not inferior to that of FS. DL significantly reduces the analysis time without compromising the performance of brain segmentation and differential diagnosis. Our findings may contribute to the adoption of DL brain MRI segmentation in clinical settings and advance brain research.

Survival Analysis in Observational Studies (HSC 110)

Baode Gao
Subgroup analysis under the cure rate model
In survival analysis studies, cure rate model is widely used when a cure fraction exists. It is quite often to use a common factor for people on the same group. For this reason, subgroup detection has become popular recently in clinical trials. In this project, I propose a cure rate model, which assuming there are two subgroups in the same group. The expectations of latent variables, the approximation of the marginal likelihood, and the estimation of the baseline survival function of the subgroup cure rate model are deduced. I also implement EM algorithm in R. Based on Eastern Cooperative Oncology Group data, the log likelihood ratio test (p-value: 0.0045) and AIC (1083.046) show that a more efficient model is achieved compared with Peng's cure rate model (AIC:1090.103). The Kaplan-Meier survival curves indicates there is a huge difference between two subgroups.

Ruiqi Yan
Comparison of long-term outcome between RA+CBA with RA+PBA for patients with severely calcified lesions
PCI for patients with severely calcified lesions remains challenging and Rotational atherectomy (RA) and cutting balloon angioplasty (CBA) are two most popular strategies for this issue. The long-term performance of their combination as stradegy is rarely investigated. The study aims to compare the long-term adverse outcomes of RA+CBA versus RA with conventional balloon angioplasty (PBA). The primary endpoint is major adverse cardiovascular events (defined as the composite of all-cause mortality, myocardial infraction, or target vessel revascularization) within one year after procedure and 4 secondary endpoints, all-cause mortality, myocardial infraction, target vessel revascularization and stent thrombosis one year after procedure. Survival analysis is used including the Kaplan-Meier estimates, log-rank test, and cox proportional hazard ratios adjusted for prognosis variables. The analysis showed that RA+CBA had significantly higher primary event rate and higher hazard rate than RA+PBA. For secondary outcomes, their KM rates and hazard rates were comparable. We applied model selection on regression model to find predictors for RA+CBA and the final model included some variables not in the adjusted cox-proportional hazard ratio model, implying the imbalance of those prognostic variables in our analysis. This imbalance could introduce bias . Future study could apply more advance statistical methods such as propensity score or adjusting for those variables to avoid bias.

Xiaoluo (Lorraine) Jiao
Prognostic impact of elevated glycemia in patients undergoing percutaneous coronary intervention
Hyperglycemia at the time of percutaneous coronary intervention (PCI) may be due to poorly controlled diabetes mellitus (DM) or other acute or chronic medical conditions. In this project, we aimed to assess the frequency and prognostic impact of elevated glycemia in patients undergoing PCI.
Patients undergoing PCI at a large-volume tertiary-care center (Mount Sinai Hospital, New York) between 2012 and 2019 were stratified based on their glycemia at the time of PCI into 3 groups: normal (70-126 mg/dL), high (127-199 mg/dL), and very-high (≥200 mg/dL) glycemia. Outcomes included all-cause death, myocardial infarction (MI), stroke, and major bleeding 1 year after PCI.
Among 16,193 included patients, 61% had normal, 26% high, and 13% very-high glycemia. In the three groups, 27%, 76%, and 93% of patients had DM, respectively. All-cause death, MI, or stroke occurred more frequently in the high (5.4%, HR 1.76, 95% CI 1.47 – 2.12) and very high glycemia groups (8.7%, HR 2.89, 95% CI 2.37 - 3.52) as compared to patients with normal glycemia (3.1%, reference group).
Among patients undergoing PCI, high and very high glycemia were frequent, not always associated with known DM, and were related to significantly higher rates of complications.

Benjamin Goebel
Predicting patient risk of chronic obstructive pulmonary disease
Chronic obstructive pulmonary disease (COPD) is a leading cause of death. Understanding patients at high-risk of disease can enable early interventions.
Using simulated time-to-event data derived from an observational study, data was split into training (80%) and test (20%). Cox proportional-hazards, Cox proportional-hazards with lasso penalty and random survival forests were applied to predict disease. Each model utilized inverse probability of censoring weights. The test area under the receiver operating characteristic curve (AUC) was estimated at time 5 years using 5-fold cross-validation on the training data set. The model with the highest estimate was re-fit on the training data set and assessed on the test data set.
546 (13.7%) diagnoses occurred among 4,000 subjects. The Cox proportional-hazards model had the highest estimated test AUC via cross-validation, and its test AUC was 0.852. Variables percent predicted forced expiratory volume and some college education were statistically significant (p < 0.05) and inversely associated with disease. Variables age, current or former smoker, coronary heart disease, hypertension, male sex, and white race were statistically significant and directly associated with disease.
This analysis suggests that interventions to curb smoking and promote education, aerobic exercise and healthy diet could reduce patient risk of COPD.

Topics in Deep Learning and Functional Data (HSC 207)

Xinyi Zhou
Metabolomic Evidence in Gulf War Illness
Gulf War syndrome is an unexplained illness occurring in veterans of the 1991 Gulf War. People with GWI have unexplained physical fatigue, cognitive and sensory dysfunction, sleeping disturbances, orthostatic intolerance, and gastrointestinal problems. We would like to see the metabolomic profiles between GWI patients and healthy controls using regressions. My job is to clean the metabolomics data for Gulf War Illness to analyze 4243 metabolic analytes comprising of primary metabolites, biogenic amines, and complex lipids in plasma of GWI cases and controls using R, and build linear, lognormal, and gamma GLM in R to identify altered metabolomic profiles between GWI patients and healthy controls.

Ke Xu
Estimating spatial expression patterns: to normalize or not to normalize?
Biologists often use log(x+1) (natural logarithm of given value plus one) to deal with sparse counts and excessive zeros in outcomes, such as spatial transcriptomics data. The literature provides a number of alternative normalization approaches for log models, including ordinary least-squares on ln(y) and generalized linear models. This study examines how well the alternative estimators behave biometrically in terms of bias and precision when the data are skewed or have other common data problems (heteroscedasticity, heavy tails, etc.). No single alternative is best under all conditions examined. The paper provides a straightforward algorithm for choosing among the alternative estimators. Even if the estimators considered are consistent, there can be major losses in precision from selecting a less appropriate estimator. © 2001 Elsevier Science B.V. All rights reserved.

Tucker Morgan
Utilizing environmental momentary assessment to record changes in combustible cigarette use
Studies of smoking cessation often rely on global retrospective self-reports. Ecological momentary assessment (EMA) allows for more frequent data collection in real-world settings and in much shorter spans of time than traditional study visits. As part of an open-label, two-arm randomized controlled trial pilot study (n=121) comparing e-cigarettes (EC) and nicotine replacement therapy (NRT) as smoking cessation treatment, participants were prompted every four hours each day via text to provide their smoking activity, smoking satisfaction, and craving. Participants were also queried on cigarettes per day (CPD) in study visits at 0, 3, and 6 months. The comparability of EMA and traditional study visit CPD measurements was assessed using Pearson's correlation coefficient and a paired t-test. A mixed effects model was used to examine the treatment effect of EC vs. NRT on cigarettes per day over time. Measurements of CPD from EMA and study visits were found to be highly correlated and to have a statistically significant paired difference. There was no statistically significant treatment effect of EC vs. NRT on CPD over time. These results demonstrate that EMA is a suitable measurement for CPD over time and possibly other related measures to provide insights into how these vary over time within subjects. Further research may provide more insight into the benefits and risks of EC as smoking cessation treatment.

Yiru (Barbara) Gong
Analysis of Gender Academia Trends in Latin America with Natural Language Processing Models
Multiple recent attacks of the “gender academy”, perpetrated in contexts of rising illiberalism, have taken a variety of forms, from the de-legitimation of gender programs to their outright closure, from the marginalization of scholars and researchers to their physical and psychological endangerment. As a crisis mitigation strategy, the project aimed at developing an Early-Warning System to provide information, resources, and support to gender scholars who may face illiberal attacks. As a first step, the trends of public opinions on “gender ideology” in new media (Twitter) are analyzed against the gender-attacking event timeline in Brazil and Colombia. Natural Language Processing (NLP) based sentiment analysis models will be applied to identify the public attitude on the related topics. Results showed that there were high correlations between the gender-attacking events and the sentiment trends of the gender-related topics. The future direction thus lies in applying the past trends in an AI model to predict possible early signs of attacks on gender studies.

Topics in Clinical Trials (HSC 210)

Yuan Meng
CBD pharmacokinetic study
This study aimed to investigate the pharmacokinetics (PK) of cannabidiol (CBD) in a group of 24 patients who received multiple oral doses of CBD. Plasma and urine samples were collected at predetermined time points. To facilitate data management and analysis, a data pipeline was developed to combine the predetermined time points data and lab reports data, and automate data cleaning and visualization. The mean plasma concentrations of CBD were plotted across time points to visualize the PK profiles.The mean Tmax values were 3.5 hours. No serious adverse events were reported, some patients experiencing dizziness for taking dose of CBD and generally CBD is well-tolerated. These findings provide important information on this type of CBD and inform the possible clinical use.

Tianwei Zhao
GWI effects on immune system within the proteomics perspective
Reaches about a disease called GWI (Golf War Illness). GWI is a popular disease in Iraq after the Golf War 1991. It is characterized by unexplained fatigue not relieved by rest, myalgias, impaired memory and cognitive dysfunction and orthostatic intolerance.
My goal is to analyze the effects GWI have on people’s immune system within the proteomics perspective.
Industrial data on patients’ proteomics changes after Bicycle exercise tolerance test (ETT).
Test the 1500 immune system markers out of 7000 proteomics.
Methodology: Set three time points during the ETT, collected patients ’proteomics data
and analyze the proteomics level changes after bicycle exercise.
l GEE Model will be used to estimate the proteomics parameters of a generalized linear
model with a possible unmeasured correlation between observations from three
timepoints.
l Linear Mixed Model will be used for conditional predictions about proteomics with
Bayesian Rules

Zhuolun Huang
Data-Driven Approach to Clinical Trial Recruitment (DART) Recruitment Reports
The objective of this project was to develop a data-driven approach to clinical trial recruitment using R programming language. The project aimed to identify potential participants for Alzheimer's Disease clinical trials by analyzing data from multiple sources.
We used R programming language to clean and merge data from electronic health records, social media, and online forums to create a comprehensive database of potential trial participants. We then used machine learning algorithms to identify patients who were likely to be eligible for clinical trials based on their medical history and demographics.
The data-driven approach to clinical trial recruitment yielded promising results. The algorithms identified a large pool of potential participants for Alzheimer's Disease clinical trials, which could help accelerate the recruitment process and improve the chances of successful trial outcomes.
The DART Recruitment Reports project demonstrated the potential of data-driven approaches to clinical trial recruitment. By leveraging big data and machine learning, researchers can identify potential trial participants more efficiently and effectively. The project also provided opportunities to develop practical R programming skills, collaborate with others, and learn about Alzheimer's Disease and clinical trials.

Yucong Gao
KPI Benchmarks Data Outlier Identification and Performance Assessment
Applying statistical methods such as cook's distance on KPI benchmarks to assess underlying data integrity and accuracy. Exclude KPI outliers and create an ETL to automate this process thus providing a much more accurate Benchmarks for business operation.

Tanvir Khan
Cholesterol and Alzheimer’s Disease: A Mendelian Randomization Study
The goal of the research project is to use Mendelian Randomization to determine if LDL (low-density lipoprotein) cholesterol causes an increase in the risk of Alzheimer's Disease and if HDL (high-density lipoprotein) cholesterol causes a decrease in the risk of Alzheimer's Disease. We will perform a two-sample Mendelian randomization, in which variant and exposure associations are estimated in one dataset, and variant and outcome associations are estimated in a second dataset. Two-sample investigations often occur when genetic associations with the exposure are assessed in a cross-sectional sample of healthy individuals to reflect genetic associations with normal levels of exposure in the population, and genetic associations with a binary disease outcome are estimated in a case-control study. Two-sample Mendelian Randomization is a strategy in which evidence of the associations of genetic variants with the risk factor and effect comes from non-overlapping data sources. The limiting factor for the power of a Mendelian randomization analysis using a given set of genetic variants is the precision in estimating the genetic association with the outcome. This association is typically much weaker than the genetic association with the risk factor. Therefore, published data on genetic associations with the result can be combined with individual-level data from a cross-sectional study on genetic variants and the risk factor to obtain precise Mendelian randomization estimates.

Clinical Trial Study Design (HSC 109B)

Yihan Qiu
Adaptive Designs in Multi-Reader Multi-Case Clinical Trials
Multi-reader multi-case clinical trials, typically called MRMC studies, refer to clinical trials where multiple readers (e.g., radiologists) read images of multiple cases (e.g., patients) for a clinical task. Same as other types of clinical trials, MRMC studies require sizing of the patients, but they also require sizing of the readers. The initial total number of readers is denoted by NR, which equals to NR1 + NR2, with NR1 representing first NR1 readers after the interim analysis and NR2 representing last NR2 readers. The initial number of cases is denoted by NC, which equals to N10 + N11, with N10 representing non-diseased case size and N11 representing diseased case size. Two types of adaptive designs are proposed in this paper to resize the study toward target power after an interim analysis. In adaptive method I, only the reader size is adjusted, and the total number of readers can be resized to NR* = NR1 + NR2*. However, in adaptive method II, both reader and case samples are resized. In addition to resize the total number of readers as in adaptive method I, the number of cases is resized to NC* = N20* + N21* , with N20* representing the new non-diseased case size and N21* representing the new diseased case size. Results showed that our methods can effectively resize the study toward the target power without inflating the type I error rate and implications are provided as well.

Xiaoying (Nicole) Chen
Application of Statistical K-Means Algorithm for Cognitive Training Study to Improve Cognitive Abilities
A large percentage of older adults have mild cognitive impairment (MCI) and they are at a high risk to progress to Alzheimer’s disease (AD). With information from failed medication trials in MCI and new discoveries of brain plasticity in aging, researchers started to study cognitive training as a potential treatment for MCI to improve cognitive abilities. The study result shows our intervention for cognitive training was not effective. However, we realized that our intervention might be confounded by some features of our subjects which we could not find out yet. Thus, we did this secondary analysis, using the K-means clustering algorithm to find collections of observations that share similar characteristics and find whether the cluster is a predictor. The result shows that the cluster produced by the K-means algorithm is a statistically significant predictor, but there is no cluster and treatment interaction. To conclude, cognitive training using our intervention for MCI is not effective and no features of subjects were found using the K-means clustering algorithm that cofounds our study.

Shiwei (Jessica) Chen
Impact of Coronary Artery Disease on patients undergoing TAVR: Subgroup analysis from the BRAVO-3 randomized trial
The study expected to investigate the impact of coronary artery disease (CAD) on clinical outcomes after transcatheter aortic valve replacement (TAVR) and to determine whether the choice of peri-procedural anticoagulant (bivalirudin or UFH) had any impact in those with CAD undergoing TAVR. CAD is a common risk factor in patients undergoing aortic valve replacement (AVR), and its presence is associated with poor prognosis in patients undergoing surgical AVR.The first aim of this study was to determine the impact of CAD on post-procedural and 30 day clinical outcomes after TAVR in a large randomized clinical trial population. A second aim was to determine if CAD has an impact for the use of bivalirudin versus UFH as a peri-procedural anticoagulant for TAVR.

Renjie (Ryan) Wei
General Study Design for 3-Stage SMARTs
This project extended existing methods for estimating and identifying the optimal adaptive interventions (AIs) embedded in sequential multiple assignment randomized trial (SMART) designs for 2-stage design to 3-stage and beyond. Our proposed method, namely the gate-keeping method, aims to account for multiplicity whereby an AI selection will be made after the null hypothesis of no difference among the AIs is rejected, has been well-developed, and has been implemented in two-stage cases. However, in three and more general multi-stage designs, the structure of the SMARTs designs varies and is complicated, and the number of possible AIs embedded becomes numerous. This project starts with a balanced 3-stage SMARTs design structure. Based on this design, we derived the AI values' asymptotic normality, and then we proved the degree of freedom of the global chi-square test for equality. In order to verify that our conclusions and methods are correct and valid, we used simulations, in which we also compared the performance of other existing estimation methods under more general study designs. We conclude with recommendations for general SMART designs based on the problems identified in the simulations.

Xinyuan Liu
Meta-regression of adjunctive treatment trials for negative symptoms in schizophrenia
Negative symptoms of schizophrenia typically persist despite treatment with best available medications for schizophrenia. Interpretation of adjunctive studies may be affected by placebo responses. Here, we performed a meta-analysis and meta-regression of adjunctive treatment studies for negative symptoms of schizophrenia across all mechanisms with a sufficient number of clinical trials, operationalized 5 or more studies, including at least one multi-center studies.
We conducted a literature search of adjunctive treatment studies for which both single- and multi-center studies were available. We identified 156 trials across 7 mechanisms of action (MOA). Meta-analysis and meta-regression analyses were conducted with sample size as a covariate.
Significant effects were observed for 5-HT3R antagonists; estrogen modulators; anti-inflammatories; NMDAR modulators; anti-depressants; and alpha-7 nicotinic agonists. Across MOA, the magnitude of the placebo response scaled with sample size to a greater extent than treatment response, leading to a significant reduction in trial effect size with sample size (p<.001). Significant results were obtained preferentially with sample sizes in the range of 30 to 150 individuals.
These results highlight the importance of considering the differential sample size effects on the placebo vs. treatment response when designing adjunctive clinical trials in schizophrenia.

Session 4 (3:00pm - 3:50pm)

Genetics and Molecular Epidemiology (HSC 312)

Yuchen Xu
Integration of scRNA-seq and Merfish data reveals spatial regulation of alternative splicing in the brain
In this practicum, I preprocess the scRNA-seq dataset and transfer the alternative splicing data from single-cell RNA seq to the Merfish data. Single-cell RNA-seq dataset has information on alternative splicing and the Merfish dataset has information on spatial expression patterns. By interpreting, I can input the alternative splicing information on a spatial level. This can help us to understand the importance of some specific alternative splicing genes.

Tianyou Wang
Associations of Target Biomarkers’ Gene and Protein Expressions with Breast Cancer Survival Rate
There are limited epidemiologic studies focused on the gene and protein expressions of adiponectin (ADIPOQ,) adipokine receptor 1 (ADIPOR1,) and adipokine receptor 2 (ADIPOR2.) Associations of protein and gene expressions of these biomarkers with breast cancer survival rates are not well understood.
Methods: Sociodemographic information, clinical characteristics, and breast cancer survival time were collected in the Women’s Circle of Health Study (WCHS) cohort. There are 99 women in the gene expression group and 453 women in the protein expression group. Kaplan-Meier curve was used to explore and locate potential critical cut-off points for both gene and protein expressions. Cox proportional hazard regression models were adopted to estimate associations of biomarkers’ protein and gene expressions with breast cancer survival.
The 3rd and the 4th quartiles of ADIPOQ protein expression almost significantly differ from each other, p-value = 0.0739 with Bonferroni correction (Log-rank.) This significance can be hidden due to the small sample size. No other cut-off points were found for other gene and protein expressions. No significant associations of biomarkers’ protein and gene expressions with breast cancer survival rate were found in Cox models.
The 75th percentile, 102.7314, can be a critical cut-off for the protein expression of ADIPOQ. Patients with relatively high protein expression of ADIPOQ are more likely to survive longer.

Tongxin (Joy) Cao
The Bioequivalence Test of Rivaroxaban tablets
The goal of this practicum is to compare the relative bioavailability of the tested preparation (Rivaroxaban tablets) and the reference preparation, as well as to
evaluate the bioequivalence of two different formulations of Rivaroxaban tablets. Bioequivalence refers to the similarity in the rate and extent of absorption of two formulations of the same drug. In this study, the pharmacokinetic profiles of the test and reference products will be compared to determine if they are bioequivalent. The study will involve administering the test and reference products to healthy volunteers and monitoring various pharmacokinetic parameters, such as maximum concentration, time to maximum concentration, and area under the curve. The results of this study will aid in the assessment of the test product's suitability as a substitute for the reference product in clinical use. This research has significant implications for patients who require Rivaroxaban therapy, as well as for the pharmaceutical industry in terms of the development of new formulations and the manufacturing of generic products.

Sneha Mehta
Blood Concentrations of Environmental Pollutants and COVID Disease in the City of Barcelona
There is wide, largely unexplained heterogeneity in immunological and clinical responses to SARS-CoV-2 infection. Numerous environmental chemicals, such as persistent organic pollutants (POPs) and chemical elements (including some metals, essential trace elements, rare earth elements, and minority elements), are immunomodulatory and cause a range of adverse clinical events. We conducted a prospective cohort study in 154 individuals from the general population of Barcelona. POPs and elements were measured in blood samples collected in 2016-2017. SARS-CoV-2 infection was detected by rRT-PCR in nasopharyngeal swabs and/or by antibody serology using eighteen isotype-antigen combinations measured in blood samples collected in 2020-2021. We analyzed the associations between concentrations of the contaminants and SARS-CoV-2 infection and development of COVID-19, taking into account personal habits and living conditions during the pandemic. We identified mixtures of up to five substances from several chemical groups, with all substances independently associated to the outcomes. Our results provide the first prospective and population-based evidence of an association between individual concentrations of some contaminants and COVID-19 and SARS-CoV-2 infection. POPs and elements may contribute to explain the heterogeneity in the development of SARS-CoV-2 infection and COVID-19 in the general population.

Global Pediatric Health (HSC 203)

Fouad Habib
Treatment Effect of Insulin Pump Therapy Compared to Multiple Daily Injections on Glycemic Control in Pediatric Diabetes Patents in Kuwait
This was a retrospective study on data from DDI’s pediatric clinics visits between 2018 and 2021 in Kuwait. The aim of study was to understand the effect of treatment modality on Glycemic Control (GC) in pediatric patients with diabetes. Method: We had complete data for 160 patients. The data was analyzed in SAS using PROC SQL and Logistic. The average HbA1C across all visits was taken individually for patients. Regression analysis was conducted by modeling GC (defined as HbA1c < 7.5%) against the following variables: gender; age category; BMI; diabetes duration; history of comorbidity (HOCM); history of complications; and treatment modality (TM) categorized into continuous subcutaneous insulin infusion (CSII), switching from multiple daily injections (MDI) to CSII, with MDI are reference in the model. Results: GC was 18.1%. The analysis concluded that TM (overall P-value:0.0011,CSII:0.0002, switch:0.0094)and HOCM (P-value:0.0113) were statistically significant in the model. Based on this model we were able to conclude that patients that used CSII as their TM and patients that switched from MDI to CSII had 9.4 and 7.3 times the odds of achieving GC as compared to patients that only used MDI, respectively. Patients with a HOCM had 0.24 times the odds of achieving GC as compared to patients who did not have a HOCM. Conclusion: Based on the results of the model, we can conclude that use of CSII is positively associated with GC and having a HOCM is negatively associated with GC.

Yujie (Jessie) Wang
Data management of 1200 adolescents and their primary caregivers living in KwaZulu-Natal
The Asenze project is a longitudinal study based in KwaZulu-Natal, South Africa, a region heavily impacted by HIV/AIDS and socio-economic inequality. This study focuses on children and their primary caregivers, assessing household and caregiver functioning, child health, social well-being, and neuro-development from childhood through adolescence, as well as the potential impact by COVID-19. The aim of the study is to deepen the understanding of childhood physical, cognitive, and social abilities, including risk-taking behaviors, and identify the biological, environmental, and social determinants of health. The anticipated findings will contribute to the development of community-informed interventions to promote wellbeing in this South African population and beyond. Overall, the Asenze project offers important insights into the lives of children and their families in a high-risk setting and has the potential to inform policies and programs aimed at improving the health and well-being of children in similar contexts.

Matthew Untalan
Assessing the Association of TSH Levels and NAFLD Among Youth
Non-alcoholic fatty liver disease (NAFLD) is characterized by lipid accumulation in the liver and is one of the most common forms of chronic liver disease in children and adolescents. Recent studies have indicated an association between subclinical hypothyroidism and NAFLD. Underactive thyroid gland produces insufficient levels of T3 and T4, which reduces hepatic fat metabolism that leads to NAFLD. Increased TSH production functions as a feedback mechanism to amplify T3 and T4 production. However, TSH may also have a role independent of T3 and T4 in NAFLD disease severity. This study involved a retrospective analysis of NASH-CRN data from TONIC and CyNCh trials. The association between TSH and NAFLD at baseline was analyzed using multinomial logistic regression adjusting for potential confounders. A longitudinal analysis was conducted utilizing a mixed-effects model with additional potential confounders related to experimental treatment during the clinical trials. At baseline, TSH was not significantly associated with NAFLD outcome measures when adjusting for sex, age, race, BMI-z, HOMA-IR, tanner stage, and non-HDL levels. Longitudinal analyses revealed that steatosis grade as it relates to NAFLD is sensitive to changes in TSH in children. The presence of subclinical hypothyroidism is associated with increased fibrosis stage. Furthermore, the results demonstrate a strong relationship between T3 and T4 levels and measures of NAFLD.

Zirui (Troy) Zhou
Epitope generation based on samples of acute encephalitis syndrome patients in Gorakhpur region, India
Acute encephalitis syndrome (AES) is a persistent public health issue in India, and the Gorakhpur district has been observed to be among the most impacted regions. Children under the age of 15 with weaker immune systems suffer from higher vulnerability to this syndrome. Some but not all of the causal agents for AES have been identified, which weakens any present prevention policies.
This project aims to gain a better understanding of potential causal agents for AES and its etiologic pathways. Once the agents are identified, potential epitopes for accurate AES screening and early diagnosis are generated.
Serum and cerebrospinal fluid samples were collected from 2015 to 2017. Genome sequencing and serochips were used for viral presence screening. Signals from serochips were used to determine peptide importance as well as epitope generation. Multidimensional scaling analyses were conducted on the signal files for sample group pairwise comparison.
A total of 20529 epitopes for IgG samples and 5843 epitopes for IgM samples were generated based on signal significance. MDS plots were output for group differentiation visualization.
Clear distinctions between control and recovered/fatal groups were observed but the clinical translation of the epitopes is still needed as future steps.

Topics in Survival Analysis (HSC 202)

Qing Zhou
Transcription Factor-targeting Decoy Peptides: A Novel Strategy for Cancer Treatment
Transcription factors CEBPB and CEBPD were identified as targets for the treatment of brain, skin, and other cancers. Here we report the development of transcription-factor-targeting dominant-negative peptides as a new strategy for cancer treatment. By using a series of statistical analysis methods, we investigated their efficacy and mechanism in a wide variety of solid tumors.
Data were compared using two-tailed Student’s t-tests or ANOVA followed by Dunnett’s method for the post-hoc analysis. In mouse xenograft experiments, the Mann-Whitney U test was used for tumor size comparison. To compare two survival curves, the Mantel-Cox log-rank test was employed. To compare read counts of individual genes in RNA-seq datasets of two groups, Wald test was used with a Benjamini and Hochberg correction with a false discovery rate Q value of 5% to obtain adjusted P values.
Peptides selectively suppressed the proliferation of tumor cells from various origins. In vivo, peptides were active in mouse tumor xenograft models in which they inhibited tumor growth and significantly prolonged animal survival. Moreover, multiple drug combinations were identified to work synergistically with these peptides in treating cancers or reverse resistance to them. Mechanically, RNA-seq results suggested that these peptides exerted anti-cancer efficacy through the dysregulation of cell pro- and anti-apoptotic proteins, and interference with glycolysis in cancer cells.

Yiqun Jin
Urinary cadmium concentration and the risk of diabetes
Several studies have linked exposure to cadmium with an elevated risk of developing diabetes. However, there is ongoing discussion regarding the degree and pattern of this association. The objective is to investigate the potential association between levels of cadmium in urine and the incidence of diabetes, as well as to explore any potential factors that may modify this effect.
A cohort study was designed and nested in the Reasons for Geographic and Racial Differences in Stroke (REGARDS) study, including 2666 participants from randomly selected sub-cohort and 111 adjudicated incident cases of diabetes. Urinary creatinine-corrected cadmium concentration was measured at baseline. Build models including demographic, lifestyle, and medical features of participants and perform multivariate logistic regression to test the validation of models. Using the Barlow weighting method for the Cox proportional hazards regression model, hazard ratios (HRs) with corresponding 95% confidence intervals (CIs) were estimated while adjusting for multiple variables.
The median urinary cadmium concentration was 0.38 (Interquartile range 0.25-0.61) μg/g creatinine. Following the consideration of potential confounding factors, urinary cadmium was associated with increased incidences of diabetes.
The results of this study indicate that cadmium exposure could represent an independent risk factor for diabetes mellitus among the general population in the United States.

Haotian Wu
Alternative Methods for Comparing Survival Curves under Non-Proportional Hazard
For time to event data, log rank test is most powerful and commonly used method to compare survival curves under proportional assumptions. However, non-proportional hazards over time occurs often in the real-world analysis, and the log rank test yield low power. This paper aims to find an alternative method for comparing survival curves under non-proportional hazards in simulated data under certain conditions. To be more detailed, we analysis a specific situation of the non-proportional hazard assumption - delayed treatment effects, using the Cox piecewise model. Log rank test, Fleming and Harrington test, MaxCombo, and restricted mean survival time (RMST) are examined on our simulated dataset. This study has the potential to test differences between survival curves informed by blinded data, where non-proportional hazards often threaten the validity of common methods used.

Chaoqi Wu
Derivation of a COPD Risk Score
Chronic Obstructive Pulmonary Disease (COPD) is a common and progressive disease characterized by airflow obstruction, which worsens over time. COPD affects more than 300 million people worldwide and is the third leading cause of death globally. Identifying the risk factors associated with COPD can aid in prevention and treatment, particularly in reducing mortality risk. In this study, we propose several models to predict the probability of COPD mortality at year 10 using NHLBI Pooled data. Our models include cox model, cox model with lasso penalty, and the random survival forest. We evaluated the models using cross-validation Brier Score and Concordance Index.
Our results indicate that the cox model performs the best with a cross-validation brier score of 0.05809 and a concordance index of 0.86487. In the cox model, smoking (HR = 4.3638, p-value<0.01), Coronary Heart Disease (HR = 1.4536, p-value<0.01), gender (HR = 1.3327, p-value<0.01), hypertension (HR = 1.2957, p-value<0.01), and diabetes (HR = 1.2942, p-value<0.01) are the five risk factors with the highest hazard ratio.
Our findings suggest that smoking, Coronary Heart Disease, Hypertension, and Diabetes are potential risk factors for COPD mortality. Our models can help in early identification and management of COPD patients at high risk of mortality. By predicting COPD mortality at year 10, our models can facilitate preventive interventions, including lifestyle modifications and medication adjustments.

Causal Inference (HSC 107)

Mingkuan Xu
A Mendelian Randomization Analysis of the Causal Effects of COVID-19 on the Risk of Alzheimer’s Disease
According to recent neuroradiological studies, individuals who have recovered from COVID-19 displayed changes in the functional integrity of their brains, particularly in the hippocampus. The shrinking of the hippocampus is linked to a decline in cognitive function and is a typical feature seen in patients with Alzheimer's disease. In this study, we intend to determine if there are direct causal relations between COVID-19 infection and the risk of Alzheimer's disease using a novel Penalized Inverse-Variance Weighted Estimator.

Junzhe Shao
Generalized Synthetic Control Method with State-Space Model
Synthetic control method (SCM) is a widely used approach to assess the treatment effect of a point-wise intervention for cross-sectional time-series data. The goal of SCM is to approximate the counterfactual outcomes of the treated unit as a combination of the control units' observed outcomes. Many studies propose a linear factor model as a parametric justification for the SCM that assumes the synthetic control weights are invariant across time. However, such an assumption does not always hold in practice. We propose a generalized SCM with time-varying weights based on state-space model (GSC-SSM), allowing for a more flexible and accurate construction of counterfactual series. GSC-SSM recovers the classic SCM when the hidden weights are specified as constant. It applies Bayesian shrinkage for a two-way sparsity of the estimated weights across both the donor pool and the time. On the basis of our method, we shed light on the role of auxiliary covariates, on nonlinear and non-Gaussian state-space model, and on the prediction interval based on time-series forecasting. We apply GSC-SSM to investigate the impact of German reunification and a mandatory certificate on COVID-19 vaccine compliance.

Youyuan Kong
A Comparison of Two Sample and Bi-Directional Mendelian Randomisation Methods in Exploring the Causal Effect of Risk Factors on Complex Human Traits
Mendelian Randomisation (MR) is a popular approach used to estimate the causal effect of risk factors on complex human traits. However, the basic assumptions of MR can limit its applicability. Two limitations of MR are the under-exploitation of genome-wide markers and sensitivity to the presence of a heritable confounder of the exposure-outcome relationship. In this study, we compare two classical MR methods: Two Sample MR and Bi-Directional MR, to explore the causal relationship between high blood pressure, alcohol intake, smoking etc. We first apply the Two Sample MR method and then compare the results to the new Bi-Directional MR method. The Bi-Directional MR method is an extension of the traditional MR method, which allows for better use of genome-wide markers and addresses the sensitivity to heritable confounders. Our study findings provide a comparison of these two MR methods and their suitability for exploring causal relationships between risk factors and complex human traits.

Jimmy Kelliher
Inverting Hypothesis Tests to Generate \\ Confidence Intervals for Indirect Effects \\ in Causal Mediation Analysis
In mediation analysis with multiple continuous mediators and continuous outcomes, natural indirect effects are characterized by an inner product of two vectors of parameters. Many practitioners conduct hypothesis testing of indirect effects via the delta method or the bootstrap. However, the delta method is too conservative near the null and the bootstrap is computationally intensive, provided that consistency can be established at all. In this paper, we propose a novel procedure for generating confidence intervals around indirect effects via hypothesis test inversion. We show that the procedure yields tests that are uniformly more powerful than the delta method, improving test power by over 12 percentage points in some regions of the parameter space. We also show that the procedure is computationally efficient, regardless of the dimensionality of the mediator set. That is, unlike the bootstrap, our method is not affected by the curse of dimensionality.

Genetics Research (HSC 201)

Yida Wang
Apply Hidden Markov Random Field Model for Detecting Domain Organizations on Melanoma Brain Metastases (MBMs) Spatial Transcriptomic Data
Different cell types in complex tissues have different gene expression patterns. We believe cellular heterogeneity is driven by both cell type and environmental factors. The HMRF model is a statistical tool used to identify underlying patterns or domains in spatial data, such as the expression levels of genes in different cells within a tissue. It takes into account both the neighboring cells and the spatial distribution of cells to detect spatial domains or clusters of cells with similar expression patterns. To dissect how cell’s identity influenced by gene-regulatory networks and spatial environmental factors of cells, a hidden Markov random field model is developed by Qian Zhu. By clustering cells into spatial domains, this method provides a more accurate map of cell types and combine the strengths of sequencing and imaging-based single-cell transcriptomic profiling strategies. This project applied visualization, dimension reduction approaches and HMRF-based model on melanoma brain metastases (MBMs) spatial transcriptomic data to find potential statistical relationships and detect spatial domains. The Giotto toolbox for spatial data analysis and visualization in R is utilized in this project.

Tianchuan Gao
Method for identify transcriptomics metaprograms in multi-sample single-cell RNA-seq data using Non-negative Matrix Factorization.
Single-cell RNA sequencing (scRNA-seq) has become as a powerful method to characterize cellular states in healthy and diseased tissues. Transcriptional programs were determined by applying Non-negative Matrix Factorization (NMF) to the centered expression data. This practicum project is to build an NMF-based method for identify transcriptomics metaprograms in multi-sample single-cell RNA-seq data.
NMF is that a matrix V is factorized into two matrices W and H with all three matrices having no negative values. non-negativity makes the resulting matrices easier to explain. The most distinguished difference between NMF and other dimensionality reduction methods like PCA is that NMF learn parts-based while other methods are holistic.
The goal of this project is to build a tool to perform a whole NMF analysis pipeline:
•Apply NMF to the data.
•Obtain the list of top genes which form the transcriptomics metaprogram.
•Identify the GO terms and name each metaprogram.
The first step involves rank selection for the NMF analysis. I write a function to automatically select the optimal/sub-optimal rank using elbow’s method.
Since NMF avoids the exact normalized values of undetected genes, it is beneficial in analysis of single-cell RNA-seq.

Wenhan Bao
A CCA-based pipeline for ultra-high dimensional DNA methylation data
DNA methylation plays an important role in controlling gene expression which eventually may have a potential impact on individuals. Generally, DNA methylation is stably inherited by offspring but the relationship between some specific CpG sites from pregnancy and the placenta which may exist highly correlated is rarely discovered.
Canonical correlation analysis (CCA) is widely used to measure the association between two sets of variables. As desired, there are a few modified CCA approaches proposed. The penalized matrix decomposition CCA(PMDCCA) proposed by Witten introduced a regularization term and produce the sparsity of coefficients. However, the performance of PMDCCA is worse with the increasing dimension of data.
We try a new pipeline that would combine Nonnegative matrix factorization(NMF) and PMDCCA to study the relationship between two ultra-high dimensional data. By simulation, we find the breaking point of PMDCCA which become inefficient in identifying the signal. The new pipeline screens the data (i.e., reduces the dimension ) first without losing any informative signal then the downstream PMDCCA is applied. This pipeline outperforms the direct PMDCAA in both accuracy and efficiency.
This pipeline is still being tested under different conditional settings and different types of high dimensional data (e.g., brain imaging study, genomics).

Lesi He
AlphaCluster: A method that quantify 3D clustering of missense variants to improve statistical power in risk gene discovery
A missense mutation is a common type of mutation that can cause proteins to lose function, thereby increasing the genetic risk for a range of common and rare diseases. Distinguishing pathogenic missense variants is essential yet challenging, as missense variants have a large effect size and complex functional impact. Here, we describe AlphaCluster, a method that measures the statistical significance of clustering of de novo variants by quantifying the 3D physical distances or functional correlation among missense variants. Compared with other genomic clustering tools, this tool increases the statistical power of identifying new risk genes. AlphaCluster successfully identified new risk genes for autism spectrum disorder and neurodevelopmental disorder (NDD). We are also testing whether implementing coevolutionary strength may further increase the statistical power of AlphaCluster.

Longitudinal Data Analysis (HSC 110)

Hao Xu
Efficacy and safety of tacrolimus-based treatment on non-rapidly progressive IgA nephropathy: a retrospective study
This study aimed to evaluate the efficacy and safety of tacrolimus-based treatment on immunoglobulin A nephropathy (IgAN).
A total of 127 IgAN patients were retrospectively reviewed. The included patients were divided into tacrolimus (TAC) and control (non-TAC) group according to their treatment strategy. The proteinuria remission, remission rate and adverse events between the two groups were compared.
Among the127 patients, 61 patients received TAC-based treatment and 66patients receivednon-TAC treatment.TAC group showed a more rapid decline in proteinuria than non-TAC group at 9 month (P=0.001) and 12 month (P=0.018).Remission rate at 1, 3, 6, 9 and 12 month was 40.98%, 68.85%, 80.33%, 90.16% and 88.52% respectively in the TAC group. The rate washigher than control group at 3, 9 and 12 month (P = 0.030, 0.008 and 0.026). Complete remission rate at 1, 3, 6, 9 and 12 month was 6.56%, 19.67%, 37.70%, 54.10% and 62.30% respectively in the TAC group. Compared to control group, it was higher at 9 and 12 month (P = 0.013 and 0.008). The estimated mean time to complete remissionwas significantly shorter in the TAC group compared to the control group (P = 0.028). TACdid not increase the incidence of adverse events.
For patients with IgAN（24UTP≥1g）, tacrolimus-based treatment maybe a better choice. Further prospective, randomized, control trials are warranted to obtain robust evidences.
Keywords: IgA nephropathy; tacrolimus; proteinuria

Yunxi Zhang
Cluster Analysis on Longitudinal Data of Patients with Mild Cognitive Impairment
A longitudinal 78-week study was designed to explore the efficacy of computerized Games versus Crosswords training on mild cognitive impairment (MCI). The results shown that crosswords demonstrated superior efficacy to games on the change in ADAS_Cog total score and FAQ total score, adjusted for baseline. Our study aims to use cluster analysis to further explore the efficacy of cognitive training in different cluster of patients and study the association between cluster and treatment and baseline characteristics.
K-means cluster analysis was performed by using variables from baseline and follow-up visits on 107 patients. Patients were clustered by baseline-adjusted change in ADAS_Cog score and FAQ score. Trend of the change in outcomes were plotted. Multiple logistic regression was applied to study the association between cluster and treatment and baseline characteristics.
Among 107 patients (n=51 [games]; n=56 [crosswords]), two clusters were applied. There is no association between cluster and treatment. Games has superior efficacy than crosswords training in patients younger than 70 and patients with early MCI on FAQ total score. Crosswords has superior efficacy in patients with negative APOE on FAQ total score.
Web-based cognitive games demonstrated superior efficacy to crosswords for the secondary outcome of baseline-adjusted change in FAQ score over 78 weeks.

Jinghan Liu
Longitudinal Analysis of Sedentary Activity in Patients with Mitochondrial Disease using Wearable Accelerometer
Mitochondrial disease is a serious condition affecting over 5000 children in the U.S. annually. Gene therapy is a promising treatment for this disease, and wearable accelerometers can provide continuous measurements of physical activity before and after treatment. This study aimed to assess the effectiveness of gene therapy in treating mitochondrial disease by comparing accelerometer data before and after treatment in 13 patients. Sedentary activity was measured at baseline and following the visit using the ActivPal4 and analyzed using PALanalysis software. The study utilized a linear mixed effects model to model the relationship between the response variable and other variables. The model found that ambulatory_status was statistically significant, while the other predictor variables were not statistically significant. A log transformation of the response variable was used to improve normality. The results indicated that ambulatory status is a crucial factor affecting log-response, with ambulatory individuals having much higher response compared to non-ambulatory individuals. Age, leg used, and visit were also significant predictors of log-response, albeit with smaller effects. However, the interpretation of the coefficients depends on the scale and distribution of the response variable, as well as the choice of reference categories for the categorical predictors. Additionally, the correlations among the fixed effects should be taken into account when interpreting the results

Pengchen Wang
The predictors of weight change during menopausal transitions
Some research indicates the risk of cardiovascular disease is correlated with the menopausal status of women. Women during the menopausal transition are at a higher risk of cardiovascular disease. The aim of the study is to evaluate the predictors of weight change during menopausal transitions among women.
We conducted a prospective observational 1-year study to collect 8 cardiovascular-related components from 300 women. The total 300 women include 100 in the pre-menopausal bucket, 100 in the menopausal transition, and 100 in post-menopausal status. We are going to conduct the linear and logistics regression model to figure out the potential impactful variables on the outcome.
[The analysis of the study is still in process, we haven’t received any results yet. Also, due to confidentiality, I still need to confirm with my manager the revelation of the results.]
Further studies will be in need to deep investigate the correlations between the variables and the outcome. It is promising to find an effective measurement conducted on women during the menopausal transition in order to reduce cardiovascular risk.

Data Visualization and Environmental Health (HSC 207)

Jyoti Lalitha Kumar
A Quantitative Analysis of Discontinuous Datapoints in a radiation cataract study done in Mice
I am working on an ongoing project involving radiation and its effects on cataracts in mice. I am currently working on how best to handle the discontinuous radiation cataract datapoints and how best to attack the quantitative analysis of this data. We decided the best way forward would be to clean and wrangle the data and then perform an odds ratio on the data points to see and test their significance. It was also decided that ignorer to visualize the data, it would be best to use spaghetti plots and heat maps that we are currently working on.

Wanxin Qi
Association and Prediction: Climate Variability and Highly Pathogenic Avian Influenza in Domestic Poultry
Highly pathogenic avian influenza (HPAI) is a category of avian influenza, a contagious disease caused by type A influenza virus of the family Orthomyxoviridae. It is spread by migratory wild birds, and infected birds can shed the virus without showing clinical signs, posing a significant threat to animal and human health. Other severe consequences of poultry infection include economic losses, animal losses leading to price increases of goods, and trade restrictions. Climate variability has the potential to alter the ecological risk of HPAI, such as temperature and drought affecting wild bird migration. Therefore, this study aims to develop a multimodal inference system to identify the environmental factors that influence the likelihood of HPAI outbreaks in domestic poultry and generate environmentally informed real-time forecasts. A literature review of over 20 papers will assess the association from three aspects: wild bird migration, farm conditions, and transmission between wild birds and farm birds. Combining exploratory data analysis of HPAI confirmed in commercial and backyard poultry flocks and wild birds, early research aims to provide potential directions for subsequent research and modeling.

Eric Wang
Built Environment and COVID-19 Transmission in New York City
Understanding the association between the urban built environment and the spread of COVID-19 in New York City can support informed-policy decisions on disease control. The geographical distribution of points-of-interest (POIs) may shape the transmission of SARS-CoV-2 in local communities and create health disparities in disadvantaged populations.
Using high-resolution POI information and foot-traffic data collected via mobile devices, we analyze the effect of placement of six POI categories (namely, Grocery and Pharmacy, Other Retails, Art and Entertainment, Restaurant and Bar, Education, and Healthcare) on COVID-19 outcomes in 42 New York City neighborhoods in March 2020, controlling for demographic and socioeconomic variables. We visualize the geographical distribution of POIs using QGIS and perform a Poison regression using R code. We further analyze the associations between the number of visitors to each POI category and COVID-19 cases and deaths during the early phase of the COVID-19 pandemic in New York City.
We find considerable variations in the geographical distribution of POIs across New York City neighborhoods. Effects of the distribution of the POIs and their visitation patterns on COVID-19 outcomes are evaluated.

Jiaqi Chen
Demographic Analysis of Arsenic Maximum Contaminant Level (MCL) in US Drinking Water System
This study aims to analyze how population demographics are associated with arsenic concentrations in public drinking water systems. Long-term consumption of drinking water with arsenic triggers cancers and is also associated with skin lesions, cardiovascular disease, and diabetes. This study is conducted nationwide and focuses on western states where water arsenic is the highest. The dataset for this project contains a population-weighted average of arsenic concentrations in public drinking water systems for 2,900 ZCTAs, and 11 additional sociodemographic variables derived from the US Census collected in 2010. This project uses R to generate raincloud plots and geographic maps to compare ZCTA public drinking water arsenic concentrations across states. A multiple linear regression model and random forest model are applied during the data analysis process, in order to estimate how the outcome changes as population demographics change. The ultimate goal of this project is to find out what population demographics are associated with public drinking water arsenic concentrations.

Mental Health Research (HSC 210)

Congyang Xie
Suicidal Ideation Predictive Analysis of Ecological Momentary Assessment Data
The primary objective of this practicum project is to analyze data from 83 depressed participants who answered Ecological Momentary Assessment (EMA) prompts six times a day, with the aim of identifying important predictors of suicidal ideation using machine learning (ML) models. The EMA data captured information on suicidal ideation, affect, stress, and coping strategies. This study employed Random Forest (RF) and Lasso regression as the ML models for analysis.
The performance of these models in predicting suicidal ideation was assessed using the Root Mean Square Error (RMSE) as the evaluation metric. The results from this study will provide valuable insights into the most influential factors contributing to suicidal ideation in individuals with depression, which can inform the development of targeted interventions and prevention strategies. Additionally, the findings will help determine the efficacy of ML models in analyzing and predicting mental health-related outcomes using EMA data, potentially leading to the improvement of predictive tools in the mental health care field.

Yiwen Zhao
Prognosticating Post-Traumatic Stress Disorder Symptoms Using Digital Biomarkers
Post-traumatic stress disorder (PTSD) is a serious mental health condition that can develop following exposure to a traumatic event. Early identification and intervention are crucial for managing PTSD symptoms and preventing long-term complications. The purpose of this study is to explore the efficacy of digital biomarkers for prognosticating post-traumatic stress disorder (PTSD) symptoms for emergency department clinicians and to assess whether digital biomarkers can provide equal or better prognostic value for PTSD screening as compared to well-established psychometric assessments. The main research question is: what is the ability of Digital Biomarkers on prognosticating post-traumatic stress disorder (PTSD) examination in the Emergency Department and align with traditional psychometric assessments. The primary aim of the study is to test whether digital biomarkers predict PTSD symptoms in emergency department patients. And the secondary aim of the study is to evaluate whether digital biomarkers provide equal or better prognostic value for PTSD detection in ED clinicians compared to their PCL-5 score.

Qihang Wu
Dyadic Cluster Analysis for Comorbidity in Psychiatric Disorders in Children: ABCD Study
Comorbid psychiatric disorders are the co-occurrence of two or more disorders in the same individual. Studies in adults showed that patients with psychiatric comorbidities usually have higher suicide rates and poorer prognoses, requiring greater demand for professional help and instructions. While it is also widely recognized that children and adolescents will frequently develop more than one psychiatric disorder, the investigations of pediatric comorbidity remain underdeveloped. Moreover, the structural and functional connections between those comorbidities are still unclear and thus need further exploration.
The Adolescent Brain Cognitive Development (ABCD) Study is a longitudinal study with a diverse sample of nearly 12,000 youth between 9-11 recruited from twenty-one research sites across the United States. Based on this public database, this project developed a complete data-cleaning pipeline, performed latent class analysis (LCA) and network analysis (NA) to find out and visualize the groups of comorbidity patterns, and finally connect those patterns with magnetic resonance imaging (MRI) data.
The result reveals notable significances regarding race/ethnicity, parental education level, and marital status across clusters discovered by LCA. Furthermore, the structural imaging analysis shows the difference in the thickness of bilateral superior frontal gyri for LCA clusters compared to typically developing girls.

Wenyu Zhang
Predicting Mania Onset in NESARC: A Comparative Assessment of Logistic Regression, Random Forest, and AdaBoost Algorithms
This study investigates the prediction of mania onset among Wave 1 non-manic individuals in the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) utilizing logistic regression, random forest, and AdaBoost algorithms. The primary objective is to identify individuals at risk for developing mania at Wave 2 by analyzing 43 categorical variables extracted from Wave 1 NESARC responses.
A comparative assessment of the three algorithms reveals that the area under the ROC curve (AUC) values for each method approximates 0.6, indicating the potential for enhancing predictive performance. To optimize prediction accuracy, a model stacking approach, integrating random forest and logistic regression algorithms, is employed. Additionally, response plots for the top five influential variables are generated for each algorithm, with data points stratified by model-derived risk classifications.
These findings contribute to the biostatistical understanding of machine learning techniques' applicability in psychiatric research and may guide the development of predictive models and early intervention strategies targeting individuals at risk for developing mania. Implementing model stacking and examining response plots provide insights into refining prediction accuracy and bolstering clinical relevance.

Topics in Clinical Trials (HSC 109B)

Ziyan Xu
A multicenter open-label parallel-arm study to evaluate the safety and efficacy of a PD-1 inhibitor for patients with advanced melanoma
Melanoma is a disease in which cancer cells form in melanocytes (cells that color the skin). The novel drug in this study is a PD-1 inhibitor for patients with advanced melanoma. We assessed the safety and efficacy of the drug in a multi-center open-label parallel-arm trial.
Patients with advanced melanoma in 2 sites were randomly assigned (1:1) to receive the drug in one of these two regimens: 1 mg/kg every 2 weeks (arm 1) or 3 mg/kg every 3 weeks (arm 2). The primary outcome was the objective response rate assessed by a blinded independent central review and the secondary efficacy parameter is the survival rate. The hypothesis that each dosing regimen of the drug has an overall response rate more than 28% was tested independently for each study with ⍺ = 0.05.
46 patients with advanced melanoma were enrolled. The median follow-up was 12.1 and 13.2 months in arm 1 and 2, and the median follow-up was 22.3 and 22.6 months in arm 1 and 2 for exploratory efficacy analysis. An objective response was observed in 37.8% of patients (arm 1) and in 25.6% (arm 2). The 2-year overall survival was 56.3% and 45.1%, respectively.
The study met its primary end-point in both arms. Therefore, the drug showed significant efficacy for both dosing regimens in patients with advanced melanoma.

Keming Zhan
How compounding pharmacies fill critical gaps in pediatric drug development processes
Critical gaps for compounded preparation of off label pediatrics drug is significantly observed for pediatrics medication. In this study, we focus on evaluating gaps in approval pathway for both manufactured and compounded drugs for reduction in inefficiency of availability of pediatrics drugs. Administrative suggestions for pediatric drug development is also included for safety of pediatric medication for both manufactured and compounded drugs.

Philip Kim
Effects of KB109 and Supportive Self-care in Outpatients with COVID-19
The gut microbiome is known to be quite essential on the impact of disease trajectory. Therefore, we look to see how got microbiome modulation affects the trajectory of those with Covid-19 disease. As part of my internship, this randomized, open-label, prospective, parallel-group controlled clinical study aims to explore the natural history of COVID-19 illness and the safety of KB109, a novel glycan, plus supportive self care versus supportive self care alone. Accordingly, it measures health in outpatients with mild-to-moderate COVID-19. Adults who have tested positive for Covid-19 were randomized to one of the two aforementioned groups. They were then followed for 35 days and assessed on Covid-19 related symptoms and comorbidities. The primary endpoint is the number of patients expecting study product related treatment-emergent adverse events. The administration of KB109 plus supportive self care reduced medically-attended visits by 50% in the overall population and by 61.7% in patients with more than one comorbidity. Within this group, median time to resolution of symptoms was shorter in the treatment group compared to the control group (30 vs 21 days). Results from this study show that KB109 is well-tolerated within outpatients with mild-to-moderate Covid-19. We see that medically-attended visits were fewer and time to resolution was shorter within the KB109 plus supportive self care group. Therefore, it may be crucial to continue studying the impact of KB109.

Hanfu Shi
Implementation Science Approach to Enhancing Depression Treatment in Collaborative Depression Care Settings (DepCare)
Collaborative care (CC) is a team-based approach that integrates primary and behavioral health and has widely been shown to improve clinical outcomes. Key patient-level barries to sustaining CC in real world settings i nclude stigma, lack of patient-provider communication, We sought to determine whether the tool impact key barriers to treatment
From June 2021 to December 2022, we conducted a provider-level cluster randomized control trial of a patient activation tool in 4 primary care settings seeking to sustain CC. We employed descriptive statistics and chi-square and t-tests to examine differences between intervention and control arms.
Overall, 52 patients were enrolled in the trial. 33% were women, 15.38% were Black, 84% Hispanic,19.2% partnered; 65.4% have ever received counseling and 67.3% have ever received medication. Control arm patients were more likely to be female (77.8% vs. 58.8%, p=0.48) with a marginal effect but did not differ in key demographics. In comparison to the control arm participants, intervention arm patients were more likely to trust their clinicians (9.1 vs. 7.6, p = 0.004), plan to receive professional help (77.78 % vs. 67.65 %, p= 0.05).
In this study, we demonstrate the feasibility of a psychoeducation and activation tool on key barriers to treatment optimization. Future analyses will determine whether our tool improved treatment optimization rates in a larger sample.