2024 Summer Institute for Training in Biostatistics and Data Science at Columbia (SIBDS) Program

Summer Training Institute in Biostatistics and Data Science at Columbia (SIBDS@Columbia) is an innovative seven-week summer research training program where participants acquire and hone quantitative skills anchored on data immersion related to research challenges in studies of heart and lung diseases as well as infectious diseases. The seven weeks include a one-week asynchronous component during which participants will receive an introduction to biostatistics and data science as well as training on RStudio and other software. The program is co-directed by Dr. Kiros Berhane, Cynthia and Robert Citrone-Roslyn and Leslie Goldstein Professor and Chair of the Biostatistics Department, and Dr. Christine Mauro, Associate Professor of Biostatistics and Director of the MS program.
This year's SIBDS Program brought together 15 students from around the country for the opportunity to learn and engage in research right in the heart of New York City. We are honored to have worked with such bright and enthusiastic students this past summer.

Schools Represented:

Full List of Schools:
-
Pomona College
-
University of Illinois Urbana-Champaign
-
Ohio State University
-
CUNY Brooklyn
-
University of Florida
-
University of Hawaii
-
University of Florida
-
Duke University
-
North Carolina State University
-
Macalester College
-
University of Connecticut
-
New York University
Research Projects:
Exacting Extracting signatures of health from multidimensional time series data
Mentor: Alan Cohen, PhD – Associate Professor, Environmental Health Sciences
Mentees: Shirley Toribio; Hunter Farnham
We hypothesize that bodies in good health have unique signatures for communication between various systems. We have a dataset of 40 individuals with severe mitochondrial disease (unhealthy) and 70 controls (healthy), each subjected to a series of stressors, with continuous recordings of blood pressure, heart rate, and respiration. Informed by complex systems theory, we will extract signatures of what healthy communication among these time series looks like across the stress time course, and how it may differ in unhealthy individuals. Techniques used may include transfer entropy and multivariate extensions of heart rate variability. Students will learn programming in R and stats.
Are newer antipsychotic medications more beneficial than older medications for patients with schizophrenia?
Mentor: Caleb Miles, PhD – Assistant Professor, Biostatistics
Mentees: Gabriella Nieves; Sarah Wu; Andrew Ghastine
Using data from the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE), we will compare the effectiveness of newer antipsychotic medications relative to older medications in their effect on health outcomes in patients with schizophrenia. We will consider different approaches to adjusting for noncompliance, as the study design allowed for patients to change their medication over the course of the study. We will also look at treatment effect heterogeneity to understand whether some patients would do better on one medication and other patients on another, or if the strength of the effect varied depending on certain patient characteristics.
Discovering exposomic profiles and their relation to gestational diabetes within pregnant people in the US
Mentor: Jeanette Stingone, PhD – Assistant Professor, Epidemiology
Mentees: Lise Augustin; Sophia Kop; Emily Zhang
Scientific researchers have acknowledged that studies which seek to address the combined impacts of multiple environmental exposures are needed to more closely replicate human experience. However, the lack of known patterns in exposure to various chemical classes in representative and diverse populations is a fundamental block for the progression of research. This project will use existing data from NHANES, the representative biomonitoring program within the US population, to investigate the patterns in biomarkers of multiple classes of chemicals seen in pregnant people and determine whether these patterns vary by individual-level socioeconomic characteristics and are associated with gestational diabetes. Students will learn clustering techniques, create visualizations, such as correlation globes, to compare patterns across populations and conduct regression-based statistical analyses.
Electronic health record phenotyping and genetic association study for age-related diseases
Mentor: Molei Liu, PhD – Assistant Professor, Biostatistics
Mentees: Lucy Liu; Kejin Dong
The first goal of this project is to build accurate and time-specific risk prediction (phenotyping) models for a broad set of age-related diseases based on longitudinal and structural features in electronic health records (EHR). Based on the derived phenotypes, our next aim is to perform a genetic association study leveraging the EHR-linked biobank data, to detect useful biomarkers in characterizing the biological aging process. Note: the mentor will highlight the training on the sense and fundamental skills in statistics and data science, through a combination of doing research and learning related materials.
Development of a Publicly Available Database for Predicted DNA Methylation
Mentor: Wenpin Hou, PhD – Assistant Professor, Biostatistics
Mentees: Alexandra Duta; Matthew Eichner
Building on our prediction model, we will apply it to bulk, single-cell, and spatial RNA-seq data from public resources such as ENCODE, GTEx, Human Cell Atlas, and Recount2 to reconstruct the DNA methylation (DNAm) landscape. This effort will result in the creation of a comprehensive DNAm database encompassing various tissues, cell types, disease states, treatment conditions, and spatial locations.
The predicted database will be made publicly available, accompanied by visualization tools to facilitate data exploration. We will implement a user-friendly online interface to host the DNAm data, enabling researchers to gain deeper insights into gene regulation. This resource aims to enhance our understanding of epigenomic mechanisms and improve strategies for epigenomic therapy and precision treatment.
Latent trajectories of cancer cachexia and its relationship to survival in patients on Osimertinib for metastatic EGFR-mutant non-small cell lung cancer
Mentor: Xin Ma, PhD – Assistant Professor, Biostatistics
Mentees: Alisha Bhatia; Zoe Curtis; Vivian Ferrigni
Cancer cachexia, characterized by weight loss and decline in muscle mass, is a poor prognostic factor among patients with lung cancer. Osimertinib is the standard of care for patients with metastatic EGFR-mutant non-small cell lung cancer. In previous work, we found that patients with weight loss had significantly worse overall survival. The focus of this project is to further identify subgroups of lung cancer patients with different trajectories of weight loss and compare the overall survival among these subgroups. We will provide visualization of weight loss trajectories. Students will learn about the latent trajectory analysis and hypothesis testing procedure in R.
SIBDS@Columbia is funded by NIH grant R25HL161786 and is a part of a national network of NIH funded SIBS programs: Summer Institute in Biostatistics and Data Science | NHLBI, NIH.