Levin Lecture Series Colloquium Seminars
Lectures are in-person only unless marked otherwise.
For all Zoom inquiries to virtually attend seminars with a virtual option, please send an email to Erin Elliott, Programs Coordinator (ee2548@cumc.columbia.edu).
During the Fall and Spring semesters, the Department of Biostatistics holds seminars, called the Levin Lecture Series, on a wide variety of topics which are of interest to both students and faculty. The speakers are occasionally departmental faculty members themselves but very often are invited guests who spend the day of their seminar discussing their research with Biostatistics faculty and students.
Fall 2024 Levin Lectures
September 5th, ARB 8th Floor Auditorium, 11:45am
Kevin Josey, PhD
Assistant Professor, Department of Biostatistics & Informatics
Colorado School of Public Health
Causal Inference using Variables Measured with Error
Abstract:
In both the scientific application and development of causal inference methods, it is often implicitly assumed that all relevant variables are measured without error. However, in many contexts obtaining error-free measurements of an outcome, exposure, or confounding variable may be unreasonable or even impossible. In these scenarios, the presence of measurement error can subsequently invalidate fundamental assumptions necessary for causal inference. Despite the extensive literature studying the impact of measurement error in associational studies, the development of methods at the intersection of measurement error and causal inference is in a relatively early stage. This presentation will first examine a variety of methods for addressing measurement error in causal analyses. Subsequently, we propose implementing a class of estimators applicable to general causal quantities that is conventionally used for unmeasured confounding to instead address bias induced by measurement error. Under standard double sampling schemes, the proposed estimator is shown to be competitive with existing approaches in a simulation study. We illustrate our method with observational electronic health record data on HIV outcomes from the Vanderbilt Comprehensive Care Clinic.
September 12th, ARB Room 532A/B, 11:45am
Shuangning Li, PhD
Assistant Professor of Econometrics and Statistics
University of Chicago Booth School of Business
Causal Inference in the Presence of Interference: Estimation and Testing Problems
Abstract:
In causal inference, "interference" refers to a scenario where the treatment assigned to one unit affects the observed outcomes of other units. In a wide variety of applied settings, such interference effects not only exist but are of considerable interest. In this talk, I will present some tools I have developed to conduct statistical inference in the presence of such interference.
1. Estimation:
I will begin by discussing estimation problems, focusing on a study that examines large-sample asymptotics for treatment effect estimation under network interference, where the interference graph is a random draw from a graphon. For direct effects, we demonstrate that popular estimators in this setting are significantly more accurate than previously suggested. For indirect effects, we propose a new consistent estimator in a setting where no other consistent estimators currently exist.
2. Testing:
If time permits, I will then discuss testing problems. I will present a study focused on testing for interference in A/B testing with increasing allocation. Specifically, we introduce two permutation tests designed to detect the existence of interference, each valid under different assumptions. These procedures have been implemented at LinkedIn to detect potential interference across all their marketplace experiments.
September 19th, ARB 8th Floor Auditorium, 11:45am
Shuangge (Steven) Ma, PhD
Department Chair and Professor of Biostatistics
Yale School of Medicine
Modeling Emotional Expressions for Multiple Cancers via a Linguistic Analysis of an Online Health Community
Abstract:
The diagnosis and treatment of cancer can evoke a variety of adverse emotions. Online health communities (OHCs) provide a safe platform for cancer patients and those closely related to express emotions without fear of judgement or stigma. In the literature, linguistic analysis of OHCs is usually limited to a single disease and based on methods with various technical limitations. In this article, we analyze posts from September 2010 to September 2022 on nine cancers that are publicly available at the American Cancer Society’s Cancer Survivors Network (CSN). We propose a novel network analysis technique based on a latent space model. The proposed approach decomposes the emotional expression semantic networks into an across-cancer time-independent component (which describes the ``baseline’’ that is shared by multiple cancers), a cancer-specific time-independent component (which describes cancer-specific properties), and an across-cancer time-dependent component (which accommodates temporal effects on multiple cancer communities). For the second and third components, respectively, we consider a novel clustering structure and a change point structure. A penalization approach is proposed, and its theoretical and computational properties are carefully examined. The analysis of the CSN data leads to sensible networks and deeper insights into emotions for cancer overall and specific cancer types.
September 26th, ARB 8th Floor Auditorium, 11:45am
Leilei Zeng, PhD
Professor/Associate Chair - Research
University of Waterloo, Department of Statistics and Actuarial Science
A Mixture Hidden Markov Model for Multiple Types of Disease
Abstract: Multistate models are widely used for analyzing longitudinal data on disease progression over time. Many diseases manifest differently and what appears to be a coherent collection of symptoms is often the expression of a variety of distinct disease subtypes, each with a different rate of onset of symptoms and progression. We propose a mixture hidden Markov model (MHMM), where the underlying process is characterized by a finite mixture of multiple Markov chains, one for each disease subtype, while the observation process contains states corresponding to the common symptomatic stages of these diseases. Information on type of disease is partially available and reflects the pathway through certain hidden states in the corresponding disease process, facilitating the estimation of parameters involved in the proposed models. The method is demonstrated on a dataset to model the development and progression of dementia caused by Alzheimer's disease and non-AD dementia.
October 3rd, ARB 8th Floor Auditorium, 11:45am
Qi Long, PhD
Professor of Biostatistics
University of Pennsylvania, Perelman School of Medicine, Department of Biostatistics, Epidemiology & informatics
Advancing Responsible Statistical and AI/ML Methods for Analysis of Complex EHR data
Rapid advances in technologies have enabled generation and collection of vast amounts of health data in research studies, from healthcare delivery, and from other real-world sources. While such rich data offer great promises in advancing intelligent and equitable health and medicine, they present daunting analytical challenges. One notable example is the multi-modal data from electronic health records (EHR) that are recorded at irregular time intervals with varying frequencies and include structured data such as labs and vitals, codified data such as diagnosis and procedure codes, and unstructured data such as clinical notes and pathology reports. They are typically incomplete and fraught with other data errors and biases. What’s more, data gaps and errors in EHRs are often unequally distributed across patient groups: People with less access to care, often people of color or with lower socioeconomic status, tend to have more incomplete EHRs. Such data bias, if not adequately addressed, would lead to biased results and exacerbate health inequities. In this talk, I will share my research group’s work on developing robust statistical and AI/ML methods for addressing these challenges including some recent work on large language models (LLMs). Our research experience has demonstrated that a trans-disciplinary data science approach that involves collaboration between statisticians, informaticians, computer scientists, and physician scientists can accelerate innovation in harnessing the transformative power of EHR to tackle complex real-world problems and exert powerful impact in medicine. To this end, I will also discuss some open questions and opportunities for future research.
October 10th, ARB Hess Commons, 11:45am
Menggang Yu, PhD
Professor, Biostatistics
University of Michigan, School of Public Health
Covariate-Balancing Weights for Causal Generalization with Target Sample Summary Information
In this talk, we focus on estimating the average treatment effect (ATE) of a target population when individual-level data from a source population and summary-level data (e.g., first or second moments of certain covariates) from the target population are available. In the presence of heterogeneous treatment effect, the ATE of the target population can be different from that of the source population when distributions of treatment effect modifiers are dissimilar in these two populations, a phenomenon also known as covariate shift. Many methods have been developed to adjust for covariate shift, but most require individual covariates from a representative target sample. We develop nonparametric weights for the treated and control groups within the source sample by calibration to the summary-level information from the target sample. Our approach also seeks additional covariate balance between the treated and control groups in the source sample. We will demonstrate statistical properties and numerical results of the resulting estimator.
October 17th, ARB Hess Commons, 11:45am
Laura Hatfield, PhD
Senior Fellow, NORC
University of Chicago
Transporting Difference-in-Differences Estimates for Health Equity Evaluations
The Medicare program provides medical insurance for most adults aged 65 years and older in the United States. To improve the cost, quality, and outcomes of Medicare beneficiaries, the Centers for Medicare and Medicaid Innovation (CMMI) designs and tests novel payment and delivery models. CMMI has recently pledged to put equity at the center of its demonstrations and evaluations. However, robust methods to estimate equity impacts using quasi-experimental designs are lacking. This paper addresses the problem of transporting treatment effect estimates from CMMI models, most commonly using difference-in-differences designs, to equity-relevant target populations. We extend methods developed by Renson et al. (2023) to transport difference-in-differences treatment effects. Specifically, we apply and extend these methods to transport the effects of Comprehensive Primary Care Plus (CPC+) to a target population of Black fee-for-service (FFS) Medicare beneficiaries living outside the original 18 CPC+ regions. Our application poses a unique problem in that the treatment status of the units to which we wish to transport inferences cannot be observed. Therefore, we conducted a simulation study in which we simulated practice-level spending in sample and target units, calibrating to values from the literature and varying key parameters to create multiple realistic scenarios that varied the representativeness of the sample relative to the target population. Across our simulation scenarios, transporting the treatment effect yielded median treatment effects that varied as much as the total estimated effect. We also explored the sensitivity of the methods to violations of assumptions. I also discuss connections to our research on formulating target estimands for equity evaluations and developing identification and estimation strategies for those estimands.
October 24th, ARB Hess Commons, 11:45am
Sandrah Proctor Eckel, PhD
Associate Professor of Population and Public Health Sciences
University of Southern California, Keck School of Medicine
Biostatistics, Climate Change, and Health
Climate change poses an existential threat to human health. Public health researchers are increasingly joining multidisciplinary teams to quantify the health impacts of climate-related events, to identify key priorities for adaptation, and to provide policy-relevant information by evaluating ongoing adaptation measures as well as the public health co-benefits of mitigation efforts. I have been transitioning from research focused on methods and applications in environmental statistics - especially for air pollution epidemiology – to climate and health. In this talk, I will share my perspective on directions for biostatistical contributions to climate and health. I will provide an overview from my experience co-developing a new graduate-level, team-taught multidisciplinary course: “Data Science Methods for Climate and Health Research” and insights from ongoing research projects at the University of Southern California. For example, traffic is a key source of air pollution in Southern California. We recently related the early phase transition to electric vehicles in California to reductions in nitrogen dioxide (NO2) air pollution and reductions in asthma-related emergency department visits using classic longitudinal mixed model methods. More generally, many studies relating climate hazards to acute health outcomes have adapted methods from air pollution epidemiology. For example, analyses linking climate hazards to administrative health outcomes (e.g., daily hospitalizations or mortality) typically use methods for ecological time series analysis of counts with quasi-Poisson regression or time-stratified case-crossover sampling designs with conditional logistic regression. Case-crossover designs have been growing in popularity as they are considered individual-level rather than ecological and control for time-constant confounders by design. However, case-crossover has been noted to have worse efficiency than time series. In a simulation study, we compared case-crossover and time series methods for studying rare binary exposures (e.g., high wildfire smoke day, or extreme heat day), showing that the reduced relative efficiency for case-crossover worsened with increasingly rare extreme exposures. In summary, climate and health research is a rapidly developing field with urgent needs. More insights will be gained by continuing to adapt existing methods, but key features of climate-related hazards may require new methodological approaches. Biostatisticians will play an important role in developing data-driven solutions for a healthy future in our changing climate.
October 31st, ARB 8th Floor Auditorium, 11:45am
Oscar Madrid Padilla, PhD
Assistant Professor in the Department of Statistics
University of California Los Angeles
Multilayer random dot product graphs: Estimation and online change point detection
We study the multilayer random dot product graph (MRDPG) model, an extension of the random dot product graph to multilayer networks. To estimate the edge probabilities, we deploy a tensor-based methodology and demonstrate its superiority over existing approaches. Moving to dynamic MRDPGs, we formulate and analyse an online change point detection framework. At every time point, we observe a realization from an MRDPG. Across layers, we assume fixed shared common node sets and latent positions but allow for different connectivity matrices. We propose efficient tensor algorithms under both fixed and random latent position cases to minimize the detection delay while controlling false alarms. Notably, in the random latent position case, we devise a novel nonparametric change point detection algorithm based on density kernel estimation that is applicable to a wide range of scenarios, including stochastic block models as special cases. Our theoretical findings are supported by extensive numerical experiments, with the code available online.
November 7th, ARB Hess Commons, 11:45am
Gang Li, PhD
Professor of Biostatistics
University of California Los Angeles Fielding School of Public Health
Prediction Performance Measures for Time-to-Event Data
Evaluating and validating the performance of prediction models is a crucial task in statistics, machine learning, and their diverse applications, including precision medicine. However, developing robust performance measures, particularly for time-to-event data, poses unique challenges. In this talk, I will highlight how conventional performance metrics for time-to-event data, such as the C Index, Brier Score, and time-dependent AUC, may yield unexpected results when comparing models. I will then introduce a novel pseudo R-squared measure and demonstrate its utility as a performance metric for both uncensored and right-censored time-to-event data. Additionally, I will discuss its extension to time-dependent performance measures and competing risks data, and showcase its effectiveness through simulations and real-world examples.
November 14th, ARB 8th Floor Auditorium, 11:45am
Michael Hudgens, PhD
Professor and Chair, Department of Biostatistics
University of North Carolina, Gillings School of Global Public Health
Causal Inference in Infectious Disease Prevention Studies
This talk will provide a high-level overview of the development and application of causal inference methods to infectious disease prevention studies, with particular focus on vaccines. Examples will include drawing inference about vaccine effects on post-infection outcomes, immunological correlates of vaccine protection, spillover effects of vaccines, and waning of vaccine effects over time.
November 21st, ARB Hess Commons, 11:45am
Cliff Meyer, PhD
Senior Research Scientist
Harvard T.H. Chan School of Public Health
Computational Biology
Decoding Epigenetic Complexity: Modeling Gene Regulation with the Cistrome Data Browser
The molecular mechanisms underlying many cancers are linked to disruptions in trans-acting factors and their interactions with cis-regulatory elements, which jointly regulate gene expression. Genomics techniques such as ChIP-seq, DNase-seq, and ATAC-seq are commonly used to map these interactions and chromatin landscapes across the genome, collectively known as "cistromes." This presentation will focus on recent updates to the Cistrome DB, a comprehensive repository of curated, quality-controlled cistrome data for human and mouse. I will also discuss multimodal approaches for integrating single-cell ATAC-seq and RNA-seq data, including methods to correct for batch effects using AI and topic models. Finally, I will introduce new Cistrome DB resources for AI-driven applications in regulatory genomics.
December 5th, ARB Hess Commons, 11:45am
Zhengwu Zhang, PhD
Assistant Professor
University of North Carolina, Statistics & Operations Research
Generative Models for Brain Network Data Analysis: VAEs, GANs, and Diffusion Models
Generative models are transformative tools for analyzing complex brain network data, enabling the capture of intricate patterns and their relationships with human traits like cognition. In this talk, we introduce generative models—specifically Variational Auto-Encoders (VAEs), Generative Adversarial Networks (GANs), and diffusion models—and their applications in neuroimaging. We begin with Graph Auto-Encoding (GATE), a VAE-based model that characterizes the population distribution of brain graphs, improving cognitive trait prediction in large datasets like ABCD and HCP. Next, we address motion artifacts in structural connectomes using a motion-invariant VAE (inv-VAE), enhancing accuracy in brain network analyses. We then discuss an interpretable GAN framework, named Disentangled Adversarial Flow (DAF), which leverages multi-source datasets to improve predictive modeling in studies with limited samples. Finally, we explore a conditional latent diffusion model for unpaired volumetric harmonization of brain MRI (HCLD), enabling efficient harmonization across sites without paired data. These advances underscore the pivotal role of generative models in enhancing neuroimaging analyses and deepening our understanding of brain structure and function.
December 6th, ARB Hess Commons, 11:45am
Ingrid Van Keilegom, PhD
Faculty of Economics and Business
KU Leuven
Semiparametric estimation of the survival function under dependent censoring
This paper proposes a novel estimator of the survival function under dependent random right censoring, a situation frequently encountered in survival analysis. We model the relation between the survival time T and the censoring C by using a parametric copula, whose association parameter is not supposed to be known. Moreover, the survival time distribution is left unspecified, while the censoring time distribution is modeled parametrically. We develop sufficient conditions under which our model for (T,C) is identifiable, and propose an estimation procedure for the distribution of the survival time T of interest. Our model and estimation procedure build further on the work on the copula-graphic estimator proposed by Zheng and Klein (1995) and Rivest and Wells (2001), which has the drawback of requiring the association parameter of the copula to be known, and on the recent work by Czado and Van Keilegom (2023), who suppose that both marginal distributions are parametric whereas we allow one margin to be unspecified. Our estimator is based on a pseudo-likelihood approach, and maintains low computational complexity. The asymptotic normality of the proposed estimator is shown. Additionally, we discuss an extension to include a cure fraction, addressing both identifiability and estimation issues. The practical performance of our method is validated through extensive simulation studies and an application to a breast cancer data set.
December 12th, ARB Hess Commons, 11:45am
Tingting Zhang, PhD
Professor
University of Pittsburgh, Department of Statistics
Analysis of Functional Brain Network Changes from Childhood to Old Age: A Study Using HCP-D, HCP-YA, and HCP-A Datasets
We present a new clustering-enabled regression approach to investigate how functional connectivity (FC) of the entire brain changes from childhood to old age. By applying this method to resting-state functional magnetic resonance imaging data aggregated from three Human Connectome Project studies, we cluster brain regions that undergo identical age-related changes in FC and reveal diverse patterns of these changes for different region clusters. While most brain connections between pairs of regions show minimal yet statistically significant FC changes with age, only a tiny proportion of connections exhibit practically significant age-related changes in FC. Among these connections, FC between region clusters from the same functional network tends to decrease over time, whereas FC between region clusters from different networks demonstrates various patterns of age-related changes. Moreover, our research uncovers sex-specific trends in FC changes. Females show much higher FC mainly within the default mode network, whereas males display higher FC across several more brain networks. These findings underscore the complexity and heterogeneity of FC changes in the brain throughout the lifespan.