Spring 2025 Departmental Seminars & Lectures
These are the archived Departmental Seminars & Lectures from Spring of 2025.
During the Fall and Spring semesters, the Department of Biostatistics holds regular seminars on Thursdays, called the Levin Lecture Series, on a wide variety of topics which are of interest to both students and faculty. Over each semester, there are also often guest lectures outside the regular Thursday Levin Lecture Series, to provide a robust schedule the covers the wide range of topics in Biostatistics. The speakers are invited guests who spend the day of their seminar discussing their research with Biostatistics faculty and students.
Spring 2025 Schedule
Thursday, January 16th, Hess Commons, 11:45am
Levin Lecture 
Charles J. Wolock, PhD
Postdoctoral Researcher, Department of Biostatistics, Epidemiology, and Informatics
University of Pennsylvania
Nonparametric approaches to assessing variable importance using health data
Abstract:
Given a collection of features available for inclusion in a predictive model, it may be of interest to quantify the relative importance of a subset of features for the prediction task at hand. This objective has given rise to a substantial literature focused on defining, estimating, and making inference on variable importance. Within this field, there is a need for tools to handle the complications characteristic of health data, including coarsening of the outcome variable. Drawing upon examples from research in infectious diseases and mental health, we present novel methods for assessing variable importance in a nonparametric, algorithm-agnostic manner. Our proposed methods allow for flexible estimation of nuisance parameters and provide asymptotically valid inference, while also enjoying robustness properties and nonparametric efficiency. We demonstrate the performance of our proposed procedures via numerical simulations and present an analysis of data from the HVTN 702 HIV vaccine trial, with the aim of informing enrollment strategies for future trials. Furthermore, we discuss several open questions surrounding variable importance and outline possible avenues of future work.
Thursday, January 23rd, Hess Commons, 11:45am 
Levin Lecture 
Harsh Parikh, PhD
Postdoctoral Fellow, Department of Biostatistics
John Hopkins University
Interpretable Causal Inference for Advancing Healthcare and Public Health
Abstract:
Causal inference methods are essential across healthcare, public health, and social sciences, helping understand complex systems and inform decision-making. While integrating machine learning (ML) and statistical techniques has improved causal estimation, many of these methods depend on black-box ML approaches. This raises concerns about the communicability, auditability, and trustworthiness of causal estimates, especially in high-stakes contexts. My research addresses these challenges by developing interpretable causal inference methods. In this presentation, I introduce an approach for bridging the research-to-practice gap by generalizing randomized controlled trial (RCT) findings to target populations. Although RCTs are fundamental for understanding causal effects, extending their findings to broader populations is difficult due to effect heterogeneity and the underrepresentation of certain subgroups. Our work tackles this issue by identifying and interpretably characterizing underrepresented subgroups in RCTs. Specifically, we propose the Rashomon Set of Optimal Trees (ROOT), an optimization-based method that produces interpretable characteristics of underrepresented subgroups. This approach helps researchers communicate findings more effectively. We apply ROOT to extend inferences from the Starting Treatment with Agonist Replacement Therapies (START) trial -- assessing the effectiveness of opioid use disorder medication -- to the real-world population represented by the Treatment Episode Dataset: Admissions (TEDS-A). By refining target populations using ROOT, our framework offers a systematic approach to enhance decision-making accuracy and inform future trials in diverse populations.
Friday, January 24th, Hess Commons, 11:45am 
Guest Lecture
Eric Sun
PhD Student
Stanford University
Machine learning for aging and spatial omics
Abstract:
Aging is a highly complex process and the greatest risk factor for many chronic diseases including cardiovascular disease, dementia, stroke, diabetes, and cancer. Recent spatial and single-cell omics technologies have enabled the high-dimensional profiling of complex biology including that underlying aging. As such, new machine learning and computational methods are needed to unlock important insights from spatial and single-cell omics datasets. First, I present the development of high-resolution machine learning models (‘spatial aging clocks’) that can measure the aging of individual cells in the brain. Using these spatial aging clocks, I discovered that some cell types can dramatically influence the aging of nearby cells. Next, I present new computational and statistical methods for overcoming the gene coverage limitations of existing spatial omics technologies, which have enabled the discovery of gene pathways underlying the spatial effects of brain aging. Finally, I introduce several methods for improving the reliability and robustness of high-dimensional data visualizations.
Tuesday, January 28th, 8th Floor Auditorium, 11:45am
Guest Lecture 
Yao Zhang, PhD
Postdoctoral Scholar, Statistics
Stanford University
Posterior Conformal Prediction
Abstract:
Conformal prediction is a popular technique for constructing prediction intervals with distribution-free coverage guarantees. The coverage is marginal, holding on average over the entire population but not necessarily for any specific subgroup. In this talk, I will introduce a new method, posterior conformal prediction (PCP), which generates prediction intervals with both marginal and approximate conditional coverage for clusters (or subgroups) naturally discovered in the data. PCP achieves these guarantees by modelling the conditional conformity score distribution as a mixture of cluster distributions. Compared to other methods with approximate conditional coverage, this approach produces tighter intervals, particularly when the test data is drawn from clusters that are well represented in the validation data. PCP can also be applied to guarantee conditional coverage on user-specified subgroups, in which case it achieves robust coverage on smaller subgroups within the specified subgroups. In classification, the theory underlying PCP allows for adjusting the coverage level based on the classifier’s confidence, achieving significantly smaller sets than standard conformal prediction sets. Experiments demonstrate the performance of PCP on diverse datasets from socio-economic, scientific and healthcare applications.
CANCELLED: Thursday, January 30th, Hess Commons, 11:45am
Levin Lecture 
Paul Albert, PhD
Senior Investigator, Biostatistics Branch
NCI/DCEG
Innovative applications of hidden Markov models in cancer Epidemiology and Genetics
Abstract:
During the past 30 years, Hidden Markov modeling (HMM) has had a big impact in the analysis of biomedical data, with a few important application areas in genomics, natural history modeling, environmental monitoring, and the analysis of longitudinal data. In cancer genomics, for example, the use of HMM has played an important role in uncovering both susceptibility (germline) and tumor progression (somatic) of cancer. In this talk, I will present a series of novel applications of HMMs in cancer epidemiology and genetics. I will describe the use of HMM to identify multiple subclones in next-generation sequences of tumor samples (Choo-Wosoba et al., Biostatistics 2021). I will also discuss the application of HMMs for characterizing the natural history of natural history of human papillomavirus and cervical precancer (Aron et al., Statistics in Medicine, 2021). Further, I describe the use of HMMs for application of HMMs for investigating the effects of sleeping and activity on mortality. Last, I describe the use of HMMs for joinpoint analysis in cancer surveillance. All four examples required interesting adaptations of standard HMM estimation that will be highlighted.
Tuesday, February 4th, Hess Commons, 11:45am
Guest Lecture 
Ying Cui, PhD
Postdoctoral Scholar, Department of Biomedical Data Science
Stanford University
Advancing Biomedical Data Science: From Population Insights to Personalized Decisions
Abstract:
Rapid advances in biomedicine have enabled us to address important questions that were once intractable. There is a pressing need for analyzing massive data sets emerging from cutting-edge technologies, presenting challenges such as high-dimensionality and multi-modality. Additionally, there has been rising interests in personalized decision-making. Inspired by these challenges, my research aims to enhance the integration of statistical insights and data science innovations in biomedical research. In this talk, I will cover two projects.
The first part of the talk explores key questions about identifying covariates relevant to clinical outcomes of interest. Addressing these questions, however, can be complicated due to the presence of complex covariate effects. To tackle this problem, I developed a new testing and screening framework by adopting a global view via the novel concept of interval quantile independence. I showed that this general testing framework can naturally yield both unconditional and conditional screening procedures for ultra-high dimensional settings and enjoy the sure screening property.
In the second part of the talk, I address the feature selection problem from a personalized perspective. I designed a novel dynamic prediction rule to determine the optimal order of acquiring features in predicting clinical outcomes of interest for individual subject. The goal is to optimize model performance while reducing the costs associated with measuring features. To achieve this, I employed reinforcement learning, where the agent decides the best action at each step: either making a final decision or continuing to collect new predictors. The proposed approach mirrors and improves real life decision-making processes, employing a “learn-as-you-go” paradigm.
Thursday, February 6th, Hess Commons, 11:45am 
Levin Lecture
Tianyu Zhang, PhD
Postdoctoral Researcher, Department of Statistics & Data Science
Carnegie Mellon University
Adaptive and Scalable Nonparametric Estimation via Stochastic Optimization
Abstract:
Nonparametric procedures are frequently employed in predictive and inferential modeling to relate random variables without imposing specific parametric forms. In supervised learning, for instance, our focus is often on the conditional mean function that links predictive covariates to a numerical outcome of interest. While many existing statistical learning methods achieve this with optimal statistical performance, their computational expenses often do not scale favorably with increasing sample sizes. This challenge is exacerbated in certain “online settings,” where data is continuously collected and estimates require frequent updates.
Thursday, February 27th, ARB 8th Floor Auditorium, 11:45am
Levin Lecture 
Shan Yu, PhD
Assistant Professor, Department of Statistics
University of Virginia
Distributed Learning for Heterogeneous and Complex Big Spatial Data
Spatial heterogeneity plays a crucial role in various scientific domains, including social sciences, economics, environmental studies, and biomedical research. In this talk, I will introduce a generalized partially linear spatially varying coefficient (GPL-SVC) model, a powerful statistical framework for effectively capturing spatial heterogeneity while balancing model flexibility and interpretability.
A key challenge in applying such models to large-scale spatial datasets is computational scalability. To address this, we develop a novel Distributed Heterogeneity Learning (DHL) framework, which leverages bivariate spline smoothing over a domain triangulation. The DHL method is designed to be simple, scalable, and communication-efficient, with rigorous theoretical guarantees.
We apply the proposed methodology to spatially resolved transcriptomics, an emerging technology that provides unprecedented insights into gene expression heterogeneity. The spVC model, built upon the GPL-SVC framework, seamlessly integrates constant and spatially varying effects of covariates. Unlike existing methods that primarily focus on statistical significance, spVC captures continuous expression patterns and incorporates spot-level covariates, enhancing biological interpretability. Extensive simulations and real-world data applications validate the accuracy and versatility of spVC in identifying spatially variable genes.
Through this talk, I will demonstrate how these scalable statistical methodologies provide effective solutions for analyzing spatial heterogeneity across diverse data domains, bridging the gap between traditional spatial analysis and modern high-throughput biomedical research.
Thursday, March 6th, Hess Commons, 11:45am 
Levin Lecture
Lorin Crawford, PhD
Principal Researcher at Microsoft Research
Distinguished Senior Fellow in Biostatistics, Brown University
Statistical opportunities in defining, modeling, and targeting cell state in cancer
Project Ex Vivo is a joint cancer research collaboration between Microsoft and the Broad Institute of MIT and Harvard. Our group views cancers as complex (eco)systems, beyond just mutational variation, that necessitate systems-level understanding and intervention. In this talk, I will discuss a series of multimodal statistical and deep learning approaches to understand accurate representations of tumors by integrating genetic markers, expression state, and microenvironmental interactions. These representations help us precisely define and quantify the trajectory of each tumor in each patient. Our ultimate objective is to more effectively model cancer ex vivo – outside the body – in a patient-specific manner. In doing so, we aim to unlock the ability to better stratify patient populations and identify therapies that target diverse aspects of human cancers.
Thursday, March 13th, ARB 8th Floor Auditorium, 11:45am
Levin Lecture 
Fan Li, PhD
Associate Professor of Biostatistics
Yale University
How to achieve model-robust inference in stepped wedge trials with model-based methods?
A stepped wedge design is a unidirectional crossover design where clusters are randomized to distinct treatment sequences. While model-based analysis of stepped wedge designs is standard practice to evaluate treatment effects accounting for clustering and adjusting for covariates, their properties under misspecification have not been systematically explored. In this talk, we study when a potentially misspecified working model (linear mixed models and generalized estimating equations) can offer consistent estimation of the marginal treatment effect estimands, which are defined nonparametrically with potential outcomes and may be functions of calendar time and/or exposure time. We prove a central result that consistency for nonparametric estimands usually requires a correctly specified treatment effect structure, but generally not the remaining aspects of the working model (functional form of covariates, random effects, and error distribution), and valid inference can be obtained via the sandwich variance estimator. Furthermore, an additional g-computation step is required to achieve model-robust inference under non-identity link functions or for ratio estimands. The theoretical results are illustrated via several simulation experiments and re-analysis of a completed stepped wedge trial.
Friday, March 14th, Hess Commons, 11:45am
Levin Lecture 
Weining Shen, PhD
Associate Professor, Donald Bren School of Information & Computer Sciences
University of California, Irvine
Data integration in clinical trials and sports analytics
Many real-world applications require collecting and analyzing data from multiple sources. In this talk, I will present two projects related to data integration. The first project focuses on evaluating the sports understanding of mainstream large language models, using newly introduced benchmark datasets. Our evaluation covers a range of tasks, from basic queries about rules and historical facts to complex, context-specific reasoning, as well as assessing the sports reasoning capabilities of video language models. The second project is about selecting relevant external data to improve inference of long-term efficacy for gene therapies. We will introduce a new Bayesian selection framework and discuss its theoretical properties and empirical performance in a real-world gene therapy trial.
CANCELLED: Thursday, March 27th, Hess Commons, 11:45am
Levin Lecture 
James Zou, PhD
Associate Professor of Biomedical Data Science and, by courtesy, of Computer Science and Electrical Engineering
Stanford University
Talk Title & Abstract TBA
Thursday, April 3rd, Hess Commons, 11:45am 
Levin Lecture
Jennifer Hill, PhD
Professor of Applied Statistics; Co-Department Chair; Co-Director of PRIISM, Department of Applied Statistics, Social Science, and Humanities
NYU Steinhardt
Democratizing Methods
The past few decades have seen an explosion in the development of freely available software to implement statistical methods and algorithms to help explore and analyze data. However, researchers tend to assume that releasing software packages implementing specific methods is sufficient for ensuring that the tools are adopted and used correctly. Typically, very little attention is paid to the user experience. This in turn means that the tools do not get used, are used incorrectly, or the results are misinterpreted. This talk will present a case study for how software development could be different by describing a causal analysis tool that scaffolds the user experience. I will discuss lessons learned through user studies and experimental evidence. I conclude with calls to action for those that develop methods and software.
Thursday, April 17th, Hess Commons, 11:45am 
Levin Lecture
Ali Shojaie, PhD
Professor of Biostatistics & Statistics, Associate Chair of Biostatistics
University of Washington
Inference on function-valued parameters using a restricted score test
Function-valued parameters that can be defined as the minimizer of a population risk arise naturally in many applications. Examples include the conditional mean function and the density function. Although there is an extensive literature on constructing consistent estimators for function-valued risk minimizers, such estimands can typically only be estimated at a slower-than-parametric rate in nonparametric and semiparametric models, and performing calibrated inference can be challenging. In this talk, we present a general inferential framework for function-valued risk minimizers as a nonparametric extension of the classical score test. We demonstrate that our framework is applicable in a wide variety of problems and describe how the approach can be used for inference on a mean regression function under (i) nonparametric and (ii) partially additive models.
Thursday, April 24th, Hess Commons, 11:45am
Levin Lecture 
Amita Manatunga, PhD
Donna J. Brogan Professor in Biostatistics
Rollins School of Public Health, Emory University
Model-free Framework for Evaluating the Reliability of a New Device with Multiple Imperfect Reference Standards
A common practice for establishing the reliability of a new computer-aided diagnostic (CAD) device is to evaluate how well its clinical measurements agree with those of a gold standard test. However, in many clinical studies, a gold standard is unavailable, and one needs to aggregate information from multiple imperfect reference standards for evaluation. A key challenge here is the heterogeneity in diagnostic accuracy across different reference standards, which may lead to biased evaluation of a device if improperly accounted for during the aggregation process. We propose an intuitive and easy-to-use statistical framework for evaluation of a device by assessing agreement between its measurements and the weighted sum of measurements from multiple imperfect reference standards, where weights representing relative reliability of each reference standard are determined by a model-free, unsupervised inductive procedure. Specifically, the inductive procedure recursively assigns higher weights to reference standards whose assessments are more consistent with each other and form a majority opinion, while assigning lower weights to those with greater discrepancies. Unlike existing methods, our approach does not require any modeling assumptions or external data to quantify heterogeneous accuracy levels of reference standards. It only requires specifying an appropriate agreement index used for weight assignment and device evaluation. The framework is applied to evaluate a CAD device for kidney obstruction by comparing its diagnostic ratings with those of multiple nuclear medicine physicians.
This is joint work with Ying Cui, Qi Yu and Jeong H. Jang.
Thursday, May 1st, Hess Commons, 11:45am 
Levin Lecture
Qing Pan, PhD
Professor, Department of Biostatistics and Bioinformatics
George Washington University, Milken Institute School of Public Health
Predictions of Advanced Adenoma and High-Risk Pregnancies in Longitudinal Screening Studies
Panel count data is common in cancer screening. In the context of colorectal cancer screening, our work focuses on the prediction of the probability of advanced adenoma conditional on patient-level risk factors and/or event history. We implement the joint frailty model proposed by Huang et al. (2006), which involves a non‐stationary Poisson process for recurrent adenoma events and informative screening time using semi‐parametric Cox models correlated by a latent frailty variable. Coefficients and baseline intensity functions are estimated through estimating equations. The subject-specific frailty value is estimated by the borrow‐strength method (Huang and Wang 2004). In addition, marginal models for the adenoma and screening events are also applicable when average covariate effects on the population level are of interest. Predictions of individual risks based on the marginal model and predictions based on the frailty models for patients with or without screening history are compared. When a patient’s screening history is available and sufficient adenoma events are observed, the predictions based on the frailty model with estimated subject‐specific frailty are superior. However, in the cases of early censoring when adenoma events are not observed for most patients or screening history is not available, the prediction based on the marginal model has better performance. For future screening, the individualized screening intervals based on the dynamic predictions of advanced adenoma risks will detect adenomas earlier with shorter lag times between adenoma occurrences compared to the current practice of fixed screening intervals for all.
  
In a separate project, machine learning and deep learning models to identify pregnancies with elevated risks of adverse outcomes are compared. A novel GRU model that accommodates both static and time-varying information and allows interactions between these two kinds of covariates through additional attention layers provides better performance. Contributions of various types of covariates (questionnaires, blood tests, and ultrasound) to the prediction accuracy are compared for clinical practice in low- and middle-income countries.
Thursday, May 8th, 8th Floor Auditorium, 11:45am 
 Levin Lecture
 
Yingying Wei, PhD
 Associate Professor, Department of Statistics
 The Chinese University of Hong Kong
Meta-clustering of Gene Expression Data
Traditional meta-analyses pool effect sizes across studies to improve statistical power. Likewise, there is growing interest in joint clustering across datasets to identify disease subtypes for bulk gene expression data and to discover cell types for single-cell RNA-sequencing (scRNA-seq) data. Unfortunately, due to the prevalence of technical batch effects, directly clustering of samples from multiple gene expression datasets can lead to wrong results. Therefore, in the past several years, there has been very active research on the integration of multiple gene expression datasets. However, the discussion on when multiple gene expression datasets can be integrated for joint clustering is lacking. Obviously, if different subtypes are assayed in distinct batches, then meta-clustering would be impossible no matter what types of machine learning or statistical methods are used.
In this talk, I will present our Batch-effects-correction-with-Unknown-Subtypes (BUS) framework. BUS is capable of adjusting batch effects explicitly, grouping samples that share similar characteristics into subtypes, identifying genes that distinguish subtypes and enjoying a linear-order computational complexity. The BUS framework can be adapted to perform meta-clustering for bulk gene expression data, scRNA-seq data collected from a single biological condition, and scRNA-seq data collected from multiple biological conditions, respectively. The proofs for model identifiability for the corresponding models provide insights on when multiple gene expression data can be integrated for meta-clustering and guidelines on experimental designs. Simulation studies and real data analyses show the advantages of our proposed models over state-of-the-art methods, especially when performing differential inference for scRNA-seq data collected from multiple conditions.
 
  
   
  
   
  
   
  
   
  
   
  
   
  
   
  
   
  
   
  
   
  
   
  
   
  
   
  
   
  
   
  
  