2025 Columbia Biostatistics Annual Symposium (CBAS)

Transforming Public Health Through Innovation
This year’s one-day event will showcase our newly developed statistical innovations and highlight how they advance clinical practices and influence public health policy. We will also celebrate our educational achievements in training the next generation of biostatisticians, emphasizing the importance of effective collaboration and transdisciplinary research. CBAS is designed to engage the broader scientific community in a free exchange of research ideas and implementation for the future of health-related research and learning. Please join us to see how, as a department, we work across the university and beyond to advance biomedical research through methodological innovations.
More information on individual talks, speakers, and sessions will continue to be updated to the schedule below.
2025 Schedule
This year's symposium will take place on April 7th, from 8:30am - 5:00pm. It will feature morning and afternoon sessions with meals provided.
All sessions will take place in the 8th Floor Auditorium of the Allan Rosenfield Building, 722 West 168th Street. Lunch will be served in the Riverview Lounge in the same building.
8:30am – 9:00am | Registration & Breakfast
A breakfast assortment of pastries and bagels alongside tea & coffee will be served. Please register in advance for the event here to ensure your spot.
9:00am – 9:05am | Welcome & Program Introduction
A Symposium Overview will be presented by Dr. Kiros Berhane and the Planning Committee Co-Chairs. This will be followed by brief opening remarks.
9:05am – 10:30am | Scientific Session - The Power of Interdisciplinary Collaboration in Psychiatry: The Role of Biostatistics and Digital Technologies in Advancing Mental Health Care
Speakers:
Ken Cheung, PhD (Professor of Biostatistics, Columbia University)
"The role of digital technologies in interventional studies"
Linda Valeri, PhD (Assistant Professor of Biostatistics, Columbia University)
"Causal inference framework for non-stationary time series with missing data from mHealth studies in Psychiatry"
Joshua Gordon, MD, PhD (Chair of Department of Psychiatry, Columbia University)
"Computational Approaches in Psychiatry"
Discussant:
Melanie Wall, PhD (Professor of Biostatistics [in Psychiatry], Columbia University)
10:30am – 11:00am | Break & Poster Viewing
This break will give opportunity for discussion about the talks, networking, and viewing of the student poster competition in the ARB 8th Floor Lobby
11:00am – 12:30pm | Scientific Session - The Future of AI in Health: Integrating Data, Statistics, Engineerings, and Domain Science for Trustworthy and Actionable AI for Health
Speakers:
Weijie Su (Associate Professor of Statistics and Data Science, Associate Professor of Computer and Information Science, University of Pennsylvania)
"Do Large Language Models Need Statistical Foundations?"
Abstract: In this talk, we advocate for the development of statistical foundations for large language models (LLMs) and explain why such an endeavor is both necessary and achievable. We present two fundamental characteristics of LLMs that necessitate statistical frameworks: their probabilistic, autoregressive nature in next-token prediction, and their complex, non-unique architectures whose underlying mechanisms remain largely opaque. We identify key areas where statistical foundations could advance LLM development, including uncertainty quantification, calibration, watermarking, evaluation, data mixture optimization, fairness, and privacy. We illustrate our argument through examples demonstrating how statistical insights can help develop frameworks for LLM watermarking and fairness alignment.
Tian Gu (Assistant Professor of Biostatistics, Columbia University)
"Synthetic Data for Privacy-Preserving Learning in Multi-Site Settings"
Abstract: The rise of generative AI has underscored the urgent need for methodologies that ensure privacy and generalizability—especially in contexts involving sensitive medical data. In this talk, I will discuss a line of work centered on the use of synthetic data to enable responsible learning across multiple sites. Building on recent advances in statistical learning and generative modeling, I will highlight how synthetic data can help mitigate data heterogeneity, support missing data imputation, and facilitate model development without compromising individual privacy. I will also reflect on the challenges of evaluating the utility and fidelity of synthetic data, and how statistical tools can guide the design of privacy-preserving and trustworthy AI systems.
Sent from my iPhone
Qixuan Chen (Associate Professor of Biostatistics, Columbia University)
"Bridging Non-Probability Samples and Population Inference in the Age of AI"
Abstract: Probability surveys have long been considered the gold standard for population inference. However, their high cost and declining response rates have raised concerns about their scalability and representativeness. In contrast, non-probability samples—often passively collected through digital platforms, electronic health records, or online studies—are abundant and central to modern data science and AI applications. But can these data reliably support population-level inference? In this talk, I present four case studies where we developed novel statistical frameworks to enhance inference from non-probability samples by integrating them with administrative data or traditional surveys. We focus on data-rich environments with high-dimensional auxiliary information, reflecting the types of datasets increasingly leveraged in artificial intelligence. When individual-level auxiliary data are available, we propose a regularized predictive inference framework that builds on Bayesian additive regression trees (BART) and its extensions to generate robust outcome predictions across the population. We further extend these methods to handle two-phase designs, privacy-preserving data settings, and cases where only population-level summaries are available. These case studies—drawn from mental health and HIV research—illustrate how AI-adjacent tools such as machine learning, predictive modeling, and high-dimensional data integration can improve the generalizability, fairness, and trustworthiness of findings from non-probability samples.
12:30pm – 1:30pm | Lunch in the Riverview Lounge
Sandwiches and refreshments will be served in the Riverview Lounge for attendees
1:30 – 2:00pm | Poster Viewing
We encourage attendees to view the posters in the 8th Floor Auditorium Lobby
2:00pm - 3:00pm | Keynote Speaker - Dr. Bhramar Mukherjee: "The Importance of Statistical Thinking in an AI-augmented World"
Dr. Bhramar Mukherjee of the Yale School of Public Health will present the Keynote Speech. You can read more about her further on this page.
"The Importance of Statistical Thinking in an AI-augmented World"
In this presentation, I will first delve into the obvious: AI algorithms and systems developed on exclusionary datasets can lead to erroneous conclusions and misguided policies. However, while we strive for data equity and wait for the ideal scenario of globally representative and extensive datasets or training corpora, statisticians play a pivotal role in mitigating systematic sources of bias in analyzing LARGE healthcare data—an expertise that few other quantitative disciplines possess. I will illustrate my point by two examples: (1) Handling selection bias and outcome misclassification in analyzing electronic health records (2) Combining data across multiple biobanks/healthcare systems under heterogenous sampling strategies. I will conclude the talk with a call to arms for statisticians to lead efforts for creating, curating, collecting data and pioneering new scientific studies, not just remain on the design and analytic fringes. As public health statisticians, our job is not just to predict efficiently, but to prevent effectively.
3:00pm – 3:30pm | Break & Poster Viewing with Refreshments
This break will give opportunity for discussion about the talks, networking, and viewing of the student poster competition in the ARB 8th Floor Lobby. Tea & Coffee will be served.
3:30pm – 4:30pm | Education Panel - Biostatistics Education in the Era of AI: Current State & Future Directions
In this session, we will have in-depth discussion on the current state and future directions of biostatistics graduate education in the era of data science and AI. The various degree programs at Columbia Biostatistics will be discussed as case examples in relation to the need for optimal balance between theoretical rigor and skills in emerging algorithmic techniques, the need for dynamic curricular revisions to keep up with emerging tools, and the need to promote interdisciplinarity through integration of collaborative research with cutting-edge innovations in biomedical sciences.
Moderator: Kiros Berhane, PhD (Chair of Biostatistics)
Panel:
Christine Mauro, PhD (Biostatistics MS Program Director)
4:30pm – 5:00pm | Winners Presentation & Closing Remarks
The winners of the poster competition will be presented, and closing remarks on the day will be given.
Keynote Speaker
Dr. Bhramar Mukherjee
Professor Bhramar Mukherjee is currently appointed as Anna M.R. Lauder Professor of Biostatistics and Professor of Chronic Disease Epidemiology at the Yale School of Public Health. Professor Mukherjee serves as the inaugural Senior Associate Dean of Public Health Data Science and Data Equity at the school. She holds a secondary appointment in the Department of Statistics and Data Science at Yale. Prior to joining Yale University in 2024, Dr. Mukherjee built a distinguished career at the University of Michigan where she was appointed as John D. Kalbfleisch Distinguished University Professor of Biostatistics and the first woman Chair of the Department of Biostatistics (2018-2024). She is known for her contribution to statistical methods for integration of genetic, environmental and disease data from large healthcare databases. She is winner of many awards, including the 2023 Karl Peace award from the ASA for betterment of society through statistics, the 2024 Marvin Zelen Leadership in Statistical science award from Harvard Biostatistics. She is a fellow of the ASA, AAAS and an elected member of the US National Academy of Medicine. She has written more than 400 articles and supervised 22 PhD and 4 post-doctoral scholars. She is the founding director of a flagship undergraduate summer program on big data. She is the President elect for ENAR starting January 1, 2025, an eminent professional society for biostatisticians.