
Data Science Summit Spotlights Rapid Advances in Research Methods
Artificial intelligence and other quickly advancing data science techniques are transforming public health science, according to researchers at the recent Data Science in Public Health Summit. The January 31 event brought together faculty leaders and researchers across Columbia University and beyond to discuss how they are using data science in climate research, pandemic preparedness, and the role of large language models (LLMs) like ChatGPT in public health and society.
The daylong event was organized by Jeff Goldsmith, Associate Dean of Data Science and associate professor of Biostatistics, with Gary Miller, Vice Dean for Research Strategy and Innovation, and Kiros Berhane, chair of Biostatistics. (Watch a video of the Summit below.)
In opening remarks, Dean Linda P. Fried spoke about the recent development of data science at Columbia Mailman, noting that the first Data Science in Public Health Summit took place in January 2020, immediately prior to the global COVID-19 pandemic, with two other Summits in the meantime. Dean Fried emphasized the importance of having researchers learn from each other to share how they are harnessing data science tools for public health. Big picture, she said public health science has repeatedly demonstrated that “health is malleable, and we can transform the opportunity for health for everyone.”
Across Columbia University, data science is ascendant, and the University is leading the way in harnessing data science to advance research for the public good. Jeannette M. Wing, Columbia’s Executive Vice President for Research, spoke about Empire AI, a consortium of six universities, including Columbia, building a computing cluster in Buffalo with support from the Simons Foundation. While much of data science is about “big compute and big data,” Wing said there is new interest in what is possible at a smaller scale, adding that there are lessons to be drawn from DeepSeek, a new LLM known for doing more with less. “It’s quite impressive,” she said. Garud Iyengar, Avanessians Director of the Columbia University Data Science Institute, presented ongoing data science initiatives in cancer and climate research and pointed to the availability of seed funding for using AI in public health projects. “Public health in my mind is one of the most essential areas where data science and AI are driving very meaningful changes,” Iyengar concluded. (He urged everyone to attend the Columbia AI Summit, the biggest such event at Columbia to date, happening on March 4.)
Data Science Summit keynote speaker Scott L. Zeger, John C. Malone Professor of Biostatistics and Medicine at Johns Hopkins School of Public Health and School of Medicine, spoke about an initiative he leads called inHealth. A decade in the making, inHealth draws from genetic insights, patient health history, environmental factors, and more, in service of personalized medicine. The goal is to help clinicians make informed decisions about treatments by predicting future outcomes. “In order to do right by individuals, you need to have a public health data science perspective,” said Zeger.
A panel discussion on pandemic preparedness led by Wafaa El-Sadr, director of ICAP at Columbia and lead on the NYC Pandemic Response Institute, emphasized cross-agency collaboration, community input, and forward-thinking strategies to strengthen public health data infrastructure. Gretchen Van Wye, an epidemiologist at the New York City Department of Health and Mental Hygiene, spoke about efforts at the Health Department to modernize its data systems to address challenges exposed by COVID-19, building on a history of mortality tracking since 1804, and aligning with the CDC’s Data Modernization Initiative. Jeffrey Shaman, professor in Environmental Health Sciences, introduced his work on infectious disease modeling, stressing transparent communication of uncertainties. Sam Sia, Professor of Biomedical Engineering and Vice Provost for the Fourth Purpose and Strategic Impact, discussed innovations in decentralized health sensors, such as rapid testing and wastewater monitoring.
A second panel on data science in climate and health was moderated by Marianthi-Anna Kioumourtzoglou, associate professor in Environmental Health Sciences. Donald Edmondson, director of the Center for Behavioral Cardiovascular Health at Columbia University Irving Medical Center (CUIMC), highlighted the link between rising temperatures and cardiovascular health risks, particularly in New York City, where heat exacerbates health disparities due to factors like urban heat islands. Oren Pizmony-Levy, Associate Professor of International and Comparative Education at Teachers College, shared findings on climate change education, revealing gaps in teacher engagement and student interest. Tian Zheng, chair of Statistics, addressed statistical challenges in climate data science, such as hybrid systems modeling and integrating diverse data types, emphasizing the need for scalable machine learning solutions.
A final panel on LLMs moderated by Ying Wei, professor in the Department of Biostatistics, highlighted advancements in AI for genomics and health care, alongside ongoing challenges in data privacy, reproducibility, and cross-institutional collaboration. Wenpin Hou, assistant professor in Biostatistics, presented her recent work on epigenomics, including using DNA methylation to predict gene expression in cancer. Gamze Gürsoy, Assistant Professor of Biomedical Informatics at CUIMC, addressed privacy and data silo issues in biomedical research, advocating for federated learning to train models across hospitals without sharing raw data. Andrew Rundle, professor in Epidemiology, spoke of several studies testing the value of ChatGPT-4 in various aspects of a public health research workflow with mixed results. Generally, more detailed prompts were more successful. Overall, however, Rundle said graduate students are still more accurate and consistent than AI.
Read about previous Columbia Mailman Data Science for Public Health Summits in 2020, 2023, and 2024.