Data Science

A thriving new era of data science for health at Columbia University

The last two decades have been a period of tremendous growth and transformation for the field of public health in general, and the discipline of biostatistics in particular. This is due to the increasing abundance of big and high dimensional data in almost every branch of public health and the emergence of data science for health. As a distinct branch of data science, data science for health deals with this new deluge of information as a fusion between biostatistics, computer science, bioinformatics and almost all other disciplines in public health such as epidemiology, environmental health, etc.

At Columbia University, the establishment of the Data Science Institute in 2012 (as one of the earliest ones in the United States) was a pioneering moment as it heralded the university’s acknowledgement  of the importance of data science and its commitment to its development across the university.

As a leading school of public health in the world, the Mailman School of Public Health has been at the forefront of leading this revolution from the very beginning. This is both in the generation of high-quality high-dimensional data in increasingly significant proportion of its research portfolio by all its substantive area departments and centers at local, national and global levels. The Mailman School of Public Health has also led in the development of cutting edge new modeling and prediction techniques at the interface of hypothesis-driven estimation/inference-focused foundational biostatistics and the emerging areas of machine learning and artificial intelligence. The result has been a vehicle for unprecedented growth and transformation of public health research and training.

School Level Activities

The Mailman School of Public Health has shown significant commitment to the growth of data science with focus on health – within the general university-wide umbrella framework of the Columbia Data Science Institute. The commitment has been multifaceted; including

  1. the promotion of public health data science in general,
  2. the investment in strategic hiring of numerous faculty with expertise in data science for health (especially in the Department of Biostatistics, but also throughout the school), 
  3. the arrangement of research incentives that have led to a growing multi-disciplinary research portfolio with increasing focus on big-data driven team science,
  4. the creation of educational curricula on data science for health throughout the school and to provide a foundation that will propel a range of scientific areas, like environmental health,
  5. the creation of the Artificial Intelligence/Machine Learning Laboratory led by Prof. Ying Wei, offering a cutting-edge platform for precision public health,
  6. recruitment of multidisciplinary data science faculty focused on the ethical issues associated with AI, and
  7. the creation of a new schoolwide leadership role, Associate Dean for Data Science for Health (Dr. Jeff Goldsmith). The school’s commitment to data science for health has been evidenced by its hosting of two important summits on the topic – an inaugural national summit in early 2020, involving almost all schools and programs of public health across the United States, and a second summit in early 2023 with inward-looking deliberations to refine and advance the school’s leadership in this burgeoning field. A well-attended Grand Rounds on Data Science of Health was also held in 2021, featuring a leading figure in the field (Dr. Eric Tchetgen Tchetgen from University of Pennsylvania as a keynote speaker, with several young data scientists from several departments at Mailman as panelists). 

Departmental Activities

Not surprisingly, the technical quantitative hub for data science of health at the Mailman School of Public Health is the Department of Biostatistics. But, true to the multi-disciplinary nature of data science, all departments and centers at Mailman are very active in advancing data science of health in their research portfolios and educational activities.

Department of Biostatistics

The department has a strong track record of being one of the oldest and leading departments in the nation (and indeed the world) in developing impactful new analytic methods in ways that are mathematically rigorous but also firmly grounded in public health-inspired real life problems. The department has continued its pioneering history of leadership by taking significant strategic steps to respond to the needs for big-data oriented research throughout the school and to become a leader in both the research on cutting-edge new methods and also in training the next generations of health data scientists. One strong example of this leadership has been exhibited since the emergence of the COVID-19 pandemic – during which departmental faculty and trainees deployed modeling/forecasting efforts at the domestic and global levels, creating spatio-temporally detailed real-time dashboards to inform the public of the trends in the pandemic, and playing crucial roles in the many clinical trials that were conducted in the search of vaccines and treatment of COVID-19 patients. Broad activities of the Department of Biostatistics include:

  • Expanding data science oriented research portfolio: Several departmental faculty are conducting groundbreaking research that are shaping the future of data science of health in several directions. In addition to the well-established faculty that are already making their marks on data science oriented research, the department hired seven new talented junior faculty with expertise in various branches of data science (e.g., algorithmic fairness in data science that is highly relevant to equitable use of data science, deep learning techniques, federated learning techniques, data science techniques relevant to high-dimensional multi-omics and genomics data, data science techniques relevant to high-dimensional data on climate and health – to mention some). As part of the department’s strategic plan, a new AI/ML Lab has been established with leadership by Vice Chair of Research, Dr. Ying Wei, and partnerships have been strengthened with relevant Columbia units such as the Department of Biomedical Informatics and the CTSI.  
  • Curriculum revisions and new educational programs: The department undertook a major revision in its curriculum a few years ago to introduce two year-long sequence of courses in data science (one oriented towards the masters programs, and another oriented towards the doctoral programs). In recognition of its importance, the department introduced a new track in Public Health Data Science. With only three years in its existence, it has quickly emerged as a student favorite, and it will become the most chosen track in our incoming cohort of over 110 MS students for Fall 2023. The department also renamed its certificate program from “Applied Biostatistics” to “Applied Biostatistics and Public Health Data Science” a few years ago. Many other existing courses have been revamped to include data science principles and new courses are continuously being introduced in many areas of data science for health. The department has also been leading the effort of incorporating data science competencies at the school level (e.g., in the MPH Core course offerings). During the department’s retreat in September 2022, the issue of curriculum development and revision to respond to the quickly evolving landscape in data science was a major topic. 
  • Pipeline programs for undergraduate students: The department has been home for more than a decade to a pioneering pipeline program, the Biostatistics and Epidemiology Summer Training (BEST) program with focus on diversity – originally initiated in 2009 by two of the department’s own doctoral students with seed money from the Mailman Dean’s Office and now continuously funded by NIH. The trainees in this program are matched with mentors from across the school and the mentoring projects they have been conducting are increasingly oriented towards data science topics. Moreover, the department was awarded another new summer pipeline program, SIBDS@Columbia, from NIH with exclusive focus on data science  under its SIBS program. These two pipeline programs are ensuring that promisingly talented undergraduate students from diverse backgrounds are attracted to the field of public health data science.The department has been home for more than a decade to a pioneering pipeline program, the Biostatistics and Epidemiology Summer Training (BEST) program with focus on diversity – originally initiated in 2009 by two of the department’s own doctoral students with seed money from the Mailman Dean’s Office and now continuously funded by NIH. The trainees in this program are matched with mentors from across the school and the mentoring projects they have been conducting are increasingly oriented towards data science topics. Moreover, the department was awarded another new summer pipeline program, SIBDS@Columbia, from NIH with exclusive focus on data science  under its SIBS program. These two pipeline programs are ensuring that promisingly talented undergraduate students from diverse backgrounds are attracted to the field of public health data science. 
  • International programs in data science for health: While the student body at the department’s educational programs are composed of students from all over the world, true to Columbia stature as a global university, the department is actively involved in training and research program in public health data science at the global level. To mention one example, the department is home to the Advancing Public Health Research in Eastern Africa through Data Science Training (APHREA-DST) as part of NIH’s major Data Science Initiative for Africa (DS-I Africa) program. Under this program, the department is partnering with University of Nairobi in Kenya and Addis Ababa University in Ethiopia to establish MS degree programs in Public Health Data Science. 
  • Interdepartmental partnerships: The department has been partnering with other units at Columbia Mailman to jointly advance public health data science training and research. As a typical example, the department partnered with the Department of Environmental Health Sciences in their launch of a new track in Environmental Health Data Science. Department faculty are very active  as highly sought after instructors on topics related to data science in summer workshop/bootcamp programs; namely in the Skills for Health and Research Professionals (SHARP) program that is mainly run by the Department of Environmental Health Sciences in partnership with our department for data science offerings as well as the episummer@Columbia program that is run by the Department of Epidemiology.  
  • Department faculty are routinely involved in all team science/data science oriented grant proposals and serve on dissertation committees of doctoral programs throughout Mailman as biostatistical and data science experts.