A Mathematician Uses AI to Find Meaning in Genomic Data
Imagine a murmuration of starlings on the wing—the flock wheels and turns as though a single organism—rising, falling, shifting in shape and form in response to cues undiscernible to the casual viewer. Zoom in and the independent behaviors of each bird that contribute to the flock’s coordinated movements become clear.
So, too, the genes in each of our cells operate in networks, activated or silenced by RNA in response to a wide array of environmental cues, at scales from single proteins within the cytoplasm of the cell, itself, to conditions within each tissue type and even the organism as a whole. Columbia Mailman School Assistant Professor of Biostatistics Wenpin Hou uses machine learning to analyze massive datasets from human tissue samples to reveal the biochemical activity within single cells. “My goal is to extract meaningful information from real-world, messy data using computational methods and software to drive forward the understanding of how and when the genes are regulated, and how we can control gene regulation in disease prevention and treatment,” she says.
Hou’s work has already yielded multiple open-source software programs. Her latest, GPTCelltype, applies artificial intelligence to the laborious process of manually annotating the array of cell types found in complex tissue samples for single-cell RNA analyses.
You combine theoretical mathematics and statistics with emerging computational methods, machine learning, and artificial intelligence to investigate how our genes respond to our environment. How did you get started?
Hou: In my undergrad studies, I noticed the elegant ways that differential equations could capture gene behaviors. That sparked a deep dive into gene regulatory networks for my PhD. The beauty of mathematics applied to biology was irresistible. I was amazed by how artfully mathematics could capture those delicate gene dynamics and regulations.
How did you make the jump from theoretical analyses to applied biomedical and public health research?
Hou: During my PhD, I worked on the theory behind modeling gene regulatory networks, intending to find gene therapies. Later, at MD Anderson Cancer Center, I worked to develop this concept for cancer therapies, but the stark reality of noisy and incomplete data stood in the way. This challenge steered me from theory to hands-on computational genomics. I discovered a passion for solving real health problems and realized the impact my research could have on disease therapy and prevention—a truly pivotal experience.
Why can’t theoretical computational models lead directly to clinical applications?
Hou: Theoretical models often assume “clean” data, which is in stark contrast to the complexity and noise inherent in real-world patient data. This discrepancy creates significant challenges when attempting to apply these models clinically. By integrating various types of gene regulation data—including gene expression, DNA methylation, histone modification, transcription factor activity, and targeted perturbations—we have the potential to bridge this gap and catalyze the creation of new therapies.
You currently lead two five-year, NIH-funded projects that have been awarded nearly $1.9 million. Each investigates the spatial landscapes of DNA methylation and gene regulatory networks. What does this mean?
Hou: DNA methylation— an epigenetic modification that can modulate gene expression—stands at the interface of genome, environment, and development. Detailing how that process unfolds in space and time to determine gene expression could contain valuable target information for early diagnosis and drug treatment. My goal is to develop methods that can predict DNA methylation landscapes and create maps of the spatial relationships among those cells.
Can you provide an example from your ongoing work on maternal-fetal health?
Hou: I approached pediatrician and molecular epidemiologist Dr. Xiaobin Wang in 2019 to work with her on the Boston Birth Cohort, which combines demographic data and DNA methylation data. I’ve been able to contribute to a series of analyses of how maternal smoking affects genes associated with overweight and obesity in children, and how certain prenatal vitamins may buffer the effects of maternal smoking on newborn gene expression.
ChatGPT is in the news. You’ve looked at using GPT AI models for biomedical research. What do you see as the power and pitfalls of AI in your line of work?
Hou: If the machine can do well in this kind of task, it has the potential to equip experts and increase the efficiency of the work in the pipeline. On the other hand, humans provide our input to GPT-4, and the way we prompt it—the way we convey our instructions—affects the results, especially when we try to apply GPT-4 to large datasets. We recommend that human experts confirm the validity of the output from GPT-4.
You cofounded a statistical genetics and genomics working group that sponsors an ongoing seminar series. Who can participate?
Hou: We bring together faculty, fellows, and grad students from biostatistics, computational biology, engineering, and medicine from both inside and beyond Columbia to discuss broad applications of methodologies for investigating how the genome and genes affect biological function and human health. All of our invited talks are hosted virtually, and the public is welcome to attend. We discuss the latest progress in the field, including new statistical and computational methodology, new data resources in genetics and genomics, and new bioengineering technologies.