Science 6 min read

An 'oracle' for anticipating gene regulation evolution

By Science Gazette10 March 2022

A neural network model capable of predicting how alterations to non-coding DNA sequences in yeast impact gene expression has been developed by computational biologists. They also developed a novel two-dimensional representation of this data, making it possible to comprehend the history and future development of non-coding sequences in species other than yeast – and even construct bespoke gene expression patterns for gene treatments and industrial uses.

Despite the large number of genes found in each human cell, these so-called “coding” DNA sequences account for just 1% of our overall genome. The remaining 99 percent is made up of “non-coding” DNA, which, unlike coding DNA, does not provide instructions for protein construction.

One important role of non-coding DNA, often known as “regulatory” DNA, is to help switch genes on and off, so regulating how much (if any) protein is produced. As cells copy their DNA to grow and divide, mutations often occur in these non-coding areas, sometimes altering their function and modifying the way they influence gene expression. Many of these changes are insignificant, and others are even advantageous. However, they are sometimes linked to an increased risk of common illnesses, such as type 2 diabetes, or more serious diseases, such as cancer.

Researchers have been hard at work developing mathematical maps that enable them to look at an organism’s genome, forecast which genes will be expressed, and estimate how that expression would alter the organism’s observable qualities in order to better comprehend the consequences of such changes. These maps, known as fitness landscapes, were developed approximately a century ago to better understand how genetic composition effects one specific metric of organismal fitness: reproductive success. Early fitness landscapes were quite simplistic, focused on a small number of mutations. Although researchers now have access to far richer data sets, they still need extra tools to define and analyze such complicated data. This capability would not only aid in a better understanding of how particular genes have changed through time, but would also aid in forecasting what sequence and expression alterations would occur in the future.

A team of scientists established a paradigm for investigating the fitness landscapes of regulatory DNA in a new paper published on March 9 in Nature. They developed a neural network model that, after being trained on hundreds of millions of experimental observations, could predict how modifications to yeast non-coding regions influenced gene expression. They also developed a novel two-dimensional representation of the landscapes, making it possible to understand the past and predict the future evolution of non-coding sequences in organisms other than yeast – and even design custom gene expression patterns for gene therapies and industrial applications.

“We now have a ‘oracle’ that can be consulted to answer the question, “What if we attempted every potential mutation of this sequence?” Alternatively, what new sequence might we devise to get the required expression?” Aviv Regev, senior author of the paper and a professor of biology at MIT (on leave), is a core member of the Broad Institute of Harvard and MIT (on leave), the leader of Genentech Research and Early Development. “Scientists may now apply the model to their own evolutionary topic or situation, as well as to other difficulties such as creating sequences that regulate gene expression in desirable ways. I’m particularly enthusiastic about the opportunities for machine learning researchers interested in interpretability; they can ask their questions backwards to better understand the underlying biology.”

Many researchers have simply trained their models on known mutations (or modest variants thereof) found in nature prior to this work. Regev’s team, on the other hand, intended to take it a step further by developing their own unbiased models capable of predicting an organism’s fitness and gene expression based on any potential DNA sequence – even ones they’d never seen before. This would also allow researchers to utilize such models to modify cells for pharmacological objectives, such as developing novel cancer and autoimmune medicines.

Eeshit Dhaval Vaishnav, a graduate student at MIT and co-first author, Carl de Boer, now an associate professor at the University of British Columbia, and their colleagues developed a neural network model to forecast gene expression to achieve this aim. They trained it using data produced by introducing millions of completely random non-coding DNA sequences into yeast and watching how each random sequence altered gene expression. They concentrated on a group of non-coding DNA sequences known as promoters, which act as binding sites for proteins that may turn neighboring genes on or off.

“This study demonstrates what possibilities emerge when we create new types of experiments to gather the proper data to train algorithms,” adds Regev. “In a larger sense, I think these techniques will be helpful for many challenges, such as understanding genetic variations in regulatory areas of the human genome that impart disease risk, but also anticipating the effect of combinations of mutations, or developing novel compounds.”

Regev, Vaishnav, de Boer, and their coauthors went on to put their model’s forecasting skills to the test in a number of ways, demonstrating how it may assist explain the evolutionary history – and likely future – of specific promoters. “Creating an accurate model was obviously an achievement,” Vaishnav says, “but it was really only a beginning point for me.”

To see whether their approach may aid in synthetic biology applications such as the production of antibiotics, enzymes, and food, the researchers practiced designing promoters that could create desired expression levels for any gene of interest. They next combed through other scientific articles for basic evolutionary concerns to see whether their model might assist explain them. The researchers even went so far as to give their model a real-world population data set from an existing study that included genetic information from yeast strains from all around the globe. They were able to outline thousands of years of historical selection forces that shaped the genomes of today’s yeast in this way.

However, in order to develop a strong tool capable of probing any genome, the researchers understood they’d need to discover a technique to foresee the development of non-coding sequences even in the absence of such a large population data set. Vaishnav and his colleagues established a computational approach that enabled them to map the predictions from their framework into a two-dimensional graph to achieve this aim. This enabled them to demonstrate, in a fairly straightforward way, how any non-coding DNA sequence affects gene expression and fitness without the need for time-consuming research at the lab bench.

“One of the unresolved challenges in fitness landscapes was that we didn’t have a method for visually representing them in a manner that truly conveyed the evolutionary features of sequences,” Vaishnav continues. “I was determined to find a method to fill that need and contribute to the long-held goal of building a comprehensive fitness landscape.”

The study, according to Martin Taylor, a professor of genetics at the University of Edinburgh’s Medical Research Council Human Genetics Unit who was not involved in the research, shows that artificial intelligence can not only predict the effect of regulatory DNA changes, but also reveal the underlying principles that govern millions of years of evolution.

Despite the fact that the model was trained on a little portion of yeast regulatory DNA under a few different growth circumstances, he is amazed that it can make such meaningful predictions about the evolution of gene regulation in mammals.

“There are clear near-term uses,” he says, “such as the unique construction of regulatory DNA for yeast in brewing, baking, and biotechnology.” “However, future research might help detect disease mutations in human regulatory DNA that are now difficult to find and usually ignored in the clinic. This research implies that AI models of gene regulation trained on larger, more complicated, and varied data sets have a promising future.”

Even before the paper was published, Vaishnav started getting inquiries from other researchers who hoped to apply the methodology to create non-coding DNA sequences for gene treatments.

“For decades, people have been researching regulatory evolution and fitness landscapes,” Vaishnav explains. “I believe our paradigm will help us answer basic, unanswered issues regarding the development and evolvability of gene regulatory DNA – and perhaps design biological sequences for fascinating new uses.”