Deep Learning Predicts Expression from DNA Sequence

A new deep learning model acts like an "oracle" that can predict a gene's expression from its DNA sequence, says Eeshit Vaishnav, first author of the study published last week in Nature.

Consult the Oracle

For the study, researchers began by cloning random promoter sequences — each 80 base pairs in length — into S. cerevisiae (yeast cells), upstream of a gene that encodes a yellow, fluorescent protein. Cells were then sorted into 18 different 'bins' based on the expression level, or amount, of the fluorescent protein in each cell. Each batch of cells was then sequenced to determine which promoters were present.

A convolutional neural network was trained with the data, which included more than 20.6 million random promoter sequences for yeast cells grown in a defined medium, and more than 30.7 million random promoter sequences for yeast grown in a complex medium. The neural network successfully generalized the results, predicting expression levels with a Pearson's r of 0.960. An r of 1 would suggest a perfect, linear relationship between two variables; in this case, DNA sequence and gene expression.

The new model is 45 percent less error prone than a prior model that was trained to perform this task.

In further experiments, researchers used the predictive model to 'discover' promoter sequences that would produce lots of protein (a.k.a. drive high expression). They synthesized 500 of these predicted sequences, cloned them into yeast, and tested each of them. The predicted promoters, on average, drove a higher gene expression than 99% of promoters native to the yeast genome.

Summary: Gene expression can be predicted, ahead of time, for arbitrary promoter sequences in the model organism, S. cerevisiae. These models could help engineers design DNA sequences that produce a desired amount of protein in fewer experiments.

Read more in Nature.

