Deep Learning Predicts Expression from DNA Sequence

A new deep learning model acts like an "oracle" that can predict a gene's expression from its DNA sequence, says Eeshit Vaishnav, first author of the study published last week in Nature.

This is from Codon, my weekly newsletter. Subscribe for free.

Consult the Oracle

A new deep learning model acts like an "oracle" that can predict a gene's expression from its DNA sequence, says Eeshit Vaishnav, first author of the study published last week in Nature.

For the study, researchers began by cloning random promoter sequences — each 80 base pairs in length — into S. cerevisiae (yeast cells), upstream of a gene that encodes a yellow, fluorescent protein. Cells were then sorted into 18 different 'bins' based on the expression level, or amount, of the fluorescent protein in each cell. Each batch of cells was then sequenced to determine which promoters were present.

A convolutional neural network was trained with the data, which included more than 20.6 million random promoter sequences for yeast cells grown in a defined medium, and more than 30.7 million random promoter sequences for yeast grown in a complex medium. The neural network successfully generalized the results, predicting expression levels with a Pearson's r of 0.960. An r of 1 would suggest a perfect, linear relationship between two variables; in this case, DNA sequence and gene expression.

The new model is 45 percent less error prone than a prior model that was trained to perform this task.

In further experiments, researchers used the predictive model to 'discover' promoter sequences that would produce lots of protein (a.k.a. drive high expression). They synthesized 500 of these predicted sequences, cloned them into yeast, and tested each of them. The predicted promoters, on average, drove a higher gene expression than 99% of promoters native to the yeast genome.

Summary: Gene expression can be predicted, ahead of time, for arbitrary promoter sequences in the model organism, S. cerevisiae. These models could help engineers design DNA sequences that produce a desired amount of protein in fewer experiments.

Read more in Nature.

Other Papers

(* = open access, † = review article)

Basic Research

OpenCell: Endogenous tagging for the cartography of human cellular organization. Cho NH...Leonetti MD. Science. Link

*Probing the genomic limits of de-extinction in the Christmas Island rat. Lin J...Gilbert MTP. Current Biology. Link

*Nucleosome Positioning on Large Tandem DNA Repeats of the ‘601’ Sequence Engineered in Saccharomyces cerevisiae. Lancrey A...Boulé J. Journal of Molecular Biology. Link


*Self-attenuating adenovirus enables production of recombinant adeno-associated virus for high manufacturing yield without contamination. Su W...Cawood R. Nature Communications. Link

†New technologies for cultivated meat production. Lavon N. Trends in Biotechnology. Link


A Dual-Purpose Real-Time Indicator and Transcriptional Integrator for Calcium Detection in Living Cells. Erdenee E & Ting AY. ACS Synthetic Biology. Link

Cell-Free Systems

Constructing Cell-Free Expression Systems for Low-Cost Access. Guzman-Chavez F...Haseloff J. ACS Synthetic Biology. Link

Computation & Models

*Improved prediction of protein-protein interactions using AlphaFold2. Bryant P, Pozzati G, Elofsson A. Nature Communications. Link

*Information Decay and Enzymatic Information Recovery for DNA Data Storage. Meiser LC...Grass RN. bioRxiv (preprint). Link

The energy cost and optimal design of networks for biological discrimination. Yu Q...Igoshin OA. Journal of the Royal Society Interface. Link

CRISPR & Genetic Engineering

*Spacer2PAM: A computational framework to guide experimental determination of functional CRISPR-Cas system PAM sequences. Rybnicky GA...Jewett MC. Nucleic Acids Research. Link

Endogenous ADAR-mediated RNA editing in non-human primates using stereopure chemically modified oligonucleotides. Monian P...Vargeese C. Nature Biotechnology. Link

*CRISPR/Cas13 effectors have differing extents of off-target effects that limit their utility in eukaryotic cells. Ai Y, Liang D & Wilusz JE. Nucleic Acids Research. Link

*Cas9 exo-endonuclease eliminates chromosomal translocations during genome editing. Yin J...Hu J. Nature Communications. Link

Conjugation-Based Genome Engineering in Deinococcus radiodurans. Brumwell SL...Karas BJ. ACS Synthetic Biology. Link

DNA Assembly

*Single 3′-exonuclease-based multifragment DNA assembly method (SENAX). Dao VL...Poh CL. Scientific Reports. Link

Medicine & Diagnostics

In vivo partial reprogramming alters age-associated molecular changes during physiological aging in mice. Browder KC...Belmonte JCI. Nature Aging. Link

*Field validation of the performance of paper-based tests for the detection of the Zika and chikungunya viruses in serum samples. Karlikow M...Pardee K. Nature Biomedical Engineering. Link

*An engineered bacterial therapeutic lowers urinary oxalate in preclinical models and in silico simulations of enteric hyperoxaluria. Lubkowicz D...Hava DL. Molecular Systems Biology. Link

Reprogramming Synthetic Cells for Targeted Cancer Therapy. Lim B...Huang WE. ACS Synthetic Biology. Link

*Bacterioboat—A novel tool to increase the half-life period of the orally administered drug. Kaur P...Choudhury D. Science Advances. Link

*Synthetic glycans control gut microbiome structure and mitigate colitis in mice. Tolonen AC...van Hylckama Vlieg JET. Nature Communications. Link

Metabolic Engineering

*Oil accumulation in leaves driven by a native promoter-gene fusion created using CRISPR/Cas9 mediated genomic deletion. Bhunia RK, Menard G & Eastmond PJ. bioRxiv (preprint). Link

*Remodeling Yersinia pseudotuberculosis to generate a highly immunogenic outer membrane vesicle vaccine against pneumonic plague. Wang X...Sun W. PNAS. Link

*Regulation of protein secretion through chemical regulation of endoplasmic reticulum retention signal cleavage. Praznik A...Jerala R. Nature Communications. Link

*Production of abscisic acid in the oleaginous yeast Yarrowia lipolytica. Arnesen JA...Borodina I. FEMS Yeast Research. Link

†Refactoring transcription factors for metabolic engineering. Deng C...Liu L. Biotechnology Advances. Link

Protein Engineering

*Broad neutralization of SARS-CoV-2 variants by an inhalable bispecific single-domain antibody. Li C...Ying T. Cell. Link

*Biparatopic sybodies neutralize SARS-CoV-2 variants of concern and mitigate drug resistance. Walter JD...Seeger MA. EMBO Reports. Link

*Rationally designed immunogens enable immune focusing following SARS-CoV-2 spike imprinting. Hauser BM...Schmidt AG. Cell Reports. Link

*†Engineering Strategies to Overcome the Stability–Function Trade-Off in Proteins. Teufl M, Zajc CU & Traxlmayr MW. ACS Synthetic Biology. Link

*Protein Optimization Evolving Tool (POET) based on Genetic Programming. Bricco AR...Gilad AA. bioRxiv (preprint). Link

Tools & Technology

*High-throughput molecular recording can determine the identity and biological activity of sequences within single cells. Tu B & Esvelt KM. bioRxiv (preprint). Link

Programmable molecular transport achieved by engineering protein motors to move on DNA nanotubes. Ibusuki R...Furuta K. Science. Link

*High-throughput screening of cell-free riboswitches by fluorescence-activated droplet sorting. Tabuchi T & Yokobayashi Y. Nucleic Acids Research. Link

*Mutation-specific reporter for optimization and enrichment of prime editing. Schene IF...Fuchs SA. Nature Communications. Link

*An operator-based expression toolkit for Bacillus subtilis enables fine-tuning of gene expression and biosynthetic pathway regulation. Fu G...Zhang D. PNAS. Link

*Putative Phenotypically Neutral Genomic Insertion Points in Prokaryotes. Bernhards CB...Lux MW. ACS Synthetic Biology. Link


*PASIV: A Pooled Approach-Based Workflow to Overcome Toxicity-Induced Design of Experiments Failures and Inefficiencies. Casas A...Kitney R. ACS Synthetic Biology. Link