DNA is the fundamental layer of biological information in all living organisms. Researchers have made tremendous progress in developing technologies for reading, writing, and editing DNA, but composing new DNA sequences remains a formidable challenge. Advanced AI models, trained on genomic data from millions of organisms, now enable us to compose new genomic sequences beyond what has been generated by natural evolution.
SynGenome is a first-of-its-kind database consisting of synthetic DNA sequences generated by Evo, a genomic language model. Evo is similar to models of human language that learn to predict the next word in a sentence. Given a DNA sequence prompt, Evo will generate a DNA sequence response that continues the genomic sequence. Essentially, Evo enables “autocomplete” for the genome.
In the genomes of prokaryotes and phage, genes with related functions frequently appear directly next to each other along the DNA sequence. As a result, prompting Evo with a sequence encoding a function of interest instructs the model to generate functionally related genes. This enables function-guided generative design by prompt engineering a genomic language model.
SynGenome is organized according to the known functions, domains, and species of the prompt sequences. The corresponding response sequences are likely enriched for genes with related functions or domains, but they could contain many other interesting genes as well. The generated sequences in SynGenome may be very different from anything found in nature while still performing useful biological functions, opening up a new universe of biological discovery.
SynGenome contains more than 100 billion DNA base pairs generated by Evo, a DNA language model trained on millions of prokaryotic genomes. Evo is trained to predict the next base pair in a DNA sequence, which enables Evo to generate a response sequence that follows a DNA prompt. To prompt Evo, we used DNA sequences encoding or directly adjacent to known protein-coding genes from prokaryotic organisms and bacteriophages (as compiled by the UniProt database). While the prompt sequences come from natural genomes, the responses can be very different from any DNA sequence found in nature.
SynGenome entries are organized by the labels associated with the prompt sequence as provided by UniProt. Functional annotations take the form of Gene Ontology (GO) terms. Protein domain annotations are based on InterPro identifiers. Finally, the species labels correspond to the organism from which the prompt sequence was derived.
Many important biological systems, with applications in genome editing or DNA synthesis, have been found by searching through genomic databases in a process referred to as “genome mining.” The sequences in SynGenome could be mined similarly. But rather than mining natural sequences, these sequences are generated based on their relationship with known genes in the prompt and can be unlike natural sequences, which is why we refer to this process as “semantic mining.” Because the synthetic sequences are likely enriched for functions that are related to the prompt sequence, these sequences could form the basis of a functional screening library.
Using the same strategy for generating sequences in SynGenome, we have had high experimental success rates in low-throughput experiments in which we test only tens of variants for the desired function encoded by a genomic prompt. This includes the design of proteins with anti-CRISPR activity, as well as the design of both bacterial toxins and their conjugate antitoxins. See our paper for more details.
Each prompt is associated with the coding sequence (CDS) of a UniProt entry. We constructed six prompts denoted in the raw data with different labels. A diagram illustrating these prompts and a table describing the prompt type labels are below.
Prompt type | Description |
---|---|
Upstream | The 500 base pair (bp) sequence at the 5' end of the coding sequence (CDS). |
CDS | The 500 bp sequence beginning with the CDS. If the CDS sequence is shorter than 500 bp, the CDS sequence is rounded down to the nearest 100 bp. |
Downstream | The 500 bp sequence at the 3' end on the CDS. |
RevComp Upstream | The reverse complement of the 500 base pair sequence at the 5' end of the CDS. |
RevComp | The reverse complement of the 500 base pair sequence that ends with the 5' end of the CDS. If the reverse complement of the CDS sequence is shorter than 500 bp, the CDS sequence is rounded down to the nearest 100 bp. |
RevComp Downstream | The reverse complement of the 500 base pair sequence at the 3' end of the CDS. |
We generate sequences with the Evo 1.5 model without any additional finetuning or post-training. We use a standard autoregressive decoding algorithm to sample new sequences with a temperature of 0.7, top-k of 4, and top-p of 1. Code for generating sequences with Evo can be found here. For each prompt type described above, we sampled 2 sequences.
After sampling, we ran NCBI dustmasker to remove highly repetitive sequences, for example, long stretches of sequence containing only a single base pair. While many repetitive sequences were eliminated, we chose not to filter out all repetitive elements given that some prokaryotic genomes do naturally contain biologically meaningful repeats.
The raw SynGenome data is provided in CSV format. Below are the column names and a brief description of the field. Fields corresponding to sequences generated by Evo are in bold.
Field | Description |
---|---|
UUID | Unique identifier for each entry in the dataset |
Prompt | Input text used to generate sequences |
Generated_Seq | DNA sequence output generated from the prompt |
Score | Evo log-likelihood of the sequence |
File_Derivation | Source file |
UniProt_CID | ID of the protein |
Type | Classification of the prompt (e.g., Upstream, Downstream, CDS) |
Entry | UniProt accession |
Organism | Name of the biological organism |
Gene Ontology IDs | Unique identifiers from Gene Ontology database (e.g., GO:0016020) |
Gene Ontology (biological process) | GO terms describing biological processes |
Gene Ontology (cellular component) | GO terms describing cellular locations |
Gene Ontology (GO) | All Gene Ontology annotations |
Gene Ontology (molecular function) | GO terms describing molecular activities |
Protein families | Classification of protein family membership |
CDD | CDD domain annotations |
DisProt | DisProt database annotations |
Gene3D | Gene3D database annotations |
HAMAP | HAMAP database annotations |
InterPro | InterPro domain annotations |
NCBIfam | NCBI protein family annotations |
PANTHER | PANTHER database annotations |
Pfam | Pfam database annotations |
PIRSF | PIRSF database classifications |
PRINTS | PRINTS database annotations |
PROSITE | PROSITE database annotations |
SFLD | SFLD database annotations |
SMART | SMART database annotations |
SUPFAM | SUPFAM database annotations |
Domains Compiled | Compilation of domain annotations from various sources |
Gene Names | Official or common names of genes |
Protein names | Names or descriptions of proteins |
Generation_Proteins | Generated protein sequences contained in the DNA response |
When using SynGenome, please keep in mind a few important details. All of the generated DNA sequences are synthetic sequences and any downstream functional claims regarding these sequences should be based on additional experimental data. To achieve high experimental success rates, some bioinformatic filtering related to your function of interest is recommended. The “function,” “domain,” and “species” labels in SynGenome are based on the UniProt annotations of the prompt sequence. While the response sequences may be enriched for the same GO terms or InterPro domains as in the prompt, there could also be a diversity of biological structures and functions contained in the response. Generating sequences with a language model is prone to highly repetitive generations; while many trivially repetitive regions have been filtered out, some repetitive sequences, especially those with more complex motifs, have been retained.
SynGenome makes use of and is inspired by other biological sequence databases such as the UniProt database of protein sequences, the Gene Ontology database of gene annotations, and the InterPro database of protein domain annotations. Evo is trained on data from the Genome Taxonomy Database (GTDB) and the IMG/PR and IMG/VR databases from the Joint Genome Institute. We extend our gratitude to the developers and maintainers of these resources.
The full database is available for download at Hugging Face datasets: https://huggingface.co/datasets/evo-design/syngenome-uniprot.
For downloads specific to a given GO term, InterPro domain, species, or UniProt ID, please use the SynGenome browse/search functionality and click the “Download prompts and generations” button on a given entry page.
SynGenome is freely available under an MIT license. If this database or any of its contents prove useful for your research, please cite Merchant et al. (2024).
@article {merchant2024semantic, author = {Merchant, Aditi T and King, Samuel H and Nguyen, Eric and Hie, Brian L}, title = {Semantic mining of functional de novo genes from a genomic language model}, year = {2024}, doi = {10.1101/2024.12.17.628962}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2024/12/18/2024.12.17.628962}, journal = {bioRxiv} }