SynGenome
Going beyond the observable evolutionary universe

DNA is the fundamental layer of biological information in all living organisms. Researchers have made tremendous progress in developing technologies for reading, writing, and editing DNA, but composing new DNA sequences remains a formidable challenge. Advanced AI models, trained on genomic data from millions of organisms, now enable us to compose new genomic sequences beyond what has been generated by natural evolution.

SynGenome is a first-of-its-kind database consisting of synthetic DNA sequences generated by Evo, a genomic language model. Evo is similar to models of human language that learn to predict the next word in a sentence. Given a DNA sequence prompt, Evo will generate a DNA sequence response that continues the genomic sequence. Essentially, Evo enables “autocomplete” for the genome.

In the genomes of prokaryotes and phage, genes with related functions frequently appear directly next to each other along the DNA sequence. As a result, prompting Evo with a sequence encoding a function of interest instructs the model to generate functionally related genes. This enables function-guided generative design by prompt engineering a genomic language model.

SynGenome is organized according to the known functions, domains, and species of the prompt sequences. The corresponding response sequences are likely enriched for genes with related functions or domains, but they could contain many other interesting genes as well. The generated sequences in SynGenome may be very different from anything found in nature while still performing useful biological functions, opening up a new universe of biological discovery.

FAQs

How was SynGenome made?

SynGenome contains more than 100 billion DNA base pairs generated by Evo, a DNA language model trained on millions of prokaryotic genomes. Evo is trained to predict the next base pair in a DNA sequence, which enables Evo to generate a response sequence that follows a DNA prompt. To prompt Evo, we used DNA sequences encoding or directly adjacent to known protein-coding genes from prokaryotic organisms and bacteriophages (as compiled by the UniProt database). While the prompt sequences come from natural genomes, the responses can be very different from any DNA sequence found in nature.

What do the “function,” “domain,” and “species” labels mean?

SynGenome entries are organized by the labels associated with the prompt sequence as provided by UniProt. Functional annotations take the form of Gene Ontology (GO) terms. Protein domain annotations are based on InterPro identifiers. Finally, the species labels correspond to the organism from which the prompt sequence was derived.

How should I use SynGenome?

Many important biological systems, with applications in genome editing or DNA synthesis, have been found by searching through genomic databases in a process referred to as “genome mining.” The sequences in SynGenome could be mined similarly. But rather than mining natural sequences, these sequences are generated based on their relationship with known genes in the prompt and can be unlike natural sequences, which is why we refer to this process as “semantic mining.” Because the synthetic sequences are likely enriched for functions that are related to the prompt sequence, these sequences could form the basis of a functional screening library.

Have any sequences in SynGenome been experimentally validated?

Using the same strategy for generating sequences in SynGenome, we have had high experimental success rates in low-throughput experiments in which we test only tens of variants for the desired function encoded by a genomic prompt. This includes the design of proteins with anti-CRISPR activity, as well as the design of both bacterial toxins and their conjugate antitoxins. See our paper for more details.

Methodology

Prompt construction

Each prompt is associated with the coding sequence (CDS) of a UniProt entry. We constructed six prompts denoted in the raw data with different labels. A diagram illustrating these prompts and a table describing the prompt type labels are below.

Prompt type Description
Upstream The 500 base pair (bp) sequence at the 5' end of the coding sequence (CDS).
CDS The 500 bp sequence beginning with the CDS. If the CDS sequence is shorter than 500 bp, the CDS sequence is rounded down to the nearest 100 bp.
Downstream The 500 bp sequence at the 3' end on the CDS.
RevComp Upstream The reverse complement of the 500 base pair sequence at the 5' end of the CDS.
RevComp The reverse complement of the 500 base pair sequence that ends with the 5' end of the CDS. If the reverse complement of the CDS sequence is shorter than 500 bp, the CDS sequence is rounded down to the nearest 100 bp.
RevComp Downstream The reverse complement of the 500 base pair sequence at the 3' end of the CDS.

Model and sampling algorithm

We generate sequences with the Evo 1.5 model without any additional finetuning or post-training. We use a standard autoregressive decoding algorithm to sample new sequences with a temperature of 0.7, top-k of 4, and top-p of 1. Code for generating sequences with Evo can be found here. For each prompt type described above, we sampled 2 sequences.

Post-processing

After sampling, we ran NCBI dustmasker to remove highly repetitive sequences, for example, long stretches of sequence containing only a single base pair. While many repetitive sequences were eliminated, we chose not to filter out all repetitive elements given that some prokaryotic genomes do naturally contain biologically meaningful repeats.

Raw data format

The raw SynGenome data is provided in CSV format. Below are the column names and a brief description of the field. Fields corresponding to sequences generated by Evo are in bold.

Field Description
UUID Unique identifier for each entry in the dataset
Prompt Input text used to generate sequences
Generated_Seq DNA sequence output generated from the prompt
Score Evo log-likelihood of the sequence
File_Derivation Source file
UniProt_CID ID of the protein
Type Classification of the prompt (e.g., Upstream, Downstream, CDS)
Entry UniProt accession
Organism Name of the biological organism
Gene Ontology IDs Unique identifiers from Gene Ontology database (e.g., GO:0016020)
Gene Ontology (biological process) GO terms describing biological processes
Gene Ontology (cellular component) GO terms describing cellular locations
Gene Ontology (GO) All Gene Ontology annotations
Gene Ontology (molecular function) GO terms describing molecular activities
Protein families Classification of protein family membership
CDD CDD domain annotations
DisProt DisProt database annotations
Gene3D Gene3D database annotations
HAMAP HAMAP database annotations
InterPro InterPro domain annotations
NCBIfam NCBI protein family annotations
PANTHER PANTHER database annotations
Pfam Pfam database annotations
PIRSF PIRSF database classifications
PRINTS PRINTS database annotations
PROSITE PROSITE database annotations
SFLD SFLD database annotations
SMART SMART database annotations
SUPFAM SUPFAM database annotations
Domains Compiled Compilation of domain annotations from various sources
Gene Names Official or common names of genes
Protein names Names or descriptions of proteins
Generation_Proteins Generated protein sequences contained in the DNA response

Limitations

When using SynGenome, please keep in mind a few important details. All of the generated DNA sequences are synthetic sequences and any downstream functional claims regarding these sequences should be based on additional experimental data. To achieve high experimental success rates, some bioinformatic filtering related to your function of interest is recommended. The “function,” “domain,” and “species” labels in SynGenome are based on the UniProt annotations of the prompt sequence. While the response sequences may be enriched for the same GO terms or InterPro domains as in the prompt, there could also be a diversity of biological structures and functions contained in the response. Generating sequences with a language model is prone to highly repetitive generations; while many trivially repetitive regions have been filtered out, some repetitive sequences, especially those with more complex motifs, have been retained.

Acknowledgements

SynGenome makes use of and is inspired by other biological sequence databases such as the UniProt database of protein sequences, the Gene Ontology database of gene annotations, and the InterPro database of protein domain annotations. Evo is trained on data from the Genome Taxonomy Database (GTDB) and the IMG/PR and IMG/VR databases from the Joint Genome Institute. We extend our gratitude to the developers and maintainers of these resources.

Download

The full database is available for download at Hugging Face datasets: https://huggingface.co/datasets/evo-design/syngenome-uniprot.

For downloads specific to a given GO term, InterPro domain, species, or UniProt ID, please use the SynGenome browse/search functionality and click the “Download prompts and generations” button on a given entry page.

License and citation

SynGenome is freely available under an MIT license. If this database or any of its contents prove useful for your research, please cite Merchant et al. (2024).

@article {merchant2024semantic, author = {Merchant, Aditi T and King, Samuel H and Nguyen, Eric and Hie, Brian L}, title = {Semantic mining of functional de novo genes from a genomic language model}, year = {2024}, doi = {10.1101/2024.12.17.628962}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2024/12/18/2024.12.17.628962}, journal = {bioRxiv} }