Ciro Santilli @cirosantilli 40

 Incoming links: Nucleotide

De novo DNA synthesis Updated 2025-07-16

As of 2018, Ciro Santilli believes that this could be the next big thing in biology technology.

"De novo" means "starting from scratch", that is: you type the desired sequence into a computer, and the synthesize it.

The "de novo" part is important, because it distinguishes this from the already well solved problem of duplicating DNA from an existing DNA template, which is what all our cells do daily, and which can already be done very efficiently in vitro with polymerase chain reaction.

Many companies are attempting to create more efficient de novo synthesis methods:

Notably, the dream of most of those companies is to have a machine that sits on a lab bench, which synthesises whatever you want.

TODO current de novo synthesis costs/time to delivery after ordering a custom sequence.

The initial main applications are likely going to be:

polymerase chain reaction primers (determine which region will be amplified
creating a custom sequence to be inserted in a plasmid, i.e. artificial gene synthesis

but the real pipe dream is building and bootstraping entire artificial chromosomes

News coverage:

2023-03 twitter.com/sethbannon/status/1633848116154880001
AnsaBio created the world's longest DNA oligo produced using de novo synthesis! 1,005 bases! 99.9% stepwise yield
2020-10-05 www.nature.com/articles/s41587-020-0695-9 "Enzymatic DNA synthesis enters new phase"

Video 1.

Nuclera eDNA enzymatic de novo DNA synthesis explanatory animation (2021)

Source. The video shows nicely how Nuclera's enzymatic DNA synthesis works:

they provide blocked nucleotides of a single type
add them with the enzyme. They use a werid DNA polymerase called terminal deoxynucleotidyl transferase that adds a base at a time to a single stranded DNA strand rather than copying from a template
wash everything
do deblocking reaction
and then repeat until done

 Read the full article

E. Coli Whole Cell Model by Covert Lab / Source code overview Updated 2025-07-16

 View more

The key model database is located in the source code at reconstruction/ecoli/flat.

Let's try to understand some interesting looking, with a special focus on our understanding of the tiny E. Coli K-12 MG1655 operon thrLABC part of the metabolism, which we have well understood at Section "E. Coli K-12 MG1655 operon thrLABC".

We'll realize that a lot of data and IDs come from/match BioCyc quite closely.

reconstruction/ecoli/flat/compartments.tsv contains cellular compartment information:
```
"abbrev" "id"
"n" "CCO-BAC-NUCLEOID"
"j" "CCO-CELL-PROJECTION"
"w" "CCO-CW-BAC-NEG"
"c" "CCO-CYTOSOL"
"e" "CCO-EXTRACELLULAR"
"m" "CCO-MEMBRANE"
"o" "CCO-OUTER-MEM"
"p" "CCO-PERI-BAC"
"l" "CCO-PILUS"
"i" "CCO-PM-BAC-NEG"
```
- CCO: "Celular COmpartment"
- BAC-NUCLEOID: nucleoid
- CELL-PROJECTION: cell projection
- CW-BAC-NEG: TODO confirm: cell wall (of a Gram-negative bacteria)
- CYTOSOL: cytosol
- EXTRACELLULAR: outside the cell
- MEMBRANE: cell membrane
- OUTER-MEM: bacterial outer membrane
- PERI-BAC: periplasm
- PILUS: pilus
- PM-BAC-NEG: TODO: plasma membrane, but that is the same as cell membrane no?
reconstruction/ecoli/flat/promoters.tsv contains promoter information. Simple file, sample lines:
```
"position" "direction" "id" "name"
148 "+" "PM00249" "thrLp"
```
corresponds to E. Coli K-12 MG1655 promoter thrLp, which starts as position 148.
reconstruction/ecoli/flat/proteins.tsv contains protein information. Sample line corresponding to e. Coli K-12 MG1655 gene thrA:
```
"aaCount" "name" "seq" "comments" "codingRnaSeq" "mw" "location" "rnaId" "id" "geneId"
[91, 46, 38, 44, 12, 53, 30, 63, 14, 46, 89, 34, 23, 30, 29, 51, 34, 4, 20, 0, 69] "ThrA" "MRVL..." "Location information from Ecocyc dump." "AUGCGAGUGUUG..." [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 89103.51099999998, 0.0, 0.0, 0.0, 0.0] ["c"] "EG10998_RNA" "ASPKINIHOMOSERDEHYDROGI-MONOMER" "EG10998"
```
so we understand that:
- aaCount: amino acid count, how many of each of the 20 proteinogenic amino acid are there
- seq: full sequence, using the single letter abbreviation of the proteinogenic amino acids
- mw; molecular weight? The 11 components appear to be given at reconstruction/ecoli/flat/scripts/unifyBulkFiles.py:
  molecular_weight_keys = [ '23srRNA', '16srRNA', '5srRNA', 'tRNA', 'mRNA', 'miscRNA', 'protein', 'metabolite', 'water', 'DNA', 'RNA' # nonspecific RNA ]
  so they simply classify the weight? Presumably this exists for complexes that have multiple classes?
  - 23srRNA, 16srRNA, 5srRNA are the three structural RNAs present in the ribosome: 23S ribosomal RNA, 16S ribosomal RNA, 5S ribosomal RNA, all others are obvious:
  - tRNA
  - mRNA
  - protein. This is the seventh class, and this enzyme only contains mass in this class as expected.
  - metabolite
  - water
  - DNA
  - RNA: TODO rna vs miscRNA
- location: cell compartment where the protein is present, c defined at reconstruction/ecoli/flat/compartments.tsv as cytoplasm, as expected for something that will make an amino acid
reconstruction/ecoli/flat/rnas.tsv: TODO vs transcriptionUnits.tsv. Sample lines:
```
"halfLife" "name" "seq" "type" "modifiedForms" "monomerId" "comments" "mw" "location" "ntCount" "id" "geneId" "microarray expression"
174.0 "ThrA [RNA]" "AUGCGAGUGUUG..." "mRNA" [] "ASPKINIHOMOSERDEHYDROGI-MONOMER" "" [0.0, 0.0, 0.0, 0.0, 790935.00399999996, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ["c"] [553, 615, 692, 603] "EG10998_RNA" "EG10998" 0.0005264904
```
- halfLife: half-life
- mw: molecular weight, same as in reconstruction/ecoli/flat/proteins.tsv. This molecule only have weight in the mRNA class, as expected, as it just codes for a protein
- location: same as in reconstruction/ecoli/flat/proteins.tsv
- ntCount: nucleotide count for each of the ATGC
- microarray expression: presumably refers to DNA microarray for gene expression profiling, but what measure exactly?

reconstruction/ecoli/flat/sequence.fasta: FASTA DNA sequence, first two lines:

>E. coli K-12 MG1655 U00096.2 (1 to 4639675 = 4639675 bp)
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG

reconstruction/ecoli/flat/transcriptionUnits.tsv: transcription units. We can observe for example the two different transcription units of the E. Coli K-12 MG1655 operon thrLABC in the lines:
```
"expression_rate" "direction" "right" "terminator_id"  "name"    "promoter_id" "degradation_rate" "id"       "gene_id"                                   "left"
0.0               "f"         310     ["TERM0-1059"]   "thrL"    "PM00249"     0.198905992329492 "TU0-42486" ["EG11277"]                                  148
657.057317358791  "f"         5022    ["TERM_WC-2174"] "thrLABC" "PM00249"     0.231049060186648 "TU00178"   ["EG10998", "EG10999", "EG11000", "EG11277"] 148
```
- promoter_id: matches promoter id in reconstruction/ecoli/flat/promoters.tsv
- gene_id: matches id in reconstruction/ecoli/flat/genes.tsv
- id: matches exactly those used in BioCyc, which is quite nice, might be more or less standardized:
  - biocyc.org/ECOLI/NEW-IMAGE?object=TU0-42486
  - biocyc.org/ECOLI/NEW-IMAGE?type=OPERON&object=TU00178

reconstruction/ecoli/flat/genes.tsv

"length" "name"                      "seq"             "rnaId"      "coordinate" "direction" "symbol" "type" "id"      "monomerId"
66       "thr operon leader peptide" "ATGAAACGCATT..." "EG11277_RNA" 189         "+"         "thrL"   "mRNA" "EG11277" "EG11277-MONOMER"
2463     "ThrA"                      "ATGCGAGTGTTG"    "EG10998_RNA" 336         "+"         "thrA"   "mRNA" "EG10998" "ASPKINIHOMOSERDEHYDROGI-MONOMER"

reconstruction/ecoli/flat/metabolites.tsv contains metabolite information. Sample lines:
```
"id"                       "mw7.2" "location"
"HOMO-SER"                 119.12  ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]
"L-ASPARTATE-SEMIALDEHYDE" 117.104 ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]
```
In the case of the enzyme thrA, one of the two reactions it catalyzes is "L-aspartate 4-semialdehyde" into "Homoserine".
Starting from the enzyme page: biocyc.org/gene?orgid=ECOLI&id=EG10998 we reach the reaction page: biocyc.org/ECOLI/NEW-IMAGE?type=REACTION&object=HOMOSERDEHYDROG-RXN which has reaction ID HOMOSERDEHYDROG-RXN, and that page which clarifies the IDs:
- biocyc.org/compound?orgid=ECOLI&id=L-ASPARTATE-SEMIALDEHYDE: "L-aspartate 4-semialdehyde" has ID L-ASPARTATE-SEMIALDEHYDE
- biocyc.org/compound?orgid=ECOLI&id=HOMO-SER: "Homoserine" has ID HOMO-SER
so these are the compounds that we care about.

reconstruction/ecoli/flat/reactions.tsv contains chemical reaction information. Sample lines:

"reaction id" "stoichiometry" "is reversible" "catalyzed by"

"HOMOSERDEHYDROG-RXN-HOMO-SER/NAD//L-ASPARTATE-SEMIALDEHYDE/NADH/PROTON.51."
  {"NADH[c]": -1, "PROTON[c]": -1, "HOMO-SER[c]": 1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "NAD[c]": 1}
  false
  ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]

"HOMOSERDEHYDROG-RXN-HOMO-SER/NADP//L-ASPARTATE-SEMIALDEHYDE/NADPH/PROTON.53."
  {"NADPH[c]": -1, "NADP[c]": 1, "PROTON[c]": -1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "HOMO-SER[c]": 1
  false
  ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]

catalized by: here we see ASPKINIHOMOSERDEHYDROGI-CPLX, which we can guess is a protein complex made out of ASPKINIHOMOSERDEHYDROGI-MONOMER, which is the ID for the thrA we care about! This is confirmed in complexationReactions.tsv.

reconstruction/ecoli/flat/complexationReactions.tsv contains information about chemical reactions that produce protein complexes:
```
"process" "stoichiometry" "id" "dir"
"complexation"
  [
    {
      "molecule": "ASPKINIHOMOSERDEHYDROGI-CPLX",
      "coeff": 1,
      "type": "proteincomplex",
      "location": "c",
      "form": "mature"
    },
    {
      "molecule": "ASPKINIHOMOSERDEHYDROGI-MONOMER",
      "coeff": -4,
      "type": "proteinmonomer",
      "location": "c",
      "form": "mature"
    }
  ]
"ASPKINIHOMOSERDEHYDROGI-CPLX_RXN"
1
```
The coeff is how many monomers need to get together for form the final complex. This can be seen from the Summary section of ecocyc.org/gene?orgid=ECOLI&id=ASPKINIHOMOSERDEHYDROGI-MONOMER:
Aspartate kinase I / homoserine dehydrogenase I comprises a dimer of ThrA dimers. Although the dimeric form is catalytically active, the binding equilibrium dramatically favors the tetrameric form. The aspartate kinase and homoserine dehydrogenase activities of each ThrA monomer are catalyzed by independent domains connected by a linker region.
Fantastic literature summary! Can't find that in database form there however.

reconstruction/ecoli/flat/proteinComplexes.tsv contains protein complex information:

"name" "comments" "mw" "location" "reactionId" "id"
"aspartate kinase / homoserine dehydrogenase"
""
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 356414.04399999994, 0.0, 0.0, 0.0, 0.0]
["c"]
"ASPKINIHOMOSERDEHYDROGI-CPLX_RXN"
"ASPKINIHOMOSERDEHYDROGI-CPLX"

reconstruction/ecoli/flat/protein_half_lives.tsv contains the half-life of proteins. Very few proteins are listed however for some reason.

reconstruction/ecoli/flat/tfIds.csv: transcription factors information:

"TF"   "geneId"  "oneComponentId"  "twoComponentId" "nonMetaboliteBindingId" "activeId" "notes"
"arcA" "EG10061" "PHOSPHO-ARCA"    "PHOSPHO-ARCA"
"fnr"  "EG10325" "FNR-4FE-4S-CPLX" "FNR-4FE-4S-CPLX"
"dksA" "EG10230"

 Read the full article

Mycoplasma genitalium Updated 2025-07-16

 View more

www.lgcstandards-atcc.org/products/all/49896.aspx:

£355.00 in 2019
biosafety level: 2

Size: 300 x 600 nm

Reproduction time: www.quora.com/unanswered/How-long-do-Mycoplasma-bacteria-take-to-reproduce-under-optimal-conditions

Has one of the smallest genomes known, and JCVI made a minimized strain with 473 genes: JCVI-syn3.0.

The reason why genitalium has such a small genome is that parasites tend to have smaller DNAs. So it must be highlighted that genitalium can only survive in highly enriched environments, it can't even make its own amino acids, which it normally obtains fromthe host cells! And because it cannot do cellular respiration, it very likely replicates slower than say E. Coli. It's easy to be small in such scenarios!

Power, Sex, Suicide by Nick Lane (2006) section "How to lose the cell wall without dying" page 184 has some related mentions puts it well very:

One group, the Mycoplasma, comprises mostly parasites, many of which live inside other cells. Mycoplasma cells are tiny, with very small genomes. M. genitalium, discovered in 1981, has the smallest known genome of any bacterial cell, encoding fewer than  genes. Despite its simplicity, it ranks among the most common of sexually transmitted diseases, producing symptoms similar to Chlamydia infection. It is so small (less than a third of a micron in diameter, or an order of magnitude smaller than most bacteria) that it must normally be viewed under the electron microscope; and difﬁculties culturing it meant its signiﬁcance was not appreciated until the important advances in gene sequencing in the early 1990s. Like Rickettsia, Mycoplasma have lost virtually all the genes required for making nucleotides, amino acids, and so forth. Unlike Rickettsia, however, Mycoplasma have also lost all the genes for oxygen respiration, or indeed any other form of membrane respiration: they have no cytochromes, and so must rely on fermentation for energy.

Downsides mentioned at youtu.be/PSDd3oHj548?t=293:

too small to see on light microscope
difficult to genetically manipulate. TODO why?
less literature than E. Coli.

Data:

www.ncbi.nlm.nih.gov/bioproject/97 contains genome, genes, proteins.
www.genome.jp/kegg-bin/show_pathway?mge01100 all known pathways. TODO: numerical reaction coefficients? Which enzyimes mediate what? Appears to factor pathways across organisms, which is awesome.

 Read the full article