Ciro Santilli @cirosantilli 37

 Incoming links: RNA

E. Coli K-12 MG1655 gene thrL Updated 2025-07-16

UniProt entry: www.uniprot.org/uniprot/P0AD86.

NCBI gene entry: www.ncbi.nlm.nih.gov/gene/944742.

The first gene in the E. Coli K-12 MG1655 genome. Remember however that bacterial chromosome is circular, so being the first doesn't mean much, how the choice was made: Section "E. Coli genome starting point".

Part of E. Coli K-12 MG1655 operon thrLABC.

At only 65 bp, this gene is quite small and boring. For a more interesting gene, have a look at the next gene, e. Coli K-12 MG1655 gene thrA.

Does something to do with threonine.

This is the first in the sequence thrL, thrA, thrB, thrC. This type of naming convention is quite common on related adjacent proteins, all of which must be getting transcribed into a single RNA by the same promoter. As mentioned in the analysis of the KEGG entry for e. Coli K-12 MG1655 gene thrA, those A, B and C are actually directly functionally linked in a direct metabolic pathway.

We can see that thrL, A, B, and C are in the same transcription unit by browsing the list of promoter at: biocyc.org/group?id=:ALL-PROMOTERS&orgid=ECOLI. By finding the first one by position we reach; biocyc.org/ECOLI/NEW-IMAGE?object=TU0-42486.

 Read the full article

E. Coli Whole Cell Model by Covert Lab / Source code overview Updated 2025-07-16

The key model database is located in the source code at reconstruction/ecoli/flat.

Let's try to understand some interesting looking, with a special focus on our understanding of the tiny E. Coli K-12 MG1655 operon thrLABC part of the metabolism, which we have well understood at Section "E. Coli K-12 MG1655 operon thrLABC".

We'll realize that a lot of data and IDs come from/match BioCyc quite closely.

reconstruction/ecoli/flat/compartments.tsv contains cellular compartment information:
```
"abbrev" "id"
"n" "CCO-BAC-NUCLEOID"
"j" "CCO-CELL-PROJECTION"
"w" "CCO-CW-BAC-NEG"
"c" "CCO-CYTOSOL"
"e" "CCO-EXTRACELLULAR"
"m" "CCO-MEMBRANE"
"o" "CCO-OUTER-MEM"
"p" "CCO-PERI-BAC"
"l" "CCO-PILUS"
"i" "CCO-PM-BAC-NEG"
```
- CCO: "Celular COmpartment"
- BAC-NUCLEOID: nucleoid
- CELL-PROJECTION: cell projection
- CW-BAC-NEG: TODO confirm: cell wall (of a Gram-negative bacteria)
- CYTOSOL: cytosol
- EXTRACELLULAR: outside the cell
- MEMBRANE: cell membrane
- OUTER-MEM: bacterial outer membrane
- PERI-BAC: periplasm
- PILUS: pilus
- PM-BAC-NEG: TODO: plasma membrane, but that is the same as cell membrane no?
reconstruction/ecoli/flat/promoters.tsv contains promoter information. Simple file, sample lines:
```
"position" "direction" "id" "name"
148 "+" "PM00249" "thrLp"
```
corresponds to E. Coli K-12 MG1655 promoter thrLp, which starts as position 148.
reconstruction/ecoli/flat/proteins.tsv contains protein information. Sample line corresponding to e. Coli K-12 MG1655 gene thrA:
```
"aaCount" "name" "seq" "comments" "codingRnaSeq" "mw" "location" "rnaId" "id" "geneId"
[91, 46, 38, 44, 12, 53, 30, 63, 14, 46, 89, 34, 23, 30, 29, 51, 34, 4, 20, 0, 69] "ThrA" "MRVL..." "Location information from Ecocyc dump." "AUGCGAGUGUUG..." [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 89103.51099999998, 0.0, 0.0, 0.0, 0.0] ["c"] "EG10998_RNA" "ASPKINIHOMOSERDEHYDROGI-MONOMER" "EG10998"
```
so we understand that:
- aaCount: amino acid count, how many of each of the 20 proteinogenic amino acid are there
- seq: full sequence, using the single letter abbreviation of the proteinogenic amino acids
- mw; molecular weight? The 11 components appear to be given at reconstruction/ecoli/flat/scripts/unifyBulkFiles.py:
  molecular_weight_keys = [ '23srRNA', '16srRNA', '5srRNA', 'tRNA', 'mRNA', 'miscRNA', 'protein', 'metabolite', 'water', 'DNA', 'RNA' # nonspecific RNA ]
  so they simply classify the weight? Presumably this exists for complexes that have multiple classes?
  - 23srRNA, 16srRNA, 5srRNA are the three structural RNAs present in the ribosome: 23S ribosomal RNA, 16S ribosomal RNA, 5S ribosomal RNA, all others are obvious:
  - tRNA
  - mRNA
  - protein. This is the seventh class, and this enzyme only contains mass in this class as expected.
  - metabolite
  - water
  - DNA
  - RNA: TODO rna vs miscRNA
- location: cell compartment where the protein is present, c defined at reconstruction/ecoli/flat/compartments.tsv as cytoplasm, as expected for something that will make an amino acid
reconstruction/ecoli/flat/rnas.tsv: TODO vs transcriptionUnits.tsv. Sample lines:
```
"halfLife" "name" "seq" "type" "modifiedForms" "monomerId" "comments" "mw" "location" "ntCount" "id" "geneId" "microarray expression"
174.0 "ThrA [RNA]" "AUGCGAGUGUUG..." "mRNA" [] "ASPKINIHOMOSERDEHYDROGI-MONOMER" "" [0.0, 0.0, 0.0, 0.0, 790935.00399999996, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ["c"] [553, 615, 692, 603] "EG10998_RNA" "EG10998" 0.0005264904
```
- halfLife: half-life
- mw: molecular weight, same as in reconstruction/ecoli/flat/proteins.tsv. This molecule only have weight in the mRNA class, as expected, as it just codes for a protein
- location: same as in reconstruction/ecoli/flat/proteins.tsv
- ntCount: nucleotide count for each of the ATGC
- microarray expression: presumably refers to DNA microarray for gene expression profiling, but what measure exactly?

reconstruction/ecoli/flat/sequence.fasta: FASTA DNA sequence, first two lines:

>E. coli K-12 MG1655 U00096.2 (1 to 4639675 = 4639675 bp)
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG

reconstruction/ecoli/flat/transcriptionUnits.tsv: transcription units. We can observe for example the two different transcription units of the E. Coli K-12 MG1655 operon thrLABC in the lines:
```
"expression_rate" "direction" "right" "terminator_id"  "name"    "promoter_id" "degradation_rate" "id"       "gene_id"                                   "left"
0.0               "f"         310     ["TERM0-1059"]   "thrL"    "PM00249"     0.198905992329492 "TU0-42486" ["EG11277"]                                  148
657.057317358791  "f"         5022    ["TERM_WC-2174"] "thrLABC" "PM00249"     0.231049060186648 "TU00178"   ["EG10998", "EG10999", "EG11000", "EG11277"] 148
```
- promoter_id: matches promoter id in reconstruction/ecoli/flat/promoters.tsv
- gene_id: matches id in reconstruction/ecoli/flat/genes.tsv
- id: matches exactly those used in BioCyc, which is quite nice, might be more or less standardized:
  - biocyc.org/ECOLI/NEW-IMAGE?object=TU0-42486
  - biocyc.org/ECOLI/NEW-IMAGE?type=OPERON&object=TU00178

reconstruction/ecoli/flat/genes.tsv

"length" "name"                      "seq"             "rnaId"      "coordinate" "direction" "symbol" "type" "id"      "monomerId"
66       "thr operon leader peptide" "ATGAAACGCATT..." "EG11277_RNA" 189         "+"         "thrL"   "mRNA" "EG11277" "EG11277-MONOMER"
2463     "ThrA"                      "ATGCGAGTGTTG"    "EG10998_RNA" 336         "+"         "thrA"   "mRNA" "EG10998" "ASPKINIHOMOSERDEHYDROGI-MONOMER"

reconstruction/ecoli/flat/metabolites.tsv contains metabolite information. Sample lines:
```
"id"                       "mw7.2" "location"
"HOMO-SER"                 119.12  ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]
"L-ASPARTATE-SEMIALDEHYDE" 117.104 ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]
```
In the case of the enzyme thrA, one of the two reactions it catalyzes is "L-aspartate 4-semialdehyde" into "Homoserine".
Starting from the enzyme page: biocyc.org/gene?orgid=ECOLI&id=EG10998 we reach the reaction page: biocyc.org/ECOLI/NEW-IMAGE?type=REACTION&object=HOMOSERDEHYDROG-RXN which has reaction ID HOMOSERDEHYDROG-RXN, and that page which clarifies the IDs:
- biocyc.org/compound?orgid=ECOLI&id=L-ASPARTATE-SEMIALDEHYDE: "L-aspartate 4-semialdehyde" has ID L-ASPARTATE-SEMIALDEHYDE
- biocyc.org/compound?orgid=ECOLI&id=HOMO-SER: "Homoserine" has ID HOMO-SER
so these are the compounds that we care about.

reconstruction/ecoli/flat/reactions.tsv contains chemical reaction information. Sample lines:

"reaction id" "stoichiometry" "is reversible" "catalyzed by"

"HOMOSERDEHYDROG-RXN-HOMO-SER/NAD//L-ASPARTATE-SEMIALDEHYDE/NADH/PROTON.51."
  {"NADH[c]": -1, "PROTON[c]": -1, "HOMO-SER[c]": 1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "NAD[c]": 1}
  false
  ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]

"HOMOSERDEHYDROG-RXN-HOMO-SER/NADP//L-ASPARTATE-SEMIALDEHYDE/NADPH/PROTON.53."
  {"NADPH[c]": -1, "NADP[c]": 1, "PROTON[c]": -1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "HOMO-SER[c]": 1
  false
  ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]

catalized by: here we see ASPKINIHOMOSERDEHYDROGI-CPLX, which we can guess is a protein complex made out of ASPKINIHOMOSERDEHYDROGI-MONOMER, which is the ID for the thrA we care about! This is confirmed in complexationReactions.tsv.

reconstruction/ecoli/flat/complexationReactions.tsv contains information about chemical reactions that produce protein complexes:
```
"process" "stoichiometry" "id" "dir"
"complexation"
  [
    {
      "molecule": "ASPKINIHOMOSERDEHYDROGI-CPLX",
      "coeff": 1,
      "type": "proteincomplex",
      "location": "c",
      "form": "mature"
    },
    {
      "molecule": "ASPKINIHOMOSERDEHYDROGI-MONOMER",
      "coeff": -4,
      "type": "proteinmonomer",
      "location": "c",
      "form": "mature"
    }
  ]
"ASPKINIHOMOSERDEHYDROGI-CPLX_RXN"
1
```
The coeff is how many monomers need to get together for form the final complex. This can be seen from the Summary section of ecocyc.org/gene?orgid=ECOLI&id=ASPKINIHOMOSERDEHYDROGI-MONOMER:
Aspartate kinase I / homoserine dehydrogenase I comprises a dimer of ThrA dimers. Although the dimeric form is catalytically active, the binding equilibrium dramatically favors the tetrameric form. The aspartate kinase and homoserine dehydrogenase activities of each ThrA monomer are catalyzed by independent domains connected by a linker region.
Fantastic literature summary! Can't find that in database form there however.

reconstruction/ecoli/flat/proteinComplexes.tsv contains protein complex information:

"name" "comments" "mw" "location" "reactionId" "id"
"aspartate kinase / homoserine dehydrogenase"
""
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 356414.04399999994, 0.0, 0.0, 0.0, 0.0]
["c"]
"ASPKINIHOMOSERDEHYDROGI-CPLX_RXN"
"ASPKINIHOMOSERDEHYDROGI-CPLX"

reconstruction/ecoli/flat/protein_half_lives.tsv contains the half-life of proteins. Very few proteins are listed however for some reason.

reconstruction/ecoli/flat/tfIds.csv: transcription factors information:

"TF"   "geneId"  "oneComponentId"  "twoComponentId" "nonMetaboliteBindingId" "activeId" "notes"
"arcA" "EG10061" "PHOSPHO-ARCA"    "PHOSPHO-ARCA"
"fnr"  "EG10325" "FNR-4FE-4S-CPLX" "FNR-4FE-4S-CPLX"
"dksA" "EG10230"

 Read the full article

Half-life Updated 2025-07-16

The half-life of radioactive decay, which as discovered a few years before quantum mechanics was discovered and matured, was a major mystery. Why do some nuclei fission in apparently random fashion, while others don't? How is the state of different nuclei different from one another? This is mentioned in Inward Bound by Abraham Pais (1988) Chapter 6.e Why a half-life?

The term also sees use in other areas, notably biology, where e.g. RNAs spontaneously decay as part of the cell's control system, see e.g. mentions in E. Coli Whole Cell Model by Covert Lab.

 Read the full article

How to use an Oxford Nanopore MinION to extract DNA from river water and determine which bacteria live in it / Overview of the experiment Updated 2025-07-16

For those that know biology and just want to do the thing, see: Section "Protocols used".

The PuntSeq team uses an Oxford Nanopore MinION DNA sequencer made by Oxford Nanopore Technologies to sequence the 16S region of bacterial DNA, which is about 1500 nucleotides long.

This kind of "decode everything from the sample to see what species are present approach" is called "metagenomics".

This is how the MinION looks like: Figure 1. "Oxford Nanopore MinION top".

Figure 1.
Oxford Nanopore MinION top
. Source.

Figure 2.
Oxford Nanopore MinION side
. Source.

Figure 3.
Oxford Nanopore MinION top open
. Source.

Figure 4.
Oxford Nanopore MinION side USB
. Source.

The 16S region codes for one of the RNA pieces that makes the bacterial ribosome.

Before sequencing the DNA, we will do a PCR with primers that fit just before and just after the 16S DNA, in well conserved regions expected to be present in all bacteria.

The PCR replicates only the DNA region between our two selected primers a gazillion times so that only those regions will actually get picked up by the sequencing step in practice.

Eukaryotes also have an analogous ribosome part, the 18S region, but the PCR primers are selected for targets around the 16S region which are only present in prokaryotes.

This way, we amplify only the 16S region of bacteria, excluding other parts of bacterial genome, and excluding eukaryotes entirely.

Despite coding such a fundamental piece of RNA, there is still surprisingly variability in the 16S region across different bacteria, and it is those differences will allow us to identify which bacteria are present in the river.

The variability exists because certain base pairs are not fundamental for the function of the 16S region. This variability happens mostly on RNA loops as opposed to stems, i.e. parts of the RNA that don't base pair with other RNA in the RNA secondary structure as shown at: Code 1. "RNA stem-loop structure".

                A-U
               /   \
A-U-C-G-A-U-C-G     C
| | | | | | | |     |
U-A-G-C-U-A-G-C     G
               \   /
                U-A
|             ||    |
+-------------++----+
    stem        loop

RNA stem-loop structure

.

This is how the 16S RNA secondary structure looks like in its full glory: Figure 5. "16S RNA secondary structure".

height=800 — Figure 5.
16S RNA secondary structure
. Source.

Since loops don't base pair, they are less crucial in the determination of the secondary structure of the RNA.

The variability is such that it is possible to identify individual species apart if full sequences are known with certainty.

With the experimental limitations of experiment however, we would only be able to obtain family or genus level breakdowns.

 Read the full article

Physics and the illusion of life Updated 2025-07-16

The natural sciences are not just a tool to predict the future.

They are a reminder that the lives that we live daily are mere illusions, religious concepts such as Maya and Samsara come to mind.

We as individuals perceive nothing about the materials that we touch every day really work, nor more importantly how our brain and cell work.

Everything is magic out of our control.

The natural sciences allow us peek, with huge concentrated effort, into tiny little bits a little of those unknowns, and blow our minds as we notice that we don't know anything.

For all practical purposes in life, there is a huge macro micro gap. We are only able to directly perceive and influence the macro events. And through those we try to affect micro events. Because for good or bad, micro events reflect in the macro world.

It is as if we live in a different plane of existence above molecules, and below galaxies. The hierarchy of Figure "xkcd 435: Fields arranged by purity" puts that nicely into perspective, shame it only starts at the economical level, not going up to astronomy.

The great beauty of science is that it allows us to puncture through some of the layers of reality, either up or down, away from our daily experience.

And the great beauty of artificial intelligence research is that it allows to peer deeper into exactly our layer of existence.

Every one or two weeks Ciro Santilli remembers that he and everything he touches are just a bunch of atoms, and that is an amazing feeling. This is Ciro's preferred source of Great doubt. Another concept that comes to mind is when you see it, you'll shit bricks.

Perhaps, the feeling of physics and the illusion of life reaches its peak in molecular biology.

Just look at your fucking hand right now.

Do you have any idea of each of the cells in it work? Isn't is at least 100 times more complex than the materials of the table you hand is currently resting on?

This is the non-science fiction version of the lotus-Eater Machine.

Alan Watts's "Philosopher" talk mentions related ideas:

The origin of a person who is defined as a philosopher, is one who finds that existence itself is exceedingly odd.

The toddler of a friend of Ciro Santilli's wife asked her mum:

Why doesn't my tiger doll close its eyes when we sleep?

Our perception of the macroscopic world is so magic that children have to learn the difference between living and non-living things.

James Somers put it very well as well in his article I should have loved biology by James Somers, this quote was brought to Ciro's attention by Bert Hubert's website ^[ref].

I should have loved biology but I found it to be a lifeless recitation of names: the Golgi apparatus and the Krebs cycle; mitosis, meiosis; DNA, RNA, mRNA, tRNA.
In the textbooks, astonishing facts were presented without astonishment. Someone probably told me that every cell in my body has the same DNA. But no one shook me by the shoulders, saying how crazy that was. I needed Lewis Thomas, who wrote in The Medusa and the Snail:
For the real amazement, if you wish to be amazed, is this process. You start out as a single cell derived from the coupling of a sperm and an egg; this divides in two, then four, then eight, and so on, and at a certain stage there emerges a single cell which has as all its progeny the human brain. The mere existence of such a cell should be one of the great astonishments of the earth. People ought to be walking around all day, all through their waking hours calling to each other in endless wonderment, talking of nothing except that cell.

The same applies to other natural sciences.

Alan Watts' "Philosopher" talk (1973)

Source. Lecture given at UCLA on 1973-02-21. Some key quotes from the talk:

The origin of a person who is defined as a philosopher, is one who finds that existence itself is exceedingly odd.

A transcript at: www.organism.earth/library/document/clarity-of-mind

 Read the full article

Positive-strand RNA virus Updated 2025-07-16

It just has RNA that can be transcribed directly by the host ribosome.

 Read the full article

Protein degradation Updated 2025-07-16

proteins also have a half-life, much like RNA. But it tends to be longer.

www.ncbi.nlm.nih.gov/books/NBK9957/

 Read the full article

Retrovirus Updated 2025-07-16

Integrates its RNA genome into the host genome.

first RNA to DNA with reverse transcriptase
then injects DNA into host genome with integrase

Sounds complicated! The advantage is likely as in HIV: once inside the cell, it can remain hidden far away from the cell surface, but still infections.

 Read the full article

Reverse transcriptase Updated 2025-07-16

Converts RNA to DNA, i.e. the inverse of transcription. Found in viruses such as Retrovirus, which includes e.g. HIV.

 Read the full article

RNA-dependent RNA polymerase Updated 2025-07-16

Makes RNA from RNA.

Used in Positive-strand RNA virus to replicate.

I don't think it's present outside viruses. Well regulated organisms just transcribe more DNA instead.

 Read the full article

RNA polymerase Updated 2025-07-16

Converts DNA to RNA.

 Read the full article

RNA-Seq Updated 2025-07-16

Sequencing the DNA tells us what the organism can do. Sequencing the RNA tells us what the organism is actually doing at a given point in time. The problem is not killing the cell while doing that. Is it possible to just take a chunk of the cell to sequence without killing it maybe?

 Read the full article

SARS-CoV-2 S protein Updated 2025-07-16

Spike.

Nucleocapsid phosphoprotein, sticks to the RNA inside.

www.nature.com/articles/s41467-020-20768-y mentions functions:

helps pack the viral RNA into the capsule
also has a side function in immune suppression

 Read the full article

Sonicator Updated 2025-07-16

These can be used to break cells apart from tissue, and also break up larger DNA or RNA molecules into smaller ones, suitable for sequencing.

 Read the full article

Uracil Updated 2025-07-16

Replaces Thymine in RNA.

 Read the full article