genome: www.ncbi.nlm.nih.gov/genome/?term=txid511145 From there there are links to either:
- Download the FASTA: "Download sequences in FASTA format for genome, protein"
  For the genome, you get a compressed FASTA file with extension .fna called GCF_000005845.2_ASM584v2_genomic.fna that starts with:
  >NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTG
  Using wc as in wc GCF_000005845.2_ASM584v2_genomic.fna gives 58022 lines, in Vim we see that each line is 80 characters, except for the final one which is 52. So we have 58020 * 80 + 52 = 4641652 =~ 4.6 Mbp
- Interactively browse the sequence on the browser viewer: "Reference genome: Escherichia coli str. K-12 substr. MG1655" which eventually leads to: www.ncbi.nlm.nih.gov/nuccore/556503834?report=graph
  If we zoom into the start, we hover over the very first gene/protein: the famous (just kidding) e. Coli K-12 MG1655 gene thrL, at position 190-255.
  The second one is the much more interesting e. Coli K-12 MG1655 gene thrA.
- Gene list, with a total of 4,629 as of 2021: www.ncbi.nlm.nih.gov/gene/?term=txid511145

KEGG entry: www.genome.jp/pathway/eco01100+M00022

BioCyc promoter database query URL: biocyc.org/group?id=:ALL-PROMOTERS&orgid=ECOLI

E. Coli K-12 MG1655 origin of replication (3,925,744 - 3,925,975)

 0  0

biocyc.org/ECOLI/NEW-IMAGE?type=EXTRAGENIC-SITE&object=G0-10506:

Note that this is not the conventional starting point for gene numbering: Section "E. Coli genome starting point".

E. Coli K-12 MG1655 gene

 0  0

E. Coli K-12 MG1655 gene thrL (190-255, thr operon leader peptide)

 0  0

UniProt entry: www.uniprot.org/uniprot/P0AD86.

NCBI gene entry: www.ncbi.nlm.nih.gov/gene/944742.

The first gene in the E. Coli K-12 MG1655 genome. Remember however that bacterial chromosome is circular, so being the first doesn't mean much, how the choice was made: Section "E. Coli genome starting point".

Part of E. Coli K-12 MG1655 operon thrLABC.

At only 65 bp, this gene is quite small and boring. For a more interesting gene, have a look at the next gene, e. Coli K-12 MG1655 gene thrA.

Does something to do with threonine.

This is the first in the sequence thrL, thrA, thrB, thrC. This type of naming convention is quite common on related adjacent proteins, all of which must be getting transcribed into a single RNA by the same promoter. As mentioned in the analysis of the KEGG entry for e. Coli K-12 MG1655 gene thrA, those A, B and C are actually directly functionally linked in a direct metabolic pathway.

We can see that thrL, A, B, and C are in the same transcription unit by browsing the list of promoter at: biocyc.org/group?id=:ALL-PROMOTERS&orgid=ECOLI. By finding the first one by position we reach; biocyc.org/ECOLI/NEW-IMAGE?object=TU0-42486.

E. Coli K-12 MG1655 gene thrA (337-2799, fused aspartate kinase/homoserine dehydrogenase 1)

 0  0

UniProt entry: www.uniprot.org/uniprot/P00561.

NCBI entry: www.ncbi.nlm.nih.gov/gene/945803.

The second gene in the E. Coli K-12 MG1655 genome. Part of the E. Coli K-12 MG1655 operon thrLABC.

Part of a reaction that produces threonine.

This protein is an enzyme. The UniProt entry clearly shows the chemical reactions that it catalyses. In this case, there are actually two! It can either transforming the metabolite:

"L-homoserine" into "L-aspartate 4-semialdehyde"
"L-aspartate" into "4-phospho-L-aspartate"

Also interestingly, we see that both of those reaction require some extra energy to catalyse, one needing adenosine triphosphate and the other nADP+.

TODO: any mention of how much faster it makes the reaction, numerically?

Since this is an enzyme, it would also be interesting to have a quick search for it in the KEGG entry starting from the organism: www.genome.jp/pathway/eco01100+M00022 We type in the search bar "thrA", it gives a long list, but the last entry is our "thrA". Selecting it highlights two pathways in the large graph, so we understand that it catalyzes two different reactions, as suggested by the protein name itself (fused blah blah). We can now hover over:

the edge: it shows all the enzymes that catalyze the given reaction. Both edges actually have multiple enzymes, e.g. the L-Homoserine path is also catalyzed by another enzyme called metL.
the node: they are the metabolites, e.g. one of the paths contains "L-homoserine" on one node and "L-aspartate 4-semialdehyde"

Note that common cofactor are omitted, since we've learnt from the UniProt entry that this reaction uses ATP.

If we can now click on the L-Homoserine edge, it takes us to: www.genome.jp/entry/eco:b0002+eco:b3940. Under "Pathway" we see an interesting looking pathway "Glycine, serine and threonine metabolism": www.genome.jp/pathway/eco00260+b0002 which contains a small manually selected and extremely clearly named subset of the larger graph!

But looking at the bottom of this subgraph (the UI is not great, can't Ctrl+F and enzyme names not shown, but the selected enzyme is slightly highlighted in red because it is in the URL www.genome.jp/pathway/eco00260+b0002 vs www.genome.jp/pathway/eco00260) we clearly see that thrA, thrB and thrC for a sequence that directly transforms "L-aspartate 4-semialdehyde" into "Homoserine" to "O-Phospho-L-homoserine" and finally tothreonine. This makes it crystal clear that they are not just located adjacently in the genome by chance: they are actually functionally related, and likely controlled by the same transcription factor: when you want one of them, you basically always want the three, because you must be are lacking threonine. TODO find transcription factor!

The UniProt entry also shows an interactive browser of the tertiary structure of the protein. We note that there are currently two sources available: X-ray crystallography and AlphaFold. To be honest, the AlphaFold one looks quite off!!!

By inspecting the FASTA for the entire genome, or by using the NCBI open reading frame tool, we see that this gene lies entirely in its own open reading frame, so it is quite boring

From the FASTA we see that the very first three Codons at position 337 are

ATG CGA GTG

where ATG is the start codon, and CGA GTG should be the first two that actually go into the protein:

CGA: arginine
GTG: valine

ecocyc.org/gene?orgid=ECOLI&id=ASPKINIHOMOSERDEHYDROGI-MONOMER mentions that the enzime is most active as protein complex with four copies of the same protein:

Aspartate kinase I / homoserine dehydrogenase I comprises a dimer of ThrA dimers. Although the dimeric form is catalytically active, the binding equilibrium dramatically favors the tetrameric form. The aspartate kinase and homoserine dehydrogenase activities of each ThrA monomer are catalyzed by independent domains connected by a linker region.

TODO image?

E. Coli K-12 MG1655 gene thrB (2,801 - 3,733)

 0  0

Immediately follows e. Coli K-12 MG1655 gene thrA,

Part of E. Coli K-12 MG1655 operon thrLABC.

Note that this is very close to the "end" of the genome.

NCBI: www.ncbi.nlm.nih.gov/gene/948874

UniProt: www.uniprot.org/uniprot/P0A9Q1

TODO DNA assembly structure.

E. Coli K-12 MG1655 gene ytdX (5,234 - 5,530)

 0  0

The "last" gene, and also an E. Coli K-12 MG1655 gene of unknown function.

E. Coli K-12 MG1655 gene of unknown function

 0  0

All gene names that start with an Y such as:

appear to be proteins of unknown function.

UniProt for example describes YaaX as "Uncharacterized protein YaaX".

As function is discovered, they then change it to a better name, e.g. to names such as the E. Coli K-12 MG1655 transcription unit thrLABC proteins all of which have a clear name due to threonine.

There are many other y??? as of 2021! Though they do tend to be smaller molecules.

E. Coli K-12 MG1655 promoter

 0  0

biocyc.org/group?id=:ALL-PROMOTERS&orgid=ECOLI

From this we see that there is a convention of naming promoters as protein name + p, e.g. the first gene in E. Coli K-12 MG1655 promoter thrLp encodes protein thrL.

It is also possible to add numbers after the p, e.g. at biocyc.org/ECOLI/NEW-IMAGE?type=OPERON&object=PM0-45989 we see that the protein zur has two promoters:

zurp6
zurp7

TODO why 6 and 7? There don't appear to be 1, 2, etc.

E. Coli K-12 MG1655 promoter thrLp (148)

 0  0

promoter for the E. Coli K-12 MG1655 operon thrLABC.

E. Coli K-12 MG1655 operon thrLABC

 0  0

Contains the genes: e. Coli K-12 MG1655 gene thrL, e. Coli K-12 MG1655 gene thrA, e. Coli K-12 MG1655 gene thrB and e. Coli K-12 MG1655 gene thrC, all of which have directly linked functionality.

We can find it by searching for the species in the BioCyc promoter database. This leads to: biocyc.org/group?id=:ALL-PROMOTERS&orgid=ECOLI.

By finding the first operon by position we reach: biocyc.org/ECOLI/NEW-IMAGE?object=TU0-42486.

That page lists several components of the promoter, which we should try to understand!

Some of the transcription factors are proteins:

After the first gene in the codon, thrL, there is a rho-independent termination. By comparing:

we understand that the presence of threonine or isoleucine variants, L-threonyl and L-isoleucyl, makes the rho-independent termination become more efficient, so the control loop is quite direct! Not sure why it cares about isoleucine as well though.

TODO which factor is actually specific to that DNA region?