E. Coli K-12 MG1655 gene thrA Updated +Created
The second gene in the E. Coli K-12 MG1655 genome. Part of the E. Coli K-12 MG1655 operon thrLABC.
Part of a reaction that produces threonine.
This protein is an enzyme. The UniProt entry clearly shows the chemical reactions that it catalyses. In this case, there are actually two! It can either transforming the metabolite:
  • "L-homoserine" into "L-aspartate 4-semialdehyde"
  • "L-aspartate" into "4-phospho-L-aspartate"
Also interestingly, we see that both of those reaction require some extra energy to catalyse, one needing adenosine triphosphate and the other nADP+.
TODO: any mention of how much faster it makes the reaction, numerically?
Since this is an enzyme, it would also be interesting to have a quick search for it in the KEGG entry starting from the organism: www.genome.jp/pathway/eco01100+M00022 We type in the search bar "thrA", it gives a long list, but the last entry is our "thrA". Selecting it highlights two pathways in the large graph, so we understand that it catalyzes two different reactions, as suggested by the protein name itself (fused blah blah). We can now hover over:
  • the edge: it shows all the enzymes that catalyze the given reaction. Both edges actually have multiple enzymes, e.g. the L-Homoserine path is also catalyzed by another enzyme called metL.
  • the node: they are the metabolites, e.g. one of the paths contains "L-homoserine" on one node and "L-aspartate 4-semialdehyde"
Note that common cofactor are omitted, since we've learnt from the UniProt entry that this reaction uses ATP.
If we can now click on the L-Homoserine edge, it takes us to: www.genome.jp/entry/eco:b0002+eco:b3940. Under "Pathway" we see an interesting looking pathway "Glycine, serine and threonine metabolism": www.genome.jp/pathway/eco00260+b0002 which contains a small manually selected and extremely clearly named subset of the larger graph!
But looking at the bottom of this subgraph (the UI is not great, can't Ctrl+F and enzyme names not shown, but the selected enzyme is slightly highlighted in red because it is in the URL www.genome.jp/pathway/eco00260+b0002 vs www.genome.jp/pathway/eco00260) we clearly see that thrA, thrB and thrC for a sequence that directly transforms "L-aspartate 4-semialdehyde" into "Homoserine" to "O-Phospho-L-homoserine" and finally tothreonine. This makes it crystal clear that they are not just located adjacently in the genome by chance: they are actually functionally related, and likely controlled by the same transcription factor: when you want one of them, you basically always want the three, because you must be are lacking threonine. TODO find transcription factor!
The UniProt entry also shows an interactive browser of the tertiary structure of the protein. We note that there are currently two sources available: X-ray crystallography and AlphaFold. To be honest, the AlphaFold one looks quite off!!!
By inspecting the FASTA for the entire genome, or by using the NCBI open reading frame tool, we see that this gene lies entirely in its own open reading frame, so it is quite boring
From the FASTA we see that the very first three Codons at position 337 are
ATG CGA GTG
where ATG is the start codon, and CGA GTG should be the first two that actually go into the protein:
ecocyc.org/gene?orgid=ECOLI&id=ASPKINIHOMOSERDEHYDROGI-MONOMER mentions that the enzime is most active as protein complex with four copies of the same protein:
Aspartate kinase I / homoserine dehydrogenase I comprises a dimer of ThrA dimers. Although the dimeric form is catalytically active, the binding equilibrium dramatically favors the tetrameric form. The aspartate kinase and homoserine dehydrogenase activities of each ThrA monomer are catalyzed by independent domains connected by a linker region.
TODO image?
E. Coli Whole Cell Model by Covert Lab Updated +Created
github.com/CovertLab/WholeCellEcoliRelease is a whole cell simulation model created by Covert Lab and other collaborators.
The project is written in Python, hurray! But according to te README, it seems to be the use a code drop model with on-request access to master, very meh, asked rationale on GitHub discussion, and they confirmed as expected that it is to:
  • to prevent their publication ideas from being stolen. Who would steal publication ideas with public proof in an issue tracker without crediting original authors?
  • to prevent noise from non collaborators. They do only get like 2 issues as year though, people forget that it is legal to ignore other people :-)
Oh well.
The project is a followup to the earlier M. genitalium whole cell model by Covert lab which modelled Mycoplasma genitalium. E. Coli has 8x more genes (500 vs 4k), but it the undisputed bacterial model organism and as such has been studied much more thoroughly. It also reproduces faster than Mycoplasma (20 minutes vs a few hours), which is a huge advantages for validation/exploratory experiments.
The project has a partial dependency on the proprietary optimization software CPLEX which is freeware, for students, not sure what it is used for exactly, from the comment in the requirements.txt the dependency is only partial.
This project makes Ciro Santilli think of the E. Coli as an optimization problem. Given such external nutrient/temperature condition, which DNA sequence makes the cell grow the fastest? Balancing metabolites feels like designing a Factorio speedrun.
There is one major thing missing thing in the current model: promoters/transcription factor interactions are not modelled due to lack/low quality of experimental data: github.com/CovertLab/WholeCellEcoliRelease/issues/21. They just have a magic direct "transcription factor to gene" relationship, encoded at reconstruction/ecoli/flat/foldChanges.tsv in terms of type "if this is present, such protein is expressed 10x more". Transcription units are not implemented at all it appears.
Everything in this section refers to version 7e4cc9e57de76752df0f4e32eca95fb653ea64e4, the code drop from November 2020, and was tested on Ubuntu 21.04 with a docker install of docker.pkg.github.com/covertlab/wholecellecolirelease/wcm-full with image id 502c3e604265, unless otherwise noted.
Source code overview Updated +Created
The key model database is located in the source code at reconstruction/ecoli/flat.
Let's try to understand some interesting looking, with a special focus on our understanding of the tiny E. Coli K-12 MG1655 operon thrLABC part of the metabolism, which we have well understood at Section "E. Coli K-12 MG1655 operon thrLABC".
We'll realize that a lot of data and IDs come from/match BioCyc quite closely.
  • reconstruction/ecoli/flat/compartments.tsv contains cellular compartment information:
    "abbrev" "id"
    "n" "CCO-BAC-NUCLEOID"
    "j" "CCO-CELL-PROJECTION"
    "w" "CCO-CW-BAC-NEG"
    "c" "CCO-CYTOSOL"
    "e" "CCO-EXTRACELLULAR"
    "m" "CCO-MEMBRANE"
    "o" "CCO-OUTER-MEM"
    "p" "CCO-PERI-BAC"
    "l" "CCO-PILUS"
    "i" "CCO-PM-BAC-NEG"
  • reconstruction/ecoli/flat/promoters.tsv contains promoter information. Simple file, sample lines:
    "position" "direction" "id" "name"
    148 "+" "PM00249" "thrLp"
    corresponds to E. Coli K-12 MG1655 promoter thrLp, which starts as position 148.
  • reconstruction/ecoli/flat/proteins.tsv contains protein information. Sample line corresponding to e. Coli K-12 MG1655 gene thrA:
    "aaCount" "name" "seq" "comments" "codingRnaSeq" "mw" "location" "rnaId" "id" "geneId"
    [91, 46, 38, 44, 12, 53, 30, 63, 14, 46, 89, 34, 23, 30, 29, 51, 34, 4, 20, 0, 69] "ThrA" "MRVL..." "Location information from Ecocyc dump." "AUGCGAGUGUUG..." [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 89103.51099999998, 0.0, 0.0, 0.0, 0.0] ["c"] "EG10998_RNA" "ASPKINIHOMOSERDEHYDROGI-MONOMER" "EG10998"
    so we understand that:
    • aaCount: amino acid count, how many of each of the 20 proteinogenic amino acid are there
    • seq: full sequence, using the single letter abbreviation of the proteinogenic amino acids
    • mw; molecular weight? The 11 components appear to be given at reconstruction/ecoli/flat/scripts/unifyBulkFiles.py:
      molecular_weight_keys = [
        '23srRNA',
        '16srRNA',
        '5srRNA',
        'tRNA',
        'mRNA',
        'miscRNA',
        'protein',
        'metabolite',
        'water',
        'DNA',
        'RNA' # nonspecific RNA
        ]
      so they simply classify the weight? Presumably this exists for complexes that have multiple classes?
    • location: cell compartment where the protein is present, c defined at reconstruction/ecoli/flat/compartments.tsv as cytoplasm, as expected for something that will make an amino acid
  • reconstruction/ecoli/flat/rnas.tsv: TODO vs transcriptionUnits.tsv. Sample lines:
    "halfLife" "name" "seq" "type" "modifiedForms" "monomerId" "comments" "mw" "location" "ntCount" "id" "geneId" "microarray expression"
    174.0 "ThrA [RNA]" "AUGCGAGUGUUG..." "mRNA" [] "ASPKINIHOMOSERDEHYDROGI-MONOMER" "" [0.0, 0.0, 0.0, 0.0, 790935.00399999996, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ["c"] [553, 615, 692, 603] "EG10998_RNA" "EG10998" 0.0005264904
    • halfLife: half-life
    • mw: molecular weight, same as in reconstruction/ecoli/flat/proteins.tsv. This molecule only have weight in the mRNA class, as expected, as it just codes for a protein
    • location: same as in reconstruction/ecoli/flat/proteins.tsv
    • ntCount: nucleotide count for each of the ATGC
    • microarray expression: presumably refers to DNA microarray for gene expression profiling, but what measure exactly?
  • reconstruction/ecoli/flat/sequence.fasta: FASTA DNA sequence, first two lines:
    >E. coli K-12 MG1655 U00096.2 (1 to 4639675 = 4639675 bp)
    AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG
  • reconstruction/ecoli/flat/transcriptionUnits.tsv: transcription units. We can observe for example the two different transcription units of the E. Coli K-12 MG1655 operon thrLABC in the lines:
    "expression_rate" "direction" "right" "terminator_id"  "name"    "promoter_id" "degradation_rate" "id"       "gene_id"                                   "left"
    0.0               "f"         310     ["TERM0-1059"]   "thrL"    "PM00249"     0.198905992329492 "TU0-42486" ["EG11277"]                                  148
    657.057317358791  "f"         5022    ["TERM_WC-2174"] "thrLABC" "PM00249"     0.231049060186648 "TU00178"   ["EG10998", "EG10999", "EG11000", "EG11277"] 148
  • reconstruction/ecoli/flat/genes.tsv
    "length" "name"                      "seq"             "rnaId"      "coordinate" "direction" "symbol" "type" "id"      "monomerId"
    66       "thr operon leader peptide" "ATGAAACGCATT..." "EG11277_RNA" 189         "+"         "thrL"   "mRNA" "EG11277" "EG11277-MONOMER"
    2463     "ThrA"                      "ATGCGAGTGTTG"    "EG10998_RNA" 336         "+"         "thrA"   "mRNA" "EG10998" "ASPKINIHOMOSERDEHYDROGI-MONOMER"
  • reconstruction/ecoli/flat/metabolites.tsv contains metabolite information. Sample lines:
    "id"                       "mw7.2" "location"
    "HOMO-SER"                 119.12  ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]
    "L-ASPARTATE-SEMIALDEHYDE" 117.104 ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]
    In the case of the enzyme thrA, one of the two reactions it catalyzes is "L-aspartate 4-semialdehyde" into "Homoserine".
    Starting from the enzyme page: biocyc.org/gene?orgid=ECOLI&id=EG10998 we reach the reaction page: biocyc.org/ECOLI/NEW-IMAGE?type=REACTION&object=HOMOSERDEHYDROG-RXN which has reaction ID HOMOSERDEHYDROG-RXN, and that page which clarifies the IDs:
    so these are the compounds that we care about.
  • reconstruction/ecoli/flat/reactions.tsv contains chemical reaction information. Sample lines:
    "reaction id" "stoichiometry" "is reversible" "catalyzed by"
    
    "HOMOSERDEHYDROG-RXN-HOMO-SER/NAD//L-ASPARTATE-SEMIALDEHYDE/NADH/PROTON.51."
      {"NADH[c]": -1, "PROTON[c]": -1, "HOMO-SER[c]": 1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "NAD[c]": 1}
      false
      ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]
    
    "HOMOSERDEHYDROG-RXN-HOMO-SER/NADP//L-ASPARTATE-SEMIALDEHYDE/NADPH/PROTON.53."
      {"NADPH[c]": -1, "NADP[c]": 1, "PROTON[c]": -1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "HOMO-SER[c]": 1
      false
      ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]
    • catalized by: here we see ASPKINIHOMOSERDEHYDROGI-CPLX, which we can guess is a protein complex made out of ASPKINIHOMOSERDEHYDROGI-MONOMER, which is the ID for the thrA we care about! This is confirmed in complexationReactions.tsv.
  • reconstruction/ecoli/flat/complexationReactions.tsv contains information about chemical reactions that produce protein complexes:
    "process" "stoichiometry" "id" "dir"
    "complexation"
      [
        {
          "molecule": "ASPKINIHOMOSERDEHYDROGI-CPLX",
          "coeff": 1,
          "type": "proteincomplex",
          "location": "c",
          "form": "mature"
        },
        {
          "molecule": "ASPKINIHOMOSERDEHYDROGI-MONOMER",
          "coeff": -4,
          "type": "proteinmonomer",
          "location": "c",
          "form": "mature"
        }
      ]
    "ASPKINIHOMOSERDEHYDROGI-CPLX_RXN"
    1
    The coeff is how many monomers need to get together for form the final complex. This can be seen from the Summary section of ecocyc.org/gene?orgid=ECOLI&id=ASPKINIHOMOSERDEHYDROGI-MONOMER:
    Aspartate kinase I / homoserine dehydrogenase I comprises a dimer of ThrA dimers. Although the dimeric form is catalytically active, the binding equilibrium dramatically favors the tetrameric form. The aspartate kinase and homoserine dehydrogenase activities of each ThrA monomer are catalyzed by independent domains connected by a linker region.
    Fantastic literature summary! Can't find that in database form there however.
  • reconstruction/ecoli/flat/proteinComplexes.tsv contains protein complex information:
    "name" "comments" "mw" "location" "reactionId" "id"
    "aspartate kinase / homoserine dehydrogenase"
    ""
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 356414.04399999994, 0.0, 0.0, 0.0, 0.0]
    ["c"]
    "ASPKINIHOMOSERDEHYDROGI-CPLX_RXN"
    "ASPKINIHOMOSERDEHYDROGI-CPLX"
  • reconstruction/ecoli/flat/protein_half_lives.tsv contains the half-life of proteins. Very few proteins are listed however for some reason.
  • reconstruction/ecoli/flat/tfIds.csv: transcription factors information:
    "TF"   "geneId"  "oneComponentId"  "twoComponentId" "nonMetaboliteBindingId" "activeId" "notes"
    "arcA" "EG10061" "PHOSPHO-ARCA"    "PHOSPHO-ARCA"
    "fnr"  "EG10325" "FNR-4FE-4S-CPLX" "FNR-4FE-4S-CPLX"
    "dksA" "EG10230"