Source: /cirosantilli/e-coli-whole-cell-model-by-covert-lab/source-code-overview

= Source code overview

The key model database is located in the source code at `reconstruction/ecoli/flat`.

Let's try to understand some interesting looking, with a special focus on our understanding of the tiny <E. Coli K-12 MG1655 operon thrLABC> part of the metabolism, which we have well understood at <E. Coli K-12 MG1655 operon thrLABC>{full}.

We'll realize that a lot of data and IDs come from/match <BioCyc> quite closely.

Before we start, there is one major thing missing thing in the current model: <promoters>/<transcription factor> interactions are not modelled due to lack/low quality of experimental data: https://github.com/CovertLab/WholeCellEcoliRelease/issues/21[]. They just have a magic direct "<transcription factor> to <gene>" relationship, encoded at https://github.com/CovertLab/WholeCellEcoliRelease/blob/7e4cc9e57de76752df0f4e32eca95fb653ea64e4/reconstruction/ecoli/flat/foldChanges.tsv[reconstruction/ecoli/flat/foldChanges.tsv] in terms of type "if this is present, such protein is expressed 10x more". <Transcription units> are not implemented at all it appears.

* `reconstruction/ecoli/flat/compartments.tsv` contains <cellular compartment> information:
  ``
  "abbrev" "id"
  "n" "CCO-BAC-NUCLEOID"
  "j" "CCO-CELL-PROJECTION"
  "w" "CCO-CW-BAC-NEG"
  "c" "CCO-CYTOSOL"
  "e" "CCO-EXTRACELLULAR"
  "m" "CCO-MEMBRANE"
  "o" "CCO-OUTER-MEM"
  "p" "CCO-PERI-BAC"
  "l" "CCO-PILUS"
  "i" "CCO-PM-BAC-NEG"
  ``
  * `CCO`: "Celular COmpartment"
  * `BAC-NUCLEOID`: <nucleoid>
  * `CELL-PROJECTION`: <cell projection>
  * `CW-BAC-NEG`: TODO confirm: <cell wall> (of a <Gram-negative bacteria>)
  * `CYTOSOL`: <cytosol>
  * `EXTRACELLULAR`: outside the cell
  * `MEMBRANE`: <cell membrane>
  * `OUTER-MEM`: <bacterial outer membrane>
  * `PERI-BAC`: <periplasm>
  * `PILUS`: <pilus>
  * `PM-BAC-NEG`: TODO: <plasma membrane>, but that is the same as <cell membrane> no?
* `reconstruction/ecoli/flat/promoters.tsv` contains <promoter> information. Simple file, sample lines:
  ``
  "position" "direction" "id" "name"
  148 "+" "PM00249" "thrLp"
  ``
  corresponds to <E. Coli K-12 MG1655 promoter thrLp>, which starts as position 148.
* `reconstruction/ecoli/flat/proteins.tsv` contains <protein> information. Sample line corresponding to <e. Coli K-12 MG1655 gene thrA>:
  ``
  "aaCount" "name" "seq" "comments" "codingRnaSeq" "mw" "location" "rnaId" "id" "geneId"
  [91, 46, 38, 44, 12, 53, 30, 63, 14, 46, 89, 34, 23, 30, 29, 51, 34, 4, 20, 0, 69] "ThrA" "MRVL..." "Location information from Ecocyc dump." "AUGCGAGUGUUG..." [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 89103.51099999998, 0.0, 0.0, 0.0, 0.0] ["c"] "EG10998_RNA" "ASPKINIHOMOSERDEHYDROGI-MONOMER" "EG10998"
  ``
  so we understand that:
  * `aaCount`: <amino acid> count, how many of each of the 20 <proteinogenic amino acid> are there
  * `seq`: full sequence, using the single letter abbreviation of the <proteinogenic amino acids>
  * `mw`; molecular weight? The 11 components appear to be given at `reconstruction/ecoli/flat/scripts/unifyBulkFiles.py`:
    ``
    molecular_weight_keys = [
      '23srRNA',
      '16srRNA',
      '5srRNA',
      'tRNA',
      'mRNA',
      'miscRNA',
      'protein',
      'metabolite',
      'water',
      'DNA',
      'RNA' # nonspecific RNA
      ]
    ``
    so they simply classify the weight? Presumably this exists for complexes that have multiple classes?
    * `23srRNA`, `16srRNA`, `5srRNA` are the three structural <RNAs> present in the <ribosome>: <23S ribosomal RNA>, <16S ribosomal RNA>, <5S ribosomal RNA>, all others are obvious:
    * <tRNA>
    * <mRNA>
    * <protein>. This is the seventh class, and this enzyme only contains mass in this class as expected.
    * <metabolite>
    * <water>
    * <DNA>
    * <RNA>: TODO `rna` vs `miscRNA`
  * `location`: <cell compartment> where the protein is present, `c` defined at `reconstruction/ecoli/flat/compartments.tsv` as <cytoplasm>, as expected for something that will make an <amino acid>
* `reconstruction/ecoli/flat/rnas.tsv`: TODO vs `transcriptionUnits.tsv`. Sample lines:
  ``
  "halfLife" "name" "seq" "type" "modifiedForms" "monomerId" "comments" "mw" "location" "ntCount" "id" "geneId" "microarray expression"
  174.0 "ThrA [RNA]" "AUGCGAGUGUUG..." "mRNA" [] "ASPKINIHOMOSERDEHYDROGI-MONOMER" "" [0.0, 0.0, 0.0, 0.0, 790935.00399999996, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ["c"] [553, 615, 692, 603] "EG10998_RNA" "EG10998" 0.0005264904
  ``
  * `halfLife`: <half-life>
  * `mw`: molecular weight, same as in `reconstruction/ecoli/flat/proteins.tsv`. This <molecule> only have weight in the `mRNA` class, as expected, as it just codes for a protein
  * `location`: same as in `reconstruction/ecoli/flat/proteins.tsv`
  * `ntCount`: <nucleotide> count for each of the ATGC
  * `microarray expression`: presumably refers to <DNA microarray> for <gene expression profiling>, but what measure exactly?
* `reconstruction/ecoli/flat/sequence.fasta`: <FASTA> <DNA> sequence, first two lines:
  ``
  >E. coli K-12 MG1655 U00096.2 (1 to 4639675 = 4639675 bp)
  AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG
  ``
* `reconstruction/ecoli/flat/transcriptionUnits.tsv`: <transcription units>. We can observe for example the two different transcription units of the <E. Coli K-12 MG1655 operon thrLABC> in the lines:
  ``
  "expression_rate" "direction" "right" "terminator_id"  "name"    "promoter_id" "degradation_rate" "id"       "gene_id"                                   "left"
  0.0               "f"         310     ["TERM0-1059"]   "thrL"    "PM00249"     0.198905992329492 "TU0-42486" ["EG11277"]                                  148
  657.057317358791  "f"         5022    ["TERM_WC-2174"] "thrLABC" "PM00249"     0.231049060186648 "TU00178"   ["EG10998", "EG10999", "EG11000", "EG11277"] 148
  ``
  * `promoter_id`: matches promoter id in `reconstruction/ecoli/flat/promoters.tsv`
  * `gene_id`: matches id in `reconstruction/ecoli/flat/genes.tsv`
  * `id`: matches exactly those used in <BioCyc>, which is quite nice, might be more or less standardized:
    * https://biocyc.org/ECOLI/NEW-IMAGE?object=TU0-42486
    * https://biocyc.org/ECOLI/NEW-IMAGE?type=OPERON&object=TU00178
* `reconstruction/ecoli/flat/genes.tsv`
  ``
  "length" "name"                      "seq"             "rnaId"      "coordinate" "direction" "symbol" "type" "id"      "monomerId"
  66       "thr operon leader peptide" "ATGAAACGCATT..." "EG11277_RNA" 189         "+"         "thrL"   "mRNA" "EG11277" "EG11277-MONOMER"
  2463     "ThrA"                      "ATGCGAGTGTTG"    "EG10998_RNA" 336         "+"         "thrA"   "mRNA" "EG10998" "ASPKINIHOMOSERDEHYDROGI-MONOMER"
  ``
* `reconstruction/ecoli/flat/metabolites.tsv` contains <metabolite> information. Sample lines:
  ``
  "id"                       "mw7.2" "location"
  "HOMO-SER"                 119.12  ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]
  "L-ASPARTATE-SEMIALDEHYDE" 117.104 ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]
  ``
  In the case of the enzyme thrA, one of the two reactions it catalyzes is "L-aspartate 4-semialdehyde" into "Homoserine".

  Starting from the enzyme page: https://biocyc.org/gene?orgid=ECOLI&id=EG10998 we reach the reaction page: https://biocyc.org/ECOLI/NEW-IMAGE?type=REACTION&object=HOMOSERDEHYDROG-RXN[] which has reaction ID `HOMOSERDEHYDROG-RXN`, and that page which clarifies the IDs:
  * https://biocyc.org/compound?orgid=ECOLI&id=L-ASPARTATE-SEMIALDEHYDE: "L-aspartate 4-semialdehyde" has ID `L-ASPARTATE-SEMIALDEHYDE`
  * https://biocyc.org/compound?orgid=ECOLI&id=HOMO-SER: "Homoserine" has ID `HOMO-SER`
  so these are the compounds that we care about.
* `reconstruction/ecoli/flat/reactions.tsv` contains <chemical reaction> information. Sample lines:
  ``
  "reaction id" "stoichiometry" "is reversible" "catalyzed by"

  "HOMOSERDEHYDROG-RXN-HOMO-SER/NAD//L-ASPARTATE-SEMIALDEHYDE/NADH/PROTON.51."
    {"NADH[c]": -1, "PROTON[c]": -1, "HOMO-SER[c]": 1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "NAD[c]": 1}
    false
    ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]

  "HOMOSERDEHYDROG-RXN-HOMO-SER/NADP//L-ASPARTATE-SEMIALDEHYDE/NADPH/PROTON.53."
    {"NADPH[c]": -1, "NADP[c]": 1, "PROTON[c]": -1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "HOMO-SER[c]": 1
    false
    ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]
  ``
  * `catalized by`: here we see `ASPKINIHOMOSERDEHYDROGI-CPLX`, which we can guess is a <protein complex> made out of `ASPKINIHOMOSERDEHYDROGI-MONOMER`, which is the ID for the `thrA` we care about! This is confirmed in `complexationReactions.tsv`.
* `reconstruction/ecoli/flat/complexationReactions.tsv` contains information about <chemical reactions> that produce <protein complexes>:
  ``
  "process" "stoichiometry" "id" "dir"
  "complexation"
    [
      {
        "molecule": "ASPKINIHOMOSERDEHYDROGI-CPLX",
        "coeff": 1,
        "type": "proteincomplex",
        "location": "c",
        "form": "mature"
      },
      {
        "molecule": "ASPKINIHOMOSERDEHYDROGI-MONOMER",
        "coeff": -4,
        "type": "proteinmonomer",
        "location": "c",
        "form": "mature"
      }
    ]
  "ASPKINIHOMOSERDEHYDROGI-CPLX_RXN"
  1
  ``
  The `coeff` is how many monomers need to get together for form the final complex. This can be seen from the Summary section of https://ecocyc.org/gene?orgid=ECOLI&id=ASPKINIHOMOSERDEHYDROGI-MONOMER[]:
  \Q[Aspartate kinase I / homoserine dehydrogenase I comprises a dimer of ThrA dimers. Although the dimeric form is catalytically active, the binding equilibrium dramatically favors the tetrameric form. The aspartate kinase and homoserine dehydrogenase activities of each ThrA monomer are catalyzed by independent domains connected by a linker region.]
  Fantastic literature summary! Can't find that in database form there however.
* `reconstruction/ecoli/flat/proteinComplexes.tsv` contains <protein complex> information:
  ``
  "name" "comments" "mw" "location" "reactionId" "id"
  "aspartate kinase / homoserine dehydrogenase"
  ""
  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 356414.04399999994, 0.0, 0.0, 0.0, 0.0]
  ["c"]
  "ASPKINIHOMOSERDEHYDROGI-CPLX_RXN"
  "ASPKINIHOMOSERDEHYDROGI-CPLX"
  ``
* `reconstruction/ecoli/flat/protein_half_lives.tsv` contains the <half-life> of <proteins>. Very few proteins are listed however for some reason.
* `reconstruction/ecoli/flat/tfIds.csv`: <transcription factors> information:
  ``
  "TF"   "geneId"  "oneComponentId"  "twoComponentId" "nonMetaboliteBindingId" "activeId" "notes"
  "arcA" "EG10061" "PHOSPHO-ARCA"    "PHOSPHO-ARCA"
  "fnr"  "EG10325" "FNR-4FE-4S-CPLX" "FNR-4FE-4S-CPLX"
  "dksA" "EG10230"
  ``