Size: 1-2 micrometers long and about 0.25 micrometer in diameter, so: 2 * 0.5 * 0.5 * 10e-18 and thus 0.5 micrometer square.
Reference strain: E. Coli K-12 MG1655.
Genome:
  • 4k genes
  • 5 Mbps
  • www.ncbi.nlm.nih.gov/genome/167
  • wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
  • wget -O NC_000913.3.fasta 'https://www.ncbi.nlm.nih.gov/search/api/sequence/NC_000913.3/?report=fasta'
Omics modeling: www.ncbi.nlm.nih.gov/pmc/articles/PMC5611438/ Tools for Genomic and Transcriptomic Analysis of Microbes at Single-Cell Level Zixi Chen, Lei Chen, Weiwen Zhang.
20 minutes in optimal conditions, with a crazy multiple start sites mechanism: E. Coli starts DNA replication before the previous one finished.
Otherwise, naively, would take 60-90 minutes just to replicate and segregate the full DNA otherwise. So it starts copying multiple times.
Appears to have just one, other bacteria can have more. TODO position in NCBI. Sequence determined in 1979: www.ncbi.nlm.nih.gov/pmc/articles/PMC382992
The conventional starting point is not at the E. Coli K-12 MG1655 origin of replication.
biocyc.org/ECOLI/NEW-IMAGE?type=EXTRAGENIC-SITE&object=G0-10506 explains:
This site is the origin of replication of the E. coli chromosome. It contains the binding sites for DnaA, which is critical for initiation of replication. Replication proceeds bidirectionally. For historical reasons, the numbering of E. coli's circular chromosome does not start at the origin of replication, but at the origin of transfer during conjugation.
If it is a bit hard to understand what they mean by "origin of transfer" though, as that term is usually associated with the origin of transfer of bacterial conjugation.
github.com/CovertLab/WholeCellEcoliRelease is a whole cell simulation model created by Covert Lab and other collaborators.
The project is written in Python, hurray! But according to te README, it seems to be the use a code drop model with on-request access to master, very meh, asked rationale on GitHub discussion, and they confirmed as expected that it is to:
  • to prevent their publication ideas from being stolen. Who would steal publication ideas with public proof in an issue tracker without crediting original authors?
  • to prevent noise from non collaborators. They do only get like 2 issues as year though, people forget that it is legal to ignore other people :-)
Oh well.
The project is a followup to the earlier M. genitalium whole cell model by Covert lab which modelled Mycoplasma genitalium. E. Coli has 8x more genes (500 vs 4k), but it the undisputed bacterial model organism and as such has been studied much more thoroughly. It also reproduces faster than Mycoplasma (20 minutes vs a few hours), which is a huge advantages for validation/exploratory experiments.
The project has a partial dependency on the proprietary optimization software CPLEX which is freeware, for students, not sure what it is used for exactly, from the comment in the requirements.txt the dependency is only partial.
This project makes Ciro Santilli think of the E. Coli as an optimization problem. Given such external nutrient/temperature condition, which DNA sequence makes the cell grow the fastest? Balancing metabolites feels like designing a Factorio speedrun.
Everything in this section refers to version 7e4cc9e57de76752df0f4e32eca95fb653ea64e4, the code drop from November 2020, and was tested on Ubuntu 21.04 with a docker install of docker.pkg.github.com/covertlab/wholecellecolirelease/wcm-full with image id 502c3e604265, unless otherwise noted.
At 7e4cc9e57de76752df0f4e32eca95fb653ea64e4 you basically need to use the Docker image on Ubuntu 21.04 due to pip breaking changes... (not their fault). Perhaps pyenv would solve things, but who has the patience for that?!?!
The Docker setup from README does just work. The image download is a bit tedius, as it requires you to create a GitHub API key as described in the README, but there must be reasons for that.
Once the image is downloaded, you really want to run is from the root of the source tree:
sudo docker run --name=wcm -it -v "$(pwd):/wcEcoli" docker.pkg.github.com/covertlab/wholecellecolirelease/wcm-full
This mounts the host source under /wcEcoli, so you can easily edit and view output images from your host. Once inside Docker we can compile, run the simulation, and analyze results with:
make clean compile &&
python runscripts/manual/runFitter.py &&
python runscripts/manual/runSim.py &&
python runscripts/manual/analysisVariant.py &&
python runscripts/manual/analysisCohort.py &&
python runscripts/manual/analysisMultigen.py &&
python runscripts/manual/analysisSingle.py
The meaning of each of the analysis commands is described at Section "Output overview".
As a Docker refresher, after you stop the container, e.g. by restarting your computer or running sudo docker stop wcm, you can get back into it with:
sudo docker start wcm
sudo docker run -it wcm bash
runscripts/manual/runFitter.py takes about 15 minutes, and it generates files such as reconstruction/ecoli/dataclasses/process/two_component_system.py (related) which is required to run the simulation, it is basically a part of the build.
runSim.py does the main simulation, progress output contains lines of type:
Time (s)  Dry mass     Dry mass      Protein          RNA    Small mol     Expected
              (fg)  fold change  fold change  fold change  fold change  fold change
========  ========  ===========  ===========  ===========  ===========  ===========
    0.00    403.09        1.000        1.000        1.000        1.000        1.000
    0.20    403.18        1.000        1.000        1.000        1.000        1.000
and then it ended on the Lenovo ThinkPad P51 (2017) at:
 2569.18    783.09        1.943        1.910        2.005        1.950        1.963

Simulation finished:
 - Length: 0:42:49
 - Runtime: 0:09:13
when the cell had almost doubled, and presumably divided in 42 minutes of simulated time, which could make sense compared to the 20 under optimal conditions.
Run output is placed under out/:
Some of the output data is stored as .cpickle files. To observe those files, you need the original Python classes, and therefore you have to be inside Docker, from the host it won't work.
We can list all the plots that have been produced under out/ with
find -name '*.png'
Plots are also available in SVG and PDF formats, e.g.:
  • PNG: ./out/manual/plotOut/low_res_plots/massFractionSummary.png
  • SVG: ./out/manual/plotOut/svg_plots/massFractionSummary.svg The SVGs write text as polygons, see also: SVG fonts.
  • PDF: ./out/manual/plotOut/massFractionSummary.pdf
The output directory has a hierarchical structure of type:
./out/manual/wildtype_000000/000000/generation_000000/000000/
where:
  • wildtype_000000: variant conditions. wildtype is a human readable label, and 000000 is an index amongst the possible wildtype conditions. For example, we can have different simulations with different nutrients, or different DNA sequences. An example of this is shown at run variants.
  • 000000: initial random seed for the initial cell, likely fed to NumPy's np.random.seed
  • genereation_000000: this will increase with generations if we simulate multiple cells, which is supported by the model
  • 000000: this will presumably contain the cell index within a generation
We also understand that some of the top level directories contain summaries over all cells, e.g. the massFractionSummary.pdf plot exists at several levels of the hierarchy:
./out/manual/plotOut/massFractionSummary.pdf
./out/manual/wildtype_000000/plotOut/massFractionSummary.pdf
./out/manual/wildtype_000000/000000/plotOut/massFractionSummary.pdf
./out/manual/wildtype_000000/000000/generation_000000/000000/plotOut/massFractionSummary.pdf
Each of thoes four levels of plotOut is generated by a different one of the analysis scripts:
  • ./out/manual/plotOut: generated by python runscripts/manual/analysisVariant.py. Contains comparisons of different variant conditions. We confirm this by looking at the results of run variants.
  • ./out/manual/wildtype_000000/plotOut: generated by python runscripts/manual/analysisCohort.py --variant_index 0. TODO not sure how to differentiate between two different labels e.g. wildtype_000000 and somethingElse_000000. If -v is not given, a it just picks the first one alphabetically. TODO not sure how to automatically generate all of those plots without inspecting the directories.
  • ./out/manual/wildtype_000000/000000/plotOut: generated by python runscripts/manual/analysisMultigen.py --variant_index 0 --seed 0
  • ./out/manual/wildtype_000000/000000/generation_000000/000000/plotOut: generated by python runscripts/manual/analysisSingle.py --variant_index 0 --seed 0 --generation 0 --daughter 0. Contains information about a single specific cell.
Let's look into a sample plot, out/manual/plotOut/svg_plots/massFractionSummary.svg, and try to understand as much as we can about what it means and how it was generated.
This plot contains how much of each type of mass is present in all cells. Since we simulated just one cell, it will be the same as the results for that cell.
We can see that all of them grow more or less linearly, perhaps as the start of an exponential. We can see that all of them grow more or less linearly, perhaps as the start of an exponential. We can see that all of them grow more or less linearly, perhaps as the start of an exponential.
  • total dry mass (mass excluding water)
  • protein mass
  • rRNA mass
  • mRNA mass
  • DNA mass. The last label is not very visible on the plots, but we can deduce it from the source code.
By grepping the title "Cell mass fractions" in the source code, we see the files:
models/ecoli/analysis/cohort/massFractionSummary.py
models/ecoli/analysis/multigen/massFractionSummary.py
models/ecoli/analysis/variant/massFractionSummary.py
which must correspond to the different massFractionSummary plots throughout different levels of the hierarchy.
By reading models/ecoli/analysis/variant/massFractionSummary.py a little bit, we see that:
  • the plotting is done with Matplotlib, hurray
  • it is reading its data from files under ./out/manual/wildtype_000000/000000/generation_000000/000000/simOut/Mass/, more precisely ./out/manual/wildtype_000000/000000/generation_000000/000000/simOut/Mass/columns/<column-name>/data. They are binary files however.
    Looking at the source for wholecell/io/tablereader.py shows that those are just a standard NumPy serialization mechanism. Maybe they should have used the Hierarchical Data Format instead.
    We can also take this opportunity to try and find where the data is coming from. Mass from the ./out/manual/wildtype_000000/000000/generation_000000/000000/simOut/Mass/ looks like an ID, so we grep that and we reach models/ecoli/listeners/mass.py.
    From this we understand that all data that is to be saved from a simulation must be coming from listeners: likely nothing, or not much, is dumped by default, because otherwise it would take up too much disk space. You have to explicitly say what it is that you want to save via a listener that acts on each time step.
Figure 1. Minimal condition mass fraction plot. Source. File name: out/manual/plotOut/svg_plots/massFractionSummary.svg
More plot types will be explored at time series run variant, where we will contrast two runs with different growth mediums.
It would be boring if we could only simulate the same condition all the time, so let's have a look at the different boundary conditions that we can apply to the cell!
We are able to alter things like the composition of the external medium, and the genome of the bacteria, which will make the bacteria behave differently.
The variant selection is a bit cumbersome as we have to use indexes instead of names, but one you know what you are doing, it is fine.
Of course, genetic modification is limited only to experimentally known protein interactions due to the intractability of computational protein folding and computational chemistry in general, solving those would bsai.
The default run variant, if you don't pass any options, just has the minimal growth conditions set. What this means can be seen at condition.
Notably, this implies a growth medium that includes glucose and salt. It also includes oxygen, which is not strictly required, but greatly benefits cell growth, and is of course easier to have than not have as it is part of the atmosphere!
But the medium does not include amino acids, which the bacteria will have to produce by itself.
To modify the nutrients as a function of time, with To select a time series we can use something like:
python runscripts/manual/runSim.py --variant nutrientTimeSeries 25 25
As mentioned in python runscripts/manual/runSim.py --help, nutrientTimeSeries is one of the choices from github.com/CovertLab/WholeCellEcoliRelease/blob/7e4cc9e57de76752df0f4e32eca95fb653ea64e4/models/ecoli/sim/variants/__init__.py#L57
25 25 means to start from index 25 and also end at 25, so running just one simulation. 25 27 would run 25 then 26 and then 27 for example.
The timeseries with index 25 is reconstruction/ecoli/flat/condition/timeseries/000025_cut_aa.tsv and contains
"time (units.s)" "nutrients"
0 "minimal_plus_amino_acids"
1200 "minimal"
so we understand that it starts with extra amino acids in the medium, which benefit the cell, and half way through those are removed at time 1200s = 20 minutes. We would therefore expect the cell to start expressing amino acid production genes exactly at that point.
nutrients likely means condition in that file however, see bug report with 1 1 failing: github.com/CovertLab/WholeCellEcoliRelease/issues/24
When we do this the simulation ends in:
Simulation finished:
 - Length: 0:34:23
 - Runtime: 0:08:03
so we see that the doubling time was faster than the one with minimal conditions of 0:42:49, which makes sense, since during the first 20 minutes the cell had extra amino acid nutrients at its disposal.
The output directory now contains simulation output data under out/manual/nutrientTimeSeries_000025/. Let's run analysis and plots for that:
python runscripts/manual/analysisVariant.py &&
python runscripts/manual/analysisCohort.py --variant 25 &&
python runscripts/manual/analysisMultigen.py --variant 25 &&
python runscripts/manual/analysisSingle.py --variant 25
We can now compare the outputs of this run to the default wildtype_000000 run from Section "Install and first run".
  • out/manual/plotOut/svg_plots/massFractionSummary.svg: because we now have two variants in the same out/ folder, wildtype_000000 and nutrientTimeSeries_000025, we now see a side by side comparision of both on the same graph!
    The run variant where we started with amino acids initially grows faster as expected, because the cell didn't have to make it's own amino acids, so growth is a bit more efficient.
    Then, at 20 minutes, which is about 0.3 hours, we see that the cell starts growing a bit less fast as the slope of the curve decreases a bit, because we removed that free amino acid supply.
    Figure 1. Minimal condition vs amino acid cut mass fraction plot. Source. From file out/manual/plotOut/svg_plots/massFractionSummary.svg.
The following plots from under out/manual/wildtype_000000/000000/{generation_000000,nutrientTimeSeries_000025}/000000/plotOut/svg_plots have been manually joined side-by-side with:
for f in out/manual/wildtype_000000/000000/generation_000000/000000/plotOut/svg_plots/*; do
  echo $f
  svg_stack.py \
    --direction h \
    out/manual/wildtype_000000/000000/generation_000000/000000/plotOut/svg_plots/$(basename $f) \
    out/manual/nutrientTimeSeries_000025/000000/generation_000000/000000/plotOut/svg_plots/$(basename $f) \
    > tmp/$(basename $f)
done
Figure 2. Amino acid counts. Source. aaCounts.svg:
  • default: quantities just increase
  • amino acid cut: there is an abrupt fall at 20 minutes when we cut off external supply, presumably because it takes some time for the cell to start producing its own
Figure 3. External exchange fluxes of amino acids. Source. aaExchangeFluxes.svg:
  • default: no exchanges
  • amino acid cut: for all graphs except phenylalanine (PHE), either the cell was intaking the AA (negative flux), and that intake goes to 0 when the supply is cut, or the flux is always 0.
    For PHE however, the flux is at all times, except shortly after the cut. Why? And why there was no excretion on the default conditions?
Figure 4. Evaluation time. Source. evaluationTime.svg: this has nothing to do with biology, but it is rather a profile of the program runtime. We can see that the simulation gets slower and slower as time passes, presumably because there are more and more molecules to simulate.
Figure 5. mRNA count of highly expressed mRNAs. Source. From file expression_rna_03_high.svg. Each of the entries is a gene using the conventional gene naming convention of xyzW, e.g. here's the BioCyc for the first entry, tufA: biocyc.org/gene?orgid=ECOLI&id=EG11036, which comments
Elongation factor Tu (EF-Tu) is the most abundant protein in E. coli.
and
In E. coli, EF-Tu is encoded by two genes, tufA and tufB
. What they seem to mean is that tufA and tufB are two similar molecules, either of which can make up the EF-Tu of the E. Coli, which is an important part of translation.
Figure 6. External exchange fluxes. Source.
mediaExcange.svg: this one is similar to aaExchangeFluxes.svg, but it also tracks other substances. The color version makes it easier to squeeze more substances in a given space, but you lose the shape of curves a bit. The title seems reversed: red must be excretion, since that's where glucose (GLC) is.
The substances are different between the default and amino acid cut graphs, they seem to be the most exchanged substances. On the amino cut graph, first we see the cell intaking most (except phenylalanine, which is excreted for some reason). When we cut amino acids, the uptake of course stops.
Besides time series run variants, conditions can also be selected directly without a time series as in:
python runscripts/manual/runSim.py --variant condition 1 1
which select row indices from reconstruction/ecoli/flat/condition/condition_defs.tsv. The above 1 1 would mean the second line of that file which starts with:
"condition" "nutrients" "genotype perturbations" "doubling time (units.min)" "active TFs"
"basal" "minimal" {} 44.0 []
"no_oxygen" "minimal_minus_oxygen" {} 100.0 []
"with_aa" "minimal_plus_amino_acids" {} 25.0 ["CPLX-125", "MONOMER0-162", "CPLX0-7671", "CPLX0-228", "MONOMER0-155"]
so 1 means no_oxygen.
The key model database is located in the source code at reconstruction/ecoli/flat.
Let's try to understand some interesting looking, with a special focus on our understanding of the tiny E. Coli K-12 MG1655 operon thrLABC part of the metabolism, which we have well understood at Section "E. Coli K-12 MG1655 operon thrLABC".
We'll realize that a lot of data and IDs come from/match BioCyc quite closely.
Before we start, there is one major thing missing thing in the current model: promoters/transcription factor interactions are not modelled due to lack/low quality of experimental data: github.com/CovertLab/WholeCellEcoliRelease/issues/21. They just have a magic direct "transcription factor to gene" relationship, encoded at reconstruction/ecoli/flat/foldChanges.tsv in terms of type "if this is present, such protein is expressed 10x more". Transcription units are not implemented at all it appears.
  • reconstruction/ecoli/flat/compartments.tsv contains cellular compartment information:
    "abbrev" "id"
    "n" "CCO-BAC-NUCLEOID"
    "j" "CCO-CELL-PROJECTION"
    "w" "CCO-CW-BAC-NEG"
    "c" "CCO-CYTOSOL"
    "e" "CCO-EXTRACELLULAR"
    "m" "CCO-MEMBRANE"
    "o" "CCO-OUTER-MEM"
    "p" "CCO-PERI-BAC"
    "l" "CCO-PILUS"
    "i" "CCO-PM-BAC-NEG"
  • reconstruction/ecoli/flat/promoters.tsv contains promoter information. Simple file, sample lines:
    "position" "direction" "id" "name"
    148 "+" "PM00249" "thrLp"
    corresponds to E. Coli K-12 MG1655 promoter thrLp, which starts as position 148.
  • reconstruction/ecoli/flat/proteins.tsv contains protein information. Sample line corresponding to e. Coli K-12 MG1655 gene thrA:
    "aaCount" "name" "seq" "comments" "codingRnaSeq" "mw" "location" "rnaId" "id" "geneId"
    [91, 46, 38, 44, 12, 53, 30, 63, 14, 46, 89, 34, 23, 30, 29, 51, 34, 4, 20, 0, 69] "ThrA" "MRVL..." "Location information from Ecocyc dump." "AUGCGAGUGUUG..." [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 89103.51099999998, 0.0, 0.0, 0.0, 0.0] ["c"] "EG10998_RNA" "ASPKINIHOMOSERDEHYDROGI-MONOMER" "EG10998"
    so we understand that:
    • aaCount: amino acid count, how many of each of the 20 proteinogenic amino acid are there
    • seq: full sequence, using the single letter abbreviation of the proteinogenic amino acids
    • mw; molecular weight? The 11 components appear to be given at reconstruction/ecoli/flat/scripts/unifyBulkFiles.py:
      molecular_weight_keys = [
        '23srRNA',
        '16srRNA',
        '5srRNA',
        'tRNA',
        'mRNA',
        'miscRNA',
        'protein',
        'metabolite',
        'water',
        'DNA',
        'RNA' # nonspecific RNA
        ]
      so they simply classify the weight? Presumably this exists for complexes that have multiple classes?
    • location: cell compartment where the protein is present, c defined at reconstruction/ecoli/flat/compartments.tsv as cytoplasm, as expected for something that will make an amino acid
  • reconstruction/ecoli/flat/rnas.tsv: TODO vs transcriptionUnits.tsv. Sample lines:
    "halfLife" "name" "seq" "type" "modifiedForms" "monomerId" "comments" "mw" "location" "ntCount" "id" "geneId" "microarray expression"
    174.0 "ThrA [RNA]" "AUGCGAGUGUUG..." "mRNA" [] "ASPKINIHOMOSERDEHYDROGI-MONOMER" "" [0.0, 0.0, 0.0, 0.0, 790935.00399999996, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ["c"] [553, 615, 692, 603] "EG10998_RNA" "EG10998" 0.0005264904
    • halfLife: half-life
    • mw: molecular weight, same as in reconstruction/ecoli/flat/proteins.tsv. This molecule only have weight in the mRNA class, as expected, as it just codes for a protein
    • location: same as in reconstruction/ecoli/flat/proteins.tsv
    • ntCount: nucleotide count for each of the ATGC
    • microarray expression: presumably refers to DNA microarray for gene expression profiling, but what measure exactly?
  • reconstruction/ecoli/flat/sequence.fasta: FASTA DNA sequence, first two lines:
    >E. coli K-12 MG1655 U00096.2 (1 to 4639675 = 4639675 bp)
    AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG
  • reconstruction/ecoli/flat/transcriptionUnits.tsv: transcription units. We can observe for example the two different transcription units of the E. Coli K-12 MG1655 operon thrLABC in the lines:
    "expression_rate" "direction" "right" "terminator_id"  "name"    "promoter_id" "degradation_rate" "id"       "gene_id"                                   "left"
    0.0               "f"         310     ["TERM0-1059"]   "thrL"    "PM00249"     0.198905992329492 "TU0-42486" ["EG11277"]                                  148
    657.057317358791  "f"         5022    ["TERM_WC-2174"] "thrLABC" "PM00249"     0.231049060186648 "TU00178"   ["EG10998", "EG10999", "EG11000", "EG11277"] 148
  • reconstruction/ecoli/flat/genes.tsv
    "length" "name"                      "seq"             "rnaId"      "coordinate" "direction" "symbol" "type" "id"      "monomerId"
    66       "thr operon leader peptide" "ATGAAACGCATT..." "EG11277_RNA" 189         "+"         "thrL"   "mRNA" "EG11277" "EG11277-MONOMER"
    2463     "ThrA"                      "ATGCGAGTGTTG"    "EG10998_RNA" 336         "+"         "thrA"   "mRNA" "EG10998" "ASPKINIHOMOSERDEHYDROGI-MONOMER"
  • reconstruction/ecoli/flat/metabolites.tsv contains metabolite information. Sample lines:
    "id"                       "mw7.2" "location"
    "HOMO-SER"                 119.12  ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]
    "L-ASPARTATE-SEMIALDEHYDE" 117.104 ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]
    In the case of the enzyme thrA, one of the two reactions it catalyzes is "L-aspartate 4-semialdehyde" into "Homoserine".
    Starting from the enzyme page: biocyc.org/gene?orgid=ECOLI&id=EG10998 we reach the reaction page: biocyc.org/ECOLI/NEW-IMAGE?type=REACTION&object=HOMOSERDEHYDROG-RXN which has reaction ID HOMOSERDEHYDROG-RXN, and that page which clarifies the IDs:
    so these are the compounds that we care about.
  • reconstruction/ecoli/flat/reactions.tsv contains chemical reaction information. Sample lines:
    "reaction id" "stoichiometry" "is reversible" "catalyzed by"
    
    "HOMOSERDEHYDROG-RXN-HOMO-SER/NAD//L-ASPARTATE-SEMIALDEHYDE/NADH/PROTON.51."
      {"NADH[c]": -1, "PROTON[c]": -1, "HOMO-SER[c]": 1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "NAD[c]": 1}
      false
      ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]
    
    "HOMOSERDEHYDROG-RXN-HOMO-SER/NADP//L-ASPARTATE-SEMIALDEHYDE/NADPH/PROTON.53."
      {"NADPH[c]": -1, "NADP[c]": 1, "PROTON[c]": -1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "HOMO-SER[c]": 1
      false
      ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]
    • catalized by: here we see ASPKINIHOMOSERDEHYDROGI-CPLX, which we can guess is a protein complex made out of ASPKINIHOMOSERDEHYDROGI-MONOMER, which is the ID for the thrA we care about! This is confirmed in complexationReactions.tsv.
  • reconstruction/ecoli/flat/complexationReactions.tsv contains information about chemical reactions that produce protein complexes:
    "process" "stoichiometry" "id" "dir"
    "complexation"
      [
        {
          "molecule": "ASPKINIHOMOSERDEHYDROGI-CPLX",
          "coeff": 1,
          "type": "proteincomplex",
          "location": "c",
          "form": "mature"
        },
        {
          "molecule": "ASPKINIHOMOSERDEHYDROGI-MONOMER",
          "coeff": -4,
          "type": "proteinmonomer",
          "location": "c",
          "form": "mature"
        }
      ]
    "ASPKINIHOMOSERDEHYDROGI-CPLX_RXN"
    1
    The coeff is how many monomers need to get together for form the final complex. This can be seen from the Summary section of ecocyc.org/gene?orgid=ECOLI&id=ASPKINIHOMOSERDEHYDROGI-MONOMER:
    Aspartate kinase I / homoserine dehydrogenase I comprises a dimer of ThrA dimers. Although the dimeric form is catalytically active, the binding equilibrium dramatically favors the tetrameric form. The aspartate kinase and homoserine dehydrogenase activities of each ThrA monomer are catalyzed by independent domains connected by a linker region.
    Fantastic literature summary! Can't find that in database form there however.
  • reconstruction/ecoli/flat/proteinComplexes.tsv contains protein complex information:
    "name" "comments" "mw" "location" "reactionId" "id"
    "aspartate kinase / homoserine dehydrogenase"
    ""
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 356414.04399999994, 0.0, 0.0, 0.0, 0.0]
    ["c"]
    "ASPKINIHOMOSERDEHYDROGI-CPLX_RXN"
    "ASPKINIHOMOSERDEHYDROGI-CPLX"
  • reconstruction/ecoli/flat/protein_half_lives.tsv contains the half-life of proteins. Very few proteins are listed however for some reason.
  • reconstruction/ecoli/flat/tfIds.csv: transcription factors information:
    "TF"   "geneId"  "oneComponentId"  "twoComponentId" "nonMetaboliteBindingId" "activeId" "notes"
    "arcA" "EG10061" "PHOSPHO-ARCA"    "PHOSPHO-ARCA"
    "fnr"  "EG10325" "FNR-4FE-4S-CPLX" "FNR-4FE-4S-CPLX"
    "dksA" "EG10230"
  • reconstruction/ecoli/flat/condition/nutrient/minimal.tsv contains the nutrients in a minimal environment in which the cell survives:
    "molecule id" "lower bound (units.mmol / units.g / units.h)" "upper bound (units.mmol / units.g / units.h)"
    "ADP[c]" 3.15 3.15
    "PI[c]" 3.15 3.15
    "PROTON[c]" 3.15 3.15
    "GLC[p]" NaN 20
    "OXYGEN-MOLECULE[p]" NaN NaN
    "AMMONIUM[c]" NaN NaN
    "PI[p]" NaN NaN
    "K+[p]" NaN NaN
    "SULFATE[p]" NaN NaN
    "FE+2[p]" NaN NaN
    "CA+2[p]" NaN NaN
    "CL-[p]" NaN NaN
    "CO+2[p]" NaN NaN
    "MG+2[p]" NaN NaN
    "MN+2[p]" NaN NaN
    "NI+2[p]" NaN NaN
    "ZN+2[p]" NaN NaN
    "WATER[p]" NaN NaN
    "CARBON-DIOXIDE[p]" NaN NaN
    "CPD0-1958[p]" NaN NaN
    "L-SELENOCYSTEINE[c]" NaN NaN
    "GLC-D-LACTONE[c]" NaN NaN
    "CYTOSINE[c]" NaN NaN
    If we compare that to reconstruction/ecoli/flat/condition/nutrient/minimal_plus_amino_acids.tsv, we see that it adds the 20 amino acids on top of the minimal condition:
    "L-ALPHA-ALANINE[p]" NaN NaN
    "ARG[p]" NaN NaN
    "ASN[p]" NaN NaN
    "L-ASPARTATE[p]" NaN NaN
    "CYS[p]" NaN NaN
    "GLT[p]" NaN NaN
    "GLN[p]" NaN NaN
    "GLY[p]" NaN NaN
    "HIS[p]" NaN NaN
    "ILE[p]" NaN NaN
    "LEU[p]" NaN NaN
    "LYS[p]" NaN NaN
    "MET[p]" NaN NaN
    "PHE[p]" NaN NaN
    "PRO[p]" NaN NaN
    "SER[p]" NaN NaN
    "THR[p]" NaN NaN
    "TRP[p]" NaN NaN
    "TYR[p]" NaN NaN
    "L-SELENOCYSTEINE[c]" NaN NaN
    "VAL[p]" NaN NaN
    so we guess that NaN in the upper mound likely means infinite.
    We can try to understand the less obvious ones:
    • ADP: TODO
    • PI: TODO
    • PROTON[c]: presumably a measure of pH
    • GLC[p]: glucose, this can be seen by comparing minimal.tsv with minimal_no_glucose.tsv
    • AMMONIUM: ammonium. This appears to be the primary source of nitrogen atoms for producing amino acids.
    • CYTOSINE[c]: hmmm, why is external cytosine needed? Weird.
  • reconstruction/ecoli/flat/reconstruction/ecoli/flat/condition/timeseries/` contains sequences of conditions for each time. For example:
    * 
    reconstruction/ecoli/flat/reconstruction/ecoli/flat/condition/timeseries/000000_basal.tsv contains: "time (units.s)" "nutrients" 0 "minimal" which means just using reconstruction/ecoli/flat/condition/nutrient/minimal.tsv until infinity. That is the default one used by runSim.py, as can be seen from ./out/manual/wildtype_000000/000000/generation_000000/000000/simOut/Environment/attributes/nutrientTimeSeriesLabel which contains just 000000_basal. * reconstruction/ecoli/flat/reconstruction/ecoli/flat/condition/timeseries/000001_cut_glucose.tsv is more interesting and contains:
      "time (units.s)" "nutrients"
      0 "minimal"
      1200 "minimal_no_glucose"
      
    so we see that this will shift the conditions half-way to a condition that will eventually kill the bacteria because it will run out of glucose and thus energy!
    Timeseries can be selected with --variant nutrientTimeSeries X Y, see also: run variants.
    We can use that variant with:
      VARIANT="condition" FIRST_VARIANT_INDEX=1 LAST_VARIANT_INDEX=1 python runscripts/manual/runSim.py
      
  • reconstruction/ecoli/flat/condition/condition_defs.tsv contains lines of form:
    "condition" "nutrients"                "genotype perturbations" "doubling time (units.min)" "active TFs"
    "basal"     "minimal"                  {}                       44.0                        []
    "no_oxygen" "minimal_minus_oxygen"     {}                       100.0                       []
    "with_aa"   "minimal_plus_amino_acids" {}                       25.0                        ["CPLX-125", "MONOMER0-162", "CPLX0-7671", "CPLX0-228", "MONOMER0-155"]
    • condition refers to entries in reconstruction/ecoli/flat/condition/condition_defs.tsv
    • nutrients refers to entries under reconstruction/ecoli/flat/condition/nutrient/, e.g. reconstruction/ecoli/flat/condition/nutrient/minimal.tsv or reconstruction/ecoli/flat/condition/nutrient/minimal_plus_amino_acids.tsv
    • genotype perturbations: there aren't any in the file, but this suggests that genotype modifications can also be incorporated here
    • doubling time: TODO experimental data? Because this should be a simulation output, right? Or do they cheat and fix doubling by time?
    • active TFs: this suggests that they are cheating transcription factors here, as those would ideally be functions of other more basic inputs
TODO compare with actual datasetes.
Unfortunately, due to lack of one page to rule them all, the on-Git tree publication list is meager, some of the most relevant ones seems to be:
By Tagkopoulos lab at University of California, Davies.
Reference strain: E. Coli K-12 MG1655.
NCBI taxonomy entry: www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=511145 This links to:
  • genome: www.ncbi.nlm.nih.gov/genome/?term=txid511145 From there there are links to either:
    • Download the FASTA: "Download sequences in FASTA format for genome, protein"
      For the genome, you get a compressed FASTA file with extension .fna called GCF_000005845.2_ASM584v2_genomic.fna that starts with:
      >NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome
      AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTG
      Using wc as in wc GCF_000005845.2_ASM584v2_genomic.fna gives 58022 lines, in Vim we see that each line is 80 characters, except for the final one which is 52. So we have 58020 * 80 + 52 = 4641652 =~ 4.6 Mbp
Note that this is not the conventional starting point for gene numbering: Section "E. Coli genome starting point".
The first gene in the E. Coli K-12 MG1655 genome. Remember however that bacterial chromosome is circular, so being the first doesn't mean much, how the choice was made: Section "E. Coli genome starting point".
At only 65 bp, this gene is quite small and boring. For a more interesting gene, have a look at the next gene, e. Coli K-12 MG1655 gene thrA.
Does something to do with threonine.
This is the first in the sequence thrL, thrA, thrB, thrC. This type of naming convention is quite common on related adjacent proteins, all of which must be getting transcribed into a single RNA by the same promoter. As mentioned in the analysis of the KEGG entry for e. Coli K-12 MG1655 gene thrA, those A, B and C are actually directly functionally linked in a direct metabolic pathway.
We can see that thrL, A, B, and C are in the same transcription unit by browsing the list of promoter at: biocyc.org/group?id=:ALL-PROMOTERS&orgid=ECOLI. By finding the first one by position we reach; biocyc.org/ECOLI/NEW-IMAGE?object=TU0-42486.
The second gene in the E. Coli K-12 MG1655 genome. Part of the E. Coli K-12 MG1655 operon thrLABC.
Part of a reaction that produces threonine.
This protein is an enzyme. The UniProt entry clearly shows the chemical reactions that it catalyses. In this case, there are actually two! It can either transforming the metabolite:
  • "L-homoserine" into "L-aspartate 4-semialdehyde"
  • "L-aspartate" into "4-phospho-L-aspartate"
Also interestingly, we see that both of those reaction require some extra energy to catalyse, one needing adenosine triphosphate and the other nADP+.
TODO: any mention of how much faster it makes the reaction, numerically?
Since this is an enzyme, it would also be interesting to have a quick search for it in the KEGG entry starting from the organism: www.genome.jp/pathway/eco01100+M00022 We type in the search bar "thrA", it gives a long list, but the last entry is our "thrA". Selecting it highlights two pathways in the large graph, so we understand that it catalyzes two different reactions, as suggested by the protein name itself (fused blah blah). We can now hover over:
  • the edge: it shows all the enzymes that catalyze the given reaction. Both edges actually have multiple enzymes, e.g. the L-Homoserine path is also catalyzed by another enzyme called metL.
  • the node: they are the metabolites, e.g. one of the paths contains "L-homoserine" on one node and "L-aspartate 4-semialdehyde"
Note that common cofactor are omitted, since we've learnt from the UniProt entry that this reaction uses ATP.
If we can now click on the L-Homoserine edge, it takes us to: www.genome.jp/entry/eco:b0002+eco:b3940. Under "Pathway" we see an interesting looking pathway "Glycine, serine and threonine metabolism": www.genome.jp/pathway/eco00260+b0002 which contains a small manually selected and extremely clearly named subset of the larger graph!
But looking at the bottom of this subgraph (the UI is not great, can't Ctrl+F and enzyme names not shown, but the selected enzyme is slightly highlighted in red because it is in the URL www.genome.jp/pathway/eco00260+b0002 vs www.genome.jp/pathway/eco00260) we clearly see that thrA, thrB and thrC for a sequence that directly transforms "L-aspartate 4-semialdehyde" into "Homoserine" to "O-Phospho-L-homoserine" and finally tothreonine. This makes it crystal clear that they are not just located adjacently in the genome by chance: they are actually functionally related, and likely controlled by the same transcription factor: when you want one of them, you basically always want the three, because you must be are lacking threonine. TODO find transcription factor!
The UniProt entry also shows an interactive browser of the tertiary structure of the protein. We note that there are currently two sources available: X-ray crystallography and AlphaFold. To be honest, the AlphaFold one looks quite off!!!
By inspecting the FASTA for the entire genome, or by using the NCBI open reading frame tool, we see that this gene lies entirely in its own open reading frame, so it is quite boring
From the FASTA we see that the very first three Codons at position 337 are
ATG CGA GTG
where ATG is the start codon, and CGA GTG should be the first two that actually go into the protein:
ecocyc.org/gene?orgid=ECOLI&id=ASPKINIHOMOSERDEHYDROGI-MONOMER mentions that the enzime is most active as protein complex with four copies of the same protein:
Aspartate kinase I / homoserine dehydrogenase I comprises a dimer of ThrA dimers. Although the dimeric form is catalytically active, the binding equilibrium dramatically favors the tetrameric form. The aspartate kinase and homoserine dehydrogenase activities of each ThrA monomer are catalyzed by independent domains connected by a linker region.
TODO image?
Immediately follows e. Coli K-12 MG1655 gene thrA,
The fifth gene, and the first E. Coli K-12 MG1655 gene of unknown function as of 2021.
Note that this is very close to the "end" of the genome.
TODO DNA assembly structure.
The "last" gene, and also an E. Coli K-12 MG1655 gene of unknown function.
UniProt for example describes YaaX as "Uncharacterized protein YaaX".
As function is discovered, they then change it to a better name, e.g. to names such as the E. Coli K-12 MG1655 transcription unit thrLABC proteins all of which have a clear name due to threonine.
There are many other y??? as of 2021! Though they do tend to be smaller molecules.
From this we see that there is a convention of naming promoters as protein name + p, e.g. the first gene in E. Coli K-12 MG1655 promoter thrLp encodes protein thrL.
It is also possible to add numbers after the p, e.g. at biocyc.org/ECOLI/NEW-IMAGE?type=OPERON&object=PM0-45989 we see that the protein zur has two promoters:
  • zurp6
  • zurp7
TODO why 6 and 7? There don't appear to be 1, 2, etc.
We can find it by searching for the species in the BioCyc promoter database. This leads to: biocyc.org/group?id=:ALL-PROMOTERS&orgid=ECOLI.
By finding the first operon by position we reach: biocyc.org/ECOLI/NEW-IMAGE?object=TU0-42486.
That page lists several components of the promoter, which we should try to understand!
After the first gene in the codon, thrL, there is a rho-independent termination. By comparing:we understand that the presence of threonine or isoleucine variants, L-threonyl and L-isoleucyl, makes the rho-independent termination become more efficient, so the control loop is quite direct! Not sure why it cares about isoleucine as well though.
TODO which factor is actually specific to that DNA region?