github.com/CovertLab/WholeCellEcoliRelease is a whole cell simulation model created by Covert Lab and other collaborators.
The project is written in Python, hurray!
But according to te README, it seems to be the use a code drop model with on-request access to master. Ciro Santilli asked at rationale on GitHub discussion, and they confirmed as expected that it is to:
- to prevent their publication ideas from being stolen. Who would steal publication ideas with public proof in an issue tracker without crediting original authors? Academia is broken. Academia should be the most open form of knowledge sharing. But instead we get this silly competition for publication points.
- to prevent noise from non-collaborators. But they only get like 2 issues as year on such a meganiche subject... Did you know that you can ignore people, and even block them if they are particularly annoying? Much more likely is that no one will every hear about your project and that it will die with its last graduate student slave.
The project is a followup to the earlier M. genitalium whole cell model by Covert lab which modelled Mycoplasma genitalium. E. Coli has 8x more genes (500 vs 4k), but it the undisputed bacterial model organism and as such has been studied much more thoroughly. It also reproduces faster than Mycoplasma (20 minutes vs a few hours), which is a huge advantages for validation/exploratory experiments.
The project has a partial dependency on the proprietary optimization software CPLEX which is freeware, for students, not sure what it is used for exactly, from the comment in the 
requirements.txt the dependency is only partial.This project makes Ciro Santilli think of the E. Coli as an optimization problem. Given such external nutrient/temperature condition, which DNA sequence makes the cell grow the fastest? Balancing metabolites feels like designing a Factorio speedrun.
There is one major thing missing thing in the current model: promoters/transcription factor interactions are not modelled due to lack/low quality of experimental data: github.com/CovertLab/WholeCellEcoliRelease/issues/21. They just have a magic direct "transcription factor to gene" relationship, encoded at reconstruction/ecoli/flat/foldChanges.tsv in terms of type "if this is present, such protein is expressed 10x more". Transcription units are not implemented at all it appears.
Everything in this section refers to version 7e4cc9e57de76752df0f4e32eca95fb653ea64e4, the code drop from November 2020, and was tested on Ubuntu 21.04 with a docker install of 
docker.pkg.github.com/covertlab/wholecellecolirelease/wcm-full with image id 502c3e604265, unless otherwise noted.At 7e4cc9e57de76752df0f4e32eca95fb653ea64e4 you basically need to use the Docker image on Ubuntu 21.04 due to pip breaking changes... (not their fault). Perhaps pyenv would solve things, but who has the patience for that?!?!
The Docker setup from README does just work. The image download is a bit tedius, as it requires you to create a GitHub API key as described in the README, but there must be reasons for that.
Once the image is downloaded, you really want to run is from the root of the source tree:This mounts the host source under The meaning of each of the analysis commands is described at Section "Output overview".
sudo docker run --name=wcm -it -v "$(pwd):/wcEcoli" docker.pkg.github.com/covertlab/wholecellecolirelease/wcm-full/wcEcoli, so you can easily edit and view output images from your host. Once inside Docker we can compile, run the simulation, and analyze results with:make clean compile &&
python runscripts/manual/runFitter.py &&
python runscripts/manual/runSim.py &&
python runscripts/manual/analysisVariant.py &&
python runscripts/manual/analysisCohort.py &&
python runscripts/manual/analysisMultigen.py &&
python runscripts/manual/analysisSingle.pyAs a Docker refresher, after you stop the container, e.g. by restarting your computer or running 
sudo docker stop wcm, you can get back into it with:sudo docker start wcm
sudo docker run -it wcm bashrunscripts/manual/runFitter.py takes about 15 minutes, and it generates files such as reconstruction/ecoli/dataclasses/process/two_component_system.py (related) which is required to run the simulation, it is basically a part of the build.runSim.py does the main simulation, progress output contains lines of type:Time (s)  Dry mass     Dry mass      Protein          RNA    Small mol     Expected
              (fg)  fold change  fold change  fold change  fold change  fold change
========  ========  ===========  ===========  ===========  ===========  ===========
    0.00    403.09        1.000        1.000        1.000        1.000        1.000
    0.20    403.18        1.000        1.000        1.000        1.000        1.000 2569.18    783.09        1.943        1.910        2.005        1.950        1.963
Simulation finished:
 - Length: 0:42:49
 - Runtime: 0:09:13Run output is placed under 
out/:Some of the output data is stored as 
.cpickle files. To observe those files, you need the original Python classes, and therefore you have to be inside Docker, from the host it won't work.We can list all the plots that have been produced under Plots are also available in SVG and PDF formats, e.g.:
out/ withfind -name '*.png'The output directory has a hierarchical structure of type:where:
./out/manual/wildtype_000000/000000/generation_000000/000000/- wildtype_000000: variant conditions.- wildtypeis a human readable label, and- 000000is an index amongst the possible- wildtypeconditions. For example, we can have different simulations with different nutrients, or different DNA sequences. An example of this is shown at run variants.
- 000000: initial random seed for the initial cell, likely fed to NumPy's- np.random.seed
- genereation_000000: this will increase with generations if we simulate multiple cells, which is supported by the model
- 000000: this will presumably contain the cell index within a generation
We also understand that some of the top level directories contain summaries over all cells, e.g. the 
massFractionSummary.pdf plot exists at several levels of the hierarchy:./out/manual/plotOut/massFractionSummary.pdf
./out/manual/wildtype_000000/plotOut/massFractionSummary.pdf
./out/manual/wildtype_000000/000000/plotOut/massFractionSummary.pdf
./out/manual/wildtype_000000/000000/generation_000000/000000/plotOut/massFractionSummary.pdfEach of thoes four levels of 
plotOut is generated by a different one of the analysis scripts:- ./out/manual/plotOut: generated by- python runscripts/manual/analysisVariant.py. Contains comparisons of different variant conditions. We confirm this by looking at the results of run variants.
- ./out/manual/wildtype_000000/plotOut: generated by- python runscripts/manual/analysisCohort.py --variant_index 0. TODO not sure how to differentiate between two different labels e.g.- wildtype_000000and- somethingElse_000000. If- -vis not given, a it just picks the first one alphabetically. TODO not sure how to automatically generate all of those plots without inspecting the directories.
- ./out/manual/wildtype_000000/000000/plotOut: generated by- python runscripts/manual/analysisMultigen.py --variant_index 0 --seed 0
- ./out/manual/wildtype_000000/000000/generation_000000/000000/plotOut: generated by- python runscripts/manual/analysisSingle.py --variant_index 0 --seed 0 --generation 0 --daughter 0. Contains information about a single specific cell.
Let's look into a sample plot, 
out/manual/plotOut/svg_plots/massFractionSummary.svg, and try to understand as much as we can about what it means and how it was generated.This plot contains how much of each type of mass is present in all cells. Since we simulated just one cell, it will be the same as the results for that cell.
We can see that all of them grow more or less linearly, perhaps as the start of an exponential. We can see that all of them grow more or less linearly, perhaps as the start of an exponential. We can see that all of them grow more or less linearly, perhaps as the start of an exponential.which must correspond to the different 
By grepping the title "Cell mass fractions" in the source code, we see the files:
models/ecoli/analysis/cohort/massFractionSummary.py
models/ecoli/analysis/multigen/massFractionSummary.py
models/ecoli/analysis/variant/massFractionSummary.pymassFractionSummary plots throughout different levels of the hierarchy.By reading 
models/ecoli/analysis/variant/massFractionSummary.py a little bit, we see that:- the plotting is done with Matplotlib, hurray
- it is reading its data from files under./out/manual/wildtype_000000/000000/generation_000000/000000/simOut/Mass/, more precisely./out/manual/wildtype_000000/000000/generation_000000/000000/simOut/Mass/columns/<column-name>/data. They are binary files however.Looking at the source forwholecell/io/tablereader.pyshows that those are just a standard NumPy serialization mechanism. Maybe they should have used the Hierarchical Data Format instead.We can also take this opportunity to try and find where the data is coming from.Massfrom the./out/manual/wildtype_000000/000000/generation_000000/000000/simOut/Mass/looks like an ID, so wegrepthat and we reachmodels/ecoli/listeners/mass.py.From this we understand that all data that is to be saved from a simulation must be coming from listeners: likely nothing, or not much, is dumped by default, because otherwise it would take up too much disk space. You have to explicitly say what it is that you want to save via a listener that acts on each time step.
More plot types will be explored at time series run variant, where we will contrast two runs with different growth mediums.
It would be boring if we could only simulate the same condition all the time, so let's have a look at the different boundary conditions that we can apply to the cell!
We are able to alter things like the composition of the external medium, and the genome of the bacteria, which will make the bacteria behave differently.
The variant selection is a bit cumbersome as we have to use indexes instead of names, but one you know what you are doing, it is fine.
Of course, genetic modification is limited only to experimentally known protein interactions due to the intractability of computational protein folding and computational chemistry in general, solving those would bsai.
The default run variant, if you don't pass any options, just has the minimal growth conditions set. What this means can be seen at condition.
Notably, this implies a growth medium that includes glucose and salt. It also includes oxygen, which is not strictly required, but greatly benefits cell growth, and is of course easier to have than not have as it is part of the atmosphere!
But the medium does not include amino acids, which the bacteria will have to produce by itself.
To modify the nutrients as a function of time, with To select a time series we can use something like:As mentioned in 
python runscripts/manual/runSim.py --variant nutrientTimeSeries 25 25python runscripts/manual/runSim.py --help, nutrientTimeSeries is one of the choices from github.com/CovertLab/WholeCellEcoliRelease/blob/7e4cc9e57de76752df0f4e32eca95fb653ea64e4/models/ecoli/sim/variants/__init__.py#L5725 25 means to start from index 25 and also end at 25, so running just one simulation. 25 27 would run 25 then 26 and then 27 for example.The timeseries with index 25 is so we understand that it starts with extra amino acids in the medium, which benefit the cell, and half way through those are removed at time 1200s = 20 minutes. We would therefore expect the cell to start expressing amino acid production genes exactly at that point.
reconstruction/ecoli/flat/condition/timeseries/000025_cut_aa.tsv and contains"time (units.s)" "nutrients"
0 "minimal_plus_amino_acids"
1200 "minimal"nutrients likely means condition in that file however, see bug report with 1 1 failing:  github.com/CovertLab/WholeCellEcoliRelease/issues/24When we do this the simulation ends in:so we see that the doubling time was faster than the one with minimal conditions of 
Simulation finished:
 - Length: 0:34:23
 - Runtime: 0:08:030:42:49, which makes sense, since during the first 20 minutes the cell had extra amino acid nutrients at its disposal.The output directory now contains simulation output data under 
out/manual/nutrientTimeSeries_000025/. Let's run analysis and plots for that:python runscripts/manual/analysisVariant.py &&
python runscripts/manual/analysisCohort.py --variant 25 &&
python runscripts/manual/analysisMultigen.py --variant 25 &&
python runscripts/manual/analysisSingle.py --variant 25We can now compare the outputs of this run to the default 
wildtype_000000 run from Section "Install and first run".- out/manual/plotOut/svg_plots/massFractionSummary.svg: because we now have two variants in the same- out/folder,- wildtype_000000and- nutrientTimeSeries_000025, we now see a side by side comparision of both on the same graph!The run variant where we started with amino acids initially grows faster as expected, because the cell didn't have to make it's own amino acids, so growth is a bit more efficient.
The following plots from under 
out/manual/wildtype_000000/000000/{generation_000000,nutrientTimeSeries_000025}/000000/plotOut/svg_plots have been manually joined side-by-side with:for f in out/manual/wildtype_000000/000000/generation_000000/000000/plotOut/svg_plots/*; do
  echo $f
  svg_stack.py \
    --direction h \
    out/manual/wildtype_000000/000000/generation_000000/000000/plotOut/svg_plots/$(basename $f) \
    out/manual/nutrientTimeSeries_000025/000000/generation_000000/000000/plotOut/svg_plots/$(basename $f) \
    > tmp/$(basename $f)
doneAmino acid counts
. Source. aaCounts.svg:- default: quantities just increase
- amino acid cut: there is an abrupt fall at 20 minutes when we cut off external supply, presumably because it takes some time for the cell to start producing its own
External exchange fluxes of amino acids
. Source. aaExchangeFluxes.svg:- default: no exchanges
- amino acid cut: for all graphs except phenylalanine (PHE), either the cell was intaking the AA (negative flux), and that intake goes to 0 when the supply is cut, or the flux is always 0.
mRNA count of highly expressed mRNAs
. Source. From file expression_rna_03_high.svg. Each of the entries is a gene using the conventional gene naming convention of xyzW, e.g. here's the BioCyc for the first entry, tufA: biocyc.org/gene?orgid=ECOLI&id=EG11036, which comments Elongation factor Tu (EF-Tu) is the most abundant protein in E. coli.
External exchange fluxes
. Source. mediaExcange.svg: this one is similar to aaExchangeFluxes.svg, but it also tracks other substances. The color version makes it easier to squeeze more substances in a given space, but you lose the shape of curves a bit. The title seems reversed: red must be excretion, since that's where glucose (GLC) is.The substances are different between the default and amino acid cut graphs, they seem to be the most exchanged substances. On the amino cut graph, first we see the cell intaking most (except phenylalanine, which is excreted for some reason). When we cut amino acids, the uptake of course stops.
Besides time series run variants, conditions can also be selected directly without a time series as in:which select row indices from so 
python runscripts/manual/runSim.py --variant condition 1 1reconstruction/ecoli/flat/condition/condition_defs.tsv. The above 1 1 would mean the second line of that file which starts with:"condition" "nutrients" "genotype perturbations" "doubling time (units.min)" "active TFs"
"basal" "minimal" {} 44.0 []
"no_oxygen" "minimal_minus_oxygen" {} 100.0 []
"with_aa" "minimal_plus_amino_acids" {} 25.0 ["CPLX-125", "MONOMER0-162", "CPLX0-7671", "CPLX0-228", "MONOMER0-155"]1 means no_oxygen.Let's try to understand some interesting looking, with a special focus on our understanding of the tiny E. Coli K-12 MG1655 operon thrLABC part of the metabolism, which we have well understood at Section "E. Coli K-12 MG1655 operon thrLABC".
- reconstruction/ecoli/flat/compartments.tsvcontains cellular compartment information:- "abbrev" "id" "n" "CCO-BAC-NUCLEOID" "j" "CCO-CELL-PROJECTION" "w" "CCO-CW-BAC-NEG" "c" "CCO-CYTOSOL" "e" "CCO-EXTRACELLULAR" "m" "CCO-MEMBRANE" "o" "CCO-OUTER-MEM" "p" "CCO-PERI-BAC" "l" "CCO-PILUS" "i" "CCO-PM-BAC-NEG"- CCO: "Celular COmpartment"
- BAC-NUCLEOID: nucleoid
- CELL-PROJECTION: cell projection
- CW-BAC-NEG: TODO confirm: cell wall (of a Gram-negative bacteria)
- CYTOSOL: cytosol
- EXTRACELLULAR: outside the cell
- MEMBRANE: cell membrane
- OUTER-MEM: bacterial outer membrane
- PERI-BAC: periplasm
- PILUS: pilus
- PM-BAC-NEG: TODO: plasma membrane, but that is the same as cell membrane no?
 
- reconstruction/ecoli/flat/promoters.tsvcontains promoter information. Simple file, sample lines:corresponds to E. Coli K-12 MG1655 promoter thrLp, which starts as position 148.- "position" "direction" "id" "name" 148 "+" "PM00249" "thrLp"
- reconstruction/ecoli/flat/proteins.tsvcontains protein information. Sample line corresponding to e. Coli K-12 MG1655 gene thrA:so we understand that:- "aaCount" "name" "seq" "comments" "codingRnaSeq" "mw" "location" "rnaId" "id" "geneId" [91, 46, 38, 44, 12, 53, 30, 63, 14, 46, 89, 34, 23, 30, 29, 51, 34, 4, 20, 0, 69] "ThrA" "MRVL..." "Location information from Ecocyc dump." "AUGCGAGUGUUG..." [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 89103.51099999998, 0.0, 0.0, 0.0, 0.0] ["c"] "EG10998_RNA" "ASPKINIHOMOSERDEHYDROGI-MONOMER" "EG10998"- aaCount: amino acid count, how many of each of the 20 proteinogenic amino acid are there
- seq: full sequence, using the single letter abbreviation of the proteinogenic amino acids
- mw; molecular weight? The 11 components appear to be given at- reconstruction/ecoli/flat/scripts/unifyBulkFiles.py:so they simply classify the weight? Presumably this exists for complexes that have multiple classes?- molecular_weight_keys = [ '23srRNA', '16srRNA', '5srRNA', 'tRNA', 'mRNA', 'miscRNA', 'protein', 'metabolite', 'water', 'DNA', 'RNA' # nonspecific RNA ]- 23srRNA,- 16srRNA,- 5srRNAare the three structural RNAs present in the ribosome: 23S ribosomal RNA, 16S ribosomal RNA, 5S ribosomal RNA, all others are obvious:
- tRNA
- mRNA
- protein. This is the seventh class, and this enzyme only contains mass in this class as expected.
- metabolite
- water
- DNA
- RNA: TODO rnavsmiscRNA
 
- location: cell compartment where the protein is present,- cdefined at- reconstruction/ecoli/flat/compartments.tsvas cytoplasm, as expected for something that will make an amino acid
 
- reconstruction/ecoli/flat/rnas.tsv: TODO vs- transcriptionUnits.tsv. Sample lines:- "halfLife" "name" "seq" "type" "modifiedForms" "monomerId" "comments" "mw" "location" "ntCount" "id" "geneId" "microarray expression" 174.0 "ThrA [RNA]" "AUGCGAGUGUUG..." "mRNA" [] "ASPKINIHOMOSERDEHYDROGI-MONOMER" "" [0.0, 0.0, 0.0, 0.0, 790935.00399999996, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] ["c"] [553, 615, 692, 603] "EG10998_RNA" "EG10998" 0.0005264904- halfLife: half-life
- mw: molecular weight, same as in- reconstruction/ecoli/flat/proteins.tsv. This molecule only have weight in the- mRNAclass, as expected, as it just codes for a protein
- location: same as in- reconstruction/ecoli/flat/proteins.tsv
- ntCount: nucleotide count for each of the ATGC
- microarray expression: presumably refers to DNA microarray for gene expression profiling, but what measure exactly?
 
- reconstruction/ecoli/flat/sequence.fasta: FASTA DNA sequence, first two lines:- >E. coli K-12 MG1655 U00096.2 (1 to 4639675 = 4639675 bp) AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG
- reconstruction/ecoli/flat/transcriptionUnits.tsv: transcription units. We can observe for example the two different transcription units of the E. Coli K-12 MG1655 operon thrLABC in the lines:- "expression_rate" "direction" "right" "terminator_id" "name" "promoter_id" "degradation_rate" "id" "gene_id" "left" 0.0 "f" 310 ["TERM0-1059"] "thrL" "PM00249" 0.198905992329492 "TU0-42486" ["EG11277"] 148 657.057317358791 "f" 5022 ["TERM_WC-2174"] "thrLABC" "PM00249" 0.231049060186648 "TU00178" ["EG10998", "EG10999", "EG11000", "EG11277"] 148- promoter_id: matches promoter id in- reconstruction/ecoli/flat/promoters.tsv
- gene_id: matches id in- reconstruction/ecoli/flat/genes.tsv
- id: matches exactly those used in BioCyc, which is quite nice, might be more or less standardized:
 
- reconstruction/ecoli/flat/genes.tsv- "length" "name" "seq" "rnaId" "coordinate" "direction" "symbol" "type" "id" "monomerId" 66 "thr operon leader peptide" "ATGAAACGCATT..." "EG11277_RNA" 189 "+" "thrL" "mRNA" "EG11277" "EG11277-MONOMER" 2463 "ThrA" "ATGCGAGTGTTG" "EG10998_RNA" 336 "+" "thrA" "mRNA" "EG10998" "ASPKINIHOMOSERDEHYDROGI-MONOMER"
- reconstruction/ecoli/flat/metabolites.tsvcontains metabolite information. Sample lines:In the case of the enzyme thrA, one of the two reactions it catalyzes is "L-aspartate 4-semialdehyde" into "Homoserine".- "id" "mw7.2" "location" "HOMO-SER" 119.12 ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"] "L-ASPARTATE-SEMIALDEHYDE" 117.104 ["n", "j", "w", "c", "e", "m", "o", "p", "l", "i"]Starting from the enzyme page: biocyc.org/gene?orgid=ECOLI&id=EG10998 we reach the reaction page: biocyc.org/ECOLI/NEW-IMAGE?type=REACTION&object=HOMOSERDEHYDROG-RXN which has reaction ID- HOMOSERDEHYDROG-RXN, and that page which clarifies the IDs:so these are the compounds that we care about.- biocyc.org/compound?orgid=ECOLI&id=L-ASPARTATE-SEMIALDEHYDE: "L-aspartate 4-semialdehyde" has ID L-ASPARTATE-SEMIALDEHYDE
- biocyc.org/compound?orgid=ECOLI&id=HOMO-SER: "Homoserine" has ID HOMO-SER
 
- biocyc.org/compound?orgid=ECOLI&id=L-ASPARTATE-SEMIALDEHYDE: "L-aspartate 4-semialdehyde" has ID 
- reconstruction/ecoli/flat/reactions.tsvcontains chemical reaction information. Sample lines:- "reaction id" "stoichiometry" "is reversible" "catalyzed by" "HOMOSERDEHYDROG-RXN-HOMO-SER/NAD//L-ASPARTATE-SEMIALDEHYDE/NADH/PROTON.51." {"NADH[c]": -1, "PROTON[c]": -1, "HOMO-SER[c]": 1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "NAD[c]": 1} false ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"] "HOMOSERDEHYDROG-RXN-HOMO-SER/NADP//L-ASPARTATE-SEMIALDEHYDE/NADPH/PROTON.53." {"NADPH[c]": -1, "NADP[c]": 1, "PROTON[c]": -1, "L-ASPARTATE-SEMIALDEHYDE[c]": -1, "HOMO-SER[c]": 1 false ["ASPKINIIHOMOSERDEHYDROGII-CPLX", "ASPKINIHOMOSERDEHYDROGI-CPLX"]- catalized by: here we see- ASPKINIHOMOSERDEHYDROGI-CPLX, which we can guess is a protein complex made out of- ASPKINIHOMOSERDEHYDROGI-MONOMER, which is the ID for the- thrAwe care about! This is confirmed in- complexationReactions.tsv.
 
- reconstruction/ecoli/flat/complexationReactions.tsvcontains information about chemical reactions that produce protein complexes:The- "process" "stoichiometry" "id" "dir" "complexation" [ { "molecule": "ASPKINIHOMOSERDEHYDROGI-CPLX", "coeff": 1, "type": "proteincomplex", "location": "c", "form": "mature" }, { "molecule": "ASPKINIHOMOSERDEHYDROGI-MONOMER", "coeff": -4, "type": "proteinmonomer", "location": "c", "form": "mature" } ] "ASPKINIHOMOSERDEHYDROGI-CPLX_RXN" 1- coeffis how many monomers need to get together for form the final complex. This can be seen from the Summary section of ecocyc.org/gene?orgid=ECOLI&id=ASPKINIHOMOSERDEHYDROGI-MONOMER:Fantastic literature summary! Can't find that in database form there however.- Aspartate kinase I / homoserine dehydrogenase I comprises a dimer of ThrA dimers. Although the dimeric form is catalytically active, the binding equilibrium dramatically favors the tetrameric form. The aspartate kinase and homoserine dehydrogenase activities of each ThrA monomer are catalyzed by independent domains connected by a linker region. 
- reconstruction/ecoli/flat/proteinComplexes.tsvcontains protein complex information:- "name" "comments" "mw" "location" "reactionId" "id" "aspartate kinase / homoserine dehydrogenase" "" [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 356414.04399999994, 0.0, 0.0, 0.0, 0.0] ["c"] "ASPKINIHOMOSERDEHYDROGI-CPLX_RXN" "ASPKINIHOMOSERDEHYDROGI-CPLX"
- reconstruction/ecoli/flat/protein_half_lives.tsvcontains the half-life of proteins. Very few proteins are listed however for some reason.
- reconstruction/ecoli/flat/tfIds.csv: transcription factors information:- "TF" "geneId" "oneComponentId" "twoComponentId" "nonMetaboliteBindingId" "activeId" "notes" "arcA" "EG10061" "PHOSPHO-ARCA" "PHOSPHO-ARCA" "fnr" "EG10325" "FNR-4FE-4S-CPLX" "FNR-4FE-4S-CPLX" "dksA" "EG10230"
- reconstruction/ecoli/flat/condition/nutrient/minimal.tsvcontains the nutrients in a minimal environment in which the cell survives:If we compare that to- "molecule id" "lower bound (units.mmol / units.g / units.h)" "upper bound (units.mmol / units.g / units.h)" "ADP[c]" 3.15 3.15 "PI[c]" 3.15 3.15 "PROTON[c]" 3.15 3.15 "GLC[p]" NaN 20 "OXYGEN-MOLECULE[p]" NaN NaN "AMMONIUM[c]" NaN NaN "PI[p]" NaN NaN "K+[p]" NaN NaN "SULFATE[p]" NaN NaN "FE+2[p]" NaN NaN "CA+2[p]" NaN NaN "CL-[p]" NaN NaN "CO+2[p]" NaN NaN "MG+2[p]" NaN NaN "MN+2[p]" NaN NaN "NI+2[p]" NaN NaN "ZN+2[p]" NaN NaN "WATER[p]" NaN NaN "CARBON-DIOXIDE[p]" NaN NaN "CPD0-1958[p]" NaN NaN "L-SELENOCYSTEINE[c]" NaN NaN "GLC-D-LACTONE[c]" NaN NaN "CYTOSINE[c]" NaN NaN- reconstruction/ecoli/flat/condition/nutrient/minimal_plus_amino_acids.tsv, we see that it adds the 20 amino acids on top of the minimal condition:so we guess that- "L-ALPHA-ALANINE[p]" NaN NaN "ARG[p]" NaN NaN "ASN[p]" NaN NaN "L-ASPARTATE[p]" NaN NaN "CYS[p]" NaN NaN "GLT[p]" NaN NaN "GLN[p]" NaN NaN "GLY[p]" NaN NaN "HIS[p]" NaN NaN "ILE[p]" NaN NaN "LEU[p]" NaN NaN "LYS[p]" NaN NaN "MET[p]" NaN NaN "PHE[p]" NaN NaN "PRO[p]" NaN NaN "SER[p]" NaN NaN "THR[p]" NaN NaN "TRP[p]" NaN NaN "TYR[p]" NaN NaN "L-SELENOCYSTEINE[c]" NaN NaN "VAL[p]" NaN NaN- NaNin the- upper moundlikely means infinite.We can try to understand the less obvious ones:- ADP: TODO
- PI: TODO
- PROTON[c]: presumably a measure of pH
- GLC[p]: glucose, this can be seen by comparing- minimal.tsvwith- minimal_no_glucose.tsv
- AMMONIUM: ammonium. This appears to be the primary source of nitrogen atoms for producing amino acids.
- CYTOSINE[c]: hmmm, why is external cytosine needed? Weird.
 
- reconstruction/ecoli/flat/reconstruction/ecoli/flat/condition/timeseries/contains sequences of conditions for each time. For example:- reconstruction/ecoli/flat/reconstruction/ecoli/flat/condition/timeseries/000000_basal.tsvcontains:which means just using- "time (units.s)" "nutrients" 0 "minimal"- reconstruction/ecoli/flat/condition/nutrient/minimal.tsvuntil infinity. That is the default one used by- runSim.py, as can be seen from- ./out/manual/wildtype_000000/000000/generation_000000/000000/simOut/Environment/attributes/nutrientTimeSeriesLabelwhich contains just- 000000_basal.
- reconstruction/ecoli/flat/reconstruction/ecoli/flat/condition/timeseries/000001_cut_glucose.tsvis more interesting and contains:so we see that this will shift the conditions half-way to a condition that will eventually kill the bacteria because it will run out of glucose and thus energy!- "time (units.s)" "nutrients" 0 "minimal" 1200 "minimal_no_glucose"
 Timeseries can be selected with- --variant nutrientTimeSeries X Y, see also: run variants.We can use that variant with:- VARIANT="condition" FIRST_VARIANT_INDEX=1 LAST_VARIANT_INDEX=1 python runscripts/manual/runSim.py
- reconstruction/ecoli/flat/condition/condition_defs.tsvcontains lines of form:- "condition" "nutrients" "genotype perturbations" "doubling time (units.min)" "active TFs" "basal" "minimal" {} 44.0 [] "no_oxygen" "minimal_minus_oxygen" {} 100.0 [] "with_aa" "minimal_plus_amino_acids" {} 25.0 ["CPLX-125", "MONOMER0-162", "CPLX0-7671", "CPLX0-228", "MONOMER0-155"]- conditionrefers to entries in- reconstruction/ecoli/flat/condition/condition_defs.tsv
- nutrientsrefers to entries under- reconstruction/ecoli/flat/condition/nutrient/, e.g.- reconstruction/ecoli/flat/condition/nutrient/minimal.tsvor- reconstruction/ecoli/flat/condition/nutrient/minimal_plus_amino_acids.tsv
- genotype perturbations: there aren't any in the file, but this suggests that genotype modifications can also be incorporated here
- doubling time: TODO experimental data? Because this should be a simulation output, right? Or do they cheat and fix doubling by time?
- active TFs: this suggests that they are cheating transcription factors here, as those would ideally be functions of other more basic inputs
 
TODO compare with actual datasetes.
Unfortunately, due to lack of one page to rule them all, the on-Git tree publication list is meager, some of the most relevant ones seems to be:
- 2021 open access review paper: journals.asm.org/doi/full/10.1128/ecosalplus.ESP-0001-2020 "The E. coli Whole-Cell Modeling Project". They should just past that stuff in a README :-) The article mentions that it is a follow up to the previous M. genitalium whole cell model by Covert lab. Only 43% of known genes modelled at this point however, a shame.
- 2020 under Science paywall: www.science.org/doi/10.1126/science.aav3751 "Simultaneous cross-evaluation of heterogeneous E. coli datasets via mechanistic simulation"
By Tagkopoulos lab at University of California, Davies.
- www.nature.com/articles/ncomms13090 Multi-omics integration accurately predicts cellular state in unexplored conditions for Escherichia coli (2016)
- www.sciencedaily.com/releases/2016/10/161027173552.htm
 Articles by others on the same topic
There are currently no matching articles.