1. Introduction¶

A central goal of biological science is to quantitatively understand how genotype influences phenotype. However, despite decades of research, a growing wealth of experimental data, and extensive knowledge of individual molecules and individual pathways, we still do not understand how biological behavior emerges from the molecular level. For example, we do not understand how transcription factors, non-coding RNA, localization signals, degradation tags, and other regulatory systems interact to control protein expression.

Consequently, physicians still cannot interpret the pathophysiological consequences of genetic variation and bioengineers still cannot rationally design microorganisms. Instead, patients often have to try multiple drugs to find a single effective drug, which exposes patients to unnecessary drugs, prolongs disease, and increases costs. Similarly, bioengineers often have to rely on time-consuming and expensive trial and error methods such as directed evolution [4][5].

Many engineering fields use mechanistic models to help understand and design complex systems such as cars [6], buildings [7], and transportation networks [8]. In particular, mechanistic models can help researchers conduct experiments with complete control, precision, and reproducibility.

To comprehensively understand cells, we must develop whole-cell (WC) computational models that predict cellular behavior by representing all of the biochemical activity inside cells [9][10][11][12]. WC models could accelerate biological science by helping researchers unify our knowledge of cell biology, identify gaps in our understanding, and conduct complex experiments that would be infeasible in vitro. WC models could also help bioengineers design microorganisms and help physicians personalize medicine.

Since the 1950’s, researchers have been using modeling to understand cells. This has led to numerous models of individual pathways, including models of cell cycle regulation, chemical and electrical signaling, circadian rhythms, metabolism, and transcriptional regulation. Collectively, these efforts have used a wide range of mathematical formalisms including Boolean networks, flux balance analysis (FBA) [13][14][15], ordinary differential equations (ODEs), partial differential equations (PDEs), and stochastic simulation [16].

Over the last 20 years, researchers have begun to build more comprehensive models that represent multiple pathways [17][18][19][20][21][22][23]. Many of these models have built by combining multiple mathematically-dissimilar submodels of individual pathways into a single multi-algorithmic model [24][3].

Although we do not yet have all of the data and methods needed to model entire cells, we believe that WC models are rapidly becoming feasible due to ongoing advances in measurement and computational technology. In particular, we now have a wide array of experimental methods for characterizing cells, numerous repositories which contain much of the data needed for WC modeling, and a variety of tools for extrapolating experimental data to other organisms and conditions. In addition, we now have a wide range of modeling and simulation tools, including tools for designing models, rule-based model formats for describing complex models, and tools for simulating multi-algorithmic models. However, few of these resources support the scale required for WC modeling, and many these resources remain siloed.

Nevertheless, we and others are beginning to model entire cells [17][25][26][27][28]. In 2012, we and others reported the first dynamical model that represents all of the characterized genes in a cell [27]. The model represents 28 pathways of the small bacterium Mycoplasma genitalium and predicts the essentiality of its genes with 80% accuracy.

However, several bottlenecks remain to build more comprehensive and more accurate WC models. In particular, we do not yet have all of the data needed for WC modeling or tools for designing, describing, or simulating WC models. To accelerate WC modeling, we must develop new methods for characterizing the single-cell dynamics of each metabolite and protein; develop new methods for scalably designing, simulating, and calibrating high-dimensional dynamical models; develop new standards for describing and verifying dynamical models; and assemble an interdisciplinary WC modeling community.

In this part, we summarize the scientific, engineering, and medical problems which are motivating WC modeling; propose the phenotypes that WC models should aim to predict and the molecular mechanisms that WC models should aim to represent; outline the fundamental challenges of WC modeling; describe why WC models are feasible by reviewing the existing methods, data, and models which could be leveraged for WC modeling; review the latest WC models and their limitations; outline the most immediate bottlenecks to WC modeling; propose a plan for achieving WC models; and summarize ongoing efforts to advance WC modeling.

1.1. Motivation for WC modeling¶

In our opinion, WC modeling is motivated by the needs to understand biology, personalize medicine, and design microorganisms. Biological science needs comprehensive models that represent the sequence, function, and interactions of each gene to help scientists holistically understanding cell biology. Similarly, precision medicine needs comprehensive models that predict phenotype from genotype to help physicians interpret the pathophysiological impact of genetic variation which can occur in any gene, and synthetic biology requires comprehensive models to help bioengineers rationally design microbial genomes for a wide range of applications.

In addition, WC models could help researchers address specific scientific problems such as determining how transcriptional regulation, non-coding RNA, and other pathways combine to regulate protein expression. Furthermore, each WC model could be used to address multiple questions, avoiding the need to build separate models for each question. However, few scientific problems require WC models, and we believe that most scientific problems would be more easily addressed with focused modeling.

Here, we describe the main applications which are motivating WC modeling. In the following sections, we define the biology that WC models must represent to support these applications and describe how to achieve such WC models.

1.1.1. Biological science: understand how genotype influences phenotype¶

Historically, the main motivation for WC modeling has been to help scientists understand how genotype and the environment determine phenotype, including how each individual gene, reaction, and pathway contributes to cellular behavior. For example, WC models could help researchers integrate heterogeneous experimental data about multiple genes and pathways. WC models could also help researchers gain novel insights into how pathways interact to control behavior. By comparison to experimental data, WC models could also help researchers identify gaps in our understanding. In addition, WC models would enable researchers to conduct experiments with complete control, infinite scope, and unlimited resolution, which would allow researchers to conduct complex experiments that would be infeasible in vitro.

1.1.2. Medicine: personalize medicine for individual genomes¶

Recent studies have shown that each patient has a unique genome, that genetic variation can occur in any gene and pathway, and that small genetic differences can cause patients to respond differentially to the same drugs. Together, this suggests that medicine could be improved by tailoring therapy to each patient’s genome. Physicians are beginning to use data-driven models to tailor medicine to a small number of well-established genetic variants that have large phenotypic effects. Tailoring medicine for all genetic variation requires WC models that represent every gene and that can predict the phenotypic effect of any combination of genetic variation. Such WC models would help physicians predict the most likely prognosis for each patient and identify the best combination of drugs for each patient (Figure 1.1). For example, WC models could help oncologists conduct personalized in silico drug trials to identify the best chemotherapy regimen for each patient. Similarly, WC models could help obstetricians identify diseases in early fetuses. In addition, WC models could help pharmacologists avoid harmful gene-drug interactions.

Figure 1.1 WC models could transform medicine by helping physicians use patient-specific models informed by genomic data to design personalized prognoses and therapies.¶

1.1.3. Synthetic biology: rationally design microbial genomes¶

Synthetic biology promises to create microorganisms for a wide range of industrial, medical, security applications such as cheaply producing chemicals, drugs, and fuels; quickly detecting diseased tissue; killing pathogenic bacteria; and decontaminating industrial waste. Currently, microorganisms are often engineered using directed evolution [4][5]. However, directed evolution is often time-consuming and limited to small phenotypic changes. Recently, researchers at the JCVI have begun to pioneer methods for chemically synthesizing entire genomes [29]. Realizing the full potential of this methodology requires WC models that can help bioengineers design entire genomes. For example, WC models could help bioengineers analyze the impact of synthetic circuits on host cells, design efficient chassis for synthetic circuits, and design bacterial drug delivery systems that can detect diseased tissue and synthesize drugs in situ.

1.2. The biology that WC models should aim to represent and predict¶

In the previous section, we argued that medicine and bioengineering need comprehensive models that can predict phenotype from genotype. Here, we outline the specific phenotypes that we believe that WC models should aim to predict and the specific physiochemical mechanisms that we believe that WC models should aim to represent to support medicine and bioengineering (Figure 1.2). In the following sections, we outline why we believe that WC models are becoming feasible and describe how to build and simulate WC models.

Figure 1.2 The physical and chemical mechanisms that WC models should aim to represent (a) and the phenotypes that WC models should aim to predict (b).¶

1.2.1. Phenotypes that WC models should aim to predict¶

To support medicine and bioengineering, we believe that WC models should aim to predict the phenotypes of individual cells over their entire life cycles (Figure 1.2b). Specifically, we believe that WC models should aim to predict the following five levels of phenotypes:

Stochastic dynamics: To help physicians understand how genetic variation affects how cells respond to drugs, and to help bioengineers design microorganisms that are robust to stochastic variation, WC models should predict the stochastic behavior of each molecular species and molecular interaction. For example, this would help physicians design drugs that are robust to variation in RNA splicing, protein modification, and protein complexation. This would also help bioengineers design feedback loops that can control the expression of key RNA and proteins.

Temporal dynamics: To help physicians understand the impact of genetic variation on cell cycle regulation, and to help bioengineers control the temporal dynamics of microorganisms, WC models should predict the temporal dynamics of the concentration of each molecular species. For example, this would help physicians identify genetic variation that can disrupt cell cycle regulation and cause cancer. This would also help bioengineers design microorganisms that can perform specific tasks at specific times.

Spatial dynamics: To help physicians predict the intracellular distribution of drugs, and to help bioengineers use space to concentrate and insulate molecular interactions, WC models should predict the concentration of each molecular species in each spatial domain. For example, this would help physicians predict whether drugs interact with their intended targets and predict how quickly cells metabolize drugs. This would also help bioengineers maximize the metabolic activity of microorganisms by co-localizing enzymes and their substrates.

Single-cell variation: To help physicians understand how drugs affect populations of heterogeneous cells, and to help bioengineers design robust microorganisms, WC models should predict the single-cell variation of cellular behavior. For example, this would help physicians understand how chemotherapies affect heterogeneous tumors, and help bioengineers design reliable biosensors that activate at the same threshold irrespective of stochastic variation in RNA and protein expression.

Complex phenotypes: To help physicians understand the impact of variation on complex phenotypes and to help bioengineers design microorganisms that can perform complex phenotypes, WC models should predict complex phenotypes such as the cell shape, growth rate, and fate. For example, this would help physicians identify the primary variants responsible for disease and help physicians screen drugs in silico. This would also help bioengineers design sophisticated strains that can detect tumors, synthesize chemotherapeutics, and deliver drugs directly to tumors.

1.2.2. Physics and chemistry that WC models should aim to represent¶

To predict these phenotypes, we believe that WC models should aim to represent all of the chemical reactions inside cells and all of the physical processes that influence their rates (Figure 1.2a). Specifically, we propose that WC models aim to represent the following seven aspects of cells:

Sequences: To predict how genotype influence phenotype, including the contribution of each individual variant and gene, WC models should represent the sequence of each chromosome, RNA, and protein; the location of each feature of each chromosome such as genes, operons, promoters, and terminators; and the location of each site of each RNA and protein.

Structures: To predict how molecular species interact and react, WC models should represent the structure of each molecule, including atom-level information about small molecules, the domains and sites of macromolecules, and the subunit composition of complexes. For example, this would enable WC models to predict the metabolism of novel compounds.

Subcellular organization: To capture the molecular interactions that occur inside cells, WC models should represent the spatial organization of cells and the localization of each of metabolite, RNA, and protein species. For example, this would enable WC models to predict the spatial compartments in which molecular interactions occur.

Concentrations: To capture the molecular interactions that can occur inside cells, WC models should also represent the concentration of each molecular species in each organelle and spatial domain.

Molecular interactions: To capture how cells behave over time, WC models should represent the participants and effect of each molecular interaction, including the molecules that are consumed, produced, and transported, the molecular sites that are modified, and the bonds that are broken and formed. For example, this would enable WC models to capture the reactions responsible for cellular growth and homeostatic maintenance.

Kinetic parameters: To predict the temporal dynamics of cell behavior, WC models should represent the kinetic parameters of each interaction such as the maximum rate of each reaction and the affinity of each enzyme for its substrates and inhibitors. For example, this would enable WC models to predict the impact of genetic variation on the function of each enzyme.

Extracellular environment: To predict how the extracellular environment, including nutrients, hormones, and drugs, influences cell behavior, WC models should represent the concentration of each species in the extracellular environment. For example, this should enable WC models to predict the minimum media required for growth.

1.3. Fundamental challenges to WC modeling¶

In the previous section, we defined the biology that WC models should represent and predict. Building WC models that represent all of the biochemical activity inside cells and that can predict any cellular phenotype is challenging because this requires integrating molecular behavior to the cellular level across several spatial and temporal scales; assembling a complete molecular understanding of cell biology from incomplete, imprecise, and heterogeneous data; and simulating, calibrating, and validating computationally-expensive, high-dimensional models. Here, we describe these challenges to WC modeling. In the following sections, we describe emerging methods for overcoming these challenges to achieve WC models.

1.3.1. Integrating molecular behavior to the cell level over several spatiotemporal scales¶

The most fundamental challenge to WC modeling is integrating the behavior of individual species and reactions to the cellular level over several spatial and temporal scales. This is challenging because it requires accurate parameter values and scalable methods for simulating large models. Here, we summarize these challenges.

1.3.1.1. Sensitivity of phenotypic predictions to molecular parameter values¶

The first challenge to integrating molecular behavior to the cellular level is the sensitivity of model predictions to the values of critical parameters, which necessitates accurate parameter values. Accurately identifying these values is challenging because, as described below, it is challenging to optimize high-dimensional functions and because, as described in Section 1.3.1, our experimental data is incomplete and imprecise.

1.3.1.2. High computational cost of simulating large fine-grained models¶

A second challenge to integrating molecular behavior to the cellular level is the high computational cost of simulating entire cells with molecular granularity. For example, simulating one cell cycle of our first WC model of the smallest known freely living organism took a full core-day of an Intel E5520 CPU, or approximately \(1 \times 10^{15}\) floating-point operations [27]. Based on this data, the fact that human cells are approximately 10⁶ larger, and the fact that a typical WC simulation experiment will require at least 1,000 simulation runs, a typical WC simulation experiment of a human cell will require approximately 10⁶ core-years. To simulate larger and more complex organisms, we must develop faster parallel simulators.

1.3.2. Assembling a unified molecular understanding of cells from imperfect data¶

In our opinion, the greatest challenge to WC modeling is assembling a unified molecular understanding of cell biology. As illustrated in Figure 1.3, this requires assembling comprehensive data about every molecular species and molecular interaction. For example, to model M. genitalium we reconstructed (a) its subcellular organization; (b) its chromosome sequence; (c) the location, length, direction and essentiality of each gene; (d) the organization and promoter of each transcription unit; (e) the expression and degradation rate of each RNA transcript; (f) the specific folding and maturation pathway of each RNA and protein species including the localization, N-terminal cleavage, signal sequence, prosthetic groups, disulfide bonds and chaperone interactions of each protein species; (g) the subunit composition of each macromolecular complex; (h) its genetic code; (i) the binding sites and footprint of every DNA-binding protein; (j) the structure, charge and hydrophobicity of every metabolite; (k) the stoichiometry, catalysis, coenzymes, energetics and kinetics of every chemical reaction; (l) the regulatory role of each transcription factor; (m) its chemical composition and (n) the composition of its growth medium [30].

Figure 1.3 WC models require comprehensive data about every molecular species and molecular interaction.¶

This is challenging because our data is incomplete, imprecise, heterogeneous, scattered, and poorly annotated. Here, we summarize these limitations and the challenges they present for WC modeling.

1.3.2.1. Incomplete data¶

The biggest limitation of our experimental data is that we do not have a complete experimental characterization of a cell. In particular, we have limited genome-scale data about individual metabolites and proteins, limited data about cell cycle dynamics, limited data about cell-to-cell variation, limited data about culture media, and limited data about cellular responses to genetic and environmental perturbations. Many genome-scale datasets are also incomplete. For example, most metabolomics and proteomics methods can only measure small numbers of metabolites and proteins.

1.3.2.2. Imprecise and noisy data¶

A second limitation of our experimental data is that many of our measurement methods are imprecise and noisy. For example, fluorescent microscopy cannot precisely quantitate single-cell protein abundances, single-cell RNA sequencing cannot reliably discern unexpressed RNA, and mass-spectrometry cannot reliably discern unexpressed proteins.

1.3.2.3. Heterogeneous experimental methods¶

A third limitation of our experimental data is that our data is highly heterogeneous because we do not have a single experimental technology that is capable of completely characterizing a cell. Rather, we have a wide range of methods for characterizing different aspects of cells at different scales with different levels of resolution. For example, mass-spectrometry can quantitate the concentrations of tens of metabolites, deep sequencing can quantitate the concentrations of tens of thousands RNA, and each biochemical experiment can quantitate one or a few kinetic parameters.

Consequently, our experimental data also spans a wide range of scales and units. For example, we have extensive molecular information about the participants in each metabolic reaction and their stoichiometries, but we only have limited information about the substrates of each protein chaperone. As a second example, we have extensive single-cell information about RNA expression, but we have limited single-cell data about metabolite concentrations.

1.3.2.4. Heterogeneous organisms and environmental conditions¶

A fourth limitation of our data is that we only have a small amount of data about each organism and environmental condition, and only a small amount of data from each laboratory. However, collectively, we have a large amount of data.

1.3.2.5. Siloed data¶

Another limitation of our data is that no resource contains all of the data needed for WC modeling. Rather, our data is scattered across a wide range of databases, websites, textbooks, publications, supplementary materials, and other resources. For example, ArrayExpress [31] and the Gene Expression Omnibus [32] (GEO) only contain RNA abundance data, PaxDb only contains protein abundance data [33], and SABIO-RK only contains kinetic data [34]. Furthermore, many of these data sources use different identifiers and different units.

1.3.2.6. Insufficient annotation¶

Furthermore, much of our data is insufficiently annotated to understand its biological semantic meaning and provenance. For example, few RNA-seq datasets in ArrayExpress [31] have sufficient metadata to understand the environmental condition that was measured, including the concentration of each metabolite in the growth media and the temperature and pH of the growth media. Similarly, few kinetic measurements in SABIO-RK [34] have sufficient metadata to understand the strain that was measured.

1.3.3. Selecting, calibrating and validating high-dimensional models¶

A third fundamental challenge to WC modeling is the high-dimensionality of WC models which makes WC models susceptible to the “curse of dimensionality”, the need for more data to constrain high-dimensional models [1]. In particular, the curse of dimensionality makes it challenging to select, calibrate, and validate WC models because we do not yet have sufficient data to data to select among multiple possible WC models, avoid overfitting WC models, precisely determine the value of each parameter, or test the accuracy of every possible prediction. Furthermore, it is computationally expensive to select, calibrate, and validate high-dimensional models.

1.4. Feasibility of WC models¶

Despite the numerous challenges to WC modeling described in the previous section, we believe that WC modeling is rapidly becoming feasible due to ongoing technological advances throughout computational systems biology, bioinformatics, genomics, molecular cell biology, applied mathematics, computer science, and software engineering including methods for experimentally characterizing cells, repositories for sharing data, tools for building and simulating dynamical models, models of individual pathways, and model repositories. While substantial work remains to adapt and integrate these technologies into a unified framework for WC modeling, these technologies are already forming a strong intellectual foundation for WC modeling. Here, we review the technologies that we believe are making WC modeling feasible, and describe their present limitations for WC modeling. In the following section, we describe how we are beginning to leveraging these technologies to build and simulate WC models.

1.4.1. Experimental methods, data, and repositories¶

Here, we review advances in measurement methods, data repositories, and bioinformatics tools that are generating the data needed for WC modeling, aggregating this data into repositories, and producing tools for extrapolating data to other genotypes and environments.

1.4.1.1. Measurement methods¶

Advances in biochemical, genomic, and single-cell measurement are rapidly generating the data needed for WC modeling [35][36][37] (Table 1.1). For example, Meth-Seq can assess epigenetic modifications [38], Hi-C can determine the average structure of chromosomes [39], ChIP-seq can determine protein-DNA interactions [40], fluorescence microscopy can determine protein localizations, mass-spectrometry can quantitate average metabolite concentrations, scRNA-seq [41][42] can quantitate the single-cell variation of each RNA [41], FISH [43] can quantitate the spatiotemporal dynamics and single-cell variation of the abundances of a few RNA, mass spectrometry can quantitate the average abundances of hundreds of proteins [44][45], mass cytometry can quantitate the single-cell variation of the abundances of tens of proteins [46], and fluorescence microscopy can quantitate the spatiotemporal dynamics and single-cell variation of the abundances of a few proteins. However, improved methods are still needed to measure the dynamics of the entire metabolome and proteome.

Table 1.1 Types of experimental data that can be used to build, calibrate, and validate WC models.¶
Data type	URL	Reference
Metabolites
Structure
Mass spectrometry	http://doi.org/10.1002/mas.20108	Dettmer et al., 2007
Concentration
Fluorescence microscopy	http://doi.org/10.1126/science.1243259	Zenobi, 2013
Mass spectrometry	http://doi.org/10.1002/mas.20108	Dettmer et al., 2007
Spectrophotometry	http://doi.org/10.1016/B978-0-12-416618-9.00005-4	TeSlaa and Teitell, 2014
DNA
Structure
DNA sequencing	http://doi.org/10.1038/nbt1486	Shendure and Ji, 2008
Methylation sequencing	http://doi.org/10.1038/nrg2732	Laird, 2010
Chromosome conformation capture	http://doi.org/10.1038/nrg3454	Dekker et al., 2013
Concentration
Flow cytometry	http://doi.org/10.1016/j.it.2012.02.010	Bendall et al., 2012
RNA
Structure
RNA sequencing	http://doi.org/10.1038/nrg2934	Ozsolak and Milos, 2011
Modification sequencing (ICE, MERIP-Seq)	http://doi.org/10.1016/j.trsl.2014.04.003	Liu and Pan, 2015
X-ray crystallography	http://doi.org/10.1016/S0076-6879(09)69006-6	Reyes et al., 2009
Localization
Fluorescence in situ hybridization	http://doi.org/10.1126/science.1250212	Lee et al., 2014
Transcription rate
ChIP-seq	http://doi.org/10.1038/nrg2641	Park, 2009
GRO-seq	http://doi.org/10.1126/science.1162228	Core et al., 2008
Half-life
Microarray timecourse	http://doi.org/10.1101/gr.912603	Selinger et al., 2003
RNA sequencing timecourse	http://doi.org/10.1038/nature10098	Schwanhäusser et al., 2011
Concentration
Microarray	http://doi.org/10.1038/35087138	Schulze and Downward, 2001
RNA sequencing	http://doi.org/10.1038/nrg2934	Ozsolak and Milos, 2011
Fluorescence in situ hybridization	http://doi.org/10.1126/science.1188308	Taniguchi et al., 2010
Proteins
Structure
Mass spectrometry	http://doi.org/10.1126/science.1124619	Domon and Aebersold, 2006
Nuclear magnetic resonance spectroscopy	http://doi.org/10.1146/annurev.biochem.73.011303.074004	Tugarinov et al., 2004
RNA sequencing	http://doi.org/10.1038/nrg2934	Ozsolak and Milos, 2011
X-ray crystallography	http://doi.org/10.1007/978-1-60327-159-2_3	Ilari and Savino, 2008
Localization
Fluorescence microscopy	http://doi.org/10.1126/science.1124618	Giepmans et al., 2006
Translation rate
Ribosomal profiling	http://doi.org/10.1038/nrg3645	Ignolia, 2014
Half-life
Fluorescence timecourse	http://doi.org/10.1098/rsob.140002	Knop and Edgar, 2014
Mass spectrometry timecourse	http://doi.org/10.1038/nature10098	Schwanhäusser et al., 2011
Concentration
Flow cytometry	http://doi.org/10.1016/j.it.2012.02.010	Bendall et al., 2012
Fluorescence microscopy	http://doi.org/10.1126/science.1124618	Giepmans et al., 2006
Mass cytometry	http://doi.org/10.1016/j.it.2012.02.010	Bendall et al., 2012
Mass spectrometry	http://doi.org/10.1126/science.1124619	Domon and Aebersold, 2006
Spectrophotometery	http://doi.org/10.1016/S0076-6879(09)63008-1	Noble and Bailey, 2009
Interactions
RNA-DNA
CHIRP-Seq	http://doi.org/10.1016/j.molcel.2011.08.027	Chu et al., 2011
Protein-metabolite
Mass spectrometry	http://doi.org/10.1126/science.1124619	Domon and Aebersold, 2006
Protein-DNA
ChIP-seq	http://doi.org/10.1038/nrg2641	Park, 2009
DNase-seq	http://doi.org/10.1101/pdb.prot5384	Song and Crawford, 2010
Protein-RNA
CLIP-seq	http://doi.org/10.1002/wrna.31	Darnell, 2010
RIP-seq	http://doi.org/10.1016/j.molcel.2010.12.011	Zhao et al., 2010
Protein-protein
Co-immunoprecipitation	http://doi.org/10.1101/pdb.prot3898	Sambrook and Russell, 2006
Tandem affinity purification	http://doi.org/10.1016/j.pep.2010.04.009	Xu et al., 2010
Two-hybrid screen	http://doi.org/10.3390/ijms10062763	Brückner et al., 2009
Reaction fluxes
Isotopic labeling	http://doi.org/10.1002/wsbm.1167	Klein and Heinzle, 2012
Phenotypic data
Cell size
Fluorescence microscopy	http://doi.org/10.1146/annurev.cellbio.042308.113408	Muzzey and van Oudenaarden, 2009
Growth rates
Spectrophotometery	http://doi.org/10.1177/2211068214555414	Jensen et al., 2015
Division times
Fluorescence microscopy	http://doi.org/10.1002/cyto.a.20812	Wang et al., 2010
Motility, chemotaxis
Fluorescence microscopy	http://doi.org/10.1038/sj.emboj.7601227	Dormann and Weijer, 2006

1.4.1.2. Data repositories¶

Researchers are rapidly aggregating the experimental data needed for WC modeling into repositories (Table 1.2). This includes specialized repositories for individual types of data such as ECMDB [47] and YMDB [48] for metabolite concentrations; ArrayExpress [31] and the Gene Expression Omnibus [32] (GEO) for RNA abundances; PaxDb [33] for protein abundances; BiGG [49] for metabolic reactions, and SABIO-RK for kinetic parameters [34], as well as general purpose repositories such as FigShare [50], SimTk [51], and Zenodo [52].

Some researchers are making the data in these repositories more accessible by providing common interfaces to multiple repositories such as BioMart [53], BioServices [54], and Intermine [55].

Other researchers are making the data in these repositories more accessible by integrating the data into meta-databases. For example, KEGG contains a variety of information about metabolites, proteins, reactions, and pathways [56]; Pathway Commons contains extensive information about protein-protein interactions and pathways [57]; and UniProt contains a multitude of information about proteins [58].

In addition, some researchers are integrating information about individual organisms into PGDBs such as the BioCyc family of databases [59][60]. These databases contain a wide range of information including the stoichiometries of individual reactions, the compositions of individual protein complexes, and the genes regulated by individual transcription factors. Because PGDBs already contain integrated data about a single organism, PGDBs could readily be leveraged to build WC models. In fact, Latendresse developed MetaFlux to build constraint-based models of metabolism from EcoCyc [61].

Furthermore, meta-databases such as Nucleic Acid Research’s Database Summary [62] and re3data.org [63] contain lists of data repositories.

Most of these repositories have been developed by encouraging individual researchers to deposit their data or by employing curators to manually extract data from publications, supplementary files, and websites. In addition, researchers are beginning to use natural language processing to develop tools for automatically extracting data from publications [64].

Table 1.2 Repositories that contain experimental data which can be used to build, calibrate, and validate WC models.¶
Database	Content	URL	Reference
Species structures
Metabolites
ChEBI	Compound structures	https://www.ebi.ac.uk/chebi	Hastings et al., 2016
KEGG Compound	Compound structures	http://www.genome.jp/kegg/compound	Kanehisa et al., 2017
KEGG Glycan	Glycan structures	http://www.genome.jp/kegg/glycan	Hashimoto et al., 2006
Metabolomics Workbench Metabolite Database	Compound structures	http://www.metabolomicsworkbench.org	Sud et al., 2016
LIPID MAPS	Lipid structures	http://www.lipidmaps.org/data/structure	Sud et al., 2007
PubChem	Compound structures	https://pubchem.ncbi.nlm.nih.gov	Kim et al., 2016
DNA
ArrayExpress	Functional genomics data including Hi-C data	http://www.ebi.ac.uk/arrayexpress	Kolesnikov et al., 2015
GenBank	DNA sequences	https://www.ncbi.nlm.nih.gov/genbank	Benson et al., 2017
GEO	Functional genomics data including Hi-C data	https://www.ncbi.nlm.nih.gov/geo	Clough and Barrett, 2016
MethDB	Methylation sequencing data	http://www.methdb.net	Grunau et al., 2001
RNA
ArrayExpress	Functional genomics data including RNA-seq data that encompasses initiation and termination sites	http://www.ebi.ac.uk/arrayexpress	Kolesnikov et al., 2015
GEO	Functional genomics data including RNA-seq data that encompasses initiation and termination sites	https://www.ncbi.nlm.nih.gov/geo	Clough and Barrett, 2016
MODOMICS	Post-transcriptional modifications	http://modomics.genesilico.pl	Machnicka et al., 2013
RNA Modification Database	Post-transcriptional modifications	http://mods.rna.albany.edu	Cantara et al., 2011
Protein
3d-footprint	3-dimensional footprints	http://floresta.eead.csic.es/3dfootprint	Contreras-Moreira, 2010
dbPTM	Post-translational modifications	http://dbptm.mbc.nctu.edu.tw	Huang et al., 2016
PDB	3-dimensional structures	http://www.rcsb.org	Rose et al., 2017
RESID	Post-translational modifications	http://pir.georgetown.edu/resid	Garavelli, 2004
UniMod	Post-translational modifications	http://www.unimod.org	Creasy and Cottrell, 2004
UniProt	Functional protein annotations including post-translational modifications	http://www.uniprot.org	The UniProt Consortium, 2017
Localization and signal sequences
RNA
Fly-FISH	RNA localizations	http://fly-fish.ccbr.utoronto.ca	Wilk et al., 2006
RNALocate	RNA localizations	http://www.rna-society.org/rnalocate	Zhang et al., 2017
Protein
COMPARTMENTS	Protein localizations for Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Mus musculus, and Rattus norvegicus	http://compartments.jensenlab.org	Binder et al., 2014
Human Protein Reference Database	Protein localizations for Homo sapiens	http://www.hprd.org	Prasad et al., 2009
LOCATE	Protein localizations for Homo sapiens and Mus musculus	http://locate.imb.uq.edu.au	Sprenger et al., 2008
LocDB	Protein localizations for Arabidopsis thaliana and Homo sapiens	https://www.rostlab.org/services/locDB	Rastogi and Rost, 2011
LocSigDB	Protein localizations for eukaryotes	http://genome.unmc.edu/LocSigDB	Negi et al., 2015
OrganelleDB	Protein localizations	http://labs.mcdb.lsa.umich.edu/organelledb	Wiwatwattana et al., 2007
PSORTdb	Protein localizations for bacteria and archaea	http://db.psort.org	Peabody et al., 2016
UniProt	Functional protein annotations including protein localizations	http://www.uniprot.org	The UniProt Consortium, 2017
Concentrations
Metabolites
BioNumbers	Quantitative measurements of physical, chemical, and biological properties including metabolite concentrations	http://bionumbers.hms.harvard.edu	Milo et al., 2010
ECMBD	Metabolite concentrations in Escherichia coli	http://www.ecmdb.ca	Sajed et al., 2016
HMDB	Metabolite concentrations in Homo sapiens	http://www.hmdb.ca	Wishart et al., 2013
MetaboLights		https://www.ebi.ac.uk/metabolights	Kale et al., 2016
YMDB	Metabolite concentrations in Saccharomyces cerevisiae	http://www.ymdb.ca	Ramirez-Gaona et al., 2017
RNA
ArrayExpress	Functional genomics data including RNA abundances from microarray and RNA-seq experiments	http://www.ebi.ac.uk/arrayexpress	Kolesnikov et al., 2015
Expression Atlas	RNA abundances across organisms and environmental conditions	http://www.ebi.ac.uk/gxa	Petryszak et al., 2016
GEO	Functional genomics data including RNA abundances from microarray and RNA-seq experiments	https://www.ncbi.nlm.nih.gov/geo	Clough and Barrett, 2016
Proteins
Review		http://doi.org/10.1002/pmic.201400302	Perez-Riverol et al., 2015
Human Protein Atlas	Protein abundances for Homo sapiens	http://www.proteinatlas.org	Uhlén et al., 2015
PaxDb	Protein abundances	http://pax-db.org	Wang et al., 2015
Plasma Proteome Database	Protein abundances for Homo sapiens plasma	http://plasmaproteomedatabase.org	Nanjappa et al., 2014
PRIDE	Mass-spectrometry proteomics data	http://www.ebi.ac.uk/pride	Vizcaíno et al., 2016
Interactions
Protein-Metabolite, See also: Cofactors
Review		http://doi.org/10.1016/j.jchromb.2013.11.043	Matsuda et al., 2014
DrugBank	Drugs and their targets	https://www.drugbank.ca	Law et al., 2014
STITCH	Drugs and their targets	http://stitch.embl.de	Szklarczyk et al., 2016
SuperTarget	Drugs and their targets	http://insilico.charite.de/supertarget	Hecker et al., 2012
Therapeutic Targets Database	Drugs and their targets	http://bidd.nus.edu.sg/group/cjttd	Zhu et al., 2012
Protein-DNA
ArrayExpress	Functional genomics data including ChIP-seq data of protein-DNA interations	http://www.ebi.ac.uk/arrayexpress	Kolesnikov et al., 2015
GEO	Functional genomics data including ChIP-seq data of protein-DNA interations	https://www.ncbi.nlm.nih.gov/geo	Clough and Barrett, 2016
DBD	Predicted transcription factors	http://www.transcriptionfactor.org	Wilson et al., 2008
DBTBS	Bacillus subtilis transcription factors and the operons they regulate	http://dbtbs.hgc.jp	Sierro et al., 2008
ORegAnno	Transcription factor binding sites	http://www.oreganno.org	Lesurf et al., 2016
TRANSFAC	Transcription factor binding motifs	http://genexplain.com/transfac	Matys et al., 2003
UniProbe	Transcription factor binding motifs	http://thebrain.bwh.harvard.edu/uniprobe	Hume et al., 2015
Protein-Protein
Review		http://doi.org/10.1186/1479-7364-3-3-291	Lehne et al., 2009
ConsensusPathDB	Homo sapiens molecular interactions including protein-protein interactions	http://cpdb.molgen.mpg.de	Kamburov et al., 2013
BioGRID	Protein-protein interactions	https://thebiogrid.org	Chatr-aryamontri et al., 2017
CORUM	Protein complex composition	http://mips.helmholtz-muenchen.de/corum/
DIP	Protein-protein interactions	http://dip.doe-mbi.ucla.edu	Salwinski et al., 2004
IntAct	Molecular interactions including protein-protein interactions	http://www.ebi.ac.uk/intact	Szklarczyk et al., 2017
STRING	Protein-protein interactions	https://string-db.org	Kerrien et al., 2012
UniProt	Function protein annotations including protein complex compositions	http://www.uniprot.org	The UniProt Consortium, 2017
Reactions
Stoichiometries, catalysis
BioCyc	Reaction stoichiometries and catalysts	https://biocyc.org	Caspi et al., 2016
KEGG	Reaction stoichiometries and catalysts	http://www.genome.jp/kegg	Kanehisa et al., 2017
MACiE	Detailed reaction mechanisms	http://www.ebi.ac.uk/thornton-srv/databases/MACiE	Holliday et al., 2012
Rhea	Reaction stoichiometries	http://www.rhea-db.org	Morgat et al., 2017
UniProt	Reaction stoichiometries and catalysts	http://www.uniprot.org	The UniProt Consortium, 2017
Cofactors
CoFactor	Organic enzyme cofactors	http://www.ebi.ac.uk/thornton-srv/databases/CoFactor	Fischer et al., 2010
PDB	3-dimensional protein structures including cofactors	http://www.rcsb.org	Rose et al., 2017
UniProt	Functional protein annotations including cofactors	http://www.uniprot.org	The UniProt Consortium, 2017
Rate laws and rate constants
BioNumbers	Quantitative measurements of physical, chemical, and biological properties including kinetic parameters	http://bionumbers.hms.harvard.edu	Milo et al., 2010
BRENDA	Kinetic parameters and rate laws	http://www.brenda-enzymes.org	Schomburg et al., 2017
SABIO-RK	Kinetic parameters and rate laws	http://sabio.h-its.org	Wittig et al., 2012
Pathways
Metabolic
Review		https://doi.org/10.1007/s00204-011-0705-2	Karp and Caspi, 2011
BioCyc	Species-specific pathways	https://biocyc.org	Caspi et al., 2016
KEGG PATHWAY	Species-specific pathways	http://www.genome.jp/kegg/pathway.html	Kanehisa et al., 2017
Signaling
Review		https://doi.org/10.1093/database/bau126	Chowdhury and Sarkar, 2015
hiPathDB	Metadatabase of Homo sapiens signaling pathways	http://hipathdb.kobic.re.kr	Yu et al., 2012
KEGG PATHWAY	Pathways including signaling pathways	http://www.genome.jp/kegg/pathway.html	Kanehisa et al., 2017
NetPath	Immune signaling pathways	http://www.netpath.org	Kandasamy et al., 2010
PANTHER Pathway	Pathways including signaling pathways	http://www.pantherdb.org/pathway	Mi et al., 2017
Pathway Commons	Metadatabase of signaling pathways	http://www.pathwaycommons.org	Cerami et al., 2011
Reactome	Pathways including signaling pathways	http://www.reactome.org	Fabregat et al., 2016
WikiPathways	Community curated pathways including signaling pathways	http://www.wikipathways.org	Kutmon et al., 2016
Meta-databases and meta-database tools
Review		http://doi.org/10.1002/minf.201600035	Urdidiales-Nieto et al., 2017
BioCatalogue	List of web services	https://www.biocatalogue.org	Bhagat et al., 2010
BioMart	Tools for integrating data from multiple repositories	http://www.biomart.org	Kasprzyk, 2010
BioMoby	Ontology-based messaging system for discovering data	http://biomoby.open-bio.org	BioMoby Consortium et al., 2008
BIOSERVICES	Python APIs to several popular repositories	https://pythonhosted.org/bioservices	Cokelaer et al., 2013
BioSWR	List of web services	http://inb.bsc.es/BioSWR	Repchevsky and Gelpi, 2014
ELIXIR	Effort to develop a common data infrastructure for Europe	https://www.elixir-europe.org	Crosswell and Thornton, 2012
NAR Database Summary	List of database papers published in Nucleic Acids Research database issues	http://www.oxfordjournals.org/nar/database/c	Galperin et al., 2017
re3data.org Registry	List of data repositories	http://www.re3data.org	Pampel et al., 2013

1.4.1.3. Prediction tools¶

Accurate prediction tools can be a useful alternative to constraining models with direct experimental evidence. Currently, many tools can predict molecular properties such as the organization of genes into operons, RNA folds, and protein localizations (Table 1.3). For example, PSORTb can predict the localization of bacterial proteins [65] and TargetScan can predict the mRNA targets of small non-coding RNAs [66]. In particular, these tools can be used to impute missing data and extrapolate observations to other organisms, genetic conditions, and environmental conditions. However, many current prediction tools are not sufficiently accurate for WC modeling.

Table 1.3 Computational prediction tools that can generate data which can be used to build, calibrate, and validate WC models.¶
Tool	Prediction(s)	Language	URL	Reference
Metabolites
Physical properties
Review	Survey of several chemoinformatic packages		http://doi.org/10.1186/1758-2946-3-37	O’Boyle et al., 2011
Chemistry Development Kit (CDK)	Java libraries for processing chemical information	Java	https://cdk.github.io	Steinbeck et al., 2006
Cinfony	A common API to several cheminformatics toolkits	Python	http://cinfony.github.io	O’Boyle and Hutchison, 2008
Indigo	A toolkit for molecular fingerprinting, substructure searching, and visualization	C++, Java, .Net, Python	http://lifescience.opensource.epam.com/indigo
JChem	Tools for draw and visualizing molecules and searching chemical databases	Java, .Net, REST	https://www.chemaxon.com/download/jchem-suite	Csizmadia, 2000
Open Babel	Tools for searching, converting, analyzing, and storing chemical structures	C++, Java, .Net, Python	http://openbabel.org	O’Boyle et al., 2011
RDKit	Cheminformatics toolkit	C++, Python	http://www.rdkit.org
Thermodynamics
UManSysProp	Estimates the standard Gibbs free energy of formation of organic molecules using the Joback group contribution method	Python, REST	http://umansysprop.seaes.manchester.ac.uk	Joback and Reid, 1987; Topping et al., 2016
Web GCM	Estimates the standard Gibbs free energy of formation of organic molecules using the Mavrovouniotis group contribution method	REST	http://doi.org/10.1529/biophysj.107.124784	Jankowski et al., 2008
DNA
Promoters
Review	Review of promoter prediction methods for Homo sapiens		http://doi.org/10.1093/bioinformatics/17.suppl_1.S90	Pedersen et al., 1999
PePPER	Predicts prokaryote promoters	REST	http://pepper.molgenrug.nl/index.php/prokaryote-promoters	de Yong et al., 2012
Promoter	Predicts vertebrate PolII promoters	REST	http://www.cbs.dtu.dk/services/Promoter	Knudsen, 1999
PromoterHunter	Predicts prokaryote promoters	REST	http://www.phisite.org/promoterhunter	Klucar et al., 2010
Genes
Review	Review of several gene prediction software tools		https://cmgm.stanford.edu/biochem218/Projects%202007/Mcelwain.pdf	McElwain, 2007
GeneMark	Family of tools for predicting viral, prokaryotic, archaeal, and eukaryotic genes	Linux executable, REST	http://exon.gatech.edu/GeneMark	Borodovsky and Lomsadze, 2011
GENESCAN	Predicts plant and vertebrate genes	Linux executable, REST	http://genes.mit.edu/GENSCAN.html	Burge and Karlin, 1997
GLIMMER	Predicts viral, prokaryotic, and archaeal genes	C, REST	https://ccb.jhu.edu/software/glimmer	Salzberg et al., 1998
Operons
Review	Survey of several operon prediction methods		https://doi.org/10.1093/bib/bbn019	Brouwer et al., 2008
DOOR	Predicts prokaryotic operons	REST	http://csbl.bmb.uga.edu/DOOR	Mao et al., 2014
OperonDB	Estimates the likelihood that pairs of genes are in the same operon	Perl, REST	http://operondb.cbcb.umd.edu/cgi-bin/operondb/operons.cgi	Ermolaeva et al., 2001
ProOpDB	Predicts prokaryotic operons	Java, REST	http://operons.ibt.unam.mx/OperonPredictor	Taboada et al., 2010
VIMSS	Predicts prokaryotic and archaeal operons	R, REST	http://www.microbesonline.org/operons	Price et al., 2005
Variant interpretation
PolyPhen-2	Predicts the functional effects of amino acid substitutions	C, REST	http://genetics.bwh.harvard.edu/pph2	Adzhubei et al., 2013
PROVEAN	Predicts the functional effects of amino acid substitutions and indels	C++, REST	http://provean.jcvi.org	Choi and Chan, 2015
SIFT	Predicts the functional effects of amino acid indels	C++, REST	http://sift.bii.a-star.edu.sg	Hu and Ng, 2013
RNA
Splice sites
Review	Review of methods for predicting splice sites		http://www.umd.be/HSF/Desmet_2010.pdf	Desmet et al., 2010
GeneSplicer	Predicts eukaryotic splice sites	Java	https://ccb.jhu.edu/software/genesplicer	Pertea et al., 2001
Human Splicing Finder	Identify and predict mutations’ effect on human splicing motifs	REST	http://www.umd.be/HSF3/	Desmet et al., 2009
NetGene2	Predicts splice sites in Arabidopsis thaliana, Caenorhabditis elegans, and Homo sapiens	REST	http://www.cbs.dtu.dk/services/NetGene2	Hebsgaard et al., 1996
NNSplice	Predicts splice sites Drosophila melanogaster and Homo sapiens	REST	http://www.fruitfly.org/seq_tools/splice.html	Reese et al., 1997
Secondary structure
Review	Review of methods for predicting RNA secondary structures		http://doi.org/10.1016/j.ymeth.2016.04.004	Lorenz et al., 2016
Mfold	Predicts RNA secondary structures	C, REST	http://unafold.rna.albany.edu/?q=mfold	Zuker, 2003
RNAstructure	Predicts RNA and DNA secondary structures	C++, Java	http://rna.urmc.rochester.edu/RNAstructure.html	Reuter and Mathews, 2010
ViennaRNA	Predicts RNA secondary structures	C, Perl, Python	https://www.tbi.univie.ac.at/RNA	Lorenz et al., 2011
Open reading frame
ORF Finder	Predicts open reading frames	Linux executable, REST	https://www.ncbi.nlm.nih.gov/orffinder	Rombel et al., 2002
ORF Investigator	Predicts open reading frames	Windows executable	https://sites.google.com/site/dwivediplanet/ORF-Investigator	Dhar and Kumar, 2012
ORFPredictor	Predicts open reading frames from EST and cDNA sequences	Perl, REST	http://bioinformatics.ysu.edu/tools/OrfPredictor.html	Min et al., 2005
Terminators
Review	Review of prokaryotic transcription termination that cites several methods for predicting terminators.		http://doi.org/10.1016/j.jmb.2011.03.036	Peters et al., 2011
ARNold	Predicts prokaryotic rho-independent terminators	REST	http://rna.igmors.u-psud.fr/toolbox/arnold	Gautheret D and Lambert A, 2001
FindTerm	Predicts prokaryotic rho-independent terminators	REST	http://www.softberry.com/berry.phtml?topic=findterm&group=programs&subgroup=gfindb	Solovyev and Salamov, 2011
GeSTer	Predicts prokaryotic rho-independent terminators	REST	http://pallab.serc.iisc.ernet.in/gester	Mitra et al., 2011
TransTermHP	Predicts prokaryotic rho-independent terminators	C++	http://transterm.cbcb.umd.edu	Kingsford et al., 2007
Proteins
Localization
Review	Review of methods for predicting the subcellular localization of prokaryotic and eukaryotic proteins		http://doi.org/10.1002/pmic.201000274	Imai and Nakai, 2010
Review	Review of methods for predicting the subcellular localization of prokaryotic proteins		http://doi.org/10.1038/nrmicro1494	Gardy and Brinkman, 2006
Cell-PLoc	Predicts the subcellular localization of proteins for multiple species	REST	http://www.csbio.sjtu.edu.cn/bioinf/Cell-PLoc-2	Chou and Shen, 2010
MultiLoc	Predicts the subcellular localization of proteins for multiple species	Python, REST	http://abi.inf.uni-tuebingen.de/Services/MultiLoc2	Blum et al., 2009
PSORTb	Predicts the subcellular localization of prokaryotic and archaeal proteins	C++, Perl, REST	http://www.psort.org/psortb	Yu et al., 2010
SecretomeP	Predicts signal peptide-independent protein secretion	REST, tcsh	http://www.cbs.dtu.dk/services/SecretomeP	Bendtsen et al., 2004
WoLF PSORT	Predicts the subcellular localization of eukaryotic proteins	Perl, REST	https://wolfpsort.hgc.jp	Horton et al., 2007
Signal sequence
Review	Architecture, function and prediction of long signal peptides		http://doi.org/10.1093/bib/bbp030	Hiss and Schneider, 2009
Phobius	Predict protein transmembrane topology and signal peptides from AA sequences	Java, REST	http://phobius.sbc.su.se	Käll et al., 2007
PRED-LIPO	Predict lipoprotein and secretory signal peptides in gram-positive bacteria	REST	http://bioinformatics.biol.uoa.gr/PRED-LIPO	Bagos et al., 2008
PRED-SIGNAL	Predict signal peptides in archaea	REST	http://bioinformatics.biol.uoa.gr/PRED-SIGNAL	Bagos et al., 2009
SignalP	Predict signal peptide cleavage sites in prokaryotic and eukaryotic proteins	Perl, REST	http://www.cbs.dtu.dk/services/SignalP	Petersen et al., 2011
Disulfide bonds
Review	Review of methods predicting disulfide bonds		http://doi.org/10.2174/138920307780831848	Tsai et al., 2007
Review	Review of methods predicting disulfide bonds		http://doi.org/10.4137/EBO.S25349	Márquez-Chamorro and Aguilar-Ruiz, 2015
Cyscon	A consensus model for predicting disulfide bonds	REST	http://www.csbio.sjtu.edu.cn/bioinf/Cyscon	Yang et al., 2015
DIANNA	Predicts disulfide bonds	Python, REST	http://clavius.bc.edu/~clotelab/DiANNA	Ferrè F and Clote P, 2006
Dinsolve	Predicts disulfide bonds	REST	http://hpcr.cs.odu.edu/dinosolve	Yaseen and Li, 2013
DIPro	Predicts disulfide bonds	REST, Perl	http://scratch.proteomics.ics.uci.edu	Cheng et al., 2006
DISULFIND	Predicts disulfide bonds	REST	http://disulfind.dsi.unifi.it	Ceroni et al., 2006
Complex abundance
SiComPre	Predicts the abundances of Homo sapiens and Saccharomyces cerevisiae protein complexes	C++, Java, Python	http://www.cosbi.eu/research/prototypes/sicompre	Rizzetto et al., 2015
Half-lives
N-End rule	Predicts the half-lives of Escherichia coli, Saccharomyces cerevisiae and mammalian (rabit) proteins	REST	http://web.expasy.org/protparam	Bachmair et al., 1986
Interactions
miRNA targets
Review	Review of methods for predicting miRNA targets		http://doi.org/10.1016/j.cell.2009.01.002	Bartel, 2009
Review	Review of methods for predicting miRNA targets		https://doi.org/10.3389/fgene.2014.00023	Peterson et al., 2014
DIANA-microT-CDS	Predicts miRNA targets in Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, and Mus musculus	REST	http://www.microrna.gr/microT-CDS	Reczko et al., 2012
miRSearch	Predicts miRNA targets in Homo sapiens, Mus musculus, and Rattus norvegicus	REST	https://www.exiqon.com/miRSearch
MirTarget	Predicts miRNAs targets in several animals	REST	http://mirdb.org	Wang, 2016
PITA	Predicts miRNA targets in Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, and Mus musculus	C, Perl, REST	https://genie.weizmann.ac.il/pubs/mir07	Kertesz et al., 2007
STarMir	Predicts miRNA targets in Caenorhabditis elegans, Homo sapiens, and Mus musculus	Perl, R, REST	http://sfold.wadsworth.org/cgi-bin/starmirtest2.pl	Lui et al., 2013
TargetScan	Predicts miRNA targets in several animals	Perl, REST	http://www.targetscan.org	Agarwal et al., 2015
Protein-DNA binding sites
Review	Review of tools for predicting transcription factor binding sites		http://doi.org/10.1038/nbt1053	Tompa et al., 2005
Review	Review of tools for predicting transcription factor binding sites		http://doi.org/10.1186/s12859-016-1298-9	Jayaram et al., 2016
DBD	Predicts DNA-binding domains of transcription factors	REST	http://www.transcriptionfactor.org	Wilson et al., 2008
JASPAR	Predicts transcription factor binding motifs	Perl, Python, R, REST, Ruby	http://jaspar.genereg.net	Mathelier et al., 2016
Weeder	Predicts likely transcription factor binding motifs	C++, REST	http://doi.org/10.1093/nar/gkh465	Pavesi et al., 2004
Chaperones
BiPPred	Predicts the interactions of mammalian proteins with chaperone BiP	REST	https://www.bioinformatics.wzw.tum.de/bippred	Schneider et al., 2016
cleverSuite	Predicts the interactions of Escherichia coli proteins with chaperone DnaK/GroEL	REST	http://s.tartaglialab.com/clever_suite	Klus et al., 2014
LIMBO	Predicts the interactions of Escherichia coli proteins with chaperone DnaK	REST	http://limbo.switchlab.org/limbo-analysis	Van Durme et al., 2009
Reaction center and atom mapping
Review	Review of methods for reaction mapping and reaction center detection		http://doi.org/10.1002/wcms.1140	Chen et al., 2013
CAM	Predicts the mapping of reactant to product atoms	C++	http://www.bioinf.uni-freiburg.de/Software/CAM	Mann et al., 2014
CLCA	Predicts the mapping of reactant to product atoms	REST	http://www.maranasgroup.com/metrxn	Kumar and Maranas, 2014
MWED	Predicts the mapping of reactant to product atoms	Lisp	http://doi.org/10.1021/ci3002217	Latendresse et al., 2012
ReactionDecoder	Predicts the mapping of reactant to product atoms	Java	https://github.com/asad/ReactionDecoder	Rahman et al., 2016
ReactionMap	Predicts the mapping of reactant to product atoms	REST	http://cdb.ics.uci.edu/cgibin/reactionmap/ReactionMapWeb.py	Fooshee et al., 2013

1.4.2. Modeling and simulation tools¶

Here, we review several advances in modeling and simulation technology that we believe are beginning to enable researchers to aggregate and organize the data needed for WC modeling and design, describe, simulate, calibrate, verify, and analyze WC models.

1.4.2.1. Data aggregation and organization tools¶

To make the large amount of publicly available data usable for modeling, researchers are developing tools such as BioServices [54] for programmatically accessing repositories and using PGDBs to organize the data needed for modeling. PGDBs are well-suited to organizing the data needed for WC models because they support structured representations of metabolites, DNA, RNA, proteins, and their interactions. However, traditional PGDBs provided limited support for non-metabolic pathways and quantitative data. Consequently, we are developing WholeCellKB, a PGDB specifically designed for WC modeling [30].

1.4.2.2. Model design tools¶

Several software tools have been developed for designing models of individual cellular pathways including BioUML [67], CellDesigner [68], COPASI [69], JDesigner [70], and Virtual Cell [71] which support dynamical modeling; RuleBender which supports rule-based modeling [72]; and COBRApy [73], FAME [74], and RAVEN [75] which support constraint-based metabolic modeling; and (Table 1.4).

Recently, researchers have developed several tools that support some of the features needed for WC modeling. This includes SEEK which helps researchers design models from data tables [76], Virtual Cell which helps researchers design models from KEGG pathways [71][56], MetaFlux which helps researchers design metabolic models from PGDBs [61], the Cell Collective [77] and JWS Online [78] which help researchers build models collaboratively, PySB which helps researchers design models programmatically [79], and semanticSBML [80] and SemGen [81] which help researchers merge models.

Table 1.4 Software tools that can be used to help build, calibrate, validate, simulate, visualize, and analyze WC models.¶
Tool	URL	Reference
Data aggregation tools
BioCatalogue	https://www.biocatalogue.org	Bhagat et al., 2010
BIOSERVICES	https://pythonhosted.org/bioservices	Cokelaer et al., 2013
Data organization tools
GMOD	http://gmod.org	Papanicolaou and Heckel, 2010
Pathway Tools	http://brg.ai.sri.com/ptools	Karp et al., 2016
WholeCellKB	http://www.wholecellkb.org	Karr et al., 2013
Model design tools
CellDesigner	http://www.celldesigner.org	Matsuoka et al., 2014
COPASI	http://copasi.org	Mendes et al., 2009
JWS Online	http://jjj.biochem.sun.ac.za	Olivier and Snoep, 2004
MetaFlux	http://brg.ai.sri.com/ptools	Latendresse et al., 2012
PhysioDesigner	http://www.physiodesigner.org	Asai et al., 2012
RAVEN	http://biomet-toolbox.org/index.php?page=downtools-raven	Agren et al., 2013
RuleBender	http://bionetgen.org/index.php/Quick_Start	Smith et al., 2012
VirtualCell	http://vcell.org	Schaff et al., 2016
Model testing and verification tools
biolab	http://www.lehman.edu/academics/cmacs/bio-lab.php	Clarke at al., 2008
MEMOTE	https://memote.readthedocs.io
SBML-to-PRISM	http://www.prismmodelchecker.org/sbml
Model description languages
BioNetGen	http://bionetgen.org	Harris et al., 2016
BioPAX	http://www.biopax.org	Demir et al., 2010
CellML	https://www.cellml.org	Cuellar et al., 2015
kappa	http://dev.executableknowledge.org	Wilson-Kanamori et al., 2015
ML-Rules	http://jamesii.informatik.uni-rostock.de/jamesii.org/	Maus et al., 2011
PySB	http://pysb.org/	Lopez et al., 2013
SBML	http://sbml.org	Hucka et al., 2015
Simulation description languages
SED-ML	http://sed-ml.org	Waltemath et al., 2011
SESSL	http://sessl.org	Ewald and Uhrmacher, 2014
Simulators
cobrapy	http://opencobra.github.io/cobrapy	Ebrahim et al., 2013
COPASI	http://copasi.org	Mendes et al., 2009
ECell	http://www.e-cell.org	Takahashi et al., 2003
Lattice Microbes	http://www.scs.illinois.edu/schulten/lm	Hallock et al., 2014
libRoadRunner	http://libroadrunner.org	Somogyi et al., 2015
NFSim	http://michaelsneddon.net/nfsim	Sneddon et al., 2011
VirtualCell	http://vcell.org	Schaff et al., 2016
Simulation result formats
HDF5	https://support.hdfgroup.org/HDF5	Folk et al., 2011
NuML	https://github.com/numl/numl	Dada et al., 2017
SBRML	http://www.comp-sys-bio.org/SBRML.html	Dada et al., 2010
Simulation result databases
Bookshelf	http://sbcb.bioch.ox.ac.uk/bookshelf	Vohra et al., 2010
Dynameomics	http://www.dynameomics.org	van der Kamp et al., 2010
SEEK	https://fair-dom.org/platform/seek	Wolstencroft et al., 2011
WholeCellSimDB	http://www.wholecellsimdb.org	Karr et al., 2014
Visualization tools
Vega	https://vega.github.io	Satyanarayan et al., 2017
The Visualization Toolkit (VTK)	http://www.vtk.org	Hanwell et al., 2015
WholeCellViz	http://www.wholecellviz.org	Lee et al., 2013
Workflow management tools
Galaxy	https://usegalaxy.org	Walker et al., 2016
Taverna	http://www.taverna.org.uk	Wolstencroft et al., 2013
VizTrails	https://www.vistrails.org	Freire and Silva, 2012

However, none of these tools are well-suited to WC modeling because none of these tools support all of the features needed for WC modeling including programmatically designing models from large data sources such as PGDBs; collaboratively designing models over a web-based interface; designing composite, multi-algorithmic models; representing models in terms of rule patterns; and recording the data sources and assumptions used to build models.

1.4.2.3. Model selection tools¶

Several methods have also been developed to help researchers select among multiple potential models, including likelihood-based, Bayesian, and heuristic methods [82]. ABC-SysBio [83][84], ModelMage [85], and SYSBIONS [86] are some of the most advanced model selection tools. However, these tools only support deterministic dynamical models.

1.4.2.5. Model formats¶

Several formats have been developed to represent cell models including formats such as CellML [100] that represent models as collections of variables and equations, formats such as SBML [101] that represent models as collections of species and reactions, and more abstract formats such as BioNetGen [102], Kappa [103], and ML-Rules [104] that represent models as collections of species and rule patterns.

The Systems Biology Markup Language (SBML) was developed in 2002 to represent dynamical models that can be simulated by integrating ordinary differential equations or using the stochastic simulation algorithm, as well as the semantic biological meaning of models. Recently, SBML has been extended to support a wide range of models through the development of several new packages. The flux balance constraints package supports constraint-based models, the qualitative models package supports logical models, the spatial processes package support spatial models that can be simulated by integrating PDEs, the multistate multicomponent species package supports rule-based model descriptions, and the hierarchical model composition package supports composite models. SBML is by far the most widely supported and commonly used format for representing cell models. For example, SBML is supported by COPASI [69], the most commonly used cell modeling software program and BioModels, the most commonly used cell model repository [105]. However, SBML creates verbose model descriptions, the multistate multicomponent species package only supports a few types of combinatorial complexity, SBML does not directly support multi-algorithmic models, and SBML cannot represent model provenance including the data sources and assumptions used to build models [106].

More recently, Faeder and others have developed BioNetGen [102] and other rule-based formats to efficiently describe the combinatorial complexity of protein-protein interactions. These formats enable researchers to describe models in terms of species and reaction patterns which can be evaluated to generate all of the individual species and reactions in a model. This abstraction helps researchers describe reactions directly in terms of their chemistry, describe large models concisely, and avoid errors in enumerating species and reactions. Models that are described in rule-based formats such as BioNetGen can be simulated either by enumerating all of the possible species and reactions and then simulating the expanded model via conventional deterministic or stochastic dynamical simulation methods, or via network-free simulation which iteratively discovers individual species and reactions during simulation [107]. BioNetGen is the most commonly used rule-based modeling format and NFsim is the most commonly used network-free simulator. However, BioNetGen only supports few types of combinatorial complexity, BioNetGen does not support composite or multi-algorithmic models, BioNetGen cannot represent the semantic biological meaning of models, and BioNetGen cannot represent model provenance.

1.4.2.6. Simulation algorithms¶

Several algorithms have been developed to simulate cells with a wide range of granularity including algorithms for integrating systems of ODEs and PDEs, stochastic simulation algorithms, algorithms for simulating logical networks and Petri nets, and hybrid algorithms for co-simulating models that are composed of mathematically-dissimilar submodels.

The most commonly used algorithms to simulate cell models include algorithms for integrating systems of ODEs. These algorithms are best suited to simulating well-characterized and well-mixed systems that involve large concentrations that are robust to stochastic fluctuations. These algorithms are poorly suited to simulating stochastic processes that involve small concentrations, as well as poorly characterized pathways with little kinetic data. Consequently, ODE integration algorithms are poorly suited for WC modeling.

Stochastic simulation algorithms such as the Stochastic Simulation Algorithm (SSA) or Gillespie’s Algorithm [108], newer, more efficient implementations of SSA such as the Gibson-Bruck method and RSSA-CR [109], and approximations of SSA such as tau leaping, are commonly used to simulate pathways that involve small concentrations that are susceptible stochastic variation. However, these algorithms are only suitable for dynamical models which require substantial kinetic data, they are computationally expensive, especially for models that include reactions that have high fluxes, and they are limited to models with small state spaces. Consequently, stochastic simulation algorithms are poorly suited for simulating WC models.

Network-free simulation algorithms are stochastic simulation algorithms for efficiently simulating rule-based models without enumerating every possible species and reaction prior to simulation and instead discovering the active species and reactions during simulation. Unlike traditional stochastic simulation algorithms, network-free simulation algorithms can represent large models that have combinatorially large or even infinite state spaces. Otherwise, network-free stochastic simulation algorithms have the same limitations as other stochastic simulation algorithms.

FBA is the second-most commonly used algorithm for simulating cell models. FBA predicts the steady-state flux of each metabolic reaction using detailed information about the stoichiometry and catalysis of each reaction, a small amount of quantitative data about the chemical composition of cells, a small amount of data about the exchange rate of each extracellular nutrient, and the assumption that metabolism has evolved to maximize the rate of cellular growth. However, FBA has limited ability to predict metabolite concentrations and temporal dynamics, and its assumptions are largely only applicable to microbial metabolism. Consequently, FBA is not well-suited to simulating entire cells.

Logical simulation algorithms are frequently used for coarse-grained simulations of transcriptional regulation and other pathways for which we have limited kinetic data. Logical simulations are computationally efficient because they are coarse-grained. However, logical simulation algorithms are poorly suited to WC modeling because they cannot generate detailed quantitative predictions, and therefore have limited utility for medicine and bioengineering.

Multi-algorithmic simulations are ideal for WC modeling because they can simulate models that include fine-grained representations of well-characterized pathways, as well as coarse-grained representations of poorly-characterized pathways. Takahashi et al. developed one of the first algorithms for co-simulating multiple mathematically-dissimilar submodels [3]. However, their algorithm is not well-suited to WC modeling because it does not support FBA or network-free simulation. Recently, we and others developed a multi-algorithm simulation meta-algorithm which supports ODE integration, conventional stochastic simulation, network-free stochastic simulation, FBA, and logical simulation [27]. However, our algorithm violates the arrow of time and is not scalable to large models.

1.4.2.7. Simulation experiment formats¶

The Minimum Information About a Simulation Experiment (MIASE) guidelines have been developed to establish the minimum metadata that should be provided about a simulation experiment to enable other researchers to reproduce and understand the simulation [2]. The Simulation Experiment Description Markup Language (SED-ML) [110] and the Simulation Experiment Specification via a Scala Layer (SESSL) [111] formats have been developed to represent simulation experiments. Both formats are capable of representing all of the model parameters and simulator arguments needed to simulate a model. However, both formats are limited to a small range of model formats and simulators. SED-ML is limited to models that are represented using XML-based formats such as SBML, and SESSL is currently limited to Java-based simulators. Consequently, neither is currently well-suited to WC modeling.

1.4.2.8. Simulation tools¶

Numerous tools have been developed to simulate cell models including the BioUML [67], Cell Collective [77], COBRApy [73], COPASI [69], E-Cell [112], FAME [74], iBioSim [113], libRoadRunner [114], JWS Online [78], NFsim [107], RAVEN [75], and Virtual Cell [71].

COPASI is the most commonly used simulation tool. COPASI supports several deterministic, stochastic, and hybrid deterministic/stochastic simulation algorithms. However, COPASI does not support network-free stochastic simulation, FBA, logical, or multi-algorithmic simulation and COPASI does not support high-performance parallel simulation of large models.

Virtual Cell supports several deterministic, stochastic, hybrid deterministic/stochastic, network-free, and spatial simulation algorithms. However, Virtual Cell does not support FBA or multi-algorithmic simulations and Virtual Cell does not support high-performance parallel simulation of large models.

COBRApy, FAME, and RAVEN support FBA of metabolic models. However, these packages provide no support for other types of models.

E-Cell is one of the only simulation programs that supports multi-algorithmic simulation. However, E-Cell does not support FBA or rule-based simulation, and E-Cell does not scale well to large models.

Several tools including cupSODA [115], cuTauLeaping [116], and Rensselaer’s Optimistic Simulation System (ROSS) [117] have been developed to simulate models in parallel. However, cupSODA only supports deterministic simulation, cuTauLeaping only supports network-based stochastic simulation, cupSODA and cuTauLeaping only support GPUs, and ROSS is a low-level, general-purpose framework for distributed CPU simulation.

1.4.2.9. Calibration tools¶

Accurate parameter values are essential for reliable predictions. Many methods have been developed to calibrate models by numerically optimizing the values of their parameters, including derivative-based initial value methods and stochastic multiple shooting methods [118].

Several complementary methods have also been developed to optimize computationally-expensive, high-dimensional functions, including surrogate modeling, distributed optimization, and automatic differentiation. Surrogate modeling, which is also referred to as function approximation, metamodeling, response surface modeling, and model emulation, promises to reduce the computational cost of numerical optimization by optimizing a computationally cheaper model which approximates the original model [119][120][121][122]. Surrogate modeling has been used in several fields including aerospace engineering [123], hydrology [124], and petroleum engineering [125]. However, further work is needed to develop methods for efficiently generating reduced surrogate WC models.

Distributed optimization is also a promising approach for optimizing computationally expensive functions. Distributed optimization uses multiple agents, each simultaneously employing the same algorithm on different regions, to quickly identify optima [126][127]. Furthermore, agents can cooperate by exchanging information. Distributed optimization has been used in several fields including aerospace and electrical engineering [128][129] and molecular dynamics [130].

Another promising approach for optimizing computationally expensive functions is automatic differentiation. Automatic differentiation is an efficient technique for analytically computing the derivative of a function [131]. Automatic differentiation can be used to make derivative-based optimization methods tractable in cases where finite difference calculations are prohibitively expensive. Automatic differentiation has been used to identify parameters in chemical engineering [132], biomechanics [133], and physiology [134].

Several software tools have also been developed for calibrating cell models [135][136][137][138][139]. Some of the most advanced model calibration tools include DAISY which can evaluate the identifiability of a model [140], ABC-SysBio which uses approximate Bayesian computation [83], saCeSS which supports distributed, collaborative optimization [141], and SBSI which supports several distributed optimization methods [142]. Some of the most popular modeling tools, including COPASI [69] and Virtual Cell [71], also provide model calibration tools. However, none of these tools support multi-algorithmic models. To efficiently calibrate WC models, we should combine numerical optimization methods with additional techniques such as reduced surrogate modeling, distributed computing, and automatic differentiation.

1.4.2.10. Verification tools¶

Several tools have been developed to verify cell models, including formal verification tools that seek to prove or refute mathematical properties of models and informal verification tools that help modelers organize and evaluate computational tests of models. BioLab [143] and PRISM [144] are formal tools for verifying BioNetGen-encoded and SBML-encoded models, respectively. Memote [145] and SciUnit [146] are unit testing frameworks for organizing computational tests of models. Continuous integration tools such as CircleCI [147] and Jenkins [148] can be used to regularly verify models each time they are modified and pushed to a version control system (VCS) such as Git [149].

1.4.2.11. Simulation results formats¶

HDF5 is an ideal format for storing simulation results [150]. In particular, HDF5 supports hierarchical data structures, HDF5 supports compression, HDF5 supports chunking to facilitate fast retrieval of small slices of large datasets, HDF5 can store both simulation results and their metadata, and there are HDF5 libraries available for several languages including C++, Java, MATLAB, Python, and R.

1.4.2.12. Simulation results databases¶

Several database systems have been developed to organize simulation results for visual and mathematical analysis and disseminate simulation results to the community [151][152][153][154][155][156][157]. We developed WholeCellSimDB, a hybrid relational/HDF5 database, to organize, search, and share WC simulation results [158]. WholeCellSimDB uses HDF5 to store simulation results and a relational database to store their metadata. This enables WholeCellSimDB to efficiently store simulation results, quickly search simulations by their metadata, and quickly retrieve slices of simulation results. WholeCellSimDB providers uses two interfaces to deposit simulation results; a web-based interface to search, browse, and visualize simulation results; and a JSON web service to retrieve simulation results. However, further work is needed to scale WholeCellSimDB to larger models and to develop tools for quickly searching WholeCellSimDB.

1.4.2.13. Simulation results analysis¶

Several tools have been developed to analyze and visualize simulation results. The most popular simulation software programs, including COPASI [69], E-Cell [112], and Virtual Cell [71], provide basic tools for visualizing simulation results. Tools such as Escher [159] and Pathway Tools Omics Viewer [160] can also be used to visualize simulation results.

We developed WholeCellViz to visualize WC simulation results in their biological context [161]. WholeCellViz provides users time series plots and interactive animations to visualize model predictions, and enables users to arrange grids of plots and animations to help users compare predictions across multiple simulation runs and simulated conditions. However, further work is needed to scale WholeCellViz to larger models and to make it easier to incorporate new visualizations into WholeCellViz.

1.4.3. Models of individual pathways and model repositories¶

Since the 1950’s, researchers have been using the tools described above to model cells. This has led to numerous models that represent individual pathways. Here, we review our progress in modeling individual pathways, building repositories of cell models, and their utility for WC modeling.

1.4.3.1. Models of individual pathways¶

Over the past 30 years, researchers have developed a wide range of models of individual cellular pathways [105] (Figure 1.4, Table 1.5). In particular, researchers have developed models of cell cycle regulation [162]; circadian rhythms [163]; electrical signaling [164]; metabolism [165][166][167]; signaling pathways such as the JAK/STAT, NF-\(\kappa\)B, p53, and TGF\(\beta\) pathways [168]; transcriptional regulation [169], and multicellular processes such as developmental patterning [170] and infection. However, many pathways have not been modeled at the scale of entire cells, including several well-studied pathways. For example, although we have extensive knowledge of the mutations responsible for cancer, we have few models of DNA repair; although we have extensive structural and catalytic information about RNA modification, we have few kinetic models of RNA modification; and although we have detailed atomistic models of protein folding, we have few cell-scale models of chaperone-mediated folding.

Figure 1.4 WC models can be built by leveraging existing models of well-studied processes (colors) and developing new models of other processes (gray).¶

Table 1.5 Pathway distribution, computational representations, and taxonomic distribution of the models contained in the BioModels model repository (Chelliah et al., 2015).¶
		Number of models in BioModels, by kingdom				Mean model size
Pathway	Formalisms	Viruses	Eukaryotes	Bacteria	Unannotated	Species	Reactions	Parameters
Cell cycle	ODEs, SSA		44			14.0	19.4	33.6
Cell death	ODEs		11		2	24.5	33.6	42.2
Circadian regulation	ODEs		38		1	17.3	31.2	65.5
DNA repair	ODEs		1			23.0	25.0	26.0
Electrical signaling	ODEs		34		5	12.7	26.4	37.5
Gene expression regulation	Boolean network		9	10	5	11.9	14.0	15.5
Host-pathogen interaction	ODEs	1	2	1		24.3	44.5	58.0
Intracellular transport	ODEs		2		2	7.8	12.8	16.3
Macromolecule modification	ODEs		1		2	10.7	26.0	19.7
Metabolism	FBA, ODEs		100	16	5	57.0	39.7	195.6
Motility	ODEs, PDEs		2	2		40.8	48.3	79.5
Organismal process	ODEs	1	66	2	2	17.2	20.1	48.8
Regulation, other	ODEs		5		14	12.0	17.8	22.2
Signal transduction	ODEs, SSA		144	3	30	35.3	54.1	67.8
Stress response	ODEs		9			16.6	19.4	46.2

Collectively, these models span a broad range of scales. For example, although most of these models represent the chemical transformations responsible for each pathway, some of these models, such as most transcriptional regulation models, use coarser representations. As a second example, although most of these models represent temporal dynamics, most metabolic models only represent the steady-state behavior of metabolism [13]. Similarly, although most of these models represent cells as well-mixed bags, some of these models represent the spatial distribution of individual compounds including nutrients and hormones [171][172][173]. In addition, although most of these models represent the mean behavior of cells, averaged over multiple cells and cell cycle phases, a few of these models represent the temporal dynamics of the cell cycle and the variation among single cells.

Collectively, these models also use a wide range of computational representations and simulation algorithms. Many of these models are represented as reaction networks. However, some of the largest of these models must be represented using rules [102] or Boolean networks. Many of these models can be simulated by integrating ODEs. However, some of the largest models must be simulated using network-free methods [107], the steady-state metabolism models must be simulated with FBA [13], some of the spatiotemporal models must be simulated by integrating PDEs, and some of the network models must be simulated by iteratively evaluating Boolean regulatory functions [174].

These pathway models could be used to help build WC models. However, substantial work would be required to integrate these models into a single model because these models describe different scales, make different assumptions, are represented using different mathematical formalisms, are calibrated to different organisms and conditions, and are represented using different identifiers and formats. To avoid needing to substantially revise pathway models for incorporation into WC models, modelers should build pathway models explicitly for integration into WC models. This requires the modeling community to embrace a common format, common identifiers, common units, and common standards for model calibration and validation.

1.4.4. Models of multiple pathways¶

Since 1999 when Tomita et al. reported one of the first models of multiple pathways of M. genitalium [17], researchers have been trying to build increasingly comprehensive models of multiple pathways. In particular, this has led to models of Escherichia coli and Saccharomyces cerevisiae which describe their metabolism and transcriptional regulation [18][19]; their metabolism, signaling, and transcriptional regulation [20][21][22]; and their metabolism and RNA and protein synthesis and degradation [23]. Table 1.6 summarizes several recently published and proposed models of multiple pathways. Despite this progress, these models only represent a small number of pathways and a small number of organisms.

Table 1.6 Models of multiple cellular pathways and their computational representations.¶
Pathways		Computational representation	Species	Status	References
28	Chromosome Condensation, Chromosome Segregation, Cytokinesis, DNA damage, DNA repair, DNA supercoiling, FtsZ Polymerization, Host interaction, Macromolecular complexation, Metabolism, Protein activation, Protein decay, Protein folding, Protein modification, Protein processing I, Protein processing II, Protein translocation, Replication, Replication Initiation, Ribosome assembly, RNA decay, RNA modification, RNA processing, Terminal organelle assembly, Transcription, Transcriptional regulation, Translation, tRNA aminoacylation	Hybrid: Boolean, flux balance analysis, ordinary differential equations, stochastic simulation	Mycoplasma genitalium	Published	Karr et al., 2012
6	Metabolism, protein complexation, RNA maturation, RNA modification, transcription, translation	Flux balance analysis	Escherichia coli	Published	Thiele et al., 2009
5	Metabolism, protein degradation, RNA degradation, transcription, translation	Ordinary differential equations	Mycoplasma genitalium	Published	Tomita et al., 1999
3	Circadian rhythms, metabolism, transcriptional regulation	Hybrid: flux balance analysis, ordinary differential equations	Synechocystis sp. PCC 6803	Proposed	Steuer et al., 2012
3	Contraction, electrical signling, metabolism	Ordinary differential equations	Homo sapiens	Proposed	Bassingthwaighte et al., 2005
3	Metabolism, signal transduction, transcriptional regulation	Hybrid: Boolean, flux balance analysis, ordinary differential equations	Escherichia coli	Published	Covert et al., 2008
3	Metabolism, signal transduction, transcriptional regulation	Hybrid: constraint-based modeling, ordinary differential equations, phenomenological modeling	Escherichia coli	Published	Carrera et al., 2014
3	Metabolism, signal transduction, transcriptional regulation	Ordinary differential equations	Saccharomyces cerevisiae	Published	Klipp et al., 2005
3	Metabolism, signal transduction, transcriptional regulation	Hybrid: Boolean, flux balance analysis, ordinary differential equations	Saccharomyces cerevisiae	Published	Lee et al., 2008
3	Metabolism, signal transduction, transcriptional regulation	Hybrid: Boolean, flux balance analysis, ordinary differential equations	N/A	Review	Gonçalves et al., 2013
2	Cell cycle regulation, metabolism	Hybrid: Flux balance analysis, ordinary differential equations	Saccharomyces cerevisiae	Proposed	Barberis et al., 2017
2	Cell cycle regulation, signal transduction	Logical model	Homo sapiens	Published	Huard et al., 2012
2	Contraction, electrical signling	Ordinary differential equations	Homo sapiens	Published	Greenstein et al., 2006
2	Metabolism, signal transduction	Ordinary differential equations	Homo sapiens	Published	König et al., 2012
2	Metabolism, signal transduction	Ordinary differential equations	Homo sapiens	Published	Mosca et al., 2012
2	Metabolism, transcriptional regulation	Hybrid: Boolean, flux balance analysis	Escherichia coli	Published	Covert et al., 2004
2	Metabolism, transcriptional regulation	Hybrid: Bayesian, flux balance analysis	Escherichia coli	Published	Chandrasekaran and Price, 2010
2	Metabolism, transcriptional regulation	Hybrid: Boolean, flux balance analysis	Escherichia coli	Published	Shlomi et al., 2007
2	Electrical signaling, tension development	Ordinary differential equations	Homo sapiens	Published	Niederer and Smith, 2007
2	Signal transduction, transcriptional regulation	Ordinary differential equations	Homo sapiens	Published	Nakakuki et al., 2010
2	Signal transduction, transcriptional regulation	Ordinary differential equations	Homo sapiens	Published	Stelniec-Klotz et al., 2012
2	Metabolism, transcriptional regulation	Hybrid: Bayesian, flux balance analysis	Mycobacterium tuberculosis	Published	Chandrasekaran and Price, 2010
2	Metabolism, transcriptional regulation	Hybrid: Bayesian, flux balance analysis	Mycobacterium tuberculosis	Published	Ma et al., 2015

To represent multiple pathways, most of these models have been developed by combining separate submodels of each pathway, using the most appropriate mathematical representation for each pathway. This has led to multi-algorithmic models which must be simulated by co-simulating the individual submodels. Because there are few multi-algorithmic simulation tools and most of these models only combine two or three submodels, the developers of most of these models have developed ad hoc methods to simulate their models. For example, Covert et al. developed an ad hoc method to simulate their hybrid dynamic FBA / Boolean model of the metabolism and transcriptional regulation of E. coli [18] and Chandrasekaran and Price developed a different ad hoc method to simulate their hybrid FBA / Bayesian model of the metabolism and transcriptional regulation of E. coli [19]. Because there are few tools for working with such integrative models, these models have also been described with different ad hoc formats and identifiers, simulated with different ad hoc simulation software programs, and calibrated and validated with different ad hoc methods.

1.4.4.1. Model repositories¶

Several model repositories, including BioModels [105] and the Physiome Model Repository [175], have been developed to make it easy to find models (Table 1.7). However, only a few of these repositories support integrated models; most of these repositories only support a limited number of model formats; many reported models are never deposited to any model repository; many of the models that are deposited are not sufficiently annotated for other researchers to understand, reuse, and extend the models; and only a few of the repositories also support the information needed to simulate models such as parameter values.

Table 1.7 Repositories that contain published models that can be modified, extended, and combined to create WC models.¶
Repository	Content	URL	Reference
BiGG	Repository for constraint-based models of metabolism	http://bigg.ucsd.edu	King et al., 2016
BioModels	Repository for SBML-encoded models that contains many cell cycle, circadian, electrical signaling, metabolism, and signal transduction models	http://www.ebi.ac.uk/biomodels-main	Chelliah et al., 2015
FigShare	Repository for supplemental materials that contains some models	https://figshare.com
GitHub	Repository for code that contains some models	https://github.com
JWS Online	Online environment for systems biology modeling that includes a model repository	http://jjj.biochem.sun.ac.za	Peters et al., 2017
Open Source Brain	Repository for NeuroML-encoded models of neurophysiology	http://www.opensourcebrain.org	Gleeson et al., 2012
Physisome Repository	Repository for CellML-encoded models that contains physiological models	https://models.physiomeproject.org	Yu et al., 2011
SimTK	Repository for data and code that contains several biomechanics models	https://simtk.org

1.5. Emerging principles and methods for WC modeling¶

In the previous section, we outlined the ongoing technological advances that are making WC modeling feasible. Here, we propose several principles for WC modeling and describe how we and others are adapting and integrating these technologies into a methodology for WC modeling. In the following sections, we outline the major remaining bottlenecks to WC modeling, highlight ongoing efforts to overcome these bottlenecks, and describe how we are beginning to use this methodology to build WC models.

1.5.1. Principles of WC modeling¶

Based on our experience, we propose several guiding principles for WC modeling (Figure 1.5).

Modular modeling. Similar to other large engineered systems such as software, WC models should be built by partitioning cells into pathways, outlining the interfaces among these pathways, building submodels of each pathway, and combining these submodels into a single model. This approach reduces the dimensionality of model construction, calibration, and validation and facilitates collaborative modeling.

Multi-algorithmic simulation. Furthermore, to capture both well- and poorly-characterized pathways, each pathway should be represented using the most appropriate mathematical representation given our knowledge and data about each pathway. In particular, multi-algorithmic simulation should be used to create identifiable models which can be calibrated from our experimental data.

Experimental calibration and validation. WC models should be rigorously calibrated and extensively validated via comparison to detailed experimental data across a wide range of molecular mechanisms, phenotypes, and scales.

Systemization and standards. To scale modeling to entire cells and facilitate collaboration, we should systemize every aspect of dynamical modeling, develop standards for describing WC models and standard protocols for validating and merging model components, and encourage researchers to embrace these standard protocols and formats.

Technology development. To enable WC modeling, we must develop technologies for systematically and scalably building, calibrating, simulating, and validating WC models. These technologies should be modular to facilitate collaborative technology development and integrated into a unified framework to provide modelers user-friendly modeling and simulation tools.

Leverage existing methods and data. Where possible, WC modeling should take advantage of existing computational methods and experimental data. For example, WC modeling should take advantage of parallel simulation methods developed by computer science and WC models should be built, in large part, from data aggregated from public repositories.

Focus on critical problems and clear, achievable goals. To maximize our efforts, we should periodically identify the key bottlenecks to WC modeling and periodically refocus our efforts on overcoming these bottlenecks. Based on lessons learned from other “big science” projects [176][177], we should also delineate clear goals and clearly define the responsibilities of each researcher.

Focus on model organisms. To facilitate collaboration, early WC modeling efforts should focus on a small number of organisms and cell lines that are easy to culture, well-characterized, karyotypically and phenotypically “normal”, genomically stable and relevant to a wide range of basic science, medicine, and bioengineering. This includes well-characterized bacteria such as Escherichia coli and well-characterized human cell lines such as the H1 human embryonic stem cell (hESC) line.

Reproducibility, transparency, extensibility, and openness. To facilitate collaboration and maximize impact, WC models and simulations should be reproducible, comprehensible, and extensible. For example, to enable other modelers to understand a model, the biological semantic meaning of each species and reaction should be annotated, the data sources and assumptions used to design the model should be annotated, and the parameter values used to produce each simulation result should be recorded. Furthermore, each WC model and WC modeling technology should be free and open-source.

Constant innovation. Because we do not yet know exactly what WC models should represent, what WC models should predict, or how to build WC models, we should periodically evaluate the quality of our models and methods and iteratively improve our models and methods as we learn more about cell biology and WC modeling. This should include how we partition cells into pathways, the interfaces that we define among the pathways, and how we simulate multi-algorithmic models.

Interdisciplinary collaboration. WC modeling should be an interdisciplinary collaboration among modelers, experimentalists, computer scientists, and engineers, and research sponsors. Furthermore, there should be open and frequent communication among the WC modeling community.

Figure 1.5 Principles of WC modeling.¶

1.5.2. Methods for WC modeling¶

To enable WC models, we and others are adapting and integrating the technologies described in Section 1.4.2 into a workflow for scalably building, simulating, and validating WC models (Figure 1.6). (1) Modelers will use Datanator to aggregate, standardize, and integrate the experimental data that they will need to build, calibrate, and validate their model into a single dataset. (2) Modelers will use this data to design submodels of each individual pathway using the most appropriate mathematical representation for each pathway, and encode their model in wc_rules, a rule-based format for describing WC models. (3) Modelers will construct reduced models, and use them to calibrate each submodel and their entire model. (4) Modelers will use formal verification and/or unit testing to verify that their model functions as intended and recapitulates the data used to build the model. (5) Modelers will use wc_sim, a scalable, network-free, multi-algorithmic simulator, to simulate their model. (6) Modelers will use WholeCellSimDB to organize their simulation results and use WholeCellViz to visually analyze these results. Importantly, every tool in this workflow will facilitate collaboration to help researchers work together, and these tools will be modular to enable us and others to continuously improve this methodology. We plan to implement this workflow by leveraging recent advances in computational and experimental technology (Section 1.4). Here, we describe the six steps of this emerging workflow.

Figure 1.6 Emerging workflow for scalably building, simulating, and validating WC models. (a) Modelers will aggregate the data for WC modeling into a single dataset. (b) Modelers will use this data to design multi-algorithmic WC models. (c,d) Modelers will use reduced models to calibrate, verify, and validate models. (e) Modelers will simulate multi-algorithmic WC models by co-simulating their submodels. (f) Modelers will visualize and analyze their results to discover new biology, personalize medicine, and design microorganisms.¶

1.5.2.1. Data aggregation, standardization, and integration¶

The first step of WC modeling is to aggregate, standardize, integrate, and select the experimental data needed for WC modeling into a single dataset for model building, calibration, and validation (Figure 1.6a).

First, we must aggregate a wide range of experimental data from a wide range of databases such as such as biochemical data about metabolite concentrations from ECMDB [47], RNA-seq data about RNA concentrations from ArrayExpress [31], and mass-spectrometry data about metabolite concentrations from PaxDb [33]. Where possible, data should be aggregated using database downloads and web services. Otherwise, data should be aggregated by scraping webpages. In addition to aggregating data from databases, we should also aggregate data from collaborators, individual publications, and bioinformatics prediction tools such as PSORTb [65] and TargetScan [66].

To the extent possible, we should record the provenance of this data including the biosample (e.g., species, strain, genetic variants) and environmental conditions (e.g., temperature, pH, growth media) that were measured, the experimental method used to generate the data, the computational method used to analyze the data, and the citation for original data to help us select the most relevant data for modeling and trace models back to their data sources.

Second, we must standardize the identifiers and units used to describe this data. For example, metabolites should be identified using the IUPAC International Chemical Identifier (InChI) format [178] and RNA should be identified by their genomic coordinates. Similarly, all units should be standardized to SI units or combinations of SI units.

Third, we must integrate this data by linking the data together through common metabolites, chromosomes, RNA, proteins, and interactions. To enable this data to be quickly searched and explored, this data should be organized into a relational database.

Fourth, we must identify the most relevant data within our database for the species and environmental condition that we want to model. For each experimental measurement that we need to constrain a model, we must search our database for data observed for similar biology (e.g., metabolites, RNA, proteins, and interactions), genotypes (e.g., species, strain, and genetic variants), and environmental conditions (e.g., temperature, pH, growth media); calculate the relevance of each experimental observation; and calculate the consensus of the relevant observations, weighted by their relevance.

Fifth, we should organize these consensus experimental values and their provenance (experimental evidence and the method used to calculate the consensus value) into a single dataset. Pathway/genome databases (PGDB) can be used to organize this information because PGDBs are well-suited to representing relationships among experimental data about a single species. We have developed the WholeCellKB PGDB to organize the data needed for WC modeling. WholeCellKB provides users three interfaces to deposit experimental data for WC models, extensive functionality for validating this data, a web-based user interface to search and browse this data, and a JSON web service to programmatically retrieve data for model construction.

1.5.2.2. Model design¶

The second step of WC modeling is to use the data aggregated in the first step to design models, including each species and interaction (Figure 1.6b). To represent the details of well-characterized pathways, as well as coarsely represent poorly-characterized pathways, WC models should be built by partitioning cells into pathways, modeling each pathway using the most appropriate mathematical representation, and combining pathway submodels into composite, multi-algorithmic models.

To capture the large number of possible cellular phenotypes, WC models should also capture the combinatorial complexity of cellular biochemistry. For example, WC models should represent the combinatorial number of RNA transcripts that can be produced from the interactions of transcription, RNA editing, RNA folding, and RNA degradation; WC models should represent the combinatorial number of possible interactions among the subunits of protein complexes; and the combinatorial number of phosphorylation states of each protein complex.

To generate accurate predictions, WC models should also aim to represent the aggregate physiology of poorly understood biology such as uncharacterized genes, uncharacterized small peptides, and uncharacterized non-coding RNA. This can be accomplished by including lumped reactions that represent the aggregate physiology of all unknown biology. For example, to accurately predict metabolic reaction fluxes, like FBA models, WC models can include reactions that capture the aggregate energy usage of all uncharacterized interactions.

To scalably and reproducibly build WC models, WC models should be programmatically built from PGDBs using scripting tools such as PySB [79].

Because WC models will never be complete, WC models should be built by designing an initial model and then iteratively improving the model until the model accurately predicts new experimental measurements. In particular, WC models can be systematically refined by identifying gaps between their bottom-up descriptions of cellular biochemistry and our physiological knowledge, searching for reactions and gene products that might fill those gaps, and parsimoniously adding species and reactions to models so they recapitulate experimental observations. Model selection methods can also be used to select among multiple potential model designs. Furthermore, version control systems such as Git [149] should be used to track model changes and enable collaborators to refine models in parallel and merge their refined models.

To enable other researchers to reproduce, understand, reuse, and extend WC models, WC models should be encoded in rule-based formats such as BioNetGen and extensively annotated. In particular, rule-based formats enable researchers to concisely describe the combinatorial complexity of cell biology. Model annotations should include semantic annotations about the biological meaning of each species and interaction such as the chemical structure of each metabolite in InChI format [178] and provenance annotations about the data sources, assumptions, and design decisions behind each species, interaction, and pathway.

1.5.2.3. Model calibration¶

The third step in WC modeling is to calibrate model parameters (Figure 1.6c). This should be done by using numerical optimization methods to minimize the distance between the model’s predictions and related experimental observations. One promising method for calibrating composite WC models is to (a) use multi-algorithmic modeling to only create parameters whose values can be constrained by one or a small number of experimental measurements, (b) estimate the value of each individual parameter using one or a small number of experimental observations, (c) construct a set of reduced models, one for each submodel, to estimate the joint values of the parameters, and (d) use distributed global optimization tools such as saCeSS [141] to refine the joint values of the parameters [179]. This method avoids the need to calibrate large numbers of parameters of physiological data; performs the majority of model calibration using low dimensional models of individual species, reactions, and pathways; and generates successively better starting points for more refined calibration.

1.5.2.4. Model verification and validation¶

The fourth step in WC modeling is to verify that models behave as intended and validate that models recapitulate the true biology (Figure 1.6d). First, WC should be verified models using a series of increasingly comprehensive unit tests that test each individual species, reaction, and pathway, as well as groups of pathways and entire models. Importantly, these tests should cover all of the logic of the model. For example, these tests should test the edge cases of every rate law. Reduced models should be used to efficiently test individual species, reactions, and pathway submodels. Furthermore, to quickly identify errors, continuous integration systems such as Jenkins [148] should be used to automatically execute tests each time models are revised. Alternatively, models can be verified using formal verification systems such as PRISM [144]. However, substantial work remains to adapt formal verification to multi-algorithmic dynamical modeling.

Second, WC models should be validated by comparing their simulation results to independent experimental data that was not used for model construction or calibration. To be effective, models should be tested using a broad range of data that spans different types of predictions, genetic perturbations, and environmental conditions.

Third, because it is infeasible to validate possible model prediction, modelers should annotate how models were validated to help other modelers know which model predictions can be trusted, know which predictions still need to be validated, and reuse the validation data to validate improved and/or extended models. These annotations should include which data were used for validation, which predictions were validated, and how well the model recapitulated each experimental observation. We believe that this metadata will be critical for medicine where therapy should only be driven by validated model predictions.

1.5.2.5. Network-free multi-algorithmic simulation¶

The fifth step of WC modeling is to numerically simulate WC models (Figure 1.6e). Because WC models should be described using rules and composed of multiple mathematically-dissimilar submodels, WC models simulated by co-simulating their submodels. This can be achieved in three steps. First, all of the submodels should be converted to explicit time-driven submodels. For example, Boolean submodels should be converted to SSA submodels by assuming typical concentrations and kinetic rates. Second, all of the mathematically-similar submodels should be analytically merged into a single mathematically-equivalent submodel. Third, for WC models that are composed only of FBA, ODE, and ODE submodels, (a) the SSA submodel should be used as the master clock for the integration and synchronization of the submodels, (b) each time the SSA submodel advances to the next iteration, the FBA and ODE submodels should be synchronized with the SSA submodel and integrated for the same timestep as the SSA submodel, (c) and the SSA submodel should be synchronized with the FBA and ODE models. If the FBA or ODE models generate unphysical states such as negative concentrations, they must be rolled back and reintegrated for multiple smaller timesteps. To efficiently simulate WC models, the FBA and ODE models should only be evaluated periodically.

To efficiently simulate the combinatorial complexity represented by WC models, most submodels should be simulated using SSA and SSA should be implemented using network-free graph-based methods. Specifically, SSA should be implemented by representing each molecule as a graph, representing each reaction rule as a graph, searching for matching pairs of species-reaction graphs to determine the rate of each reaction, randomly selecting a reaction to fire, updating the species involved in the selected reaction, and using a species-reaction dependency graph to update the rates of all affected reactions. This methodology will enable WC simulations to scale to large numbers of possible species and reactions by only representing the configuration of each active molecule rather than representing the copy number of each possible species.

To simulate WC models quickly, WC models should be simulated using a distributed simulation framework such as parallel discrete event simulation (PDES) and partitioning WC models into cliques of tightly connected species and reactions.

To make WC simulations comprehensible and reproducible, WC simulations should be represented using a common format such as SED-ML or SESSL.

1.5.2.6. Visualization and analysis of simulation results¶

The sixth step of WC modeling is to visualize and analyze WC simulation results to discover new biology, personalize medicine, or design microbial genomes (Figure 1.6f). First, all of the metadata needed to understand and reproduce simulation results should be recorded, including the model, the version of the model, the parameter values, and the random number generator seed that was simulated. Second, simulation results should be logged and stored in HDF5 format [150]. Third, WC simulation results and their metadata should be organized using a tool such as WholeCellSimDB that helps researchers search, slice, reduce, and share simulation results. Fourth, researchers should use tools such as WholeCellViz to visually analyze WC simulation results and use visualization grammars such as Vega [180] to develop custom diagrams.

1.6. Latest WC models and their limitations¶

Because it is not yet possible to completely model a cell, researchers are pursuing several complementary approaches to modeling entire cells. Historically, researchers such as Michael Shuler focused on building coarse-grained models of the major functions of cells [25][181]. Over the last ten years, researchers have begun to leverage the growing wealth of experimental data and our increasing computational power to build fine-grained models of the molecular biology of entire cells. This includes bottom-up efforts to represent the contribution of each gene to cellular behavior starting from genome sequences and annotations [27], top-down efforts to represent the integrated behavior of each cellular process, and bottom-up efforts to model diffusion at the cell scale [182][183][26]. More recently, researchers have begun to merge these fine-grained approaches. For example, Schulten recently demonstrated a hybrid FBA-diffusion model of E. coli [184]. Here, we describe recent progress in each of these major approaches to WC modeling.

1.6.1. Coarse-grained models¶

In adddition fine-grained models, researchers have also developed several coarse-grained models of multiple cellular processes [25][181]. These models could be used to help inform the global structure and mathematical behavior of WC models. However, they generally cannot be directly incorporated into WC models because they use coarse-grained representations that are incompatible with that of fine-grained WC models.

1.6.2. Genomically-centric bottom-up fine-grained models¶

Toward WC models, recently, we and others demonstrated the first model which represents every characterized gene function of a cell [27] (Figure 1.7a). The model represents 28 pathways of M. genitalium. The model was developed by annotating the M. genitalium genome, reconstructing the species encoded by each gene and the reactions catalyzed by each gene using data from over 900 databases and publications, partitioning the species and reactions into 28 pathways, developing separate submodels of each pathway, and integrating the submodels into a single model. To help us organize the data used to build the model, we developed WholeCellKB, a pathway/genome database (PGDB) software system tailored for WC modeling [30], and developed scripts to generate the model from the PGDB.

Figure 1.7 A WC model of M. genitalium predicts high-level cellular behaviors from the molecular level. (a) The model combines multiple submodels of individual cellular subsystems. We validated the model by comparing its outputs to experimental data which describes its rate of growth (b) and RNA polymerase occupancy (c). We have used the model to understand how cells regulate their cell cycle (d) and allocate energy (e).¶

To capture our varying level of knowledge about each pathway, we described each pathway using the most appropriate mathematical representation. For example, we represented transcription and translation as stochastic models, represented metabolism using FBA, and represented cell division with ODEs. We combined the submodels into a single model by mapping their inputs and outputs onto a common set of global variables that we formed by taking the union of the state variables of the individual submodels.

We developed a novel algorithm to simulate the combined model by co-simulating the submodels. The algorithm co-simulated the submodels by partitioning the copy number variables into separate pools for each submodel proportional to their anticipated consumption, iteratively integrating the submodels, updating the global variables by merging the pools associated with the submodels, and updating all other state variables. To help us analyze the model’s simulation results, we also developed WholeCellSimDB, a database for organizing, storing, and sharing WC simulation results [158] and WholecellViz, a web-based software tool for visualizing high-dimensional WC simulation results in their biological context [161].

We calibrated the model by constructing a set of reduced models that focused on each pathway submodel, calibrating the individual submodels, and using the parameter values learn from calibrating the individual submodels as a starting point for calibrating the entire model [179].

We validated the model by constructing numerous reduced models that focused on individual submodels and groups of submodels, checking that the submodels and groups of submodels are consistent with our knowledge such as the Central Dogma, and checking that the submodels and groups of submodels are consistent with the experimental data that we used to build the model and additional independent experimental data (Figure 1.7b,c). In particular, we demonstrated that the model recapitulates the observed M. genitalium growth rate and predicts the essentiality of each gene with 80% accuracy.

In addition, we have used the model to demonstrate how WC models could be used to help design synthetic circuits [185] and we have used the model to demonstrate how WC models could help reposition antibiotics among distance bacteria [186].

Despite this progress, the model does not represent several important cell functions such as the maintenance of electrochemical gradients across the cell membrane, and the model mispredicts several important phenotypes such as the growth rates of many single-gene deletion strains. Furthermore, the model took over 10 person-years to construct because it was largely built by hand; the model is difficult to understand, reuse, and extend because it was described directly in terms of its numerical simulation rather than using a high-level format such as SBML; the model’s simulation software is not reusable because it was built to simulate a single model; the model’s simulation algorithm violates the arrow of time and is unscalable because it only partitions a portions of the state variables among the submodels.

1.6.3. Physiologically-centric top-down fine-grained models¶

In parallel, researchers such as Edda Klipp are taking a complementary top-down physiologically-centric approach to WC modeling to our genomically-centric bottom-up approach to WC modeling. In contrast to our approach which starts from annotated genomes, Edda Klipp and her colleagues are modeling entire cells by enumerating the major processes present in cells, developing submodels of each process, and combining the submodels into a single model.

1.6.4. Spatially-centric bottom-up fine-grained models¶

In parallel, researchers such as Elijah Roberts and Zaida Luthy-Schulten are taking another complementary spatially-centric approach to WC modeling [182][183][26]. This approach focuses on representing the spatial distribution and diffusion of each molecular species, and uses molecular dynamics simulation methods to predict their spatiotemporal dynamics. However, because it is computationally expensive to simulate diffusion on the scale of entire cells, this approach is currently limited to second-scale simulations.

1.6.5. Hybrid models¶

As introduced above, Zaida Luthey-Schulten and her collegues have begun to merge these fine-grained approaches to WC modeling by combining a diffusion model with an FBA model [184].

1.7. Bottlenecks to more comprehensive and predictive WC models¶

In the previous sections, we described how we and others are beginning to build WC models. Despite this progress, it is still challenging to build and simulate WC models. To help focus the community’s efforts to accelerate WC modeling, here, we summarize the major remaining bottlenecks to WC modeling (Figure 1.8). These bottlenecks are based on our own experience and a community survey of the bottlenecks to biomodeling that we conducted in 2017 [16]. In the following sections, we suggest ways to overcome these bottlenecks.

Figure 1.8 Major bottlenecks to WC modeling and the major methods, tools, and resources needed to advance WC modeling.¶

1.7.1. Inadequate experimental methods and data repositories¶

In our opinion, one of the biggest bottlenecks to WC modeling is collecting and aggregating enough high-quality experimental data to build WC models. This is a significant bottleneck because WC models require extensive data, and because, as described in Section 1.3.1, we do not yet have sufficient methods for characterizing cells, sufficient tools for annotating the semantic meaning of experimental data, sufficient repositories for aggregating and integrating experimental data, and sufficient incentives for researchers to share their data.

New measurement methods, data repositories, and data aggregation tools are needed to overcome this bottleneck: (a) improved proteome-wide methods for measuring protein abundances would facilitate more accurate models of many pathways; (b) improved metabolome-wide methods for measuring metabolite concentrations would enable more accurate models of metabolism; (c) new single-cell measurement methods would facilitate more accurate models of the phenotypic variation of single cells; (d) a new central data repository that uses consistent representations, identifiers, and units would accelerate data aggregation [187]; and (e) new tools for searching this repository would help researchers identify relevant data for WC modeling, including data from related organisms and environments.

1.7.2. Incomplete, inconsistent, scattered, and poorly annotated pathway models¶

As discussed in Section 1.5, the most promising strategy for building WC models is to combine multiple separate models. However, the lack of a complete set of compatible, well-annotated, and high-quality pathway models is a major bottleneck to WC modeling [80][81][188][24]. Here, we summarize the limitations of our pathway models.

1.7.2.1. Incomplete models¶

Despite decades of modeling research and detailed models of several pathways, we still do not have models of most pathways. For example, we do not have models of the numerous DNA repair mechanisms, the mechanisms responsible for RNA editing, or the role of chaperones in protein folding.

1.7.2.2. Poorly validated and unreliable models¶

Many of our existing pathway models are insufficiently validated and reliable to be effective components of WC models. Furthermore, few models are published with sufficient information about what data was used to validate the model, which simulation predictions were validated, and which simulation predictions are reliable for other researchers to know the limitations of a model and how to properly reuse it.

1.7.2.3. Inconsistent models¶

Furthermore, many of our existing pathway models are inconsistent. In particular, many of existing models are described with different assumptions, granularities, mathematical representations, identifiers, units, and formats.

1.7.2.4. Unpublished and scattered models¶

Unfortunately, our published models are scattered across a large number of resources, including model repositories such as BioModels, Simtk, supplementary materials, GitHub, and individual lab web pages, and many reported models are never published.

1.7.2.5. Incompletely annotated models¶

Many reported models are also not sufficiently well-annotated to combine them into WC models. For example, the biological semantic meaning of a model is often not annotated. This makes it difficult for other researchers to understand the meaning of each variable and equation which, in turn, makes it difficult for other researchers to merge models. The provenance of a model is also rarely annotated. This makes it difficult for other researchers to understand how a model was calibrated, recalibrate the model to represent a different organism and/or condition, and merge a model with models of other organisms and/or conditions. In addition, the assumptions of a model are also rarely annotated. Similarly, this makes it difficult for other researchers to understand how a model was developed, revise a model to represent other organisms and conditions, and merge models from different organisms and conditions.

1.7.3. Inadequate software tools for WC modeling¶

As described in Section 1.4, a wide range of tools have been developed for modeling individual pathways. However, few of these tools support all of the features needed for WC modeling. In particular, few of these tools support the scale required for WC modeling, few of these tools support composite, multi-algorithmic modeling, few of these tools support collaboration, and these tools do not support all of the metadata needed to understand models and their provenance.

1.7.4. Inadequate model formats¶

As described in Section 1.4.2.5, several formats have been developed to describe cell models. However, the lack of a format that supports all of the features needed for WC modeling is a major bottleneck. In particular, no existing format can represent (a) the combinatorial complexity of pathways such as transcription elongation which involve billions of sequence-based reactions; (b) the multiple scales that must be represented by WC models such as the sequence of each protein, the subunit composition of each complex, and the DNA binding of each complex; and (c) multi-algorithmic models that are composed of multiple mathematically-distinct submodels [106].

1.7.5. Lack of coordination among the cell modeling community¶

Another major bottleneck to WC modeling is the lack of coordination among the cell modeling community. Currently, the lack of coordination leads modelers to build competing models of the same pathways and describe models with inconsistent identifiers and formats.

1.8. Technologies needed to advance WC modeling¶

In the previous section, we outlined the major remaining bottlenecks to WC modeling. To overcome these bottlenecks, we must develop a wide range of computational and experimental technologies. Here, we describe the most critically needed technologies to advance WC modeling. In the following sections, we highlight our and others’ ongoing efforts to develop these technologies.

1.8.1. Experimental methods for characterizing cells¶

While substantial data about cellular populations already exists, additional data would enable better WC models. In particular, we should develop new experimental methods for quantitating the dynamics and single-cell variation of each metabolite and protein. Additionally, we should develop methods for measuring kinetic parameters at the interactome scale, as well as methods for measuring cellular phenotypes across multiple genetic and environmental conditions.

1.8.2. Tools for aggregating, standardizing, and integrating heterogeneous data¶

As described in Section 1.4.1.1-1.4.1.2, extensive data is now available for WC modeling. However, this data spans a wide range of data types, organisms, and environments; the data is often not annotated and normalized; it is scattered across many repositories and publications and it is described using inconsistent identifiers and units. To make this data more usable for modeling, we must develop tools for aggregating data from multiple sources; merging data from multiple specimens, environmental conditions, and experimental procedures; standardizing data to common identifiers and units; identifying the most relevant data for a model; and averaging across multiple imprecise and noisy observations.

1.8.3. Tools for scalably designing models from large datasets¶

To scalably build WC models, we must develop tools for defining the interfaces among pathway submodels, collaboratively designing composite, multi-algorithmic models directly from large datasets, automatically identifying inconsistencies and gaps in dynamical models, recording how data and assumptions are used to build models, and encoding models in a rule-based format. As described in Section 1.4.2.2-1.4.2.4, several tools support each of these features. To accelerate WC modeling, we should develop a single tool that supports all of these functions at the scale required for WC modeling.

1.8.4. Rule-based format for representing models¶

Several formats can represent individual biological processes. However, no existing format is well-suited to representing the scale or mathematical diversity required for WC modeling [106][189]. To succinctly represent WC models, we should develop a rule-based format that can (a) represent models in terms of high-level biological constructs such as DNA, RNA, and proteins; (b) represent each molecular species at multiple levels of granularity (for example, as a single species, as a set of sites, and as a sequence); (c) represent all of the combinatorial complexity of molecular biology including the complexity of interactions among protein sites, as well as the complexity of protein-metabolite, protein-DNA, and protein-RNA interactions and the complexity of template-based polymerization reactions such as the combinatorial number of RNA than arise from the interaction of RNA splicing, editing, and mutations; (d) represent composite, multi-algorithmic models; (e) represent the biological semantic meaning of each species and interaction using database-independent formats such as InChI [178] and DNA, RNA, and protein sequences; and (f) represent model provenance including the data and assumptions used to build models.

1.8.5. Scalable network-free, multi-algorithmic simulator¶

To simultaneously represent well-characterized pathways with fine detail and coarsely represent poorly-characterized pathways, WC modeling requires a multi-algorithmic simulator that can scalably co-simulate mathematically-dissimilar submodels that are described using rule patterns. However, no existing simulator supports network-free, multi-algorithmic, and parallel simulation. To scalably simulate WC models, we should develop a parallel, network-free, multi-algorithmic simulator [190]. At a minimum, the simulator should support FBA, ODE integration, and stochastic simulation.

1.8.6. Scalable tools for calibrating models¶

As discussed in Section 1.4.2.9, several tools are available for calibrating small single-algorithm models. However, these tools are not well-suited to calibrating large multi-algorithmic models. To calibrate WC models, we must develop new methods and software tools for scalably calibrating rule-based multi-algorithmic models. We and others have begun to explore using reduced models to efficiently calibrate WC models [179]. However, further work is needed to formalize these methods, including developing automated methods for reducing WC models.

1.8.7. Scalable tools for verifying models¶

To fulfill our vision of using WC models to drive medicine and bioengineering, it will be critical for modelers to rigorously verify that WC models function as intended. As discussed in Section 1.4.2.10, researchers are beginning to adapt tools from computer science and software engineering to verify cell models. However, none of the existing or planned tools support rule-based, multi-algorithmic models. To help modelers verify WC models, we must adapt formal verification and/or unit testing for WC modeling. Furthermore, to help researchers quickly verify models, these tools should help researchers verify entire WC models, as well as help researchers verify reduced models and individual submodels.

1.8.8. Additional tools that would help accelerate WC modeling¶

In addition to these essential tools, we believe that WC modeling would also be accelerated by additional tools for annotating and imputing data, additional tools for sharing WC models and simulation results, additional tools for visualizing simulation results, and community standards for designing, annotating, and verifying WC models.

Tools and standards for annotating data. To make our experimental more useful for modeling, we should develop software tools that help researchers annotate their data and encourage experimentalists to use these tools to annotate their data.

Bioinformatics prediction tools. While existing bioinformatics tools can predict many properties of metabolites, DNA, RNA, and proteins, additional tools are needed to accurately predict the molecular effects of insertions, deletions, and structural variants. Such tools would help WC models design microbial genomes and predict the phenotypes of individual patients.

Repositories for WC models. To help researchers share whole-cell models, BioModels and other model repositories should be extended to support WC models. In addition, these repositories should be extended to support provenance metadata, validation metadata, simulation experiments, and simulation results.

Version control system for WC models. To help researchers collaboratively develop WC models, we should develop a version control system for tracking the changes to WC models contributed by individual collaborators and merging WC model components developed by collaborators. This system could be developed by combining Git [149] with a custom program for differencing WC models.

Simulation format. SED-ML and SESSL can represent simulations of models that are encoded in XML-based formats such as SBML and Java-based formats such as ML-Rules. However, neither is well-suited to representing simulations of models that are encoded in other formats such as BioNetGen. To accelerate WC modeling, we should extend SED-ML to support non-XML-based models or extend SESSL to support other programming languages such as Python and C++.

Database for organizing simulation results. We and others have begun to develop tools for organizing simulation results. However, these tools have limited functionality. To help researchers analyze WC simulation results, we must develop an improved database for simulation results that helps researchers quickly search simulation results for specific features and quickly retrieve specific slices of large simulation results datasets. This database should be implemented using a distributed database and/or data processing technologies such as Apache Spark.

Tools for visualizing simulation results. We and others have also begun to develop tools for visualizing high-dimensional simulation results. However, these tools have limited functionality, they are not easily extensible, and they struggle to handle large datasets. To help researchers analyze WC models to gain new biological insights, we must develop a new tool for visually exploring and analyzing WC simulation results. To enable researchers to incorporate new visual layouts, this tool should support a standard visualization grammar such as Vega [180]. Furthermore, to handle terabyte-scale simulation result datasets, this tools should be implemented using a high-performance visualization toolkit such as VTK [191].

Community standards. To facilitate collaboration, we should develop guidelines for designing WC models, standards for annotating and verifying WC models, and a protocol for merging WC model components. The model design guidelines should describe the preferred granularity of WC model components and the preferred interfaces among WC model components. The standards for annotating and verifying WC models should describe the minimum acceptable semantic and provenance metadata for WC models. The protocol for merging WC model components should describe how to incorporate a new component into a WC model, how to test the new component and the merged model, and how to either accept the new component or reject the candidate component if it cannot be verified or is not properly annotated.

1.9. A plan for achieving comprehensive WC models as a community¶

In the previous sections, we described the potential of WC models to advance medicine and bioengineering, summarized the major bottlenecks to WC modeling, and outlined several technological solutions to these bottlenecks. To maximize our efforts to achieve WC models, we believe that we should begin to develop a plan for achieving WC models. Here, we propose a three-phase plan to achieve the first comprehensive WC model (Figure 1.9). The plan focuses on developing a WC model of H1-hESCs because we believe that the community should initially focus on a single cell line and because H1-hESCs are relatively easy to culture, well-characterized, karyotypically and phenotypically “normal”, genomically stable and relevant to a wide range of basic science, medicine, and bioengineering. Although the plan focuses on a single cell line, the methods and tools developed under the plan would be applicable to any organism, and the H1-hESC model could be contextualized to represent other cell lines, cell types, and individuals.

Figure 1.9 The first WC models can be achieved in three phases: (1) demonstrating the feasibility of WC models by developing scalable modeling tools and using them to model several core processes, (2) demonstrating the feasibility of collaborative modeling by developing a collaborative modeling platform and using it to model additional processes, and (3) developing a comprehensive model as a community.¶

1.9.1. Phase I: Piloting the core technologies and concepts of WC modeling¶

Phase I should demonstrate the feasibility of WC models by developing the core technologies needed for WC modeling, and using these tools to build a model of a few critical pathways of H1-hEScs. First, we should develop tools for aggregating the data needed for WC modeling, tools for designing models directly from data, a rule-based format for describing models, tools for quickly simulating multi-algorithmic models, tools for efficiently calibrating and validating high-dimensional models, and tools for visualizing and analyzing high-dimensional simulation results. Second, a small group of researchers should use these tools and public data to build a model of the core pathways of H1-hEScs including several key signal transduction pathways, metabolism, DNA replication, transcription, translation, and RNA and protein degradation. Phase I should also begin to form a WC modeling community by organizing meetings and courses, developing WC modeling training materials, and discussing potential WC modeling standards.

1.9.2. Phase II: Piloting collaborative WC modeling¶

Phase II should focus on demonstrating the feasibility of collaborative WC modeling by developing collaborative modeling tools, and using them to expand the H1-hESc model begun in Phase I. First, we should combine the technologies developed in Phase I into a collaborative web-based WC modeling platform to enable multiple experts to build models together. Second, the community should develop standards for describing, validating, and merging submodels. Third, a modest consortium of modelers and experimentalists should expand the H1-hESc model developed in Phase I by partitioning H1-hESCs into distinct pathways, outlining the interfaces among these pathways, and tasking individual researchers with modeling additional pathways such as cell cycle regulation, DNA repair, and cell division. Fourth, we should extensively validate the combined model. Phase II should also continue to develop the fundamental technologies needed for WC modeling and continue to build a WC community by organizing meetings, courses, and other community events.

1.9.3. Phase III: Community modeling and model validation¶

Phase III should produce the first comprehensive WC model. First, we should assemble a large community of modelers and experimentalists and train them to use the platform developed in Phases I and II. Second, individual researchers should volunteer to model individual pathways and merge them into the global H1-hESc model. Third, we should continue to validate the combined model. Fourth, researchers should use the model to generate testable hypotheses to discover new biology, new disease mechanisms, and new drug targets. Fifth, we should also begin to develop methods for contextualizing the H1-hESC model to represent other cell lines, cell types, and individuals. In addition, the community should continue to develop the core technologies and standards needed for WC modeling, continue to refine the partitioning of cells into pathways, continue to refine the interfaces among the pathways, continue to organize meetings and course, and continue to develop WC modeling tutorials.

1.10. Ongoing efforts to advance WC modeling¶

In the previous section, we proposed a plan for achieving the first comprehensive WC model as a community. Although we do not yet have an organized WC modeling community, we and others are beginning to pilot WC models and the technology needed to achieve them. Here, we summarize the ongoing efforts to pioneer WC modeling.

1.10.1. Genomically-centric models¶

Currently, there are three genomically-centric WC models in development of Mycoplasma pneumoniae, E. coli, and H1-hESCs.

1.10.1.1. Mycoplasma pneumoniae¶

To explore how to build more comprehensive and more accurate models, we are working with Drs. Maria Lluch-Senar and Luis Serrano to develop a comprehensive model that represents all of the characterized genes of the bacterium M. pneumoniae.

M. pneumoniae is a small gram-positive bacterium that has one of the smallest genomes among all known freely-living organisms and that is one of the most common causes of walking pneumonia. M. pneumoniae is tractable to WC modeling because it has a small genome and a small mass; because Dr. Lluch-Senar, Dr. Serrano, and others have extensively characterized M. pneumoniae; and because most of its genome is functionally annotated. However, M. pneumoniae can be difficult to characterize because it grows slowly and because there are few experimental methods for manipulating M. pneumoniae, some aspects of M. pneumoniae are challenging to model because there is no known defined growth media for M. pneumoniae, and the M. pneumoniae research community is small. Because M. pneumoniae has such a small genome, M. pneumoniae is frequently used to study the minimal requirements of cellular life, explore the origins of cellular life, and pilot genome-scale synthetic biology methods such as whole-genome synthesis and genome transplantation. M. pneumoniae is also frequently studied to gain insights into the pathophysiology of walking pneumonia.

The model will be based both on genomic, transcriptomic, and proteomic data about M. pneumoniae collected by Drs. Lluch-Senar and Serrano, as well as a broad range of biochemical and single-cell data about related species aggregated from public databases and publications. In addition to using the model to demonstrate the feasibility of more comprehensive models and drive the development of WC modeling methods, we hope to use this model to engineer a fast-growing, efficient chassis for future bioengineering projects.

1.10.1.2. Escherichia coli¶

To explore how to model more complex bacteria, Prof. Markus Covert and his group at Stanford University are modeling the model gram-negative bacterium E. coli. The project focuses on E. coli because E. coli is the best-characterized bacterium and because there are a wide variety of experimental methods for manipulating and characterizing E. coli. Because E. coli is substantially more complex than reduced bacteria such as M. genitalium and M. pneumoniae, initially, this project will focus on modeling core pathways such as metabolism, RNA and protein synthesis and degradation, DNA replication, and cell division. The model will be based primarily on data observed for E. coli aggregated from a wide range of sources. Prof. Covert and his group are using this model to demonstrate the feasibility of more comprehensive WC models, as well as gain novel insights into the pathogenesis of E. coli.

1.10.1.3. H1 human embryonic stem cells (hESCs)¶

To explore how to model eukaryotic cells, we are also beginning to model H1-hESCs. ESCs are pluripotent cells derived from the inner cell mass of a blastocyst at 4-5 days post-fertilization that can generate all three primary germ layers. We have chosen to pilot human WC models with hESCs because they are karyotypically and phenotypically “normal”; they are genomically stable; they can self-renew; and they are relevant to a wide range of basic science, medicine, and tissue engineering.

Furthermore, we have chosen to focus on H1-hESCs because they can be cultured with feeder-free media and because they have been extensively characterized. For example, H1 was one of the three cell lines most deeply characterized by the ENCODE project [192]. In addition, H1 was one of the first five hESC lines [193], H1 was the first cell line to approved under NIH’s Guidelines for Stem Cell Research, and, as of 2010, H1 was studied in 30% of all hESC studies [194].

Because human cells are vastly more complex than bacteria, we are beginning by modeling the core pathways responsible for stem cell growth, maintenance, and self-renewal, including metabolism, transcription, translation, RNA and protein degradation, signal transduction, and cell cycle regulation. This model will also be based both on genomic, transcriptomic, and proteomic data about H1-hESCs aggregated from publications, as well as biochemical and single-cell data about related cell lines aggregated from several databases. In addition to using the model to demonstrate the feasibility of human WC models and driving the development of WC modeling methods, we hope to use the model to gain new insights into the biochemical mechanisms responsible for regulating the rate of stem cell growth.

1.10.2. Physiologically-centric, spatially-centric, and hybrid models¶

As described in Section 1.6.3-1.6.4, Klipp, Roberts, and others are also developing physiogically-centric models of S. cerevisiae, spatially-centric models of E. coli, and hybrid spatially-centric/FBA models of E. coli.

1.10.3. Technology development¶

Currently, we are developing three technologies for aggregating the data needed for WC modeling; concisely representing multi-algorithmic WC models using rules; and simulating rule-based, multi-algorithmic models.

1.10.3.1. Data aggregation¶

WC modeling requires a wide range of data. Unfortunately, as described in Section 1.7.1, aggregating this data is a major bottleneck to WC modeling because this data is scattered across a wide range of databases and publications. To help modelers obtain the data needed for WC modeling, we are developing a methodology for systematically and scalably identifying, aggregating, standardizing, and integrating the data needed for WC modeling, and we are developing a software program called Datanator which implements this methodology. The methodology consists of eight steps:

Aggregation. Modelers should retrieve a wide range of data from a wide range of sources such as metabolite concentrations from ECMDB, RNA concentrations from ArrayExpress, protein concentrations from PaxDb, reaction stoichiometries from KEGG, and kinetic parameters from SABIO-RK. Where possible, this should be implemented using downloads and web services. Where this is not possible, this should be implemented by scraping web pages and manually curating individual publications. Importantly, modelers should also record the provenance of each downloaded dataset.

Parsing. Modelers should parse each data source into an easily manipulatable data structure.

Standardization. Modelers should standardize the identifiers, metadata, and units of their data. The metadata should include the species and environmental conditions that were observed, the method used to measure the data, the investigators who collected the data, and the citation of the original data. We recommend using absolute identifiers such as InChI to describe all possible measurements, using ontologies such as the Measurement Method Ontology (MMO) to describe metadata consistently, and using SI units.

Integration. Modelers should merge the aggregated data into a single dataset. We recommend that modelers use relational databases such as SQLite to organize their data and make their data searchable.

Filtering. For each model parameter that modelers would like to constrain with experimental data, modelers should identify the most relevant observations within their dataset by scoring the similarity between the physical properties of the parameter and each observation, the species that they want to model and the observed species, and the environmental condition that they want to model and the observed conditions.

Reduction. For each model parameter, modelers should reduce the relevant data to constraints on the value of the parameter by calculating the mean and standard deviation of the relevant data, weighted by its similarity to the physical property, species, and environmental condition that the modeler wants to model.

Review. Because it is difficult to fully describe the context of experimental measurements and, therefore, difficult to automatically identify relevant data for a model, modelers should manually review the least relevant data to potentially select alternative observations or integrate more relevant data from other sources.

Storage. Lastly, modelers should store the reduced data and its provenance in a data structure that is conducive to building models. We recommend organizing this data using a specialized PGDB such as WholeCellKB.

We have already developed a common platform which implements this methodology, and data aggregation modules for the most critical data types for WC modeling. Going forward, we plan to develop additional modules for aggregating data from a wider range of sources and we plan to develop a user-friendly web-based interface for using Datanator. In addition, we hope to explore additional data aggregation methods such as natural language processing and crowdsourcing.

1.10.3.2. Model representation¶

As described in Section 1.7.4, no existing format is well-suited to representing composite, multi-algorithmic WC models. In particular, there is no format which is well-suited to describing all of the combinatorial complexity of cellular biochemistry, representing composite, multi-algorithmic models, and representing the semantic biological meaning and provenance of models.

To accelerate WC modeling, we are developing, wc_rules, a more abstract rule-based format for describing WC models. The format will be able to represent each molecular species at multiple levels of granularity (for example, as a single species, as a set of sites, and as a sequence); represent all of the combinatorial complexity of each molecular species and interaction; represent composite, multi-algorithmic models; represent the data, assumptions, and design decisions used to build models; and represent the semantic biological meaning of models. We are developing tools to export models described with wc_rules to BioNetGen and SBML, as well as a simulator for simulating models described with wc_rules.

1.10.3.3. Simulation of genomically-centric models¶

As described in Section 1.7.3, no existing simulator is well-suited to simulating computationally-expensive, high-dimensional, rule-based, multi-algorithmic WC models. In particular, there are only a few parallel simulators, only a few rule-based simulators, only a couple of multi-algorithmic simulators, and no simulator which supports all of these technologies.

To accelerate WC modeling, we are beginning to use the Viatra [195] graph transformation engine and the ROSS [117] PDES engine to develop wc_sim, a parallel, network-free, multi-algorithmic simulator that can simulate models described in wc_rules [190]. Simulations will consist of six steps:

Compile models to a low-level format. We will compile models described with wc_rules to a low-level format which can be interpreted by the simulation engine.

Merge mathematically compatible submodels. We will analytically merge all mathematically-compatible submodels, producing a model which is composed of at most one FBA, one ODE, and one SSA submodel.

Partition submodels into cliques. To use multiple machines to simulate models, we will partition models into cliques that can be simulated on separate machines with minimal communication to synchronize the cliques.

Assign cliques to core. We will use ROSS to assign each clique to a separate machine and use event messages and rollback to synchronize their states.

Co-simulate mathematically-distinct submodels. We will co-simulate the FBA, ODE, and SSA submodels by periodically calculating the fluxes predicted by FBA and ODE models and interpolating them with each SSA event.

Rule-based simulation of SSA cliques. We will use Viatra to represent each species and reaction pattern as a graph and iteratively select reactions, fire reactions, and update the species graphs. To efficiently simulate both sparsely and densely concentrated species, we will use a hybrid population/particle representation in which each species graph will represent a species and its copy number, and we will periodically merge identical graphs that represent the same species.

1.11. Resources for learning about WC modeling¶

To learn more about WC modeling, we recommend attending a WC modeling summer school or participating in the WC modeling forum. Below are brief descriptions of these resources.

1.11.1. Summer schools¶

We and others organize annual WC modeling summer schools [106][196][197] for graduate students and postdoctoral scholars. The schools teach the fundamental principles of WC modeling through brief lectures and hands-on exercises. The schools also provide opportunities to network with other WC researchers. Please see http://wholecell.org for information about upcoming schools.

1.11.2. Online forum¶

The WC modeling forum is an online platform which enables researchers to initiate and participate in discussions about WC modeling.

1.12. Outlook¶

Despite several challenges, we believe that WC models are rapidly becoming feasible thanks to ongoing advances in experimental and computational technology. In particular, in Section 1.9, we have proposed a three-stage plan to achieve comprehensive WC models as a community. The cornerstones of this plan include developing practical solutions to the key bottlenecks; forming a collaborative interdisciplinary community; and adhering to common interfaces, formats, identifiers, and protocols. We have already developed tools for organizing the data needed for WC modeling, organizing WC simulation results, and visualizing WC simulation results, and we have begun to organize a WC modeling community. Currently, we are developing tools for aggregating the data needed for WC modeling, concisely describing WC models, and scalably simulating WC models, and we are continuing to organize WC modeling meetings. We are eager to advance WC modeling, and hope you will join us!