1. Overview¶

datanator is a software tool for finding experimental data for building and calibrating dynamical models of cellular biochemistry such as metabolite, RNA, and protein abundances; protein complex compositions; transcription factor binding motifs; and kinetic parameters. datanator is particularly useful for building large models such as whole-cell models that require large amounts of data to constrain large numbers of parameters.

1.1. Motivation¶

Large models such as whole-cell models are needed to help researchers, bioengineers, and physicians predict how genotype and the environment determine phenotype. In particular, large models that represent the function of each individual gene are needed to help bioengineers rationally design microorganisms to perform specific functions such as producing drugs and neutralizing pathogens and to help physicians interpret personal genomes and design personalized therapies tailored to each patient’s unique genome.

Despite their potential, large models are hard to build, in part, because it is hard to aggregate the large amount of data needed to build large models because this data is scattered across numerous siloed repositories. Without data aggregation tools such as datanator, this data must be manually aggregated:

Modelers must identify appropriate data sources for their model.
Modelers must identify the relevant subset of this data for their model.
Modelers must merge inconsistently annotated data from multiple sources.
Modelers must reduce multiple experimental observations to individual consensus parameter value recommendations.
Modelers must record the provenance of each recommendation so their model is comprehensible and reproducible.

This requires extensive time and effort, is unscalable, introduces substantial selection bias into the data underlying models, and is irreproducible.

To address these problems, we have developed datanator to systemize and accelerate the aggregation of data for biomodels.

1.2. Methodology¶

datanator downloads data from the data repositories listed below.
datanator parses this data, normalizes this data to a common schema and common identifiers, and converts this data to common units.
datanator stores the normalized data and its provenance to a local SQLite database.
datanator helps researchers find relevant data to model a specific organism and environmental condition from similar species, reactions, genotypes, and environments according to the following filters:
- Chemical similarity: datanator helps researchers identify data observed for (a) molecularly-similar species as determined by the Tanimoto similarity of their molecular fingerprints and (b) chemically-similar reactions that involve similar reaction centers and reaction mechanisms.
- Taxonomic similarity: datanator helps researchers identify data observed for taxonomically-close taxa and similar genetic variants.
- Environmental similarity: datanator helps researchers identify data observed for similar temperatures, pHs, and growth media.
datanator reduces multiple relevant observations to a single consensus recommended parameter value.
datanator records the provenance of the data underlying each consensus recommendation.
datanator exports the consensus recommendations and their provenance to Excel workbooks.

Currently, datanator provides researchers a Python API to programmatically aggregate data for models. To make datanator easier to use, we plan to develop user-friendly command line and web-based interfaces which will help users find data for SBML-encoded models, review this data, and generate consensus recommendations for parameter values.

1.3. Supported data types and data sources¶

datanator currently supports the following data type and data sources. We are also actively integrating additional data types and data sources into datanator.

Databases
- Metabolite concentrations: ECMDB and YMBD
- RNA abundance: ArrayExpress
- Protein abundance: PaxDb
- Protein complex composition: CORUM
- Transcription factor binding motifs: JASPAR
- Reaction kinetics: SABIO-RK
- Taxonomy: NCBI Taxonomy
Prediction tools
- Metabolite properties (charge, pK_a, protonation): Open Babel
- EC number: E-zyme

1.4. Advantages of `datanator` versus manual data aggregation¶

datanator accelerates data aggregation by merging data from multiple repositories into a single location.
datanator reduces selection bias in the data used to build models by systemizing how researchers find data for models.
datanator helps researchers find more data for models by helping researchers find data from chemically-similar species and reactions, taxonomically-similar organisms, and similar environmental conditions based on their temperature, pH, and media.
datanator increases the comprehensibility and reproducibility of models by automatically tracking the provenance of each recommended parameter value.

1. Overview¶

1.1. Motivation¶

1.2. Methodology¶

1.3. Supported data types and data sources¶

1.4. Advantages of datanator versus manual data aggregation¶

1.4. Advantages of `datanator` versus manual data aggregation¶