1. Overview¶
datanator
is a software tool for finding experimental data for building and calibrating dynamical models of cellular biochemistry such as metabolite, RNA, and protein abundances; protein complex compositions; transcription factor binding motifs; and kinetic parameters. datanator
is particularly useful for building large models such as whole-cell models that require large amounts of data to constrain large numbers of parameters.
1.1. Motivation¶
Large models such as whole-cell models are needed to help researchers, bioengineers, and physicians predict how genotype and the environment determine phenotype. In particular, large models that represent the function of each individual gene are needed to help bioengineers rationally design microorganisms to perform specific functions such as producing drugs and neutralizing pathogens and to help physicians interpret personal genomes and design personalized therapies tailored to each patient’s unique genome.
Despite their potential, large models are hard to build, in part, because it is hard to aggregate the large amount of data needed to build large models because this data is scattered across numerous siloed repositories. Without data aggregation tools such as datanator
, this data must be manually aggregated:
Modelers must identify appropriate data sources for their model.
Modelers must identify the relevant subset of this data for their model.
Modelers must merge inconsistently annotated data from multiple sources.
Modelers must reduce multiple experimental observations to individual consensus parameter value recommendations.
Modelers must record the provenance of each recommendation so their model is comprehensible and reproducible.
This requires extensive time and effort, is unscalable, introduces substantial selection bias into the data underlying models, and is irreproducible.
To address these problems, we have developed datanator
to systemize and accelerate the aggregation of data for biomodels.
1.2. Methodology¶
datanator
downloads data from the data repositories listed below.datanator
parses this data, normalizes this data to a common schema and common identifiers, and converts this data to common units.datanator
stores the normalized data and its provenance to a local SQLite database.datanator
helps researchers find relevant data to model a specific organism and environmental condition from similar species, reactions, genotypes, and environments according to the following filters:Chemical similarity:
datanator
helps researchers identify data observed for (a) molecularly-similar species as determined by the Tanimoto similarity of their molecular fingerprints and (b) chemically-similar reactions that involve similar reaction centers and reaction mechanisms.Taxonomic similarity:
datanator
helps researchers identify data observed for taxonomically-close taxa and similar genetic variants.Environmental similarity:
datanator
helps researchers identify data observed for similar temperatures, pHs, and growth media.
datanator
reduces multiple relevant observations to a single consensus recommended parameter value.datanator
records the provenance of the data underlying each consensus recommendation.datanator
exports the consensus recommendations and their provenance to Excel workbooks.
Currently, datanator
provides researchers a Python API to programmatically aggregate data for models. To make datanator
easier to use, we plan to develop user-friendly command line and web-based interfaces which will help users find data for SBML-encoded models, review this data, and generate consensus recommendations for parameter values.
1.3. Supported data types and data sources¶
datanator
currently supports the following data type and data sources. We are also actively integrating additional data types and data sources into datanator
.
Databases
RNA abundance: ArrayExpress
Protein abundance: PaxDb
Protein complex composition: CORUM
Transcription factor binding motifs: JASPAR
Reaction kinetics: SABIO-RK
Taxonomy: NCBI Taxonomy
Prediction tools
Metabolite properties (charge, pKa, protonation): Open Babel
EC number: E-zyme
1.4. Advantages of datanator
versus manual data aggregation¶
datanator
accelerates data aggregation by merging data from multiple repositories into a single location.datanator
reduces selection bias in the data used to build models by systemizing how researchers find data for models.datanator
helps researchers find more data for models by helping researchers find data from chemically-similar species and reactions, taxonomically-similar organisms, and similar environmental conditions based on their temperature, pH, and media.datanator
increases the comprehensibility and reproducibility of models by automatically tracking the provenance of each recommended parameter value.