datanator is a software tool for finding experimental data for building and calibrating dynamical models of cellular biochemistry such as metabolite, RNA, and protein abundances; protein complex compositions; transcription factor binding motifs; and kinetic parameters.
datanator is particularly useful for building large models such as whole-cell models that require large amounts of data to constrain large numbers of parameters.
Large models such as whole-cell models are needed to help researchers, bioengineers, and physicians predict how genotype and the environment determine phenotype. In particular, large models that represent the function of each individual gene are needed to help bioengineers rationally design microorganisms to perform specific functions such as producing drugs and neutralizing pathogens and to help physicians interpret personal genomes and design personalized therapies tailored to each patient’s unique genome.
Despite their potential, large models are hard to build, in part, because it is hard to aggregate the large amount of data needed to build large models because this data is scattered across numerous siloed repositories. Without data aggregation tools such as
datanator, this data must be manually aggregated:
Modelers must identify appropriate data sources for their model.
Modelers must identify the relevant subset of this data for their model.
Modelers must merge inconsistently annotated data from multiple sources.
Modelers must reduce multiple experimental observations to individual consensus parameter value recommendations.
Modelers must record the provenance of each recommendation so their model is comprehensible and reproducible.
This requires extensive time and effort, is unscalable, introduces substantial selection bias into the data underlying models, and is irreproducible.
To address these problems, we have developed
datanator to systemize and accelerate the aggregation of data for biomodels.
datanatordownloads data from the data repositories listed below.
datanatorparses this data, normalizes this data to a common schema and common identifiers, and converts this data to common units.
datanatorstores the normalized data and its provenance to a local SQLite database.
datanatorhelps researchers find relevant data to model a specific organism and environmental condition from similar species, reactions, genotypes, and environments according to the following filters:
datanatorhelps researchers identify data observed for (a) molecularly-similar species as determined by the Tanimoto similarity of their molecular fingerprints and (b) chemically-similar reactions that involve similar reaction centers and reaction mechanisms.
datanatorhelps researchers identify data observed for taxonomically-close taxa and similar genetic variants.
datanatorhelps researchers identify data observed for similar temperatures, pHs, and growth media.
datanatorreduces multiple relevant observations to a single consensus recommended parameter value.
datanatorrecords the provenance of the data underlying each consensus recommendation.
datanatorexports the consensus recommendations and their provenance to Excel workbooks.
datanator provides researchers a Python API to programmatically aggregate data for models. To make
datanator easier to use, we plan to develop user-friendly command line and web-based interfaces which will help users find data for SBML-encoded models, review this data, and generate consensus recommendations for parameter values.
1.3. Supported data types and data sources¶
datanator currently supports the following data type and data sources. We are also actively integrating additional data types and data sources into
1.4. Advantages of
datanator versus manual data aggregation¶
datanatoraccelerates data aggregation by merging data from multiple repositories into a single location.
datanatorreduces selection bias in the data used to build models by systemizing how researchers find data for models.
datanatorhelps researchers find more data for models by helping researchers find data from chemically-similar species and reactions, taxonomically-similar organisms, and similar environmental conditions based on their temperature, pH, and media.
datanatorincreases the comprehensibility and reproducibility of models by automatically tracking the provenance of each recommended parameter value.