4.1.1.3. datanator.data_source package

4.1.1.3.2. Submodules

4.1.1.3.3. datanator.data_source.corum_nosql module

class datanator.data_source.corum_nosql.CorumNoSQL(MongoDB, db, replicaSet=None, verbose=False, max_entries=inf, username=None, password=None, authSource='admin', cache_dirname=None)[source]

Bases: datanator.util.mongo_util.MongoUtil

load_content(endpoint='corum')[source]

Collect and parse all data from CORUM website into JSON files and add to NoSQL database

datanator.data_source.corum_nosql.correct_protein_name_list(lst)[source]

Correct a list of protein names with incorrect separators involving ‘[Cleaved into: …]’

Parameters

lst (str) – list of protein names with incorrect separators

Returns

corrected list of protein names

Return type

str

datanator.data_source.corum_nosql.main()[source]
datanator.data_source.corum_nosql.parse_list(str_lst)[source]

Parse a semicolon-separated list of strings into a list, ignoring semicolons that are inside square brackets

Parameters

str_lst (str) – semicolon-separated encoding of a list

Returns

list

Return type

list of str

datanator.data_source.corum_nosql.parse_subunits(subunits)[source]

Given enzyme subunits list, separate uniprot_id and variant. e.g. [“P78381-2”] -> [“P78381-2”, “P78381”]

Parameters

subunits (list) – corum enzyme subunit string representation

4.1.1.3.4. datanator.data_source.ec module

class datanator.data_source.ec.EC(server=None, db=None, username=None, password=None, authSource='admin', readPreference='nearest', collection_str='ec', verbose=True, max_entries=inf, cache_dir=None)[source]

Bases: datanator_query_python.util.mongo_util.MongoUtil

establish_ftp()[source]

establish ftp connection. (ftp://ftp.expasy.org/databases/enzyme/enzyme.dat)

make_doc(lines)[source]

Turn a block of EC info into a dictionary object

Parameters

lines (list of str) – list consists of lines of information on one EC group.

Returns

dictionary object.

Return type

(dict)

parse_content(file_location)[source]

Parse enzyme.dat file.

Parameters

file_location (str) – location of enzyme.dat file.

retrieve_content()[source]

Retrieve content of “enzyme.dat”

datanator.data_source.ec.main()[source]

4.1.1.3.5. datanator.data_source.gene_ortholog module

class datanator.data_source.gene_ortholog.KeggGeneOrtholog(server, src_db='datanator', des_db='datanator', collection_str='uniprot', username=None, password=None, readPreference='nearest', authSource='admin', verbose=True, max_entries=inf)[source]

Bases: datanator_query_python.util.mongo_util.MongoUtil

get_html(query)[source]

Get HTML file based on org:gene_code string, e.g. aly:ARALYDRAFT_486312.

Parameters

query (str) – org:gene_code string.

load_data(skip=0, top_hits=10)[source]

Loading data.

Parameters
  • skip (int, optional) – Beginning of the documents. Defaults to 0.

  • top_hits (int, optional) – Number of top hits to iterate through. Defaults to 10.

parse_gene_info(gene)[source]

Use mygene.info to get protein information given a string of gene code.

Parameters

gene (str) – Gene information.

Returns

List of protein IDs.

Return type

(list of str)

parse_html(soup)[source]

Parse out gene_orthologs from HTML (https://www.kegg.jp/ssdb-bin/ssdb_best?org_gene=aly:ARALYDRAFT_486312).

Parameters

soup (BeautifulSoup) – BeautifulSoup object

uniprot_to_org_gene(uniprot_id)[source]

Given uniprot_id, convert to kegg org_gene format.

Parameters

uniprot_id (str) – Uniprot ID.

Returns

Kegg org_gene format.

Return type

(str)

datanator.data_source.gene_ortholog.main()[source]

4.1.1.3.6. datanator.data_source.intact_nosql module

Downloads and parses the IntAct database of protein-protein interactions

class datanator.data_source.intact_nosql.IntActNoSQL(cache_dirname=None, MongoDB=None, db=None, replicaSet=None, verbose=False, max_entries=inf, username=None, password=None, authSource='admin')[source]

Bases: datanator.util.mongo_util.MongoUtil

A local MongoDB copy of the IntAct database

add_complexes()[source]

Parse complexes from data and add complexes to MongoDB

add_interactions()[source]

Parse interactions from data and add interactions to mongodb database

download_content()[source]

Download data from FTP server

find_between(string, first, last)[source]

Get the substring between the first occurrence of the substring first and the last occurrence of the substring last

Parameters
  • string (str) – string

  • first (str) – starting substring

  • last (str) – ending substring

Returns

substring between the first occurrence of the substring first and the

last occurrence of the substring :obj:`last

Return type

str

find_between_psi_mi_parentheses(string)[source]

Find the text between parentheses in values of psi-mi key-value pairs

Parameters

string (str) – string

Returns

substring between the first occurrence of the substring first and the

last occurrence of the substring :obj:`last

Return type

str

find_protein_gene(interactor, alias)[source]

Parse the protein and gene identifiers from key-value pairs of interactors and their aliases

Parameters
  • interactor (str) – key-value pairs of interactor

  • alias (str) – key-value pairs of the alias of the interactor

Returns

protein identifier str: gene identifier

Return type

str

find_pubmed_id(string)[source]

Parse PubMed identifier from annotated key-value pair of publication type-identifier

Parameters

string (str) – key-value pair of publication type-identifier

Returns

PubMed identifier

Return type

str

load_content()[source]

Load the content of the local copy of the data source

split_colon(string)[source]

Split a string into substrings separated by ‘:’

Parameters

string (str) – string

Returns

substring separated by ‘:’

Return type

list

split_line(string)[source]

Split a string into substrings separated by ‘|’

Parameters

string (str) – string

Returns

substring separated by ‘|’

Return type

list

4.1.1.3.7. datanator.data_source.kegg_org_code module

class datanator.data_source.kegg_org_code.KeggOrgCode(MongoDB, db, cache_dirname=None, replicaSet=None, verbose=False, max_entries=inf, username=None, password=None, readPreference=None, authSource='admin', collection_str='kegg_organism_code')[source]

Bases: datanator_query_python.util.mongo_util.MongoUtil

bulk_load(bulk_size=100)[source]

Loading bulk data into MongoDB.

Parameters

bulk_size (int) – number of entries per insertion. Defaults to 100.

get_ncbi_id(name)[source]

Given name of species, look up ncbi_taxonomy_id from official ncbi database by parsing html webpage.

Parameters

name (str) – name of the organism.

Returns

NCBI Taxonomy ID.

Return type

(int)

get_ncbi_id_rest(name)[source]

Get ncbi taxonomy id of an organism using api.datanator.info

Parameters

name (str) – Name of the organism.

Returns

NCBI Taxonomy ID.

Return type

(int)

has_align_but_no_rowspan(tag)[source]
make_bulk(offset=0, bulk_size=100)[source]

Make bulk objects to be inserted into MongoDB.

Parameters
  • offset (int) – Position of beginning (zero-indexed). Defaults to 0.

  • bulk_size (int) – number of objects. Defaults to 100.

Returns

list of objects to be inserted.

Return type

(list of dict)

parse_html_iter()[source]

Parse org code HTML iteratively.

Yields

(obj) – {‘kegg_organism_id’: , ‘org_name’: , ‘org_synonym’: }

datanator.data_source.kegg_org_code.main()[source]

4.1.1.3.8. datanator.data_source.kegg_orthology module

class datanator.data_source.kegg_orthology.KeggOrthology(cache_dirname=None, MongoDB=None, db=None, replicaSet='', verbose=False, max_entries=inf, username=None, password=None, authSource='admin')[source]

Bases: datanator.util.mongo_util.MongoUtil

download_ko(name)[source]
load_content()[source]

Load kegg_orthologs into MongoDB

parse_definition(line)[source]
Definition line could be something as follows:

” fructose-bisphosphate aldolase / 6-deoxy-5-ketofructose 1-phosphate synthase [NADP…] [EC:4.1.2.13 2.2.1.11]

EC code can be optional

parse_gene(lines)[source]

Parse GENES category (http://rest.kegg.jp/get/ko:K00023)

Parameters

lines (readlines()) – Lines for genes.

Returns

list of parsed genes.

Return type

(list of dict)

parse_ko_txt(filename)[source]

Parse kegg_ortho txt file into dictionary object

parse_pathway_disease(lines, category='pathway')[source]

Parse parthway or disease or module information

Parameters
  • line (readlines()) – pathway lines.

  • category (str) – which category to parse. Defaults to pathway.

Returns

list of pathways [{“kegg_pathway_code”: …, “pathway_description”: …}]

Return type

(list of dict)

datanator.data_source.kegg_orthology.main()[source]

4.1.1.3.9. datanator.data_source.kegg_reaction_class module

class datanator.data_source.kegg_reaction_class.KeggReaction(cache_dirname, MongoDB, db, replicaSet=None, verbose=False, max_entries=inf, username=None, password=None)[source]

Bases: datanator.util.mongo_util.MongoUtil

download_rxn(name)[source]
download_rxn_cls(cls)[source]
load_content()[source]

Load kegg_reactions into MongoDB

parse_rc_multiline(lines)[source]
Input:
DEFINITION C1y-C2y:-:C1b+C8y+N1y-C1b+C8y+N2y

N1y-N2y:-:C1a+C1x+C1y-C1a+C1x+C2y … … O1a-O2x:*-C1z:C1b-C1x

Output:

[C1y-C2y:-:C1b+C8y+N1y-C1b+C8y+N2y, N1y-N2y:-:C1a+C1x+C1y-C1a+C1x+C2y, …]

parse_rc_orthology(lines)[source]
Input:

ORTHOLOGY K00260 glutamate dehydrogenase [EC:1.4.1.2] K00261 glutamate dehydrogenase (NAD(P)+) [EC:1.4.1.3] K00262 glutamate dehydrogenase (NADP+) [EC:1.4.1.4] K00263 leucine dehydrogenase [EC:1.4.1.9] … K13547 L-glutamine:2-deoxy-scyllo-inosose/3-amino-2,3-dideoxy-scyllo-inosose aminotransferase [EC:2.6.1.100 2.6.1.101] ..

Output

[K00260, K00261, …]

parse_root_json()[source]

Parse root json file and return reaction classes

parse_rxn_cls_txt(filename)[source]

Parse kegg_ortho txt file into dictionary object categories = [‘ENTRY’, ‘DEFINITION’, ‘RPAIR’, ‘REACTION’,

‘ENZYME’, ‘PATHWAY’, ‘ORTHOLOGY’]

datanator.data_source.kegg_reaction_class.main()[source]

4.1.1.3.10. datanator.data_source.metabolite_nosql module

Author

Zhouyang Lian <zhouyang.lian@familian.life>

Author

Jonathan <jonrkarr@gmail.com>

Date

2019-04-02

Copyright

2019, Karr Lab

License

MIT

class datanator.data_source.metabolite_nosql.MetaboliteNoSQL(output_directory, source, MongoDB, db, verbose=True, max_entries=inf, username=None, password=None, authSource='admin', replicaSet=None)[source]

Bases: datanator.util.mongo_util.MongoUtil

Loads metabolite information into mongodb and output documents as JSON files for each metabolite Attribuites:

source: source database e.g. ‘ecmdb’ ‘ymdb’ MongoDB: mongodb server address e.g. ‘mongodb://localhost:27017/’ max_entries: maximum number of documents to be processed output_direcotory: directory in which JSON files will be stored.

write_to_json()[source]

4.1.1.3.11. datanator.data_source.metabolites_meta_collection module

class datanator.data_source.metabolites_meta_collection.MetabolitesMeta(cache_dirname=None, MongoDB=None, replicaSet=None, db=None, verbose=False, max_entries=inf, username=None, password=None, authSource='admin', meta_loc=None)[source]

Bases: datanator_query_python.query.query_sabiork.QuerySabio

meta_loc: database location to save the meta collection

fill_metabolite_fields(fields=None, collection_src=None, collection_des=None)[source]

Fill in values of fields of interest from metabolite collection: ecmdb or ymdb

Args:

fileds: list of fields of interest collection_src: collection in which query will be done collection_des: collection in which result will be updated

fill_names()[source]

Fill names of metabolites in ‘name’ field

fill_standard_id(skip=0)[source]

Fill meta collection with chebi_id, pubmed_id, and kegg_id.

Parameters

skip (int) – skip first n number of records.

load_content()[source]
remove_dups(_key)[source]

Remove entries with the same _key.

Parameters

_key (str) – Name of fields in which dups will be identified.

replace_key_in_similar_compounds()[source]
reset_cellular_locations(start=0)[source]

Github (https://github.com/KarrLab/datanator_rest_api/issues/69)

datanator.data_source.metabolites_meta_collection.main()[source]

4.1.1.3.12. datanator.data_source.modomics module

Use BpForms to gather rRNA and tRNA modification information from MODOMICS <https://iimcb.genesilico.pl/modomics/>.

Author

Jonathan Karr <karr@mssm.edu>

Date

2020-04-23

Copyright

2019, Karr Lab

License

MIT

class datanator.data_source.modomics.Modomics[source]

Bases: object

Use BpForms to gather rRNA and tRNA modification information from MODOMICS <https://iimcb.genesilico.pl/modomics/>.

ENDPOINT = 'https://iimcb.genesilico.pl/modomics/sequences/'[source]
run()[source]

4.1.1.3.13. datanator.data_source.pax_nosql module

class datanator.data_source.pax_nosql.PaxNoSQL(cache_dirname, MongoDB, db, verbose=False, max_entries=inf, username=None, password=None, authSource='admin', replicaSet=None)[source]

Bases: datanator.util.mongo_util.MongoUtil

load_content()[source]

Collects and Parses all data from Pax DB website and adds to MongoDB

parse_paxDB_files()[source]

This function parses pax DB files and adds them to the NoSQL database

datanator.data_source.pax_nosql.find_files(path)[source]

Scan a directory (and its subdirectories) for files and sort by ncbi_id

Parameters

path (str) – Path containing the data_files

Returns

list of files to add to DB

Return type

list

4.1.1.3.14. datanator.data_source.protein_aggregate module

class datanator.data_source.protein_aggregate.ProteinAggregate(username=None, password=None, server=None, authSource='admin', src_database='datanator', max_entries=inf, verbose=True, collection='protein', destination_database='datanator', cache_dir=None)[source]

Bases: object

copy_uniprot()[source]

Copy relevant information from uniprot collection

load_abundance_from_pax()[source]

Load protein abundance data but interating from pax collection.

load_bad_kinlaw()[source]

Load kinlaw IDs whose enzymes have names as “1” or “2”

load_ko()[source]

Load ko number for uniprot_id if such information exists

load_ko_from_uniprot()[source]

loading ko number from uniprot collection into aggregate collection (one shot after building new uniprot collection)

load_taxon()[source]

Load taxon ancestor information

load_unreviewed_abundance()[source]

Load abundance info for proteins that are not reviewed in uniprot

datanator.data_source.protein_aggregate.main()[source]

4.1.1.3.15. datanator.data_source.sabio_compound module

class datanator.data_source.sabio_compound.SabioCompound(username=None, password=None, server=None, authSource='admin', src_database='datanator', dest_database=None, max_entries=inf, verbose=True, src_collection='sabio_compound', dest_collection=None, cache_dir=None)[source]

Bases: object

add_inchi_key()[source]

Add inchi_key field to sabio_compound collection in MongoDB

datanator.data_source.sabio_compound.main()[source]

4.1.1.3.16. datanator.data_source.sabio_reaction module

class datanator.data_source.sabio_reaction.RxnAggregate(username=None, password=None, server=None, authSource='admin', src_database='datanator', max_entries=inf, verbose=True, collection='sabio_reaction_entries', destination_database='datanator', cache_dir=None)[source]

Bases: datanator_query_python.util.mongo_util.MongoUtil

create_reactants(doc)[source]
extract_enzyme_names(doc)[source]

Extract enzyme names

Parameters

doc (dict) – sabio_rk_old document

Returns

list of enzyme names

Return type

(list)

extract_reactant_names(doc)[source]

Extract compound information from doc dictionary

Parameters

doc (dict) – sabio_rk_old document

Returns

substrates and products names [[],[],…,[]], [[],[],…,[]]

Return type

(tuple)

fill_collection()[source]
get_ec(doc)[source]
get_rxn_id(doc)[source]
hash_null_reactants(start=0)[source]
(https://github.com/KarrLab/datanator/issues/50)

(https://github.com/KarrLab/datanator_rest_api/issues/116)

Parameters

start (int, optional) – Start of document. Defaults to 0.

label_existence(start=0)[source]

Label reactant’s existence in metabolites collections.

datanator.data_source.sabio_reaction.main()[source]

4.1.1.3.17. datanator.data_source.sabio_rk module

Author

Yosef Roth <yosefdroth@gmail.com>

Author

Jonathan Karr <jonrkarr@gmail.com>

Date

2017-05-04

Copyright

2017, Karr Lab

License

MIT

class datanator.data_source.sabio_rk.Compartment(**kwargs)[source]

Bases: datanator.data_source.sabio_rk.Entry

Represents a compartment in the SABIO-RK database

kinetic_laws[source]

list of kinetic laws

Type

list of KineticLaw

created[source]
cross_references[source]
id[source]
modified[source]
name[source]
synonyms[source]
class datanator.data_source.sabio_rk.Compound(**kwargs)[source]

Bases: datanator.data_source.sabio_rk.Entry

Represents a compound in the SABIO-RK database

_is_name_ambiguous[source]

if True, the currently stored compound name should not be trusted because multiple names for the same compound have been discovered. The consensus name must be obtained using download_compounds

Type

bool

structures[source]

structures

Type

list of CompoundStructure

reaction_participants[source]

list of reaction participants

Type

list of ReactionParticipant

parameters[source]

list of parameters

Type

list of Parameter

created[source]
cross_references[source]
get_inchi_structures()[source]

Get InChI-formatted structures

Returns

list of structures in InChI format

Return type

list of str

get_smiles_structures()[source]

Get SMILES-formatted structures

Returns

list of structures in SMILES format

Return type

list of str

id[source]
modified[source]
name[source]
structures[source]
synonyms[source]
class datanator.data_source.sabio_rk.CompoundStructure(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Represents the structure of a compound and its format

compounds[source]

list of compounds

Type

list of Compound

value[source]

the structure in InChI, SMILES, etc. format

Type

str

format[source]

format (InChI, SMILES, etc.) of the structure

Type

str

_value_inchi[source]

structure in InChI format

Type

str

_value_inchi_formula_connectivity[source]

empiral formula (without hydrogen) and connectivity InChI layers; used to quickly search for compound structures

Type

str

calc_inchi_formula_connectivity()[source]

Calculate a searchable structures

  • InChI format

  • Core InChI format

    • Formula layer (without hydrogen)

    • Connectivity layer

format[source]
value[source]
class datanator.data_source.sabio_rk.Entry(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Represents a compartment in the SABIO-RK database

id[source]

external identifier

Type

int

name[source]

name

Type

str

synonyms[source]

list of synonyms

Type

list of Synonym

cross_references[source]

list of cross references

Type

list of Resource

created[source]

date that the sqlite object was created

Type

datetime.datetime

updated[source]

date that the sqlite object was last updated

Type

datetime.datetime

created[source]
cross_references[source]
id[source]
modified[source]
name[source]
synonyms[source]
class datanator.data_source.sabio_rk.Enzyme(**kwargs)[source]

Bases: datanator.data_source.sabio_rk.Entry

Represents an enzyme in the SABIO-RK database

subunits[source]

list of subunits

Type

list of EnzymeSubunit

kinetic_laws[source]

list of kinetic laws

Type

list of KineticLaw

molecular_weight[source]

molecular weight in Daltons

Type

float

parameters[source]

list of parameters

Type

list of Parameter

created[source]
cross_references[source]
id[source]
modified[source]
molecular_weight[source]
name[source]
synonyms[source]
class datanator.data_source.sabio_rk.EnzymeSubunit(**kwargs)[source]

Bases: datanator.data_source.sabio_rk.Entry

Represents an enzyme in the SABIO-RK database

enzyme[source]

enzyme

Type

Enzyme

coefficient[source]

stoichiometry of the subunit in the enzyme

Type

int

sequence[source]

amino acid sequence

Type

str

molecular_weight[source]

molecular weight in Daltons

Type

float

coefficient[source]
created[source]
cross_references[source]
enzyme[source]
enzyme_id[source]
id[source]
modified[source]
molecular_weight[source]
name[source]
sequence[source]
synonyms[source]
class datanator.data_source.sabio_rk.KineticLaw(**kwargs)[source]

Bases: datanator.data_source.sabio_rk.Entry

Represents a kinetic law in the SABIO-RK database

reactants[source]

list of reactants

Type

list of ReactionParticipant

products[source]

list of products

Type

list of ReactionParticipant

enzyme[source]

enzyme

Type

Enzyme

enzyme_compartment[source]

compartment

Type

Compartment

enzyme_type[source]

type of the enzyme (e.g. Modifier-Catalyst)

Type

str

tissue[source]

tissue

Type

str

mechanism[source]

mechanism of enzymatic catalysis (e.g. Michaelis-Menten)

Type

str

equation[source]

equation

Type

str

parameters[source]

list of parameters

Type

list of Parameter

modifiers[source]

list of modifiers

Type

list of ReactionParticipant

taxon[source]

taxon

Type

str

taxon_wildtype[source]

if True, the taxon represent the wild type

Type

bool

taxon_variant[source]

variant of the taxon

Type

str

temperature[source]

temperature in C

Type

float

ph[source]

pH

Type

float

media[source]

media

Type

str

references[source]

list of PubMed references

Type

list of Resource

created[source]
cross_references[source]
enzyme[source]
enzyme_compartment[source]
enzyme_compartment_id[source]
enzyme_id[source]
enzyme_type[source]
equation[source]
id[source]
mechanism[source]
media[source]
modified[source]
modifiers[source]
name[source]
parameters[source]
ph[source]
products[source]
reactants[source]
references[source]
synonyms[source]
taxon[source]
taxon_variant[source]
taxon_wildtype[source]
temperature[source]
tissue[source]
class datanator.data_source.sabio_rk.Parameter(**kwargs)[source]

Bases: datanator.data_source.sabio_rk.Entry

Represents a parameter in the SABIO-RK database

kinetic_law[source]

kinetic law

Type

KineticLaw

type[source]

SBO term

Type

int

compound[source]

compound

Type

Compound

enzyme[source]

enzyme

Type

Enzyme

compartment[source]

compartment

Type

Compartment

value[source]

normalized value

Type

float

error[source]

normalized error

Type

float

units[source]

normalized units

Type

str

observed_name[source]

name

Type

str

observed_type[source]

SBO term

Type

int

observed_value[source]

observed value

Type

float

observed_error[source]

observed error

Type

float

observed_units[source]

observed units

Type

str

TYPES (:obj:`dict` of :obj:`int`

str): dictionary of SBO terms and their canonical string symbols

UNITS (:obj:`dict` of :obj:`int`

str): dictionary of SBO terms and their canonical units

TYPES = {25: 'k_cat', 27: 'k_m', 186: 'v_max', 261: 'k_i'}[source]
compartment[source]
compartment_id[source]
compound[source]
compound_id[source]
created[source]
cross_references[source]
enzyme[source]
enzyme_id[source]
error[source]
id[source]
kinetic_law_id[source]
modified[source]
name[source]
observed_error[source]
observed_name[source]
observed_type[source]
observed_units[source]
observed_value[source]
synonyms[source]
type[source]
units[source]
value[source]
class datanator.data_source.sabio_rk.ReactionParticipant(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Represents a participant in a SABIO-RK reaction

compound[source]

compound

Type

Compound

compartment[source]

compartment

Type

Compartment

coefficient[source]

coefficient

Type

float

type[source]

type

Type

str

reactant_kinetic_law[source]

kinetic law in which the participant appears as a reactant

Type

KineticLaw

product_kinetic_law[source]

kinetic law in which the participant appears as a product

Type

KineticLaw

coefficient[source]
compartment[source]
compartment_id[source]
compound[source]
compound_id[source]
modifier_kinetic_law_id[source]
product_kinetic_law_id[source]
reactant_kinetic_law_id[source]
type[source]
class datanator.data_source.sabio_rk.Resource(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Represents an external resource

namespace[source]

external namespace

Type

str

id[source]

external identifier

Type

str

entries[source]

entries

Type

list of Entry

kinetic_laws[source]

kinetic laws

Type

list of KineticLaw

id[source]
namespace[source]
class datanator.data_source.sabio_rk.SabioRk(name=None, cache_dirname=None, clear_content=False, load_content=False, max_entries=inf, commit_intermediate_results=False, download_backups=False, verbose=False, clear_requests_cache=False, download_request_backup=False, webservice_batch_size=1, excel_batch_size=100, quilt_owner=None, quilt_package=None)[source]

Bases: datanator.core.data_source.HttpDataSource

A local sqlite copy of the SABIO-RK database

webservice_batch_size[source]

default size of batches to download kinetic information from the SABIO webservice. Note: this should be set to one because SABIO exports units incorrectly when multiple kinetic laws are requested

Type

int

excel_batch_size[source]

default size of batches to download kinetic information from the SABIO Excel download service

Type

int

URL to obtain a list of the ids of all of the kinetic laws in SABIO-Rk

Type

str

ENDPOINT_WEBSERVICE[source]

URL for the SABIO-RK webservice

Type

str

ENDPOINT_EXCEL_EXPORT[source]

URL to download kinetic data as a table in TSV format

Type

str

ENDPOINT_COMPOUNDS_PAGE[source]

URL to download information about a SABIO-RK compound

Type

str

SKIP_KINETIC_LAW_IDS[source]

IDs of kinetic laws that should be skipped (because they cannot contained errors and can’t be downloaded from SABIO)

Type

tuple of int

PUBCHEM_MAX_TRIES[source]

maximum number of times to time querying PubChem before failing

Type

int

PUBCHEM_TRY_DELAY[source]

delay in seconds between PubChem queries (to delay overloading the server)

Type

float

ENDPOINT_COMPOUNDS_PAGE = 'http://sabiork.h-its.org/compdetails.jsp'[source]
ENDPOINT_DOMAINS = {'sabio_rk': 'http://sabiork.h-its.org', 'uniprot': 'http://www.uniprot.org'}[source]
ENDPOINT_EXCEL_EXPORT = 'http://sabiork.h-its.org/entry/exportToExcelCustomizable'[source]
ENDPOINT_KINETIC_LAWS_PAGE = 'http://sabiork.h-its.org/kindatadirectiframe.jsp'[source]
ENDPOINT_KINETIC_LAWS_SEARCH = 'http://sabiork.h-its.org/sabioRestWebServices/searchKineticLaws/entryIDs'[source]
ENDPOINT_WEBSERVICE = 'http://sabiork.h-its.org/sabioRestWebServices/kineticLaws'[source]
PUBCHEM_MAX_TRIES = 10[source]
PUBCHEM_TRY_DELAY = 0.25[source]
SKIP_KINETIC_LAW_IDS = (51286,)[source]
base_model[source]

alias of sqlalchemy.ext.declarative.api.Base

calc_enzyme_molecular_weights(enzymes)[source]

Calculate the molecular weight of each enzyme

Parameters

enzymes (list of Enzyme) – list of enzymes

calc_stats()[source]

Calculate statistics about SABIO-RK

Returns

list of list of statistics

Return type

list of list of obj

create_compartment_from_sbml(sbml)[source]

Add a compartment to the local sqlite database

Parameters

sbml (libsbml.Compartment) – SBML-representation of a compartment

Returns

compartment

Return type

Compartment

create_cross_references_from_sbml(sbml)[source]

Add cross references to the local sqlite database for an SBML object

Parameters

sbml (libsbml.SBase) – object in an SBML documentation

Returns

list of resources

Return type

list of Resource

create_kinetic_law_from_sbml(id, sbml, specie_properties, functions, units)[source]

Add a kinetic law to the local sqlite database

Parameters
  • id (int) – identifier

  • sbml (libsbml.KineticLaw) – SBML-representation of a reaction

  • specie_properties (dict) –

    additional properties of the compounds/enzymes

    • is_wildtype (bool): indicates if the enzyme is wildtype or mutant

    • variant (str): description of the variant of the eznyme

    • modifier_type (str): type of the enzyme (e.g. Modifier-Catalyst)

:param functions (dict of str: str): dictionary of rate law equations (keys = IDs in SBML, values = equations) :param units (dict of str: str): dictionary of units (keys = IDs in SBML, values = names)

Returns

kinetic law

Return type

KineticLaw

Raises

ValueError – if the temperature is expressed in an unsupported unit

create_kinetic_laws_from_sbml(ids, sbml)[source]

Add kinetic laws defined in an SBML file to the local sqlite database

Parameters
  • ids (list of int) – list kinetic law IDs

  • sbml (str) – SBML representation of one or more kinetic laws

Returns

Return type

tuple

create_specie_from_sbml(sbml)[source]

Add a species to the local sqlite database

Parameters

sbml (libsbml.Species) – SBML-representation of a compound or enzyme

Returns

  • Compound: or Enzyme: compound or enzyme

  • dict: additional properties of the compound/enzyme

    • is_wildtype (bool): indicates if the enzyme is wildtype or mutant

    • variant (str): description of the variant of the eznyme

    • modifier_type (str): type of the enzyme (e.g. Modifier-Catalyst)

Return type

tuple

Raises

ValueError – if a species is of an unsupported type (i.e. not a compound or enzyme)

export_stats(stats, filename=None)[source]

Export statistics to an Excel workbook

Parameters
  • stats (list of list of obj) – list of list of statistics

  • filename (str, optional) – path to export statistics

get_parameter_by_properties(kinetic_law, parameter_properties)[source]

Get the parameter of kinetic_law whose attribute values are equal to that of parameter_properties

Parameters
  • kinetic_law (KineticLaw) – kinetic law to find parameter of

  • parameter_properties (dict) – properties of parameter to find

Returns

parameter with attribute values equal to values of parameter_properties

Return type

Parameter

get_specie_reference_from_sbml(specie_id)[source]

Get the compound/enzyme associated with an SBML species by its ID

Parameters

specie_id (str) – ID of an SBML species

Returns

Return type

tuple

Raises

ValueError – if the species is not a compound or enzyme, no species with id = specie_id exists, or no compartment with name = compartment_name exists

infer_compound_structures_from_names(compounds)[source]

Try to use PubChem to infer the structure of compounds from their names

Notes: we don’t try look up structures from their cross references because SABIO has already gathered all structures from their cross references to ChEBI, KEGG, and PubChem

Parameters

compounds (list of Compound) – list of compounds

load_compounds(compounds=None)[source]

Download information from SABIO-RK about all of the compounds stored in the local sqlite copy of SABIO-RK

Parameters

compounds (list of Compound) – list of compounds to download

Raises

Error – if an HTTP request fails

load_content()[source]

Download the content of SABIO-RK and store it to a local sqlite database.

load_kinetic_law_ids()[source]

Download the IDs of all of the kinetic laws stored in SABIO-RK

Returns

list of kinetic law IDs

Return type

list of int

Raises

Error – if an HTTP request fails or the expected number of kinetic laws is not returned

load_kinetic_laws(ids)[source]

Download kinetic laws from SABIO-RK

Parameters

ids (list of int) – list of IDs of kinetic laws to download

Raises

Error – if an HTTP request fails

load_missing_enzyme_information_from_html(ids)[source]

Loading enzyme subunit information from html

Parameters

ids (list of int) – list of IDs of kinetic laws to download

load_missing_kinetic_law_information_from_tsv(ids)[source]

Update the properties of kinetic laws in the local sqlite database based on content downloaded from SABIO in TSV format.

Parameters

ids (list of int) – list of IDs of kinetic laws to download

load_missing_kinetic_law_information_from_tsv_helper(tsv)[source]

Update the properties of kinetic laws in the local sqlite database based on content downloaded from SABIO in TSV format.

Note: this method is necessary because neither of SABIO’s SBML and Excel export methods provide all of the SABIO’s content.

Parameters

tsv (str) – TSV-formatted table

Raises

ValueError – if a kinetic law or compartment is not contained in the local sqlite database

normalize_kinetic_laws(ids)[source]

Normalize parameter values

Parameters

ids (list of int) – list of IDs of kinetic laws to download

normalize_parameter_value(name, type, value, error, units, enzyme_molecular_weight)[source]
Parameters
  • name (str) – parameter name

  • type (int) parameter type (SBO term id) –

  • value (float) – observed value

  • error (float) – observed error

  • units (str) – observed units

  • enzyme_molecular_weight (float) – enzyme molecular weight

Returns

normalized name and

its type (SBO term), value, error, and units

Return type

tuple of str, int, float, float, str

Raises

ValueError – if units is not a supported unit of type

parse_complex_subunit_structure(text)[source]

Parse the subunit structure of complex into a dictionary of subunit coefficients

Parameters

text (str) – subunit structure described with nested parentheses

Returns

dictionary of subunit coefficients

Return type

dict of str, int

parse_enzyme_name(sbml)[source]

Parse the name of an enzyme in SBML for the enzyme name, wild type status, and variant description that it contains.

Parameters

sbml (str) – enzyme name in SBML

Returns

  • str: name

  • bool: if True, the enzyme is wild type

  • str: variant

Return type

tuple

Raises

ValueError – if the enzyme name is formatted in an unsupport format

class datanator.data_source.sabio_rk.Synonym(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Represents a synonym to a SABIO-RK entry

name[source]

name of the synonym

Type

str

entries[source]

list of entries with the synonym

Type

list of Entry

name[source]

4.1.1.3.18. datanator.data_source.sabio_rk_json_mongo module

Parse SabioRk json files into MongoDB documents

(json files acquired by running sqlite_to_json.py)

Author

Zhouyang Lian <zhouyang.lian@familian.life>

Author

Jonathan <jonrkarr@gmail.com>

Date

2019-04-02

Copyright

2019, Karr Lab

License

MIT

class datanator.data_source.sabio_rk_json_mongo.SabioRkNoSQL(db=None, MongoDB=None, cache_directory=None, verbose=False, max_entries=inf, replicaSet=None, username=None, password=None, authSource='admin')[source]

Bases: datanator.util.mongo_util.MongoUtil

add_inchi_hash(ids=None)[source]

Add inchi key values of _value_inchi in sabio_rk collection

add_taxon_info()[source]

Fill in taxonomic information

fill_ec_meta(start=0)[source]

Fill sabio documents with ec meta information.

fill_kegg_meta(start=0)[source]

Fill kegg information for reactions.

Parameters

start (int, optional) – Starting document. Defaults to 0.

fill_reactant_aggregate_name(start=0)[source]

Fill sabio documents with reactant aggregate information by reactants’ name.

load_json()[source]
make_doc(file_names, file_dict)[source]
datanator.data_source.sabio_rk_json_mongo.main()[source]

4.1.1.3.19. datanator.data_source.sabio_rk_nosql module

class datanator.data_source.sabio_rk_nosql.SabioRk(cache_dirname=None, MongoDB=None, replicaSet=None, db=None, verbose=False, max_entries=inf, username=None, password=None, authSource='admin', webservice_batch_size=50, excel_batch_size=50)[source]

Bases: object

add_inchi_hash(ids)[source]

Add sha224 hashed values of _value_inchi in sabio_rk collection

calc_enzyme_molecular_weights(enzymes, length)[source]

Calculate the molecular weight of each enzyme

Parameters

enzymes (list of dict) – list of enzymes

Returns

list of enzymes

Return type

enzymes (list of dict)

calc_inchi_formula_connectivity(structure)[source]

Calculate a searchable structures

  • InChI format

  • Core InChI format

    • Formula layer (without hydrogen)

    • Connectivity layer

create_cross_references_from_sbml(sbml)[source]

Look up cross references from an SBML object to dictionary

Parameters

sbml (libsbml.SBase) – object in an SBML documentation

Returns

list of resources

Return type

list of dictionary

create_kinetic_law_from_sbml(id, sbml, root_species, specie_properties, functions, units)[source]

Make a kinetic law doc for mongoDB

Parameters
  • id (int) – identifier

  • sbml (libsbml.KineticLaw) – SBML-representation of a reaction (reaction_sbml)

  • species (list) – list of species in root sbml

  • specie_properties (dict) –

    additional properties of the compounds/enzymes

    • is_wildtype (bool): indicates if the enzyme is wildtype or mutant

    • variant (str): description of the variant of the eznyme

    • modifier_type (str): type of the enzyme (e.g. Modifier-Catalyst)

:param functions (dict of str: str): dictionary of rate law equations (keys = IDs in SBML, values = equations) :param units (dict of str: str): dictionary of units (keys = IDs in SBML, values = names)

Returns

kinetic law

Return type

dictionary

Raises

ValueError – if the temperature is expressed in an unsupported unit

create_kinetic_laws_from_sbml(ids, sbml)[source]

Add kinetic laws defined in an SBML file to the local mongodb database

Parameters
  • ids (list of int) – list kinetic law IDs

  • sbml (str) – SBML representation of one or more kinetic laws (root)

Returns

  • list of KineticLaw: list of kinetic laws

  • list of Compound or Enzyme: list of species (compounds or enzymes)

  • list of Compartment: list of compartments

Return type

tuple

get_compartment_from_sbml(sbml)[source]

get compartment from sbml

Parameters

sbml (libsbml.Compartment) – SBML-representation of a compartment

Returns

dictionary: compartment

get_parameter_by_properties(kinetic_law, parameter_properties)[source]
Get the parameter of kinetic_law whose attribute values are

equal to that of parameter_properties

Parameters
  • kinetic_law (KineticLaw) – kinetic law to find parameter of

  • parameter_properties (dict) – properties of parameter to find

Returns

parameter with attribute values equal to values of parameter_properties

Return type

Parameter

get_specie_from_sbml(sbml)[source]

get species information from sbml

Parameters

sbml (libsbml.Species) – SBML-representation of a compound or enzyme

Returns

  • Compound: or Enzyme: compound or enzyme

  • dict: additional properties of the compound/enzyme

    • is_wildtype (bool): indicates if the enzyme is wildtype or mutant

    • variant (str): description of the variant of the eznyme

    • modifier_type (str): type of the enzyme (e.g. Modifier-Catalyst)

Return type

tuple

Raises

ValueError – if a species is of an unsupported type (i.e. not a compound or enzyme)

get_specie_reference_from_sbml(specie_id, species)[source]

Get the compound/enzyme associated with an SBML species by its ID

Parameters

specie_id (str) – ID of an SBML species

Returns

  • Compound or Enzyme: compound or enzyme

  • Compartment: compartment

Return type

tuple

Raises

ValueError – if the species is not a compound or enzyme, no species with id = specie_id exists, or no compartment with name = compartment_name exists

infer_compound_structures_from_names(compounds)[source]

Try to use PubChem to infer the structure of compounds from their names

Notes: we don’t try look up structures from their cross references because SABIO has already gathered all structures from their cross references to ChEBI, KEGG, and PubChem

Parameters

compounds (list of dict) – list of compounds

load_compounds(compounds=None)[source]
Download information from SABIO-RK about all of the compounds stored sabio_compounds

collection

Parameters

compounds (list of obj) – list of compounds to download

Raises

Error – if an HTTP request fails

load_content()[source]

Download the content of SABIO-RK and store it to a remote mongoDB.

load_kinetic_law_ids()[source]

Download the IDs of all of the kinetic laws stored in SABIO-RK

Returns

list of kinetic law IDs

Return type

list of int

load_kinetic_laws(ids)[source]

Download kinetic laws from SABIO-RK

Parameters

ids (list of int) – list of IDs of kinetic laws to download

Raises

Error – if an HTTP request fails

load_missing_enzyme_information_from_html(ids, start=0)[source]

Loading enzyme subunit information from html

Parameters
  • ids (list of int) – list of IDs of kinetic laws to download

  • start (int) – starting point for iterator

load_missing_kinetic_law_information_from_tsv(ids)[source]

Update the properties of kinetic laws in mongodb based on content downloaded from SABIO in TSV format.

Parameters
  • ids (list of int) – list of IDs of kinetic laws to download

  • start (int) – starting row

load_missing_kinetic_law_information_from_tsv_helper(tsv, start=0)[source]

Update the properties of kinetic laws in the mongodb based on content downloaded from SABIO in TSV format.

Note: this method is necessary because neither of SABIO’s SBML and Excel export methods provide all of the SABIO’s content.

Parameters
  • tsv (str) – TSV-formatted table

  • start (int) – starting row

Raises

ValueError – if a kinetic law or compartment is not contained in the local sqlite database

normalize_kinetic_laws(new_ids)[source]

Normalize parameter values.

Parameters

new_ids (list of int) – list of IDs of kinetic laws to normalize

normalize_parameter_value(name, type, value, error, units, enzyme_molecular_weight)[source]
Parameters
  • name (str) – parameter name

  • type (int) parameter type (SBO term id) –

  • value (float) – observed value

  • error (float) – observed error

  • units (str) – observed units

  • enzyme_molecular_weight (float) – enzyme molecular weight

Returns

normalized name and

its type (SBO term), value, error, and units

Return type

tuple of str, int, float, float, str

Raises

ValueError – if units is not a supported unit of type

parse_complex_subunit_structure(text)[source]

Parse the subunit structure of complex into a dictionary of subunit coefficients

Parameters

text (str) – subunit structure described with nested parentheses

Returns

dictionary of subunit coefficients

Return type

dict of str, int

parse_enzyme_name(sbml)[source]

Parse the name of an enzyme in SBML for the enzyme name, wild type status, and variant description that it contains.

Parameters

sbml (str) – enzyme name in SBML

Returns

  • str: name

  • bool: if True, the enzyme is wild type

  • str: variant

Return type

tuple

Raises

ValueError – if the enzyme name is formatted in an unsupport format

datanator.data_source.sabio_rk_nosql.main()[source]

4.1.1.3.20. datanator.data_source.sqlite_to_json module

Converts tables in SQLite into json files .. attribute:: database

path to sqlite database

datanator.data_source.sqlite_to_json.query[source]

query execution command in string format

class datanator.data_source.sqlite_to_json.SQLToJSON(query, cache_dirname=None)[source]

Bases: object

db()[source]
query_table(table, one=True)[source]
table()[source]
datanator.data_source.sqlite_to_json.main()[source]

4.1.1.3.21. datanator.data_source.taxon_tree module

class datanator.data_source.taxon_tree.TaxonTree(cache_dirname, MongoDB, db, replicaSet=None, verbose=False, max_entries=inf, username=None, password=None, authSource='admin')[source]

Bases: datanator.util.mongo_util.MongoUtil

count_line(file)[source]

Efficiently count total number of lines in a given file

download_dump()[source]
insert_canon_anc(start=0)[source]

Insert two arrays to each document, one is canon_anc_id, the other is canon_anc_name

load_content()[source]

Load contents of several .dmp files into MongoDB

parse_division()[source]

division.dmp

parse_fullname_line(line)[source]

Parses lines in file fullnamelineage.dmp and return elements in a list

parse_fullname_taxid()[source]

Parse fullnamelineage.dmp and taxidlineage.dmp store in MongoDB Always run first before loading anything else (insert_one)

parse_gencode()[source]

gencode.dmp

parse_names()[source]

names.dmp 1 | all | | synonym | 1 | root | | scientific name | 2 | bacteria | bacteria <blast2> | blast name | 2 | Bacteria | Bacteria <prokaryotes> | scientific name | 2 | eubacteria | | genbank common name

parse_nodes()[source]

nodes.dmp

parse_nodes_line(line)[source]

Parse lines in nodes.dmp

parse_taxid_line(line)[source]
Parses lines in file taxidlineage.dmp and return elements in a list

delimited by ” |

(tab, vertical bar, and newline) characters. Each record consists of one or more fields delimited by ” | ” (tab, vertical bar, and tab) characters.

datanator.data_source.taxon_tree.main()[source]

4.1.1.3.22. datanator.data_source.uniprot_nosql module

Author

Zhouyang Lian <zhouyang.lian@familian.life>

Author

Jonathan <jonrkarr@gmail.com>

Date

2019-04-02

Copyright

2019, Karr Lab

License

MIT

class datanator.data_source.uniprot_nosql.UniprotNoSQL(MongoDB=None, db=None, max_entries=inf, verbose=False, username=None, password=None, authSource='admin', replicaSet=None, collection_str='uniprot')[source]

Bases: datanator_query_python.util.mongo_util.MongoUtil

embl_helper(s)[source]

Processing emble or refseq strings into a list of standard format. “NP_796298.2 [E9PXF8-1];XP_006507989.1 [E9PXF8-2];” -> [‘NP_796298.2’, ‘XP_006507989.1’].

Parameters

s (pandas.Dataframe) – object to be processed.

Returns

list of processed strings

Return type

(list of str)

fill_abundance_publication(start=0)[source]

(https://github.com/KarrLab/datanator/issues/51)

Parameters

start (int, optional) – beginning of documents.

fill_ko_name()[source]
fill_reactions(start=0)[source]

Fill reactions in which the protein acts as a catalyst.

Parameters

start (int, optional) – Starting document in sabio_rk. Defaults to 0.

fill_species_info()[source]

Fill ancestor information.

fill_species_name()[source]
fill_species_name_new(start=0)[source]

Some documents don’t have species_name filed.

load_abundance_from_pax()[source]

Load protein abundance data but interating from pax collection.

load_df(df)[source]
load_uniprot(query=False, msg='', species=None)[source]

Build dataframe

Parameters
  • query (bool, optional) – Whether download all reviewed entries of perform individual queries. Defaults to False.

  • msg (str, optional) – Query message. Defaults to ‘’.

  • species (list, optional) – species information to extract from df and loaded into uniprot. Defaults to None.

remove_redudant_id()[source]

Remove redundant entries in uniprot collection. Priority:

1. Has ‘abundances’ field 1. Has ‘ko_name’ field 2. Has ‘kegg_org_gene’ field 3. Has ‘orthologs’ field

datanator.data_source.uniprot_nosql.main()[source]

4.1.1.3.23. Module contents