4.1.1.5. datanator.util package¶

4.1.1.5.1. Submodules¶

4.1.1.5.2. datanator.util.base26 module¶

Fork from git@github.com:mnowotka/chembl_ikey.git

datanator.util.base26.base26_dublet_for_bits_28_to_36(a)[source]¶

datanator.util.base26.base26_dublet_for_bits_56_to_64(a)[source]¶

datanator.util.base26.base26_triplet_1(a)[source]¶

datanator.util.base26.base26_triplet_2(a)[source]¶

datanator.util.base26.base26_triplet_3(a)[source]¶

datanator.util.base26.base26_triplet_4(a)[source]¶

4.1.1.5.3. datanator.util.build_util module¶

datanator.util.build_util.continuousload(method)[source]¶

datanator.util.build_util.timeloadcontent(method)[source]¶

datanator.util.build_util.timemethod(method)[source]¶

4.1.1.5.4. datanator.util.calc_tanimoto module¶

class datanator.util.calc_tanimoto.CalcTanimoto(cache_dirname=None, MongoDB=None, replicaSet=None, db=None, verbose=True, max_entries=inf, username=None, password=None, authSource='admin')[source]¶

Bases: datanator_query_python.util.mongo_util.MongoUtil

Calculating the Tanimoto similarity matrix given two compound collections e.g. ECMDB YMDB

get_tanimoto(mol1, mol2, str_format='inchi', rounding=3)[source]¶

Calculates tanimoto coefficients between two molecules, mol1 and mol2

Parameters

mol1 – molecule 1 in some format
mol2 – molecule 2 in same format as molecule 1
str_format – format for molecular representation supported formats are provided by Pybel
rounding – rounding of the final results

Returns

rounded tanimoto coefficient

Return type

tani

many_to_many(collection_str1='metabolites_meta', collection_str2='metabolites_meta', field1='inchi', field2='inchi', lookup1='InChI_Key', lookup2='InChI_Key', num=100)[source]¶: Go through collection_str and assign each compound top ‘num’ amount of most similar compounds :param collection_str1: collection in which compound is drawn :param collection_str2: collection in which comparison is made :param field1: field of interest in collection_str1 :param field2: filed of interest in collection_str2 :param num: number of most similar compound :param batch_size: batch_size for each server round trip

one_to_many(inchi, collection_str='metabolites_meta', field='inchi', lookup='InChI_Key', num=100)[source]¶

Calculate tanimoto coefficients between one metabolite and the rest of the ‘collection_str’ :param inchi: chosen chemical compound in InChI format :param collection_str: collection in which comparisons are made :param field: field that has the chemical structure :param lookup: field that had been previous indexed :param num: max number of compounds to be returned, sorted by tanimoto

Returns: sorted numpy array of top num tanimoto coeff sorted_inchi: sorted top num inchi
Return type: sorted_coeff

datanator.util.calc_tanimoto.main()[source]¶

4.1.1.5.5. datanator.util.chem_util module¶

class datanator.util.chem_util.ChemUtil[source]¶

Bases: object

get_sha256(text)[source]¶

hash_inchi(inchi='InChI = None')[source]¶: Hash inchi string using sha224

inchi_to_inchikey(szINCHISource)[source]¶: fork from git@github.com:mnowotka/chembl_ikey.git

simplify_inchi(inchi='InChI = None')[source]¶: Remove molecules’s protonation state “InChI=1S/H2O/h1H2” = > “InChI=1S/H2O”

4.1.1.5.6. datanator.util.constants module¶

4.1.1.5.7. datanator.util.file_util module¶

class datanator.util.file_util.FileUtil[source]¶

Bases: object

access_dict_by_index(_dict, count)[source]¶

Assuming dict has an order, return the first num of elements in dictionary :param _dict: { ‘a’:1, ‘b’:2, ‘c’:3, … } :param count: number of items to return

Returns

a dictionary with the first count: from _dict {‘a’:1}

Return type

result

exists_key_value_pair(dictionary, k, v)[source]¶

Test if a key/value pair exists in dictionary :param dict (: obj: dict): dictionary to be checked :param k (: obj: str): key to be matched :param v (: obj: ``): value to be matched

Returns: obj: bool): True or False
Return type: result (

extract_values(obj, key)[source]¶: Pull all values of specified key from nested JSON.

flatten_json(nested_json)[source]¶

Flatten json object with nested keys into a single level. e.g. {a: b, {a: b,

c: [ d: e,
{d: e}, => f: g } {f: g}]}

Parameters: nested_json – A nested json object.
Returns: The flattened json object if successful, None otherwise.

get_common(list1, list2)[source]¶

Given two lists, find the closest common ancestor :param list1: [a, b, c, f, g] :param list2: [a, b, d, e]

Returns

the closest common ancestor, in: the above example would be b

Return type

result

get_val_from_dict_list(dict_list, key)[source]¶

Get values for key from a list of dictionaries :param dict_list (: obj: list of :obj: dict): list of dictionary

to query

Parameters: ( (key) – obj: str): key for which to get the value
Returns: obj: list of :obj:): list of values
Return type: results (

make_dict(keys, values)[source]¶

Give two lists, make a list of dictionaries :param keys: [a, b, c, d, …] :param values: [1, 2, 3, 4]

Returns: {‘a’: 1, ‘b’: 2, ‘c’: 3, …}
Return type: dic

merge_dict(dicts)[source]¶

Merge a list of dictionaries :param dicts (: obj: list of :obj: dict): list of dictionaries

Returns: obj: dict): merged dictionries
Return type: result (

replace_dict_key(_dict, replacements)[source]¶

Replace keys in a dictionary with the order in replacements e.g., {‘a’: 0, ‘b’: 1, ‘c’: 2}, [‘d’, ‘e’, ‘f’] => {‘d’: 0, ‘e’: 1, ‘f’: 2} :param _dict: dictionary whose keys are to be replaced :param replacement: list of replacement keys

Returns: dictionary with replaced keys
Return type: result

replace_list_dict_key(_list, replacements)[source]¶

Replace keys in a dictionary with the order in replacements e.g., [{‘a’: 0}, {‘b’: 1}, {‘c’: 2}], [‘d’, ‘e’, ‘f’] => [{‘d’: 0}, {‘e’: 1}, {‘f’: 2}] :param _list (: obj: list of :obj: dict): list of dictionaries whose keys are to be replaced :param replacement (: obj: list): list of replacement keys

Returns: obj: list of :obj: dict): dictionary with replaced keys
Return type: result (

search_dict_list(dict_list, key, value='')[source]¶

Find the dictionary with key/value pair in a list of dictionaries

Parameters

( (value) – obj: list of :obj: dict): list of dictionaries
( – obj: string): key in the dictionary
( – obj: ``): value to be matched if value==None, then only search for key

Returns

obj: list of :obj: dict): list of dictionaries with the key/value pair

Return type

result (

unpack_list(_list)[source]¶

Unpack sublists in a list :param _list: a list containing sublists e.g. [ […], […], … ]

Returns: unpacked list e.g. [ …. ]
Return type: result

unzip_file(url, directory)[source]¶

Unzip a zip file into directory

Parameters

url (str) – url for the zip file
directory (str) – directory into which files will be unzipped

4.1.1.5.8. datanator.util.index_collection module¶

Index collections in MongoDB accordingly

class datanator.util.index_collection.IndexCollection(cache_dirname=None, MongoDB=None, replicaSet=None, db=None, verbose=False, max_entries=inf, username=None, password=None, authSource='admin')[source]¶

Bases: datanator.util.mongo_util.MongoUtil

index_corum(collection_str)[source]¶: Index fields in corum collection

index_intact_complex(collection_str='intact_complex')[source]¶: Index intact_complex collection

index_metabolites_meta(collection_str='metabolites_meta')[source]¶: Index metabolites_meta collection

index_pax(collection_str='pax')[source]¶: Index Pax collection

index_sabio(collection_str='sabio_rk')[source]¶: Index relevant fields in sabio_rk collection

index_strdb(collection_str='ecmdb')[source]¶: Index relevant fields in string only collections: ecmdb, ymdb, and intact_interaction

index_uniprot(collection_str='uniprot')[source]¶: Index uniprot collection

datanator.util.index_collection.main()[source]¶

4.1.1.5.9. datanator.util.molecule_util module¶

Utilities for dealing with molecules

Author: Yosef Roth <yosefdroth@gmail.com>
Author: Jonathan <jonrkarr@gmail.com>
Date: 2017-04-12
Copyright: 2017, Karr Lab
License: MIT

class datanator.util.molecule_util.InchiMolecule(structure)[source]¶

Bases: object

Represents the InChI-encoded structure of a molecule

formula[source]¶

empirical formula layer

Type: str

connections[source]¶

atomic conncetions (c) layer

Type: str

hydrogens[source]¶

hydrogen (h) layer

Type: str

protons[source]¶

proton (p) layer

Type: str

charge[source]¶

charge (q) layer

Type: str

double_bonds[source]¶

double bounds (b) layer

Type: str

stereochemistry[source]¶

stereochemistry (t) layer

Type: str

stereochemistry_parity[source]¶

stereochemistry parity (m) layer

Type: str

stereochemistry_type[source]¶

stereochemistry type (s) layer

Type: str

isotopes[source]¶

isotype (i) layer

Type: str

fixed_hydrogens[source]¶

fixed hydrogens (f) layer

Type: str

reconnected_metals[source]¶

reconnected metal (r) layer

Type: str

LAYERS[source]¶

dictionary of layer prefixes and names

Type: dict

LAYERS = {'': 'formula', 'b': 'double_bonds', 'c': 'connections', 'f': 'fixed_hydrogens', 'h': 'hydrogens', 'i': 'isotopes', 'm': 'stereochemistry_parity', 'p': 'protons', 'q': 'charge', 'r': 'reconnected_metals', 's': 'stereochemistry_type', 't': 'stereochemistry'}[source]

__str__()[source]¶

Generate an InChI string representation of the molecule

Returns: InChI string representation of the molecule
Return type: str

get_formula_and_connectivity()[source]¶

Get a string representation of the formula and connectivity

Returns: string representation of the formula and connectivity
Return type: str

is_equal(other, check_protonation=True, check_double_bonds=True, check_stereochemistry=True, check_isotopes=True, check_fixed_hydrogens=True, check_reconnected_metals=True)[source]¶

Determine if two molecules are semantically equal (all of their layers are equal).

Parameters

other (InchiMolecule) – other molecule
check_protonation (bool, optional) – if obj:True, check that the protonation states (h, p, q) are equal
check_double_bonds (bool, optional) – if obj:True, check that the doubling bonding layers (b) are equal
check_stereochemistry (bool, optional) – if obj:True, check that the stereochemistry layers (t, m, s) are equal
check_isotopes (bool, optional) – if obj:True, check that the isotopic layers (i) are equal
check_fixed_hydrogens (bool, optional) – if obj:True, check that the fixed hydrogen layers (f) are equal
check_reconnected_metals (bool, optional) – if obj:True, check that the reconnected metals layers (r) are equal

Returns

True the molecules are semantically equal

Return type

bool

is_protonation_isomer(other)[source]¶

Determine if two molecules are protonation isomers

Parameters: other (InchiMolecule) – other molecule
Returns: True if the molecules are protonation isomers
Return type: bool

is_stereoisomer(other)[source]¶

Determine if two molecules are steroisomers

Parameters: other (InchiMolecule) – other molecule
Returns: True if the molecules are stereoisomers
Return type: bool

is_tautomer(other)[source]¶

Determine if two molecules are tautomers

Parameters: other (InchiMolecule) – other molecule
Returns: True if the molecules are tautomers
Return type: bool

remove_layer(layer)[source]¶

Remove a layer from a structure

Parameters: layer (str) – name of the layer

class datanator.util.molecule_util.Molecule(id='', name='', structure='', cross_references=None)[source]¶

Bases: object

Represents a molecule

id[source]¶

identifier

Type: str

name[source]¶

name

Type: str

structure[source]¶

structure in InChI, MOL, or canonical SMILES format

Type: str

cross_references[source]¶

list of cross references

Type: list of CrossReference

get_fingerprint(type='fp2')[source]¶

Calculate a fingerprint

Parameters: type (str, optional) – fingerprint type to calculate
Returns: fingerprint
Return type: pybel.Fingerprint

static get_fingerprint_types()[source]¶

Get list of fingerprint types

Returns: list of fingerprint types
Return type: list of str

get_format()[source]¶

Get the format of the structure

Returns: format
Return type: str

get_similarity(other, fingerprint_type='fp2')[source]¶

Calculate the similarity with another molecule

Parameters

other (Molecule) – a second molecule
fingerprint_type (str, optional) – fingerprint type to use to calculate similarity

Returns

the similarity with the other molecule

Return type

float

to_format(format)[source]¶

Get the structure in a format

:param str: format such as inchi, mol, smiles

Returns: structure in a format
Return type: str

to_inchi()[source]¶

Get the structure in InChI format

Returns: structure in InChi format
Return type: str

to_mol()[source]¶

Get the structure in MOL format

Returns: structure in MOL format
Return type: str

to_openbabel()[source]¶

Create an Open Babel molecule for the molecule

Returns: Open Babel molecule
Return type: openbabel.OBMol

to_pybel()[source]¶

Create a pybel molecule for the molecule

Returns: pybel molecule
Return type: pybel.Molecule

to_smiles()[source]¶

Get the structure in SMILES format

Returns: structure in SMILES format
Return type: str

4.1.1.5.10. datanator.util.mongo_util module¶

class datanator.util.mongo_util.MongoUtil(cache_dirname=None, MongoDB=None, replicaSet=None, db='test', verbose=False, max_entries=inf, username=None, password=None, authSource='admin', readPreference='nearest')[source]¶

Bases: object

con_db(collection_str)[source]¶

fill_db(collection_str)[source]¶

Check if collection is already in MongoDB

If already in MongoDB:: Do nothing
Else:: Load data into db from quiltdata (karrlab/datanator)

Parameters: collection_str – name of collection (e.g. ‘ecmdb’, ‘pax’, etc)

flatten_collection(collection_str)[source]¶

Flatten a collection

c is ommitted because it does not have a non-object value associated with it

list_all_collections()[source]¶: List all non-system collections within database

print_schema(collection_str)[source]¶: Print out schema of a collection removed ‘_id’ from collection due to its object type and universality

4.1.1.5.11. datanator.util.reaction_util module¶

Utilities for dealing with reactions

Author: Yosef Roth <yosefdroth@gmail.com>
Author: Jonathan <jonrkarr@gmail.com>
Date: 2017-04-13
Copyright: 2017, Karr Lab
License: MIT

datanator.util.reaction_util.calc_reactant_product_pairs(reaction)[source]¶

Get list of pairs of similar reactants and products using a greedy algorithm.

Parameters: reaction (data_model.Reaction) – reaction
Returns: data_model.Specie, data_model.Specie: list of pairs of similar reactants and products
Return type: list of tuple of obj

4.1.1.5.12. datanator.util.rna_halflife_util module¶

class datanator.util.rna_halflife_util.RnaHLUtil(server=None, username=None, password=None, src_db=None, des_db=None, protein_col=None, rna_col=None, authDB='admin', readPreference=None, max_entries=inf, verbose=False, cache_dir=None)[source]¶

Bases: datanator_query_python.util.mongo_util.MongoUtil

fill_uniprot_by_embl(embl, species=None)[source]¶

Fill uniprot collection using EMBL data

Parameters

embl (str) – sequence embl data
species (list) – NCBI Taxonomy ID of the species

fill_uniprot_by_gn(gene_name, species=None)[source]¶

Fill uniprot collection using gene name

Parameters

gene_name (str) – Ordered locus name
species (list) – NCBI Taxonomy ID of the species

fill_uniprot_by_oln(oln, species=None)[source]¶

Fill uniprot collection using ordered locus name

Parameters

oln (str) – Ordered locus name
species (list) – NCBI Taxonomy ID of the species

fill_uniprot_with_df(df, identifier, identifier_type='oln', species=None)[source]¶

Fill uniprot colleciton with ordered_locus_name from excel sheet

Parameters

df (pandas.DataFrame) – dataframe to be inserted into uniprot collection.
df conforms to the schemas required by load_uniprot function in uniprot.py (Assuming) –
identifier (str) – name of column that stores ordered locus name information.
identifier_type (str) – type of identifier, i.e. ‘oln’, ‘gene_name’
species (list) – NCBI Taxonomy ID of the species.

make_df(url, sheet_name, header=0, names=None, usecols=None, skiprows=None, nrows=None, na_values=None, file_type='xlsx', file_name=None)[source]¶

Read online excel file as dataframe

Parameters

url (str) – excel file url
sheet_name (str) – name of sheet in xlsx
header (int) – Row (0-indexed) to use for the column labels of the parsed DataFrame.
names (list) – list of column names to use
usecols (int or list or str) – Return a subset of the columns.
nrows (int) – number of rows to parse. Defaults to None.
file_type (str) – downloaded file type. Defaults to xlsx.
file_name (str) – name of the file of interest.

Returns

xlsx transformed to pandas.DataFrame

Return type

(pandas.DataFrame)

uniprot_names(results, count)[source]¶

Extract protein_name and gene_name from returned tuple of uniprot query function

Parameters

results (Iter) – pymongo cursor object.
count (int) – Number of documents found.

Returns

gene_name and protein_name

Return type

(tuple of str)

4.1.1.5.13. datanator.util.rna_seq_util module¶

Utilities for RNA-seq data

Author: Jonathan Karr <jonrkarr@gmail.com>
Author: Yosef Roth <yosefdroth@gmail.com>
Date: 2018-01-15
License: MIT

class datanator.util.rna_seq_util.Kallisto[source]¶

Bases: object

Python interface to kallisto.

index(fasta_filenames, index_filename=None, kmer_size=31, make_unique=False)[source]¶

Generate index from FASTA files

Parameters

fastq_filenames (list of str) – paths to FASTA files
index_filename (str, optional) – path to the kallisto index file to be created
kmer_size (int, optional) – k-mer length
make_unique (bool, optional) – if True, replace repeated target names with unique names

quant(fastq_filenames, index_filename=None, output_dirname=None, bias=False, bootstrap_samples=0, seed=42, plaintext=False, fusion=False, single_end_reads=False, forward_stranded=False, reverse_stranded=False, fragment_length=None, fragment_length_std=None, threads=1, pseudobam=False)[source]¶

Process RNA-seq FASTQ files

Parameters

fastq_filenames (list of str) – paths to FASTQ files
index_filename (str, optional) – path to the kallisto index file to be used for quantification
output_dirname (str, optional) – path to the output directory
single_end_reads (bool, optional) – if True, quantify single-end reads
fragment_length (float, optional) – estimated average fragment length
fragment_length_std (float, optional) – estimated standard deviation of fragment length

4.1.1.5.14. datanator.util.taxonomy_util module¶

Utilities for dealing with taxa

Author: Yosef Roth <yosefdroth@gmail.com>
Author: Jonathan <jonrkarr@gmail.com>
Date: 2017-04-11
License: MIT

class datanator.util.taxonomy_util.Taxon(id='', name='', ncbi_id=None, cross_references=None)[source]¶

Bases: object

Represents a taxon such as a genus, species, or strain

id[source]¶

identifier

Type: str

name[source]¶

name of the taxon

Type: str

id_of_nearest_ncbi_taxon[source]¶

ID of the nearest parent taxon which is in the NCBI database

Type: int

distance_from_nearest_ncbi_taxon[source]¶

distance from the taxon to its nearest parent which is in the NCBI database

Type: int

additional_name_beyond_nearest_ncbi_taxon[source]¶

additional part of the taxon’s beyond that of its nearest parent in the NCBI database

Type: str

cross_references[source]¶

list of cross references

Type: list of CrossReference

get_common_ancestor(other)[source]¶

Get the lastest common ancestor of two taxa

Parameters: other (Taxon) – a second taxon
Returns: latest common ancestor
Return type: Taxon

get_distance_to_common_ancestor(other)[source]¶

Calculate the number of links in the NCBI taxonomic tree between two taxa and their latest common ancestor

Note: This distances depends on the granularity of the lineage of the taxon. For example, there are only 7 links between most bacteria species and the Bacteria superkingdom. However, there are 28 links between the Homo sapiens species and the Eukaryota superkingdom.

Parameters

other (Taxon) – a second taxon

Returns

number of links between self and its latest common ancestor with other in the NCBI: taxonomic tree

Return type

int

get_distance_to_root()[source]¶

Get the distance from the taxon to the root of the NCBI taxonomy tree

Returns: distance from the taxon to the root
Return type: int

get_max_distance_to_common_ancestor()[source]¶

Get the maximum distance from the taxon to a common ancestor with another taxon

Returns: maximum distance from the taxon to a common ancestor with another taxon
Return type: int

get_ncbi_id()[source]¶

Get the ID of the taxon within the NCBI database

Returns

ID of the taxon within the NCBI database or: None if the taxon isn’t in the NCBI database

Return type

int or None

get_parent_taxa()[source]¶

Get parent taxa

Returns: list of parent taxa
Return type: list of Taxon

get_rank()[source]¶

Get the rank of the taxon

Returns: rank of the taxon
Return type: str

datanator.util.taxonomy_util.setup_database(force_update=False)[source]¶

Setup a local sqllite copy of the NCBI Taxonomy database. If force_update is False, then only download the content from NCBI and build the sqllite database, if a local database doesn’t already exist. If force_update is True, then always download the content from NCBI and rebuild the sqllite copy of the database.

Parameters

force_update (bool, optional) –

False: only download the content for the database and build a local sqllite database
if a local sqllite copy of the database doesn’t already exist
True: always download the content for the database from NCBI and rebuild a local sqllite
database

4.1.1.5.15. datanator.util.warning_util module¶

Warning utilities

Author: Yosef Roth <yosefdroth@gmail.com>
Author: Jonathan Karr <jonrkarr@gmail.com>
Date: 2017-04-13
License: MIT

datanator.util.warning_util.disable_warnings()[source]¶: Disable warning messages from openbabel and urllib

datanator.util.warning_util.enable_warnings()[source]¶: Enable warning messages from openbabel and urllib

4.1.1.5. datanator.util package¶

4.1.1.5.1. Submodules¶

4.1.1.5.2. datanator.util.base26 module¶

4.1.1.5.3. datanator.util.build_util module¶

4.1.1.5.4. datanator.util.calc_tanimoto module¶

4.1.1.5.5. datanator.util.chem_util module¶

4.1.1.5.6. datanator.util.constants module¶

4.1.1.5.7. datanator.util.file_util module¶

4.1.1.5.8. datanator.util.index_collection module¶

4.1.1.5.9. datanator.util.molecule_util module¶

4.1.1.5.10. datanator.util.mongo_util module¶

4.1.1.5.11. datanator.util.reaction_util module¶

4.1.1.5.12. datanator.util.rna_halflife_util module¶

4.1.1.5.13. datanator.util.rna_seq_util module¶

4.1.1.5.14. datanator.util.taxonomy_util module¶

4.1.1.5.15. datanator.util.warning_util module¶

4.1.1.5.16. Module contents¶