4.1.1.5. datanator.util package

4.1.1.5.1. Submodules

4.1.1.5.2. datanator.util.base26 module

Fork from git@github.com:mnowotka/chembl_ikey.git

datanator.util.base26.base26_dublet_for_bits_28_to_36(a)[source]
datanator.util.base26.base26_dublet_for_bits_56_to_64(a)[source]
datanator.util.base26.base26_triplet_1(a)[source]
datanator.util.base26.base26_triplet_2(a)[source]
datanator.util.base26.base26_triplet_3(a)[source]
datanator.util.base26.base26_triplet_4(a)[source]

4.1.1.5.3. datanator.util.build_util module

datanator.util.build_util.continuousload(method)[source]
datanator.util.build_util.timeloadcontent(method)[source]
datanator.util.build_util.timemethod(method)[source]

4.1.1.5.4. datanator.util.calc_tanimoto module

class datanator.util.calc_tanimoto.CalcTanimoto(cache_dirname=None, MongoDB=None, replicaSet=None, db=None, verbose=True, max_entries=inf, username=None, password=None, authSource='admin')[source]

Bases: datanator_query_python.util.mongo_util.MongoUtil

Calculating the Tanimoto similarity matrix given two compound collections e.g. ECMDB YMDB

get_tanimoto(mol1, mol2, str_format='inchi', rounding=3)[source]

Calculates tanimoto coefficients between two molecules, mol1 and mol2

Parameters
  • mol1 – molecule 1 in some format

  • mol2 – molecule 2 in same format as molecule 1

  • str_format – format for molecular representation supported formats are provided by Pybel

  • rounding – rounding of the final results

Returns

rounded tanimoto coefficient

Return type

tani

many_to_many(collection_str1='metabolites_meta', collection_str2='metabolites_meta', field1='inchi', field2='inchi', lookup1='InChI_Key', lookup2='InChI_Key', num=100)[source]

Go through collection_str and assign each compound top ‘num’ amount of most similar compounds :param collection_str1: collection in which compound is drawn :param collection_str2: collection in which comparison is made :param field1: field of interest in collection_str1 :param field2: filed of interest in collection_str2 :param num: number of most similar compound :param batch_size: batch_size for each server round trip

one_to_many(inchi, collection_str='metabolites_meta', field='inchi', lookup='InChI_Key', num=100)[source]

Calculate tanimoto coefficients between one metabolite and the rest of the ‘collection_str’ :param inchi: chosen chemical compound in InChI format :param collection_str: collection in which comparisons are made :param field: field that has the chemical structure :param lookup: field that had been previous indexed :param num: max number of compounds to be returned, sorted by tanimoto

Returns

sorted numpy array of top num tanimoto coeff sorted_inchi: sorted top num inchi

Return type

sorted_coeff

datanator.util.calc_tanimoto.main()[source]

4.1.1.5.5. datanator.util.chem_util module

class datanator.util.chem_util.ChemUtil[source]

Bases: object

get_sha256(text)[source]
hash_inchi(inchi='InChI = None')[source]

Hash inchi string using sha224

inchi_to_inchikey(szINCHISource)[source]

fork from git@github.com:mnowotka/chembl_ikey.git

simplify_inchi(inchi='InChI = None')[source]

Remove molecules’s protonation state “InChI=1S/H2O/h1H2” = > “InChI=1S/H2O”

4.1.1.5.6. datanator.util.constants module

4.1.1.5.7. datanator.util.file_util module

class datanator.util.file_util.FileUtil[source]

Bases: object

access_dict_by_index(_dict, count)[source]

Assuming dict has an order, return the first num of elements in dictionary :param _dict: { ‘a’:1, ‘b’:2, ‘c’:3, … } :param count: number of items to return

Returns

a dictionary with the first count

from _dict {‘a’:1}

Return type

result

exists_key_value_pair(dictionary, k, v)[source]

Test if a key/value pair exists in dictionary :param dict (: obj: dict): dictionary to be checked :param k (: obj: str): key to be matched :param v (: obj: ``): value to be matched

Returns

obj: bool): True or False

Return type

result (

extract_values(obj, key)[source]

Pull all values of specified key from nested JSON.

flatten_json(nested_json)[source]

Flatten json object with nested keys into a single level. e.g. {a: b, {a: b,

c: [ d: e,

{d: e}, => f: g } {f: g}]}

Parameters

nested_json – A nested json object.

Returns

The flattened json object if successful, None otherwise.

get_common(list1, list2)[source]

Given two lists, find the closest common ancestor :param list1: [a, b, c, f, g] :param list2: [a, b, d, e]

Returns

the closest common ancestor, in

the above example would be b

Return type

result

get_val_from_dict_list(dict_list, key)[source]

Get values for key from a list of dictionaries :param dict_list (: obj: list of :obj: dict): list of dictionary

to query

Parameters

( (key) – obj: str): key for which to get the value

Returns

obj: list of :obj:): list of values

Return type

results (

make_dict(keys, values)[source]

Give two lists, make a list of dictionaries :param keys: [a, b, c, d, …] :param values: [1, 2, 3, 4]

Returns

{‘a’: 1, ‘b’: 2, ‘c’: 3, …}

Return type

dic

merge_dict(dicts)[source]

Merge a list of dictionaries :param dicts (: obj: list of :obj: dict): list of dictionaries

Returns

obj: dict): merged dictionries

Return type

result (

replace_dict_key(_dict, replacements)[source]

Replace keys in a dictionary with the order in replacements e.g., {‘a’: 0, ‘b’: 1, ‘c’: 2}, [‘d’, ‘e’, ‘f’] => {‘d’: 0, ‘e’: 1, ‘f’: 2} :param _dict: dictionary whose keys are to be replaced :param replacement: list of replacement keys

Returns

dictionary with replaced keys

Return type

result

replace_list_dict_key(_list, replacements)[source]

Replace keys in a dictionary with the order in replacements e.g., [{‘a’: 0}, {‘b’: 1}, {‘c’: 2}], [‘d’, ‘e’, ‘f’] => [{‘d’: 0}, {‘e’: 1}, {‘f’: 2}] :param _list (: obj: list of :obj: dict): list of dictionaries whose keys are to be replaced :param replacement (: obj: list): list of replacement keys

Returns

obj: list of :obj: dict): dictionary with replaced keys

Return type

result (

search_dict_list(dict_list, key, value='')[source]

Find the dictionary with key/value pair in a list of dictionaries

Parameters
  • ( (value) – obj: list of :obj: dict): list of dictionaries

  • ( – obj: string): key in the dictionary

  • ( – obj: ``): value to be matched if value==None, then only search for key

Returns

obj: list of :obj: dict): list of dictionaries with the key/value pair

Return type

result (

unpack_list(_list)[source]

Unpack sublists in a list :param _list: a list containing sublists e.g. [ […], […], … ]

Returns

unpacked list e.g. [ …. ]

Return type

result

unzip_file(url, directory)[source]

Unzip a zip file into directory

Parameters
  • url (str) – url for the zip file

  • directory (str) – directory into which files will be unzipped

4.1.1.5.8. datanator.util.index_collection module

Index collections in MongoDB accordingly

class datanator.util.index_collection.IndexCollection(cache_dirname=None, MongoDB=None, replicaSet=None, db=None, verbose=False, max_entries=inf, username=None, password=None, authSource='admin')[source]

Bases: datanator.util.mongo_util.MongoUtil

index_corum(collection_str)[source]

Index fields in corum collection

index_intact_complex(collection_str='intact_complex')[source]

Index intact_complex collection

index_metabolites_meta(collection_str='metabolites_meta')[source]

Index metabolites_meta collection

index_pax(collection_str='pax')[source]

Index Pax collection

index_sabio(collection_str='sabio_rk')[source]

Index relevant fields in sabio_rk collection

index_strdb(collection_str='ecmdb')[source]

Index relevant fields in string only collections: ecmdb, ymdb, and intact_interaction

index_uniprot(collection_str='uniprot')[source]

Index uniprot collection

datanator.util.index_collection.main()[source]

4.1.1.5.9. datanator.util.molecule_util module

Utilities for dealing with molecules

Author

Yosef Roth <yosefdroth@gmail.com>

Author

Jonathan <jonrkarr@gmail.com>

Date

2017-04-12

Copyright

2017, Karr Lab

License

MIT

class datanator.util.molecule_util.InchiMolecule(structure)[source]

Bases: object

Represents the InChI-encoded structure of a molecule

formula[source]

empirical formula layer

Type

str

connections[source]

atomic conncetions (c) layer

Type

str

hydrogens[source]

hydrogen (h) layer

Type

str

protons[source]

proton (p) layer

Type

str

charge[source]

charge (q) layer

Type

str

double_bonds[source]

double bounds (b) layer

Type

str

stereochemistry[source]

stereochemistry (t) layer

Type

str

stereochemistry_parity[source]

stereochemistry parity (m) layer

Type

str

stereochemistry_type[source]

stereochemistry type (s) layer

Type

str

isotopes[source]

isotype (i) layer

Type

str

fixed_hydrogens[source]

fixed hydrogens (f) layer

Type

str

reconnected_metals[source]

reconnected metal (r) layer

Type

str

LAYERS[source]

dictionary of layer prefixes and names

Type

dict

LAYERS = {'': 'formula', 'b': 'double_bonds', 'c': 'connections', 'f': 'fixed_hydrogens', 'h': 'hydrogens', 'i': 'isotopes', 'm': 'stereochemistry_parity', 'p': 'protons', 'q': 'charge', 'r': 'reconnected_metals', 's': 'stereochemistry_type', 't': 'stereochemistry'}[source]
__str__()[source]

Generate an InChI string representation of the molecule

Returns

InChI string representation of the molecule

Return type

str

get_formula_and_connectivity()[source]

Get a string representation of the formula and connectivity

Returns

string representation of the formula and connectivity

Return type

str

is_equal(other, check_protonation=True, check_double_bonds=True, check_stereochemistry=True, check_isotopes=True, check_fixed_hydrogens=True, check_reconnected_metals=True)[source]

Determine if two molecules are semantically equal (all of their layers are equal).

Parameters
  • other (InchiMolecule) – other molecule

  • check_protonation (bool, optional) – if obj:True, check that the protonation states (h, p, q) are equal

  • check_double_bonds (bool, optional) – if obj:True, check that the doubling bonding layers (b) are equal

  • check_stereochemistry (bool, optional) – if obj:True, check that the stereochemistry layers (t, m, s) are equal

  • check_isotopes (bool, optional) – if obj:True, check that the isotopic layers (i) are equal

  • check_fixed_hydrogens (bool, optional) – if obj:True, check that the fixed hydrogen layers (f) are equal

  • check_reconnected_metals (bool, optional) – if obj:True, check that the reconnected metals layers (r) are equal

Returns

True the molecules are semantically equal

Return type

bool

is_protonation_isomer(other)[source]

Determine if two molecules are protonation isomers

Parameters

other (InchiMolecule) – other molecule

Returns

True if the molecules are protonation isomers

Return type

bool

is_stereoisomer(other)[source]

Determine if two molecules are steroisomers

Parameters

other (InchiMolecule) – other molecule

Returns

True if the molecules are stereoisomers

Return type

bool

is_tautomer(other)[source]

Determine if two molecules are tautomers

Parameters

other (InchiMolecule) – other molecule

Returns

True if the molecules are tautomers

Return type

bool

remove_layer(layer)[source]

Remove a layer from a structure

Parameters

layer (str) – name of the layer

class datanator.util.molecule_util.Molecule(id='', name='', structure='', cross_references=None)[source]

Bases: object

Represents a molecule

id[source]

identifier

Type

str

name[source]

name

Type

str

structure[source]

structure in InChI, MOL, or canonical SMILES format

Type

str

cross_references[source]

list of cross references

Type

list of CrossReference

get_fingerprint(type='fp2')[source]

Calculate a fingerprint

Parameters

type (str, optional) – fingerprint type to calculate

Returns

fingerprint

Return type

pybel.Fingerprint

static get_fingerprint_types()[source]

Get list of fingerprint types

Returns

list of fingerprint types

Return type

list of str

get_format()[source]

Get the format of the structure

Returns

format

Return type

str

get_similarity(other, fingerprint_type='fp2')[source]

Calculate the similarity with another molecule

Parameters
  • other (Molecule) – a second molecule

  • fingerprint_type (str, optional) – fingerprint type to use to calculate similarity

Returns

the similarity with the other molecule

Return type

float

to_format(format)[source]

Get the structure in a format

:param str: format such as inchi, mol, smiles

Returns

structure in a format

Return type

str

to_inchi()[source]

Get the structure in InChI format

Returns

structure in InChi format

Return type

str

to_mol()[source]

Get the structure in MOL format

Returns

structure in MOL format

Return type

str

to_openbabel()[source]

Create an Open Babel molecule for the molecule

Returns

Open Babel molecule

Return type

openbabel.OBMol

to_pybel()[source]

Create a pybel molecule for the molecule

Returns

pybel molecule

Return type

pybel.Molecule

to_smiles()[source]

Get the structure in SMILES format

Returns

structure in SMILES format

Return type

str

4.1.1.5.10. datanator.util.mongo_util module

class datanator.util.mongo_util.MongoUtil(cache_dirname=None, MongoDB=None, replicaSet=None, db='test', verbose=False, max_entries=inf, username=None, password=None, authSource='admin', readPreference='nearest')[source]

Bases: object

con_db(collection_str)[source]
fill_db(collection_str)[source]

Check if collection is already in MongoDB

If already in MongoDB:

Do nothing

Else:

Load data into db from quiltdata (karrlab/datanator)

Parameters

collection_str – name of collection (e.g. ‘ecmdb’, ‘pax’, etc)

flatten_collection(collection_str)[source]

Flatten a collection

c is ommitted because it does not have a non-object value associated with it

list_all_collections()[source]

List all non-system collections within database

print_schema(collection_str)[source]

Print out schema of a collection removed ‘_id’ from collection due to its object type and universality

4.1.1.5.11. datanator.util.reaction_util module

Utilities for dealing with reactions

Author

Yosef Roth <yosefdroth@gmail.com>

Author

Jonathan <jonrkarr@gmail.com>

Date

2017-04-13

Copyright

2017, Karr Lab

License

MIT

datanator.util.reaction_util.calc_reactant_product_pairs(reaction)[source]

Get list of pairs of similar reactants and products using a greedy algorithm.

Parameters

reaction (data_model.Reaction) – reaction

Returns

data_model.Specie, data_model.Specie: list of pairs of similar reactants and products

Return type

list of tuple of obj

4.1.1.5.12. datanator.util.rna_halflife_util module

class datanator.util.rna_halflife_util.RnaHLUtil(server=None, username=None, password=None, src_db=None, des_db=None, protein_col=None, rna_col=None, authDB='admin', readPreference=None, max_entries=inf, verbose=False, cache_dir=None)[source]

Bases: datanator_query_python.util.mongo_util.MongoUtil

fill_uniprot_by_embl(embl, species=None)[source]

Fill uniprot collection using EMBL data

Parameters
  • embl (str) – sequence embl data

  • species (list) – NCBI Taxonomy ID of the species

fill_uniprot_by_gn(gene_name, species=None)[source]

Fill uniprot collection using gene name

Parameters
  • gene_name (str) – Ordered locus name

  • species (list) – NCBI Taxonomy ID of the species

fill_uniprot_by_oln(oln, species=None)[source]

Fill uniprot collection using ordered locus name

Parameters
  • oln (str) – Ordered locus name

  • species (list) – NCBI Taxonomy ID of the species

fill_uniprot_with_df(df, identifier, identifier_type='oln', species=None)[source]

Fill uniprot colleciton with ordered_locus_name from excel sheet

Parameters
  • df (pandas.DataFrame) – dataframe to be inserted into uniprot collection.

  • df conforms to the schemas required by load_uniprot function in uniprot.py (Assuming) –

  • identifier (str) – name of column that stores ordered locus name information.

  • identifier_type (str) – type of identifier, i.e. ‘oln’, ‘gene_name’

  • species (list) – NCBI Taxonomy ID of the species.

make_df(url, sheet_name, header=0, names=None, usecols=None, skiprows=None, nrows=None, na_values=None, file_type='xlsx', file_name=None)[source]

Read online excel file as dataframe

Parameters
  • url (str) – excel file url

  • sheet_name (str) – name of sheet in xlsx

  • header (int) – Row (0-indexed) to use for the column labels of the parsed DataFrame.

  • names (list) – list of column names to use

  • usecols (int or list or str) – Return a subset of the columns.

  • nrows (int) – number of rows to parse. Defaults to None.

  • file_type (str) – downloaded file type. Defaults to xlsx.

  • file_name (str) – name of the file of interest.

Returns

xlsx transformed to pandas.DataFrame

Return type

(pandas.DataFrame)

uniprot_names(results, count)[source]

Extract protein_name and gene_name from returned tuple of uniprot query function

Parameters
  • results (Iter) – pymongo cursor object.

  • count (int) – Number of documents found.

Returns

gene_name and protein_name

Return type

(tuple of str)

4.1.1.5.13. datanator.util.rna_seq_util module

Utilities for RNA-seq data

Author

Jonathan Karr <jonrkarr@gmail.com>

Author

Yosef Roth <yosefdroth@gmail.com>

Date

2018-01-15

Copyright

2018, Karr Lab

License

MIT

class datanator.util.rna_seq_util.Kallisto[source]

Bases: object

Python interface to kallisto.

index(fasta_filenames, index_filename=None, kmer_size=31, make_unique=False)[source]

Generate index from FASTA files

Parameters
  • fastq_filenames (list of str) – paths to FASTA files

  • index_filename (str, optional) – path to the kallisto index file to be created

  • kmer_size (int, optional) – k-mer length

  • make_unique (bool, optional) – if True, replace repeated target names with unique names

quant(fastq_filenames, index_filename=None, output_dirname=None, bias=False, bootstrap_samples=0, seed=42, plaintext=False, fusion=False, single_end_reads=False, forward_stranded=False, reverse_stranded=False, fragment_length=None, fragment_length_std=None, threads=1, pseudobam=False)[source]

Process RNA-seq FASTQ files

Parameters
  • fastq_filenames (list of str) – paths to FASTQ files

  • index_filename (str, optional) – path to the kallisto index file to be used for quantification

  • output_dirname (str, optional) – path to the output directory

  • single_end_reads (bool, optional) – if True, quantify single-end reads

  • fragment_length (float, optional) – estimated average fragment length

  • fragment_length_std (float, optional) – estimated standard deviation of fragment length

4.1.1.5.14. datanator.util.taxonomy_util module

Utilities for dealing with taxa

Author

Yosef Roth <yosefdroth@gmail.com>

Author

Jonathan <jonrkarr@gmail.com>

Date

2017-04-11

Copyright

2017, Karr Lab

License

MIT

class datanator.util.taxonomy_util.Taxon(id='', name='', ncbi_id=None, cross_references=None)[source]

Bases: object

Represents a taxon such as a genus, species, or strain

id[source]

identifier

Type

str

name[source]

name of the taxon

Type

str

id_of_nearest_ncbi_taxon[source]

ID of the nearest parent taxon which is in the NCBI database

Type

int

distance_from_nearest_ncbi_taxon[source]

distance from the taxon to its nearest parent which is in the NCBI database

Type

int

additional_name_beyond_nearest_ncbi_taxon[source]

additional part of the taxon’s beyond that of its nearest parent in the NCBI database

Type

str

cross_references[source]

list of cross references

Type

list of CrossReference

get_common_ancestor(other)[source]

Get the lastest common ancestor of two taxa

Parameters

other (Taxon) – a second taxon

Returns

latest common ancestor

Return type

Taxon

get_distance_to_common_ancestor(other)[source]

Calculate the number of links in the NCBI taxonomic tree between two taxa and their latest common ancestor

Note: This distances depends on the granularity of the lineage of the taxon. For example, there are only 7 links between most bacteria species and the Bacteria superkingdom. However, there are 28 links between the Homo sapiens species and the Eukaryota superkingdom.

Parameters

other (Taxon) – a second taxon

Returns

number of links between self and its latest common ancestor with other in the NCBI

taxonomic tree

Return type

int

get_distance_to_root()[source]

Get the distance from the taxon to the root of the NCBI taxonomy tree

Returns

distance from the taxon to the root

Return type

int

get_max_distance_to_common_ancestor()[source]

Get the maximum distance from the taxon to a common ancestor with another taxon

Returns

maximum distance from the taxon to a common ancestor with another taxon

Return type

int

get_ncbi_id()[source]

Get the ID of the taxon within the NCBI database

Returns

ID of the taxon within the NCBI database or

None if the taxon isn’t in the NCBI database

Return type

int or None

get_parent_taxa()[source]

Get parent taxa

Returns

list of parent taxa

Return type

list of Taxon

get_rank()[source]

Get the rank of the taxon

Returns

rank of the taxon

Return type

str

datanator.util.taxonomy_util.setup_database(force_update=False)[source]

Setup a local sqllite copy of the NCBI Taxonomy database. If force_update is False, then only download the content from NCBI and build the sqllite database, if a local database doesn’t already exist. If force_update is True, then always download the content from NCBI and rebuild the sqllite copy of the database.

Parameters

force_update (bool, optional) –

  • False: only download the content for the database and build a local sqllite database

    if a local sqllite copy of the database doesn’t already exist

  • True: always download the content for the database from NCBI and rebuild a local sqllite

    database

4.1.1.5.15. datanator.util.warning_util module

Warning utilities

Author

Yosef Roth <yosefdroth@gmail.com>

Author

Jonathan Karr <jonrkarr@gmail.com>

Date

2017-04-13

Copyright

2017, Karr Lab

License

MIT

datanator.util.warning_util.disable_warnings()[source]

Disable warning messages from openbabel and urllib

datanator.util.warning_util.enable_warnings()[source]

Enable warning messages from openbabel and urllib

4.1.1.5.16. Module contents