3.1.1.2. h1_hesc.kb_gen package

3.1.1.2.1. Submodules

3.1.1.2.2. h1_hesc.kb_gen.chromosomes module

Reconstruct the sequences of H1-hESC chromosomes

Author:Yin Hoon Chew <yinhoon.chew@mssm.edu>
Author:Jonathan Karr <jonrkarr@gmail.com>
Date:2017-07-13
Copyright:2017, Karr Lab
License:MIT
class h1_hesc.kb_gen.chromosomes.ChromosomesGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Create chromosomes for the knowledge base

clean_and_validate_options()[source]

Apply default options and validate options

Options:
  • data_path(str): path to genome sequence fasta file
gen_components()[source]

Construct chromosomes for the knowledge base

get_data()[source]

Get data for knowledge base

process_data()[source]

Process data for knowledge base_counts

3.1.1.2.3. h1_hesc.kb_gen.compartments module

Reconstruct the compartments of a cell

Author:Yin Hoon Chew <yinhoon.chew@mssm.edu>
Date:2018-10-04
Copyright:2018, Karr Lab
License:MIT
class h1_hesc.kb_gen.compartments.CompartmentsGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Creates compartments and their volumetric fractions for the knowledge base from input files.

Options: * data_path (str): path of manually curated data volumetric

fractions of compartments
clean_and_validate_options()[source]

Apply default options and validate options

gen_components()[source]

Construct compartments for the knowledge base

get_data()[source]

Get compartment data from input file

3.1.1.2.4. h1_hesc.kb_gen.complexes module

Reconstruct protein complexes :Author: Yin Hoon Chew <yinhoon.chew@mssm.edu> :Date: 2017-11-28 :Copyright: 2017, Karr Lab :License: MIT

class h1_hesc.kb_gen.complexes.ComplexesGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Create macromolecular complexes for the knowledge base

Data of protein complexes from Human Recon 2.2, HumanCyc, Corum, and Complex portal are merged into a consensus set. Each recconstructed protein complex contains information of the protein and cofactor subunit as well as their stoichiometric information wherever available.

Each database has different degrees of information. Complex Portal has both stoichiometric and isomer information (e.g. P14618-1, P14618-2) Both Recon and Corum do not have stoichiometric information, but Corum includes protein isomer information (e.g. P78381-1, P78381-2). HumanCyc has stoichiometric information.

A consensus set is compiled by consecutively evaluating and adding each complex entry from each database:

  1. All complex entries in Complex Portal are added. If two complexes have protein subunit(s) that only differ(s) in their isomer information, they are considered the same complex because the mapping of uniprot isomer to ensembl protein is not complete.
  2. Entries from Recon are added if no complex with similar protein subunits could be found in the compiled set
  3. Entries from Corum are added if no complex with similar protein subunits could be found in the compiled set. Isomer information are not considered.
  4. For HumanCyc, because it has an inconsistent definition of complexes compared to other databases, i.e. some HumanCyc complexes are only subsets of complexes from other databases (see example of P03886). we only use its stoichiometric information to fill in the missing stoichiometries of entries in the compiled set where all protein subunits are similar.

Finally, for each entry in the consensus set, if the protein subunits of an entry is the subset of the protein subunits in another entry, we lift the stoichiometry of the superset entry to the subset entry

Options:
  • id_conversion_path (str): path of gene-transcript-protein id conversion table
  • humancyc_enzyme_path (str): path of HumanCyc data on enzymes
  • humancyc_transporter_path (str): path of HumanCyc data on transporters
  • humancyc_geneproduct_path (str): path of HumanCyc data on gene products
  • humancyc_gene_path (str): path of HumanCyc data on genes
  • non_protein_conversion (dict): dictionary where the keys are the
    non-protein subunit ids used in Complex Portal while the values are references to the species type objects the ids refer to
  • test_protein (list): list of protein Uniprot IDs for which complexes
    are to be retrieved from Corum (for unit testing purpose)
  • test_complex (list, optional): list of complexes to be reconstructed
    (for unit testing purpose)
clean_and_validate_options()[source]

Apply default options and validate options

gen_components()[source]

Construct complex species types for the knowledge base

get_data()[source]

Get protein complex data from Human Recon 2.2, HumanCyc, Corum, and Complex portal

process_data()[source]

Merge data of protein complexes from the various databases

3.1.1.2.5. h1_hesc.kb_gen.core module

Generator for the knowledge base

Author:Yin Hoon Chew <yinhoon.chew@mssm.edu>
Author:Jonathan Karr <jonrkarr@gmail.com>
Date:2018-01-29
Copyright:2018, Karr Lab
License:MIT
class h1_hesc.kb_gen.core.KbGenerator(component_generators=None, options=None)[source]

Bases: wc_kb_gen.core.KbGenerator

Generator for the knowledge base

Options: * id * name * version * component

  • TaxonGenerator
  • ChromosomesGenerator
  • GenesTranscriptsExonsProteinsGenerator
  • RegulatoryModulesGenerator
  • CompartmentsGenerator
  • MetabolitesGenerator
  • PropertiesGenerator
  • TranscriptsGenerator
  • ProteinsGenerator
  • MetabolicNetworkGenerator
  • ComplexesGenerator
  • MetabolicReactionKineticsGenerator
  • ProteinModificationsGenerator
DEFAULT_COMPONENT_GENERATORS = (<class 'h1_hesc.kb_gen.taxon.TaxonGenerator'>, <class 'h1_hesc.kb_gen.chromosomes.ChromosomesGenerator'>, <class 'h1_hesc.kb_gen.genes_transcripts_exons_proteins.GenesTranscriptsExonsProteinsGenerator'>, <class 'h1_hesc.kb_gen.regulatory_modules.RegulatoryModulesGenerator'>, <class 'h1_hesc.kb_gen.compartments.CompartmentsGenerator'>, <class 'h1_hesc.kb_gen.metabolites.MetabolitesGenerator'>, <class 'h1_hesc.kb_gen.properties.PropertiesGenerator'>, <class 'h1_hesc.kb_gen.transcripts.TranscriptsGenerator'>, <class 'h1_hesc.kb_gen.proteins.ProteinsGenerator'>, <class 'h1_hesc.kb_gen.metabolic_network.MetabolicNetworkGenerator'>, <class 'h1_hesc.kb_gen.complexes.ComplexesGenerator'>, <class 'h1_hesc.kb_gen.metabolic_reaction_kinetics.MetabolicReactionKineticsGenerator'>, <class 'h1_hesc.kb_gen.protein_modifications.ProteinModificationsGenerator'>)[source]
clean_and_validate_options()[source]

Apply default options and validate options

run()[source]

Generate a knowledge base of experimental data for a whole-cell model :returns: knowledge base :rtype: wc_kb.core.KnowledgeBase

h1_hesc.kb_gen.core.main(core_path=None, seq_path=None, read=False, test_kb=None)[source]

Generates a knowledge base for H1 hESCs

Parameters:
  • core_path (str, optional) – path to read/write the knowledge base
  • seq_path (str, optional) – path to the genome sequence
  • read (bool, optional) – if True, the knowledge base will be generated by deserializing a previously created excel spreadsheet given by the core_path, else, it will be generated by running the generator and serializing the output
  • test_kb (wc_kb.core.KnowledgeBase, optional) – for test purpose only

3.1.1.2.6. h1_hesc.kb_gen.genes_transcripts_exons_proteins module

Reconstruct genes, pre-rnas, transcripts, exons, CDSs and proteins

Author:Yin Hoon Chew <yinhoon.chew@mssm.edu>
Author:Jonathan Karr <jonrkarr@gmail.com>
Date:2018-01-30
Copyright:2018, Karr Lab
License:MIT
class h1_hesc.kb_gen.genes_transcripts_exons_proteins.GenesTranscriptsExonsProteinsGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Create genes, pre-rnas, transcripts, exons, CDSs and proteins for the knowledge base by parsing genome annotation file. To speed processing, the genome annotation file is divided into smaller chunks that is further divided into batches. The batches are processed sequentially while the chunks in each batch are processed in parallel. At the end of each batch, the intermediate results will be dumped into a pickle file.

clean_and_validate_options()[source]

Apply default options and validate options

Options:
  • data_path (str, optional): path to genome annotation lifted from reference genome,
    default is system path where an annotation file will be downloaded from Quilt
  • pickle_path (str): path to directory to store pickle files of constructed genes,
    pre-rnas, transcripts, exons, CDSs and proteins
  • id_conversion_path (str): path of gene-transcript-protein id conversion table
  • chunk_size (int, optional): byte size of each chunk when breaking genome annotation
    file into smaller chunks, default is 1024*1024
  • batch_size (int, optional): number of chunks in each batch, default is 300
gen_components()[source]

Construct genes, pre-rnas, exons, transcripts and proteins for the knowledge base

get_data()[source]

Get data for the knowledge base

process_data()[source]

Read and process the data

3.1.1.2.7. h1_hesc.kb_gen.metabolic_network module

Reconstruct the metabolic network

Author:Yin Hoon Chew <yinhoon.chew@mssm.edu>
Date:2018-08-10
Copyright:2018, Karr Lab
License:MIT
class h1_hesc.kb_gen.metabolic_network.MetabolicNetworkGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Reconstruct metabolic network for the knowledge base

Starting from Recon 2.2, existing reactions are modified and new reactions and their associated genes are added to the network to:

  1. transport nutrients from the media, mTeSR1, that is commonly use in hESC cultures
  2. produce and recycle metabolites that are involved in non-metabolic pathways
  3. serve as the alternative pathways to non-essential reactions
  4. improve the representation of energy metabolism by modifying mitochondrial metabolism in Recon 2.2 based on MitoCore
Options:
  • metabolic_network_path (str): path of human metabolic reconstruction
  • new_reactions_data_path (str): path of new reactions and modified reactions
clean_and_validate_options()[source]

Apply default options and validate options

gen_components()[source]

Construct metabolite species types and concentrations for the knowledge base

get_data()[source]

Get human metabolic network and manually curated reactions from input files

3.1.1.2.8. h1_hesc.kb_gen.metabolic_reaction_kinetics module

Retrieve kinetic data from SABIO-RK for a taxon such as mammals (40674)

Author:Jonathan Karr <jonrkarr@gmail.com>
Author:Yin Hoon Chew <yinhoon.chew@mssm.edu>
Date:2017-07-25
Copyright:2018, Karr Lab
License:MIT
class h1_hesc.kb_gen.metabolic_reaction_kinetics.MetabolicReactionKineticsGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Reconstruct the rate laws and rate parameters of metabolic reactions from Sabio-RK

First, the InChI structures of metabolites are used to find the sabio compound for each species. Then, the EC numbers of reactions that are missing are imputed. Data measured in organisms with the same taxon as the modeled cell are retrieved from Sabio-RK and matched to the reactions. Each kinetic parameter associated to a reaction is then matched to each protein complex.
Options:
  • id_conversion_path (str): path of gene-transcript-protein id conversion table
  • mutant_inclusion (bool): mutant inclusion parameter for querying Sabio-RK
  • min_temp (float): minimum temperature for querying Sabio-RK
  • max_temp (float): maximum temperature for querying Sabio-RK
  • min_ph (float): minimum pH for querying Sabio-RK
  • max_ph (float): maximum pH for querying Sabio-RK
  • j_threshold (float): the minimum Jaccard similarity between the gene-reaction-rules in metabolic network
    and the reconstructed complexes that determines if a protein complex is assign to the reaction
clean_and_validate_options()[source]

Apply default options and validate options

gen_components()[source]

Construct metabolic reaction kinetic laws and parameters for the knowledge base

get_data()[source]

Get kinetic laws and rate parameters from Sabio-RK

process_data()[source]

Match reaction kinetics to complexes

First, a Jaccard similarity is used to compare and assign reconstructed protein complexes to each reaction. Then, a protein complex is assigned kinetic data whose enzyme subunits are subset of the protein complex.

3.1.1.2.9. h1_hesc.kb_gen.metabolites module

Reconstruct the metabolites in a cell

Author:Yin Hoon Chew <yinhoon.chew@mssm.edu>
Date:2018-08-05
Copyright:2018, Karr Lab
License:MIT
class h1_hesc.kb_gen.metabolites.MetabolitesGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Creates metabolites for the knowledge base from input files.

Options: * metabolic_network_path (str): path of human metabolic reconstruction * new_metabolites_data_path (str): path of manually curated new metabolites * concentration_data_path (str): path of manually curated metabolite concentrations

clean_and_validate_options()[source]

Apply default options and validate options

gen_components()[source]

Construct metabolite species types and concentrations for the knowledge base

get_data()[source]
  1. Get metabolite species from human metabolic network and manually added metabolite species from input file
  2. Get intracellular concentrations of metabolites from input file and datanator
process_data()[source]

Merge and assign the concentration data to the metabolite species type in the knowledge base using the following steps: 1) Assign concentration data from manual curation to the metabolite species 2) For the remaining metabolite species:

  1. Exactly match the InChI strings of metabolites in the database
  2. If no match is found in step a, match InChI string by successively ignoring
    charge and stereochemistry layers
  3. If no match is found in step b, extract non-matching InChI string that are not
    cardiolipin for manual matching

3.1.1.2.10. h1_hesc.kb_gen.properties module

Reconstruction of cell properties

Author:Yin Hoon Chew <yinhoon.chew@mssm.edu>
Date:2018-11-15
Copyright:2018, Karr Lab
License:MIT
class h1_hesc.kb_gen.properties.PropertiesGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Create cell properties for the knowledge base

Options:
  • data_path (dict): path to data of cell properties
clean_and_validate_options()[source]

Apply default options and validate options

gen_components()[source]

Construct cell properties for the knowledge base

get_data()[source]

Get data for cell properties

process_data()[source]

Process data for knowledge base components

3.1.1.2.11. h1_hesc.kb_gen.protein_modifications module

Phosphorylation sites and ratio calculation
from the H1-ESC iTRAQ-8plex data (Phanstiel et al Nat Methods 2011)
Author:Yassmine Chebaro <chebaro@igbmc.fr>
Date:2019-01-24
Copyright:2019, Karr Lab
Licence:MIT
class h1_hesc.kb_gen.protein_modifications.ProteinModificationsGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Create phosphorylation states for the knowledge base

The ratio of phosphorylation for each protein is calculated using the quantification provided from Phanstiel et al supp mat, and is the following:

quantification(phos)/(quantification(proteome)+quantification(phos))

In cases where quantification of a protein is NA or absent, no ratio for the corresponding phospho-protein is calculated.

Options:
  • proteome_data_path (str): path for the proteome xlsx spreadsheet data from Phanstiel et al
  • phosproteome_data_path (str): path for the phosphoproteome xlsx spreadsheet data from Phanstiel et al
  • ipi_to_ensp (:obj: str): path to IPI-ensembl conversion table
clean_and_validate_options()[source]

Apply default options and validate options

gen_components()[source]

Construct protein modifications for the knowledge base

get_data()[source]

Get proteome and phosphoproteome data from xlsx spreadsheets and get IPI mapping to ensembl protein from csv file map_ipi_to_ensp.csv

process_data()[source]

Phosphorylation ratio per site for each protein are calculated from the MS quantification data obtained from Phanstiel et al Nat Methods 2011. The following steps are performed to allow for more flexibility in the future if needed (what to do with N/A or absent data)

1) Extract phosphorylated proteins when non-phosphorylated equivalent is present in proteomics data and calculate the ratio for each phospho-site per protein per set of experiment. Three replicates are available for H1 data: ES (H1) iTRAQ-117, ES (H1) iTRAQ-113 and ES (H1) iTRAQ-117 (changed to ES (H1) iTRAQ-117b here for simplification). Clean the ratios: where N/A or no value is present the previous step will output nan, these ratios are replaced with nan. This is not combined with the previous step in case one does not want to remove the nans

2) Calculate average ratio per site per protein and map the IPI to ensembl protein after removing the nan ratios

3.1.1.2.12. h1_hesc.kb_gen.proteins module

Reconstruction of proteins

Author:Yin Hoon Chew <yinhoon.chew@mssm.edu>
Date:2017-07-19
Copyright:2018, Karr Lab
License:MIT
class h1_hesc.kb_gen.proteins.ProteinsGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Create proteins for the knowledge base

First, protein species in each compartment is generated by merging data of protein localization from metabolic network and Cell Atlas into a consensus set. Transporter proteins, which are proteins that catalyze reactions involving metabolites from multiple compartments, are localized to the membrane compartments.

Second, H1 protein abundance data are extracted from PaxDB and supplemented by merging with protein abundance data from other human cell lines. In cases where multiple supplementary datasets contain measurement for the same protein, the supplementary dataset that correlates best with the main dataset will be selected. A housekeeping protein is used to normalize the datasets. The abundance of proteins without any measurement are set to the median measured abundances. Inconsistencies between the proteomic and transcriptomic data are then resolved.

Third, the concentration of each protein species in each compartment is set by distributing whole-cell protein concentration based on the volumetric fraction of the compartment.

Options:
  • id_conversion_path (str): path of gene-transcript-protein id conversion table
  • paxdb_data_path (str): path to PaxDB database file (this will be replaced by
    querying directly from Datanator once PaxDB has been fully integrated)
  • localization_data_path (str): path to localization data from Cell Atlas
  • substructure_to_compartment (dict): dictionary to translate subcellular
    structures defined in Cell Atlas (keys) to compartments defined in Recon 2.2
  • housekeeping_gene (str): the Ensembl Protein id of the housekeeping protein
  • supplementary_dataset (list): list of dataset ids for other cell lines in PaxDB
    that will be the supplementary datasets
clean_and_validate_options()[source]

Apply default options and validate options

gen_components()[source]

Construct protein concentrations for the knowledge base

get_data()[source]

Get protein localization data from Cell Atlas and protein abundance data from PaxDB

process_data()[source]

Protein localisation data from Cell Atlas and the metabolic network are merged

Whole-cell absolute protein abundances (mol/cell) are calculated using the following steps: 1) Combine relative abundances measured in H1 with data lifted over from

other cell line
  1. Set the relative abundances of non-measured proteins to the median protein

    abundance reconstructed in Step 1.

  2. Resolve inconsistencies between the reconstructed relative protein abundances

    and observed transcript abundances: a) If the relative protein abundance is zero but the transcript has been shown to

    be expressed in H1, set the protein abundance to be the median abundance reconstructed in Step 1

    1. If the relative protein abundance is not zero but the measured transcript
      abundance is zero, check whether the proteomic data was measured in H1 or lifted from another cell line. If the protein abundance was lifted from another cell line, set the protein abundance to zero. Otherwise, set the transcript abundance to the median transcript abundance
  3. Calculate absolute protein abundances by multiplying relative protein abundances to

    the total H1 protein mass

Because each protein defined above represents one generic product of each gene, its abundance is further distributed among its protein variants based on the relative abundance of the transcript that is translated into each variant.

3.1.1.2.13. h1_hesc.kb_gen.regulatory_modules module

Reconstruct regulatory features

Author:Yin Hoon Chew <yinhoon.chew@mssm.edu>
Date:2018-11-27
Copyright:2018, Karr Lab
License:MIT
class h1_hesc.kb_gen.regulatory_modules.RegulatoryModulesGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Create regulatory elements and modules for the knowledge base

clean_and_validate_options()[source]

Apply default options and validate options

Options:
  • reg_activity_data_path (str): path to data on regulatory features and activities
  • motif_feature_data_path (str): path to data on motif features
gen_components()[source]

Construct regulatory elements for the knowledge base

get_data()[source]

Get data for the knowledge base

process_data()[source]

Read and process the data

3.1.1.2.14. h1_hesc.kb_gen.taxon module

Reconstruct the taxon

Author:Jonathan Karr <jonrkarr@gmail.com>
Author:Yin Hoon Chew <yinhoon.chew@mssm.edu>
Date:2018-02-05
Copyright:2017, Karr Lab
License:MIT
class h1_hesc.kb_gen.taxon.TaxonGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Generator for the taxon of the organism

Options:
  • scientific_name (str): the scientific name of organism
clean_and_validate_options()[source]

Apply default options and validate options

gen_components()[source]

Construct taxon for the knowledge base

get_data()[source]

Get taxon id of organism

3.1.1.2.15. h1_hesc.kb_gen.transcripts module

Reconstruction of transcripts

Author:Yin Hoon Chew <yinhoon.chew@mssm.edu>
Date:2017-08-29
Copyright:2018, Karr Lab
License:MIT
class h1_hesc.kb_gen.transcripts.TranscriptsGenerator(knowledge_base, options=None)[source]

Bases: wc_kb_gen.core.KbComponentGenerator

Create transcripts for the knowledge base

The copy number of each transcript species type is calculated by dividing RNA biomass by the sum of the products of the relative abundance and molecular weight of each RNA, and multiplying by the relative abundance of the RNA in TPM (transcript per million).

The half-life (in seconds) of each transcript species type is reconstructed based on its Gene Ontology (biological process) by referring to the correlation published in Yang et al (2003). The average value is used if a transcript species type is associated with more than one GO. If no association is found, a median half-life of 10 hrs from BioNumbers is used.

Options:
  • data_path (str): path to transcriptomic data
  • correlation_path (str): path to correlation between half-life and GO
  • id_conversion_path (str): path of gene-transcript-protein id conversion table
clean_and_validate_options()[source]

Apply default options and validate options

gen_components()[source]

Construct transcript concentrations for the knowledge base

get_data()[source]

Get transcript abundance data in TPM and the correlation between GO and half-life from input files

process_data()[source]

Calculate the copy number and estimate the half-life of each transcript

3.1.1.2.16. Module contents

Reconstruction building API

Author:Jonathan Karr <jonrkarr@gmail.com>
Date:2018-01-30
Copyright:2018, Karr Lab
License:MIT