3.1.1.2. h1_hesc.kb_gen package¶
3.1.1.2.1. Submodules¶
3.1.1.2.2. h1_hesc.kb_gen.chromosomes module¶
Reconstruct the sequences of H1-hESC chromosomes
- Author
Yin Hoon Chew <yinhoon.chew@mssm.edu>
- Author
Jonathan Karr <jonrkarr@gmail.com>
- Date
2017-07-13
- Copyright
2017, Karr Lab
- License
MIT
-
class
h1_hesc.kb_gen.chromosomes.
ChromosomesGenerator
(knowledge_base, options=None)[source]¶ Bases:
wc_kb_gen.core.KbComponentGenerator
Create chromosomes for the knowledge base
3.1.1.2.3. h1_hesc.kb_gen.compartments module¶
Reconstruct the compartments of a cell
- Author
Yin Hoon Chew <yinhoon.chew@mssm.edu>
- Date
2018-10-04
- Copyright
2018, Karr Lab
- License
MIT
-
class
h1_hesc.kb_gen.compartments.
CompartmentsGenerator
(knowledge_base, options=None)[source]¶ Bases:
wc_kb_gen.core.KbComponentGenerator
Creates compartments and their volumetric fractions for the knowledge base from input files.
Options: * data_path (
str
): path of manually curated data volumetricfractions of compartments
3.1.1.2.4. h1_hesc.kb_gen.complexes module¶
Reconstruct protein complexes :Author: Yin Hoon Chew <yinhoon.chew@mssm.edu> :Date: 2017-11-28 :Copyright: 2017, Karr Lab :License: MIT
-
class
h1_hesc.kb_gen.complexes.
ComplexesGenerator
(knowledge_base, options=None)[source]¶ Bases:
wc_kb_gen.core.KbComponentGenerator
Create macromolecular complexes for the knowledge base
Data of protein complexes from Human Recon 2.2, HumanCyc, Corum, and Complex portal are merged into a consensus set. Each recconstructed protein complex contains information of the protein and cofactor subunit as well as their stoichiometric information wherever available.
Each database has different degrees of information. Complex Portal has both stoichiometric and isomer information (e.g. P14618-1, P14618-2) Both Recon and Corum do not have stoichiometric information, but Corum includes protein isomer information (e.g. P78381-1, P78381-2). HumanCyc has stoichiometric information.
A consensus set is compiled by consecutively evaluating and adding each complex entry from each database:
All complex entries in Complex Portal are added. If two complexes have protein subunit(s) that only differ(s) in their isomer information, they are considered the same complex because the mapping of uniprot isomer to ensembl protein is not complete.
Entries from Recon are added if no complex with similar protein subunits could be found in the compiled set
Entries from Corum are added if no complex with similar protein subunits could be found in the compiled set. Isomer information are not considered.
For HumanCyc, because it has an inconsistent definition of complexes compared to other databases, i.e. some HumanCyc complexes are only subsets of complexes from other databases (see example of P03886). we only use its stoichiometric information to fill in the missing stoichiometries of entries in the compiled set where all protein subunits are similar.
Finally, for each entry in the consensus set, if the protein subunits of an entry is the subset of the protein subunits in another entry, we lift the stoichiometry of the superset entry to the subset entry
- Options:
id_conversion_path (
str
): path of gene-transcript-protein id conversion tablemanually_curated_data_path (
str
): path of manually curated enzyme complexeshumancyc_enzyme_path (
str
): path of HumanCyc data on enzymeshumancyc_transporter_path (
str
): path of HumanCyc data on transportershumancyc_geneproduct_path (
str
): path of HumanCyc data on gene productshumancyc_gene_path (
str
): path of HumanCyc data on genes- non_protein_conversion (
dict
): dictionary where the keys are the non-protein subunit ids used in Complex Portal while the values are references to the species type objects the ids refer to
- non_protein_conversion (
- test_protein (
list
): list of protein Uniprot IDs for which complexes are to be retrieved from Corum (for unit testing purpose)
- test_protein (
- test_complex (
list
, optional): list of complexes to be reconstructed (for unit testing purpose)
- test_complex (
3.1.1.2.5. h1_hesc.kb_gen.core module¶
Generator for the knowledge base
- Author
Yin Hoon Chew <yinhoon.chew@mssm.edu>
- Author
Jonathan Karr <jonrkarr@gmail.com>
- Date
2018-01-29
- Copyright
2018, Karr Lab
- License
MIT
-
class
h1_hesc.kb_gen.core.
KbGenerator
(component_generators=None, options=None)[source]¶ Bases:
wc_kb_gen.core.KbGenerator
Generator for the knowledge base
Options: * id * name * version * component
TaxonGenerator
CompartmentsGenerator
ChromosomesGenerator
GenesTranscriptsExonsProteinsGenerator
RegulatoryModulesGenerator
MetabolitesGenerator
PropertiesGenerator
TranscriptsGenerator
MetabolicNetworkGenerator
ProteinsGenerator
ComplexesGenerator
MetabolicReactionKineticsGenerator
ProteinModificationsGenerator
-
DEFAULT_COMPONENT_GENERATORS
= (<class 'h1_hesc.kb_gen.taxon.TaxonGenerator'>, <class 'h1_hesc.kb_gen.compartments.CompartmentsGenerator'>, <class 'h1_hesc.kb_gen.chromosomes.ChromosomesGenerator'>, <class 'h1_hesc.kb_gen.genes_transcripts_exons_proteins.GenesTranscriptsExonsProteinsGenerator'>, <class 'h1_hesc.kb_gen.regulatory_modules.RegulatoryModulesGenerator'>, <class 'h1_hesc.kb_gen.metabolites.MetabolitesGenerator'>, <class 'h1_hesc.kb_gen.properties.PropertiesGenerator'>, <class 'h1_hesc.kb_gen.transcripts.TranscriptsGenerator'>, <class 'h1_hesc.kb_gen.metabolic_network.MetabolicNetworkGenerator'>, <class 'h1_hesc.kb_gen.proteins.ProteinsGenerator'>, <class 'h1_hesc.kb_gen.complexes.ComplexesGenerator'>, <class 'h1_hesc.kb_gen.metabolic_reaction_kinetics.MetabolicReactionKineticsGenerator'>, <class 'h1_hesc.kb_gen.protein_modifications.ProteinModificationsGenerator'>)[source]¶
-
h1_hesc.kb_gen.core.
main
(core_path=None, seq_path=None, read=False, test_kb=None, test_mode=False, in_phase=False)[source]¶ Generates a knowledge base for H1 hESCs
- Parameters
core_path (
str
, optional) – path to read/write the knowledge baseseq_path (
str
, optional) – path to the genome sequenceread (
bool
, optional) – if True, the knowledge base will be generated by deserializing a previously created excel spreadsheet given by the core_path, else, it will be generated by running the generator and serializing the outputtest_kb (
wc_kb.core.KnowledgeBase
, optional) – KB object provided for test purpose onlytest_mode (
bool
, optional) – if True, the knowledge base generator will run in test mode using a subset of the genomein_phase (
bool
, optional) – if True, the knowledge base generator will run in phase and write to the path between phases
3.1.1.2.6. h1_hesc.kb_gen.genes_transcripts_exons_proteins module¶
Reconstruct genes, pre-rnas, transcripts, exons, CDSs and proteins
- Author
Yin Hoon Chew <yinhoon.chew@mssm.edu>
- Author
Jonathan Karr <jonrkarr@gmail.com>
- Date
2018-01-30
- Copyright
2018, Karr Lab
- License
MIT
-
class
h1_hesc.kb_gen.genes_transcripts_exons_proteins.
GenesTranscriptsExonsProteinsGenerator
(knowledge_base, options=None)[source]¶ Bases:
wc_kb_gen.core.KbComponentGenerator
Create genes, pre-rnas, transcripts, exons, CDSs and proteins for the knowledge base by parsing genome annotation file. To speed processing, the genome annotation file is divided into smaller chunks that is further divided into batches. The batches are processed sequentially while the chunks in each batch are processed in parallel. At the end of each batch, the intermediate results will be dumped into a pickle file.
-
clean_and_validate_options
()[source]¶ Apply default options and validate options
- Options:
- data_path (
str
, optional): path to genome annotation lifted from reference genome, default is system path where an annotation file will be downloaded from Quilt
- data_path (
trna_data_path (
str
, optional): path to data on cytoplasmic tRNAmt_trna_data_path (
str
, optional): path to data on mitochondrial tRNA- rrna_list (
list
of str, optional): list of gene name of each representative nuclear-encoded rRNA in the genome annotation
- rrna_list (
- rrna_18S_28S_list (
list
of str, optional): list of transcript name of representative 18S and 28S rRNAs not in the genome annotation
- rrna_18S_28S_list (
- pickle_path (
str
): path to directory to store pickle files of constructed genes, pre-rnas, transcripts, exons, CDSs and proteins
- pickle_path (
id_conversion_path (
str
): path of gene-transcript-protein id conversion table- chunk_size (
int
, optional): byte size of each chunk when breaking genome annotation file into smaller chunks, default is 1024*1024
- chunk_size (
batch_size (
int
, optional): number of chunks in each batch, default is 300- test_mode (
bool
, optional): mode for testing and debugging where only the first batch will be processed if set to True, default is False
- test_mode (
- scaled_down_gene_list (:obk:`list`, optional): list of ensembl protein-producing gene IDs
that if provided, will be the only genes constructed
- scaled_down_trna_list (:obk:`list`, optional): list of nuclear-encoded tRNA gene IDs
that if provided, will be the only nuclear-encoded tRNA genes constructed
-
3.1.1.2.7. h1_hesc.kb_gen.metabolic_network module¶
Reconstruct the metabolic network
- Author
Yin Hoon Chew <yinhoon.chew@mssm.edu>
- Date
2018-08-10
- Copyright
2018, Karr Lab
- License
MIT
-
class
h1_hesc.kb_gen.metabolic_network.
MetabolicNetworkGenerator
(knowledge_base, options=None)[source]¶ Bases:
wc_kb_gen.core.KbComponentGenerator
Reconstruct metabolic network for the knowledge base
Starting from Recon 2.2, existing reactions are modified and new reactions and their associated genes are added to the network to:
transport nutrients from the media, mTeSR1, that is commonly use in hESC cultures
produce and recycle metabolites that are involved in non-metabolic pathways
serve as the alternative pathways to non-essential reactions
improve the representation of energy metabolism by modifying mitochondrial metabolism in Recon 2.2 based on MitoCore
- Options:
metabolic_network_path (
str
): path of human metabolic reconstructionnew_reactions_data_path (
str
): path of new reactions and modified reactions- scaled_down_reaction_list (:obk:`list`, optional): list of metabolic reaction IDs
that if provided, will be the only metabolic reactions constructed
3.1.1.2.8. h1_hesc.kb_gen.metabolic_reaction_kinetics module¶
Retrieve kinetic data from SABIO-RK for a taxon such as mammals (40674)
- Author
Jonathan Karr <jonrkarr@gmail.com>
- Author
Yin Hoon Chew <yinhoon.chew@mssm.edu>
- Date
2017-07-25
- Copyright
2018, Karr Lab
- License
MIT
-
class
h1_hesc.kb_gen.metabolic_reaction_kinetics.
MetabolicReactionKineticsGenerator
(knowledge_base, options=None)[source]¶ Bases:
wc_kb_gen.core.KbComponentGenerator
Reconstruct the rate laws and rate parameters of metabolic reactions from Sabio-RK
First, the InChI structures of metabolites are used to find the sabio compound for each species. Then, the EC numbers of reactions that are missing are imputed. Data measured in organisms with the same taxon as the modeled cell are retrieved from Sabio-RK and matched to the reactions. Each kinetic parameter associated to a reaction is then matched to each protein complex.
- Options:
id_conversion_path (
str
): path of gene-transcript-protein id conversion tablemutant_inclusion (
bool
): mutant inclusion parameter for querying Sabio-RKmin_temp (
float
): minimum temperature for querying Sabio-RKmax_temp (
float
): maximum temperature for querying Sabio-RKmin_ph (
float
): minimum pH for querying Sabio-RKmax_ph (
float
): maximum pH for querying Sabio-RK- j_threshold (
float
): the minimum Jaccard similarity between the gene-reaction-rules in metabolic network and the reconstructed complexes that determines if a protein complex is assign to the reaction
- j_threshold (
3.1.1.2.9. h1_hesc.kb_gen.metabolites module¶
Reconstruct the metabolites in a cell
- Author
Yin Hoon Chew <yinhoon.chew@mssm.edu>
- Date
2018-08-05
- Copyright
2018, Karr Lab
- License
MIT
-
class
h1_hesc.kb_gen.metabolites.
MetabolitesGenerator
(knowledge_base, options=None)[source]¶ Bases:
wc_kb_gen.core.KbComponentGenerator
Creates metabolites for the knowledge base from input files.
Options: * metabolic_network_path (
str
): path of human metabolic reconstruction * new_metabolites_data_path (str
): path of manually curated new metabolites * concentration_data_path (str
): path of manually curated metabolite concentrations * scaled_down_metabolite_list (:obk:`list`, optional): list of metabolite IDsthat if provided, will be the only metabolites whose free-pool concentrations are constructed
-
gen_components
()[source]¶ Construct metabolite species types and concentrations for the knowledge base
-
get_data
()[source]¶ Get metabolite species from human metabolic network and manually added metabolite species from input file
Get intracellular concentrations of metabolites from input file and datanator
-
process_data
()[source]¶ Merge and assign the concentration data to the metabolite species type in the knowledge base using the following steps: 1) Assign concentration data from manual curation to the metabolite species 2) For the remaining metabolite species:
Exactly match the InChI strings of metabolites in the database
- If no match is found in step a, match InChI string by successively ignoring
charge and stereochemistry layers
- If no match is found in step b, extract non-matching InChI string that are not
cardiolipin for manual matching
-
3.1.1.2.10. h1_hesc.kb_gen.properties module¶
Reconstruction of cell properties
- Author
Yin Hoon Chew <yinhoon.chew@mssm.edu>
- Date
2018-11-15
- Copyright
2018, Karr Lab
- License
MIT
3.1.1.2.11. h1_hesc.kb_gen.protein_modifications module¶
- Phosphorylation sites and ratio calculation
from the H1-ESC iTRAQ-8plex data (Phanstiel et al Nat Methods 2011)
- Author
Yassmine Chebaro <chebaro@igbmc.fr>
- Date
2019-01-24
- Copyright
2019, Karr Lab
- Licence
MIT
-
class
h1_hesc.kb_gen.protein_modifications.
ProteinModificationsGenerator
(knowledge_base, options=None)[source]¶ Bases:
wc_kb_gen.core.KbComponentGenerator
Create phosphorylation states for the knowledge base
The ratio of phosphorylation for each protein is calculated using the quantification provided from Phanstiel et al supp mat, and is the following:
quantification(phos)/(quantification(proteome)+quantification(phos))
In cases where quantification of a protein is NA or absent, no ratio for the corresponding phospho-protein is calculated.
- Options:
phosphoproteomics_data_path (
str
): path of the phosphoproteome data from Phanstiel et alid_conversion_path (
str
): path of gene-transcript-protein id conversion table
-
get_data
()[source]¶ Get proteome and phosphoproteome data from xlsx spreadsheets and get IPI mapping to ensembl protein
-
process_data
()[source]¶ Phosphorylation ratio per site for each protein are calculated from the MS quantification data obtained from Phanstiel et al Nat Methods 2011. The following steps are performed to allow for more flexibility in the future if needed (what to do with N/A or absent data)
1) Extract phosphorylated proteins when non-phosphorylated equivalent is present in proteomics data and calculate the ratio for each phospho-site per protein per set of experiment. Three replicates are available for H1 data: ES (H1) iTRAQ-117, ES (H1) iTRAQ-113 and ES (H1) iTRAQ-117 (changed to ES (H1) iTRAQ-117b here for simplification). Clean the ratios: where N/A or no value is present the previous step will output nan, these ratios are replaced with nan. This is not combined with the previous step in case one does not want to remove the nans
2) Calculate average ratio per site per protein and map the IPI to ensembl protein after removing the nan ratios
3.1.1.2.12. h1_hesc.kb_gen.proteins module¶
Reconstruction of proteins
- Author
Yin Hoon Chew <yinhoon.chew@mssm.edu>
- Date
2017-07-19
- Copyright
2018, Karr Lab
- License
MIT
-
class
h1_hesc.kb_gen.proteins.
ProteinsGenerator
(knowledge_base, options=None)[source]¶ Bases:
wc_kb_gen.core.KbComponentGenerator
Create proteins for the knowledge base
First, protein species in each compartment is generated by merging data of protein localization from metabolic network and Cell Atlas into a consensus set. For transporter proteins in the metabolic network that catalyze reactions involving metabolites from multiple compartments, they are localized to the membrane compartments to ensure the reactions will occur in the model. For proteins not included in the metabolic network and Cell Atlas, localization data from Uniprot are used. For proteins not included in all data sources above, they are assumed to localize to the cytosol.
Second, H1 protein abundance data are extracted from PaxDB and supplemented by merging with protein abundance data from other human cell lines. In cases where multiple supplementary datasets contain measurement for the same protein, the supplementary dataset that correlates best with the main dataset will be selected. A housekeeping protein is used to normalize the datasets. The abundance of proteins without any measurement are set to the median measured abundances. Inconsistencies between the proteomic and transcriptomic data are then resolved.
Third, the concentration of each protein species in each compartment is set by distributing whole-cell protein concentration based on the volumetric fraction of the compartment.
Fourth, the half-lives of proteins are taken from the literature. In cases where measurements are not available, the half-lives are estimated based on the correlation between half-life and protein abundance that have been reported in the literature.
- Options:
id_conversion_path (
str
): path to gene-transcript-protein id conversion tablelocalization_data_path (
str
): path to localization data from Cell Atlasuniprot_localization_data_path (
str
): path to localization data from Uniprot- substructure_to_compartment (
dict
): dictionary to translate subcellular structures defined in Cell Atlas (keys) to compartments defined in Recon 2.2
- substructure_to_compartment (
- uniprot_location_to_compartment (
dict
): dictionary translate subcellular locations defined in Uniprot (keys) to compartments defined in Recon 2.2
- uniprot_location_to_compartment (
housekeeping_gene (
str
): the Ensembl Protein id of the housekeeping proteinmain_dataset (
str
): the filename of H1 dataset in PaxDBhalf_life_data_path (
str
): path to protein half-life data- rrna_18S_seq_path (
str
, optional): path to sequence fasta file for representative 18S rRNA
- rrna_18S_seq_path (
- rrna_28S_seq_path (
str
, optional): path to sequence fasta file for representative 28S rRNA
- rrna_28S_seq_path (
-
get_data
()[source]¶ Get protein localization data from Cell Atlas, protein abundance data from PaxDB and protein half-life data
-
process_data
()[source]¶ Protein localisation data from Cell Atlas, metabolic network and Uniprot are merged
Whole-cell absolute protein abundances (mol/cell) are calculated using the following steps: 1) Combine relative abundances measured in H1 with data lifted over from
other cell line
- Set the relative abundances of non-measured proteins to the median protein
abundance reconstructed in Step 1.
- Resolve inconsistencies between the reconstructed relative protein abundances
and observed transcript abundances: a) If the relative protein abundance is zero but the transcript has been shown to
be expressed in H1, set the protein abundance to be the median abundance reconstructed in Step 1
- If the relative protein abundance is not zero but the measured transcript
abundance is zero, check whether the proteomic data was measured in H1 or lifted from another cell line. If the protein abundance was lifted from another cell line, set the protein abundance to zero. Otherwise, set the transcript abundance to the median transcript abundance
- Calculate absolute protein abundances by multiplying relative protein abundances to
the total H1 protein mass
Because each protein defined above represents one generic product of each gene, its abundance is further distributed among its protein variants based on the relative abundance of the transcript that is translated into each variant.
The correlation between protein half-life and protein abundance are then determined and used to estimate the half-life of proteins where measurements are not available.
3.1.1.2.13. h1_hesc.kb_gen.regulatory_modules module¶
Reconstruct regulatory features
- Author
Yin Hoon Chew <yinhoon.chew@mssm.edu>
- Date
2018-11-27
- Copyright
2018, Karr Lab
- License
MIT
-
class
h1_hesc.kb_gen.regulatory_modules.
RegulatoryModulesGenerator
(knowledge_base, options=None)[source]¶ Bases:
wc_kb_gen.core.KbComponentGenerator
Create regulatory elements and modules for the knowledge base
3.1.1.2.14. h1_hesc.kb_gen.taxon module¶
Reconstruct the taxon
- Author
Jonathan Karr <jonrkarr@gmail.com>
- Author
Yin Hoon Chew <yinhoon.chew@mssm.edu>
- Date
2018-02-05
- Copyright
2017, Karr Lab
- License
MIT
3.1.1.2.15. h1_hesc.kb_gen.transcripts module¶
Reconstruction of transcripts
- Author
Yin Hoon Chew <yinhoon.chew@mssm.edu>
- Date
2017-08-29
- Copyright
2018, Karr Lab
- License
MIT
-
class
h1_hesc.kb_gen.transcripts.
TranscriptsGenerator
(knowledge_base, options=None)[source]¶ Bases:
wc_kb_gen.core.KbComponentGenerator
Create transcripts for the knowledge base
The copy number of each transcript species type is calculated by dividing RNA biomass by the sum of the products of the relative abundance and molecular weight of each RNA, and multiplying by the relative abundance of the RNA in TPM (transcript per million).
The half-life (in seconds) of each transcript species type is reconstructed based on its Gene Ontology (biological process) by referring to the correlation published in Yang et al (2003). The average value is used if a transcript species type is associated with more than one GO. If no association is found, a median half-life of 10 hrs from BioNumbers is used.
- Options:
data_path (
str
): path to transcriptomic datatrna_data_path (
str
, optional): path to tRNA- rrna_variants_path (
str
, optional): path to directory where pickle file of extra rrna variants is stored
- rrna_variants_path (
- rrna_18S_seq_path (
str
, optional): path to sequence fasta file for representative 18S rRNA
- rrna_18S_seq_path (
- rrna_28S_seq_path (
str
, optional): path to sequence fasta file for representative 28S rRNA
- rrna_28S_seq_path (
correlation_path (
str
): path to correlation between half-life and GOid_conversion_path (
str
): path of gene-transcript-protein id conversion table
3.1.1.2.16. Module contents¶
Reconstruction building API
- Author
Jonathan Karr <jonrkarr@gmail.com>
- Date
2018-01-30
- Copyright
2018, Karr Lab
- License
MIT