2.1. BpForms notation

The BpForms notation unambiguously represents the primary structure of biopolymer forms that contain canonical and non-canonical monomers using (a) a syntax similar to FASTA and (b) extended alphabets for DNA, RNA, and proteins based on DNAmod, MODOMICS, and RESID. This enables BpForms to calculate the formula, molecular, weight, and charge of biopolymer forms.

  • Canonical monomers are indicated by their single upper case character codes.
  • Non-canonical monomers defined in the alphabets that have single-character codes are indicated by these codes.
  • Non-canonical monomers defined in the alphabets that have multiple-character codes are indicated by these codes delimited by curly brackets.
  • Non-canonical monomers that are not defined in the alphabet can be defined “inline” with multiple attributes separated by vertical pipes (“|”) enclosed inside square brackets. The structures of these monomers are defined in SMILES format using the structure attribute. Additional attributes can provide metadata about monomers such as their ids and names.

BpForms contains three pre-built canonical DNA, RNA and protein alphabets and three extended DNA, RNA, and protein alphabets based on DNAmod, MODOMICS, and RESID. Users can also create additional custom alphabets.

Examples:

  • [id: "dI" | name: "deoxyinosine"]ACGC: represents deoxyinosine at the first position
  • AC[id: "dI" | name: "deoxyinosine"]GC: represents deoxyinosine at the third position
  • ACGC{6A}: represents N6-methyladenosine at the last position

2.1.1. Structure

The structure attribute describes the chemical structure of the monomer as a SMILES-encoded string:

[id: "dI" |
    structure: "O=C1NC=NC2=C1N=CN2"
    ]

The monomer-bond-atom, monomer-displaced-atom, left-bond-atom, left-displaced-atom, right-bond-atom, and right-displaced-atom attributes describe the linkages between the monomer and the backbone and between successive monomers:

[id: "dI" |
    structure: "O=C1NC=NC2=C1N=CN2"
    ]

We recommend defining these attributes for each monomer. These attributes must be defined to calculate the structure, formula, molecular weight, and charge of the biopolymer.

2.1.2. Uncertainty

BpForms can also represent two types of uncertainty in the structures of biopolymer forms.

  • The delta-mass and delta-charge attributes can describe uncertainty in the chemical identities of monomers. For example, [id: "dAMP" | delta-mass: 1 | delta-charge: 1] indicates the presence of an additional proton exact location is not known.
  • The position attribute can describe uncertainty in the position of a monomer. For example, [id: "5mC" | position: 2-3] indicates that 5mC may occur anywhere between the second and third position.

2.1.3. Metadata

BpForms can also represent several types of metadata:

  • The id and name attributes provide human-readable labels for monomers. Only one id and one name is allowed per monomer:

    [id: "dI"
        | name: "deoxyinosine"
        ]
    
  • The synonym attribute provides additional human-readable labels. Each monomer can have multiple synonyms:

    [id: "dI"
        | synonym: "2'-deoxyinosine"
        | synonym: "2'-deoxyinosine, 9-[(2R,4S,5R)-4-hydroxy-5-(hydroxymethyl)tetrahydrofuran-2-yl]-9H-purin-6-ol"
        ]
    
  • The identifier attribute describes references to entries in external databases. Each monomer can have multiple identifiers. The namespaces and ids of identifers must be separated by “/”:

    [id: "dI"
        | identifier: "biocyc.compound" / "DEOXYINOSINE"
        | identifier: "chebi" / "CHEBI:28997"
        | identifier: "pubchem.compound" / "65058"
        ]
    
  • The base-monomer attribute describes other monomer(s) which the monomer is generated from. The value of this attribute must be the code of a monomer in the alphabet. Each monomer can have one or more bases. This annotation is needed to generate more informative FASTA sequences for BpForms:

    [id: "m2A"
        | base-monomer: "A"
        ]
    
  • The comments attribute describes additional information about each monomer. Each monomer can only have one comment:

    [id: "dI"
        | comments: "A purine 2'-deoxyribonucleoside that is inosine in which the
                     hydroxy group at position 2' is replaced by a hydrogen."
        ]
    

2.1.4. Syntax

  • Monomers that are in the alphabet are indicated by a single character or multiple characters delimiated by curly brackets.
  • Monomers that are not in the alphabet are defined “inline” with one or more attributes separated by vertical pipes (“|”) inside square brackets.
    • All of the attributes are optional. However, the structure attribute is required to compute the formula, molecular weight, and charge of the biopolymer.
    • Attributes are separated by vertical pipes (“|”).
    • Attributes and their values are separated by colons (“:”).
    • White spaces are ignored.
    • The values of the id, name, synonym, and comments attributes must be enclosed in quotes (‘”’).
    • The namespace and id of each identifer must be separated by “/”.
  • The positions of the monomers in the string indicates in their location in the sequence.

2.1.5. Grammar

The following is the definition of the BpForms grammar. The grammar is defined in Lark syntax which is based on EBNF syntax.

?start: seq
seq: monomer+
?monomer: alphabet_monomer | inline_monomer
alphabet_monomer: CHAR | DELIMITED_CHARS
inline_monomer: "[" WS* inline_monomer_attr (ATTR_SEP inline_monomer_attr)* WS* "]"
?inline_monomer_attr: id | name | synonym | identifier | structure | monomer_bond_atom | monomer_displaced_atom | left_bond_atom | right_bond_atom | left_displaced_atom | right_displaced_atom | delta_mass | delta_charge | position | base_monomer | comments
?id: "id" FIELD_SEP ESCAPED_STRING
?name: "name" FIELD_SEP ESCAPED_STRING
?synonym: "synonym" FIELD_SEP ESCAPED_STRING
?identifier: "identifier" FIELD_SEP identifier_ns IDENTIFIER_SEP identifier_id
?identifier_ns: ESCAPED_STRING
?identifier_id: ESCAPED_STRING
?structure: "structure" FIELD_SEP "\"" SMILES "\""
?monomer_bond_atom: "monomer-bond-atom" FIELD_SEP atom
?monomer_displaced_atom: "monomer-displaced-atom" FIELD_SEP atom
?left_bond_atom: "left-bond-atom" FIELD_SEP atom
?right_bond_atom: "right-bond-atom" FIELD_SEP atom
?left_displaced_atom: "left-displaced-atom" FIELD_SEP atom
?right_displaced_atom: "right-displaced-atom" FIELD_SEP atom
?atom: atom_molecule IDENTIFIER_SEP atom_element IDENTIFIER_SEP atom_position IDENTIFIER_SEP atom_charge
?atom_molecule: /(Monomer|Backbone)/
?atom_element: /[A-Z][a-z]?/
?atom_position: /[0-9]+/
?atom_charge: /\-?[0-9]+/
?delta_mass: "delta-mass" FIELD_SEP DALTON
?delta_charge: "delta-charge" FIELD_SEP CHARGE
?position: "position" FIELD_SEP START_POSITION? "-" END_POSITION?
?base_monomer: "base-monomer" FIELD_SEP ESCAPED_CHARS
?comments: "comments" FIELD_SEP ESCAPED_STRING
ATTR_SEP: WS* "|" WS*
FIELD_SEP: WS* ":" WS*
IDENTIFIER_SEP: WS* "/" WS*
CHAR: /[^\[\]\{\}"]/
CHARS: CHAR+
ESCAPED_CHARS: "\"" CHARS "\""
DELIMITED_CHARS: "{" CHARS "}"
SMILES: /[^"]+/
DALTON: /[\-\+]?[0-9]+(\.[0-9]*)?/
CHARGE: /[\-\+]?[0-9]+/
START_POSITION: INT
END_POSITION: INT
%import common.WS
%import common.ESCAPED_STRING
%import common.INT

2.1.6. Examples

  • DNA:

    ACGT[id: "dI" | structure: "O=C1NC=NC2=C1N=CN2" | monomer-bond-atom: Monomer / N / 10 / 0 | monomer-displaced-atom: Monomer / H / 10 / 0]AG{m2A}
    
  • RNA:

    {01G}CGU[id: "01A" | structure: "COC1C(O)C(OC1n1cnc2c1ncn(c2=N)C)CO"]AG[id: "019A" | structure: "COC1C(O)C(OC1n1cnc2c1ncn(c2=O)C)CO"]
    
  • Protein:

    ARGKL[id: "AA0318" | structure: "COC(=O)[C@@H]([NH3+])CCCC[NH3+]"]YRCG[id: "AA0567" | structure: "CC=CC(=O)NCCCC[C@@H](C=O)[NH3+]"]