3.1. BpForms grammar

The BpForms grammar unambiguously represents the primary structure of biopolymer forms that contain canonical and non-canonical monomeric forms using (a) a syntax similar to IUPAC/IUBMB and (b) extended alphabets for DNA, RNA, and proteins to describe monomeric forms.

BpForms.org contains a detailed overview of the grammar and examples.

3.1.1. Grammar

The following is the definition of the BpForms grammar. The grammar is defined in Lark syntax which is based on EBNF syntax.

start: seq (attr_sep global_attr)* WS?
seq: (WS|monomer)((WS? knick)? (WS|monomer))*
?monomer: alphabet_monomer | inline_monomer
alphabet_monomer: CHAR | DELIMITED_CHARS
inline_monomer: OPEN_SQ_BRACKET WS? inline_monomer_attr (attr_sep inline_monomer_attr)* WS? CLOSE_SQ_BRACKET
?inline_monomer_attr: id | name | synonym | identifier | structure | backbone_bond_atom | backbone_displaced_atom | r_bond_atom | l_bond_atom | r_displaced_atom | l_displaced_atom | delta_mass | delta_charge | position | base_monomer | comments
id: ID_FIELD_NAME field_sep ESCAPED_STRING
name: NAME_FIELD_NAME field_sep ESCAPED_STRING
synonym: SYNONYM_FIELD_NAME field_sep ESCAPED_STRING
identifier: IDENTIFIER_FIELD_NAME field_sep identifier_id identifier_sep identifier_ns
?identifier_ns: ESCAPED_STRING
?identifier_id: ESCAPED_STRING
structure: STRUCTURE_FIELD_NAME field_sep QUOTE_DELIMITER SMILES QUOTE_DELIMITER
backbone_bond_atom: BACKBONE_BOND_ATOM_FIELD_NAME field_sep atom
backbone_displaced_atom: BACKBONE_DISPLACED_ATOM_FIELD_NAME field_sep atom
r_bond_atom: RIGHT_BOND_ATOM_FIELD_NAME field_sep atom
l_bond_atom: LEFT_BOND_ATOM_FIELD_NAME field_sep atom
r_displaced_atom: RIGHT_DISPLACED_ATOM_FIELD_NAME field_sep atom
l_displaced_atom: LEFT_DISPLACED_ATOM_FIELD_NAME field_sep atom
atom: atom_element INT atom_charge?
?atom_element: /[A-Z][a-z]?/
?atom_charge: /[\+\-][0-9]+/
delta_mass: DELTA_MASS_FIELD_NAME field_sep DALTON
delta_charge: DELTA_CHARGE_FIELD_NAME field_sep CHARGE
position: POSITION_FIELD_NAME field_sep INT? DASH INT? monomers_position?
monomers_position: WS? OPEN_SQ_BRACKET (CHAR|CHARS) (WS? "|" WS? (CHAR|CHARS))* CLOSE_SQ_BRACKET
base_monomer: BASE_MONOMER_FIELD_NAME field_sep QUOTE_DELIMITER (CHAR|CHARS) QUOTE_DELIMITER
comments: COMMENTS_FIELD_NAME field_sep ESCAPED_STRING
?global_attr: crosslink | circular

crosslink: CROSSLINK_FIELD_NAME field_sep OPEN_SQ_BRACKET WS? (onto_crosslink | inline_crosslink) WS? CLOSE_SQ_BRACKET

onto_crosslink: onto_crosslink_attr (attr_sep onto_crosslink_attr)*
onto_crosslink_attr: onto_crosslink_type | onto_crosslink_monomer
onto_crosslink_monomer: onto_crosslink_monomer_type field_sep INT
onto_crosslink_monomer_type: /[lr]/
onto_crosslink_type: "type" field_sep QUOTE_DELIMITER CHARS QUOTE_DELIMITER

inline_crosslink: (inline_crosslink_attr (attr_sep inline_crosslink_attr)*)?
inline_crosslink_attr: inline_crosslink_atom | inline_crosslink_comments
inline_crosslink_atom: inline_crosslink_atom_type field_sep INT atom_element INT atom_charge?
inline_crosslink_atom_type: (LEFT_BOND_ATOM_FIELD_NAME | RIGHT_BOND_ATOM_FIELD_NAME | LEFT_DISPLACED_ATOM_FIELD_NAME | RIGHT_DISPLACED_ATOM_FIELD_NAME)
inline_crosslink_comments: COMMENTS_FIELD_NAME field_sep ESCAPED_STRING

knick: ":"

circular: "circular"

ID_FIELD_NAME: "id"
NAME_FIELD_NAME: "name"
SYNONYM_FIELD_NAME: "synonym"
IDENTIFIER_FIELD_NAME: "identifier"
STRUCTURE_FIELD_NAME: "structure"
BACKBONE_BOND_ATOM_FIELD_NAME: "backbone-bond-atom"
BACKBONE_DISPLACED_ATOM_FIELD_NAME: "backbone-displaced-atom"
LEFT_BOND_ATOM_FIELD_NAME: "l-bond-atom"
LEFT_DISPLACED_ATOM_FIELD_NAME: "l-displaced-atom"
RIGHT_BOND_ATOM_FIELD_NAME: "r-bond-atom"
RIGHT_DISPLACED_ATOM_FIELD_NAME: "r-displaced-atom"
DELTA_MASS_FIELD_NAME: "delta-mass"
DELTA_CHARGE_FIELD_NAME: "delta-charge"
POSITION_FIELD_NAME: "position"
BASE_MONOMER_FIELD_NAME: "base-monomer"
COMMENTS_FIELD_NAME: "comments"
CROSSLINK_FIELD_NAME: "x-link"

?attr_sep: WS? "|" WS?
?field_sep: WS? ":" WS?
?identifier_sep: WS? "@" WS?

CHAR: /[^\[\]\{\}":\| \t\f\r\n]/
CHARS: CHAR /[^\[\]\{\}":\|\t\f\r\n]*/ CHAR
DELIMITED_CHARS: "{" (CHAR|CHARS) "}"
SMILES: /[^"]+/
DALTON: /[\-\+]?[0-9]+(\.[0-9]*)?/
CHARGE: /[\-\+]?[0-9]+/
OPEN_SQ_BRACKET: "["
CLOSE_SQ_BRACKET: "]"
QUOTE_DELIMITER: "\""
DASH: "-"
WS: /[ \t\f\r\n]+/
INT: /[0-9]+/
ESCAPED_STRING: /"(?:[^"\\]|\\.)*"/