home > posts > 2010-09-23-a-chemical-formula-parser-for-boostspirit

A chemical formula parser for boost::spirit

Within boost, Spirit is the parser generator framework. As an exercise, a boost::spirit parser is presented for chemical formulas.

The parser is compatible with boost version 1.40, hence Spirit classic is used.

The BNF grammar is:

lowercase_alphabetic character ::= 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z'
uppercase_alphabetic character ::= 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z'
atom        ::= uppercase_alphabetic character [lowercase_alphabetic character]
formula     ::= (atom [integer])*

The grammar is converted to spirit’s syntax as follows:

*( ( (range_p('A','Z') >> !range_p('a','z')) >> int_p ) | (range_p('A','Z') >> !range_p('a','z')) ) , space_p

Finally the semantic actions are added:

*( ( (range_p('A','Z') >> !range_p('a','z'))[&prefetch_s] >> int_p[&prefetch_n] )[&fetch]
|
(range_p('A','Z') >> !range_p('a','z'))[&fetch_s]
)
,
space_p

The source code is based on the boost example number_list.cpp and can be found here: chemical_formula_parser.cc.

The parser will accept repeated atoms, and will not check for non-existing atoms. Acceptable atom names are one- or two-characters long, which is fine as long as you don’t have around one of the atoms with the three-character symbols: ununbium, ununtrium, ununquadium, ununpentium, ununhexium or ununoctium (but honestly, who uses those in a formula anyway ?).