“Beautiful” recursive descent grammars using C++ short-circuits

I am author of ‘neoleo’, a curses spreadsheet forked from the abandoned GNU oleo project.

One thing I’m doing is rewriting the parsing of spreadsheet formulas using recursive descent parsing instead of yacc/lex.

Recursive descent parsing traditionally looks like a lot of manual coding. But I think there’s a better way: using the C/C++ standard-mandated short-circuiting mechanism of && and ||, and possibly a few others.

Take one section of the parser, where I parse for a cell reference. A cell ref may either be a number, or enclosed in brackets, like so: “42” , or “[42]”. Now, if you were to write it using “code”, you would have to look at the next token, and decide if was a “[“, then look for a number, then a “]”, or else it is a number, or else it is a parse error. Here is the actual implementation of that function:

bool abs_or_rel(tokens_t& tokes, CELLREF& n, bool& relative)
{
        relative = false;
        auto set_relative = [&relative]() {relative = true; return true;} ;
        return (is_char(tokes, '[') && is_cellref(tokes, n) && is_char(tokes, ']') && set_relative() )
                || is_cellref(tokes, n)
                || parse_error();
}

Note that it looks rather like a grammar rule without much in the way of boilerplate code. The && and the || almost act as a DSL for you. They are just the standard C/C++ logical and and or operators. I haven’t defined them specially. The magic comes from the fact that C/C++ uses short-circuiting in these logical operators to follow the choices or not.

Now tokens_t is a vector of tokens. “Tokens” are pairs consisting of the token type (a number, string, etc.) followed by its actual value, in string form. Passing the value in string form is more convenient. If you need to convert it to something else, like an actual number, then you can.

So, is_char() takes the vector of tokens, and compares it against the supplied char. If there’s a match, then it pops off a token, and returns true. Otherwise it returns false. is_cellref() does a similar thing. If expects a number (although negative is allowed). If it finds one, it pops the token, sets the value for n, and returns true. Otherwise it returns false. Here are their implementations:

bool is_char(tokens_t& tokes, char c)
{

        if( tokes.front().first != c)
                return false;
        //cout << "is_char:" << c << ":true\n";
        pop(tokes);
        return true;
}

bool is_cellref(tokens_t& tokes, CELLREF& ref)
{
        bool neg = is_char(tokes, '-');

        if(tokes.front().first != NUMBER)
                return false;
        ref = stoi(tokes.front().second);
        if(neg) ref = - ref;
        pop(tokes);
        return true;
}

It is a mechanism that is quite general. I believe that you can vastly improve the readability of recursive descent parsing code by adopting this “trick”.

About mcturra2000

Computer programmer living in Scotland.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s