Expressive C++: Introduction

This entry is part of a series, Expressive C++»

Hidden in C++ lurks another language—countless other languages, in fact—all of which are better than C++ at solving some types of problems. These domain-specific languages, be they languages for linear algebra, relational query or what-have-you, may only do one thing, but they do it well. We can create and use these other languages directly within C++ itself, using the power and flexibility of C++ to cut away the general-purpose parts of C++ we don’t want and replacing them with the domain-specific parts we do. In this series of articles, we’ll be taking a close look at domain-specific languages, what they’re good for, and how they can be easily implemented in C++ (with emphasis on easily) with the help of Boost.Proto.

Domain-Specific Languages

Back in 1958, John Backus was working on a vexing problem: how to precisely nail down the syntax of the ALGOL programming language. What he came up with was a system for describing the syntax of any programming langauge. That system, much expanded over the years, is now known as EBNF, and it turns up in various guises wherever a computer language needs to be defined: official specifications, compiler-construction toolkits, wherever.

EBNF’s ubiquity is a testament to its simple, elegant power. With just a few lines we can, for instance, describe the syntax, precedence, and associativity of the infix arithmetic operators:

expression = term ( ( ‘+’ term ) | ( ‘-’ term ) )* ;
term = factor ( ( ‘*’ factor ) | ( ‘/’ factor ) )* ;
factor = UINT | ‘(‘ expression ‘)’ | ‘-’ factor ;

EBNF is a prime example of a domain-specific language; just the ticket for anyone dealing with syntax and parsing, but pretty useless for everybody else. The fact that it’s highly specialized is part of its appeal. It lets you deal directly with the concepts relevant to the task at hand without distracting you with superflous fluff. There’s a downside though: if you need to do something that falls just outside of EBNF’s purview, then EBNF is useless. This is the bugaboo of all domain-specific languages: you often need a back-door so you can escape the confines of your domain when you have to.

In contrast, C++ is a general-purpose programming langauge. Anything that can be done at all can be done with C++. That power and flexibility comes with a cost: there are many things for which C++ isn’t particularly great out of the box. Compiler construction comes to mind; C++ doesn’t directly support anything like EBNF. C++’s answer is to be great at defining libraries. The onus then is on library writers to provide great domain-specific programming experiences. And by putting the domain-specific knowledge in a C++ library, users don’t fall off a cliff when they need to do something just outside of that library’s scope; they can always fall back on good ol’ C++.

A Language in a Library

But what’s the best way to provide C++ library users with the domain-level notation and concepts with which they’re already familiar? “Modern” C++ library writers have come up with a creative solution: use C++ expressions to approximate familiar domain-specific notation. Because they look like little languages built from parts of a bigger language, these libraries are called domain-specific embedded languages, or DSELs. If it’s done right, the results can really be quite impressive.

Take, for instance, the Boost.Spirit library, a parser generator library that gives the user a EBNF-like language with which to define parsers. Here is the Spirit equivalent to the above EBNF grammar:

/////////////////////////////////////////////////////////////////////
//  A Spirit grammar that recognizes infix calculator expressions
/////////////////////////////////////////////////////////////////////
 
// boost::spirit::qi is where Boost.Spirit's parser generator lives.
boost::spirit::qi::rule<Iterator> expression, term, factor;
 
// An expression is a term followed by zero or more
// other terms separated by '+' or '-'
/* The EBNF equivalent is:
expression = term     ( ( '+'    term ) | ( '-'    term ) )* ; */
expression = term >> *( ( '+' >> term ) | ( '-' >> term ) )  ;
 
// A term is a factor followed by zero or more
// other factors separated by '*' or '/'
/* The EBNF equivalent is:
term = factor     ( ( '*'    factor ) | ( '/'    factor ) )* ; */
term = factor >> *( ( '*' >> factor ) | ( '/' >> factor ) )  ;
 
// A factor is an unsigned integer or
// a parenthesized expression or
// a negated factor
using boost::spirit::qi::uint_;
/* The EBNF equivalent is:
factor = UINT  | '('    expression    ')' | '-'    factor ; */
factor = uint_ | '(' >> expression >> ')' | '-' >> factor ;

If you squint a bit, you can see how the Spirit grammar corresponds to the EBNF. The correspondence isn’t perfect, but it captures the flavor.

A rule in Spirit is the C++ equivalent of a grammar rule in EBNF; the operators >> and | and * have all been overloaded and are taken by Spirit to mean grammar sequencing, alternation and repetition, respectively. How does that work? The obvious (and sub-optimal) way would be for Spirit to make the right-shift operator do “right-shifting” and eagerly return a rule object. Instead, Spirit makes the right-shift operator return some dummy right-shift temporary object that simply holds on to its left and right operands. Likewise for all the other operators. The result is that an expression like:

factor >> *( ( '*' >> factor ) | ( '/' >> factor ) )

captures the whole expression—types, structure, values, and all—in an expression tree. When that tree gets assigned to a rule object, Spirit walks the tree and interprets it as a grammar rule. When all is said and done, the variable expression above holds a parser we can use to validate infix arithmetic expressions. Nice.

This is the general approach taken by expression-based DSELs: use operator overloading to build trees that capture the expression. Only after the whole expression is captured do you actually do any work. At that point, the tree can be analyzed, optimized, transformed and evaluated in domain-specific ways.

Spirit is a so-called compiler-construction toolkit like yacc. In addition to EBNF grammars, it provides other abstractions that compiler-construction experts have come to expect, like:

  • Abstract syntax trees: a standard tree representation of a parsed program,
  • Semantic actions1: bits of executable code you can attach to grammar rules that make your parser do stuff, and
  • A host of compiler algorithms that operate on ASTs.

… all of which can be expressed directly in your C++ program as expressions. These basic abstractions—grammars, ASTs, semantic actions, and compiler algorithms—all become important later in a different context. Read on.

This is the essence of it: by rigging things so that domain-level concepts can be expressed directly and concisely with C++ expressions, we can make users of our library comfortable and productive.

What’s more, we can often do it while realizing huge wins in performance. With full access to a complete expression tree, a domain-specific library can analyze the tree and come up with optimal evaluation strategies. A good example is Blitz++, a matrix crunching library that bested FORTRAN, something that wasn’t thought possible at the time.2 It achieves its remarkable performance by analyzing a complete expression tree and applying complex, domain-specific transformations like loop fusion. That’s only possible because Blitz++ has access to the complete expression tree.

Compiler Construction Toolkits

The term is “domain-specific embedded language“, but you could be forgiven for thinking of Spirit quite simply as a “library”. Imagine what would change if we all really believed to our core that DSELs actually were languages. Follow this logic: the libraries that implement DSELs would be like compilers, and writing such a library would be like writing a compiler. And as we’ve been discussing, writing compilers is a common enough task that tools like yacc and Spirit exist to make the job easier. It follows that there should be a tool for DSEL writers that provides the sorts of abstractions compiler writers care about: grammars, ASTs, semantic actions, and compiler algorithms. What is needed is a compiler-construction toolkit for DSELs. Boost.Proto is that toolkit.

An analogy with Spirit is apt since Spirit and Proto do pretty much the same thing, only at different times. Spirit generates programs that parse text at runtime. Proto generates programs that parse C++ expression trees at compile time.3 Your run-of-the-mill compiler might be implemented with Spirit. In turn, Spirit—an expression-based DSEL—might be implemented with Proto, and in fact is.

I haven’t discussed how libraries like Spirit are typically implemented today, but suffice it to say that it involves a fair bit of template metaprogramming, a notoriously difficult task. Proto lets us write the Spirit library in terms of EBNF-like grammars and semantic actions, not the impenetrable and brittle template spaghetti code that is so often associated with template metaprograms. This would be a big step forward:

  1. DSELs would be easier to develop and maintain;
  2. DSELs would be rigorously and robustly defined, leading to friendlier interfaces and better error messages; and most importantly,
  3. Code transformations that were prohibitively complex before would now be tractable.

The potenitial for complex code transformation opens endless possibilities. Can you see the possibilities yet? With such a toolkit, we could rip apart expressions, analyze the parts, and reassemble them to, say, do loop fusion for matrix calculations, or define our own syntax for transactions, or interpret an expression as a relational query, or … ??!!

Hello, Boost.Proto

Boost.Proto has been in Boost since version 1.37. Since it’s a compiler-construction toolkit in the tradition of yacc and Spirit, Proto can build an AST for you. In Proto it’s as simple as this:

#include <iostream>
#include <boost/proto/proto.hpp>
namespace proto = boost::proto;
 
int main()
{
    // Create a Proto terminal that wraps a string.
    // Let's be cheeky and call it "cout".
    proto::terminal< char const * >::type cout = { "cout" };
 
    // Create an expression tree and pass it to display_expr
    // for pretty-printing.
    proto::display_expr(
        cout << "hello" << ' ' << "proto!"
    );
}

The argument to proto::display_expr isn’t an output expression, although we cheekily made it look like one; instead, the expression builds a tree that represents the expression. What does it mean to left-shift strings and characters? It doesn’t matter. It’s just a tree that can be evaluated later, any number of times and any number of ways. The << operator in the expression is provided by Proto, along with every other operator. Building an AST with Proto is simple. You don’t have to define a single template or implement a single operator overload.

The result of running the above program is below:

shift_left(
    shift_left(
        shift_left(
            terminal(cout)
          , terminal(hello)
        )
      , terminal( )
    )
  , terminal(proto!)
)

You can clearly see that Proto built a tree for us whose structure corresponds to the precedence and associativity of the C++ operators.

What’s To Come

As you might suspect, Proto comes with tools for defining grammars that match expressions, and semantic actions that you can attach to grammar rules that breathe life into your DSEL. I have so much more to say, but I’ll leave the rest for later. In future articles, I’ll describe Proto in more detail, build some simple but useful DSELs and work slowly toward a radical goal: the design of a DSEL—to be released shortly—called Boost.Phoenix3, a library that blurs the line between C++ code and data, promising to bring the power of Lisp macros to C++ programmers. By the time the series is done, you’ll be like Neo in the Matrix, seeing your own code as data that you can bend to your will.


  1. From the Wikipedia entry Compiler-compiler: “A typical parser generator associates executable code with each of the rules of the grammar that should be executed when these rules are applied by the parser. These pieces of code are sometimes referred to as semantic action routines since they define the semantics of the syntactic structure that is analysed by the parser. Depending upon the type of parser that should be generated, these routines may construct a parse tree (or AST), or generate executable code directly.” 

  2. http://www.oonumerics.org/blitz/benchmarks/ 

  3. It is a curious property of strongly-typed expression trees that they have rich structure both in their types and in their values. Any effort to manipulate them means that two parallel computations must occur: one on types and another on values. This is an extra complication that a good DSEL toolkit should handle effortlessly. 

Posted Monday, August 30th, 2010 under Boost.

33 Responses to “Expressive C++: Introduction”

  1. Nicholas Makin says:

    The one problem I see with these articles (which are fantastic) is the errors in the syntax hilighting that result in includes going missing. It becomes a game of: “I know two header files are used… which two could they be?”

      Quote
    • Eric Niebler says:

      You’re right. The site recently changed hosts, and some wordpress config didn’t survive the move, leading to the problems with markup and other things that you may have noticed. The right people know and will fix it when they get a chance. Sorry for the trouble.

        Quote
      • Nicholas Makin says:

        Well I guessed wrong! I guessed it was the std string header and made the terminal of type std::string.

          Quote
  2. Rob says:

    Great article, Eric.

    I noticed these typos:

    • “remarkable performance by analyzing”
    • “endless possibilities”
      Quote
  3. David Sankel says:

    From the article:

    [C++ DSELs] are very good at getting you 90% of the way to your ideal syntax

    I think its worth pointing out that the more important part of a DSEL is that it approaches 100% conformity to the domain itself (the D in DSEL). Once that has been achieved, the surface syntax is of minimal importance.

    For example, the following code,

    factor = uint_ | '(' >> expression >> ')' | '-' >> factor ;

    is just as readable as,

    factor = alternatives( uint_
                         , sequence( '(', expression, ')' )
                         , sequence( '-', factor )
                         );

    for those familiar with the EBNF domain.

      Quote
    • Eric Niebler says:

      EBNF isn’t a domain. It’s a meta-syntax. There’s even a standard (though nobody uses it AFAICT). My point is: there is an established syntax with a history stretching back 50 years or more. The “ideal syntax” I refer to is precisely the syntax that domain experts are accustomed to. Are domain experts smart enough to use another? Sure, but why should they have to?

      We could disagree about which is more important to get right in a DSEL: syntax or concepts (semantics)? But a good DSEL should strive for both.

        Quote
      • David Sankel says:
        EBNF isn’t a domain. It’s a meta-syntax.

        Any language can be used as a semantic domain. A semantic domain is simply the codomain (a bit counter-intuitive) of the meaning function that maps an abstract syntax tree to its meaning. For those interested in learning more about denotational semantics, I highly recommend this book or this accessible blog post.

        There’s even a standard (though nobody uses it AFAICT). My point is: there is an established syntax with a history stretching back 50 years or more. The “ideal syntax” I refer to is precisely the syntax that domain experts are accustomed to. Are domain experts smart enough to use another? Sure, but why should they have to?

        A better question would be “What features can we offer the domain expert that would make them eager to give up a DSEL that mimics the surface syntax of the language they’re familiar with?”. I’d say:

        • Language Consistency. ‘*’, which represents zero or more, works well on rules *(rule), but fails on literals *'a'. If we used normal function syntax as our surface syntax, zeroOrMore, we gain consistency in this and many other cases.

        • Avoidance of counter-intuitive infix precedence. cout << 12+13 works well while cout << 12 != 13 results in a compiler error. This doesn’t match the intuition of the stream domain expert. Using a different surface syntax again resolves this situation.

        • Composability. This is one of the most exciting features of DSELs and our domain expert will certainly want to have it. By stepping away from surface syntax mimicry, we can offer them a language that integrates seamlessly with both C++ and other DSELs. A lambda DSEL combined with EBNF is a powerful example and could reduce the redundancy in your example above.

        We could disagree about which is more important to get right in a DSEL: syntax or concepts (semantics)? But a good DSEL should strive for both.

        A good DSEL should strive for a quality syntax (quality not measured by syntactic mimicry), yet that is secondary to correct coverage of an appropriate semantic domain (this paper goes into this a bit).

          Quote
        • Eric Niebler says:

          These are valuable insights, David. Yes, much goes into choosing the best DSEL syntax, and getting as close as possible to the accepted notation for the domain is only one of many considerations. There are cases where the C++ operators have the wrong precedence or associativity for what you’d like to do with them. (For instance, in the ISO EBNF standard, the comma is used for sequencing, but in C++ the comma has such low precedence that it would be a horrible choice.) These are issues I’ll be discussing in the 2nd installment.

          I do not, however, go as far as you do and eschew infix notation entirely. I still believe DSELs can make good use of the operators, even when the user needs to sometimes help the compiler along.

          And incidentally, there is nothing about using infix notation that inhibits DSEL composability. Spirit uses Phoenix lambdas for semantic actions, and both Spirit and Phoenix make liberal use of infix notation.

            Quote
          • David Sankel says:
            I do not, however, go as far as you do and eschew infix notation entirely.

            I never said that I’m against all infix notation. I think they can aid clarity in some cases. The best example I can think of is the function composition operator in semantic editor combinators.

            I still believe DSELs can make good use of the operators, even when the user needs to sometimes help the compiler along. And incidentally, there is nothing about using infix notation that inhibits DSEL composability.

            I disagree. Higher order functions cannot take operators as parameters. You could argue that lambda can mitigate this. However consider the following function:

            auto sequence = _1 >> _2;

            What is the type of sequence? It is not a function that takes in two EBNF expressions and returns their sequence. It is overgeneralized in a non-meaningful way (if these operators had a proof obligation, as in Haskell, the generalization would have some sort of meaning).

            So, when I use infix operators, I always define them in terms of a type-correct callable. This way I get the composability, infix convenience, and a simple alternative when infix fails.

              Quote
          • Eric Niebler says:

            What are you calling overgeneralized? The fact that the lambda _1 >> _2 always does right-shifting, in a context that depends on its arguments? std::plus can be used to add integers and concatenate strings. Is that overgeneralized? As a C++ programmer, I find that unsurprising and actually quite useful.

            I understand what you’re saying, though. You want a named binary function object that only accepts two Spirit parsers and places them in sequence. You can pass this to a higher order function, or implement operator<< in terms of it. <nod> That would be nice.

            For the record, all of Proto’s operator overloads have named function object equivalents. See proto::functional::make_expr. And since Spirit’s >> operator is really just Proto’s, Spirit too has a named function object that does sequencing. It’s type is proto::functional::make_expr< proto::tag::shift_right >.

              Quote
          • David Sankel says:
            What are you calling overgeneralized? The fact that the lambda _1 >> _2 always does right-shifting, in a context that depends on its arguments?

            The intent of the programmer, in the sequence = case, is to define a function that takes two spirit parsers and places them in sequence. Lets call this s₀ which has type (Sp,Sp) → Sp where Sp is the type of spirit parsers.

            A different, but similar function can be defined:

            auto s₁ = _1 >> _2;

            s₁ is a generalization of s₀ because its domain is larger and it agrees with s₀ for all inputs in the domain of s₀ (for all a,b ∈ Sp, s₀(a,b) = x ⇒ s₁(a,b) = x).

            The reason I call s₁ an over-generalization is because the programmer never intended sequence(12,4) to be well defined.

            I also said that a charactaristic of s₁ is that it lacks meaning. By reading the implementation of s₁, we cannot say anything about what it actually does for an arbitrary input that type checks. This doesn’t have to be the case, but I’ll defer explaining that to a future blog post on polymorphic classes.

            std::plus can be used to add integers and concatenate strings. Is that overgeneralized? As a C++ programmer, I find that unsurprising and actually quite useful.

            If the authors used the word plus to refer to the character ‘+’, then no it is not over-generalized, although that function doesn’t have nice semantics. If the authors used the word plus to refer to the addition operation of a ring then yes, it is over-generalized.

            For the record, all of Proto’s operator overloads have named function object equivalents. See proto::functional::make_expr. And since Spirit’s >> operator is really just Proto’s, Spirit too has a named function object that does sequencing. It’s type is proto::functional::make_expr< proto::tag::shift_right >.   

            Great! This helps a bit.

            auto s₂ = proto::functional::make_expr< proto::tag::shift_right >;

            So now we have s₁ which generalizes s₂ which generalizes s₀.

            s₂ is still overgeneralized, but not as badly as s₁. s₂ has more meaning as well, although the meaning is strictly an EDSL implementation detail.

              Quote
  4. litb says:

    I’m looking forward to the next installment of your articles. I always wanted to check into boost.proto, but never had time and motivation to do so.

    This now is an interesting series I hope, so finally maybe I’m going to be proto-typed! Ahaha

      Quote
  5. Vasilis Vasaitis says:

    Great article, thanks!

      Quote
  6. DI says:

    Another one with another approach is the PEGTL at http://code.google.com/p/pegtl

      Quote
    • Eric Niebler says:

      PEGTL, from what I can tell, has little to do with Boost.Proto. It’s an alternative to Boost.Spirit, which is an example of a DSEL. But Proto (which is what this article is really about) is a tool for building DSELs. It’s one level up from DSELs like Spirit and PEGTL.

        Quote
  7. LPR says:

    Boost.Proto + machine code generating back-end = clean, maintainable emulator core? daydreams

      Quote
    • Eric Niebler says:

      What kind of emulator are you daydreaming about?

        Quote
      • LPR says:

        CPU emulation. Specifically the binary translation stage.

        asm += add_[eax, imm32(var)]; this->block = x86_emitter.eval(asm);

          Quote
        • Eric Niebler says:

          Well, you wouldn’t want the invocation to eval and the assignment to asm be in different statements. Something like this might work:

          this->block =
              asm[
                  add_[eax, imm32(var)],
                  ...
              ]();

          Multiple asm statements within the asm block would be comma-separated. I don’t know what this->block is or the assignment to it is for, though.

            Quote
          • LPR says:

            As you can see, I don’t know anything about Proto. Anyway, the idea was that eval would return (but not execute) a block of x86 machine code.

              Quote
          • Eric Niebler says:

            And I’ll admit I know nothing about inline assembly and how it interacts with function inlining. If a function with inline asm can’t be inlined, then Proto won’t be able to help. But I honestly don’t know.

              Quote
          • LPR says:

            The main thing to consider is that an arbitrary sequence of opcodes must composed, at runtime, to form the basic block. If each translated opcode is a function, then the basic block becomes a sequence of function pointers, one pointer per opcode. I’m not sure how any compiler could then guarantee that a basic block would form a single unit of execution as is the case when a basic block is nothing more than a sequence of raw bytes (machine code).

              Quote
          • Eric Niebler says:

            AFAICT, the opcodes don’t need to be composed at runtime. They can be composed at compile time. Proto can do that by composing the larger asm block out of function (object)s with inline asm. That’s no problem. The only question is whether such a composite can be inlined by the compiler leaving only the inline asm and none of the function calls. That’s the part I don’t know. It would be pretty easy to test that, though.

              Quote
          • LPR says:

            Sorry, I’m just not seeing it.

            Link for definition of “Basic block”: http://en.wikipedia.org/wiki/Basic_block

            It is the basic block which defines a unit of execution. Basic blocks are built at runtime from the input opcode sequence. What you describe would be better suited to a simple fetch/decode/execute loop.

              Quote
          • Eric Niebler says:

            I read the link, but I still don’t see the need to build an opcode sequence at runtime. When you have something of the form:

            this->block =
                asm[
                    add_[eax, imm32(var)],
                    ...
                ];

            then the opcodes (e.g. add_) are known at compile time. If you assign the block to a block variable, then the block can be stored in type-erased form, but the code associated with the block itself is generated entirely at compile time. (BTW, when you say “opcode” do you really mean the code needs to be interpreted at runtime, or are you going for portable inline assembly? The former is easy, the latter depends on the ability of the compiler to inline functions with inline asm blocks, as I’ve said.)

            Hopefully, this will all be more clear once I publish future articles. In fact, if I could understand your requirements a bit, this might make a really interesting example for an article. Do you have code that already does what you want? Could you publish it?

              Quote
  8. I’m just throwing this in a comment because frankly, I’m not sure whether it really fits in this article (series): it’s usually important to point out, when introducing DS(E)Ls, their declarative (vs. procedural) nature and how that impacts readability, maintainability, and correctness.

      Quote
    • Eric Niebler says:

      I considered mentioning that—even wrote it out—but decided against it because there’s nothing inherently declarative about DSELs. If the language you’re embedding is procedural (e.g. Boost.Lambda, Boost.Phoenix), then your DSEL is procedural. Likewise, the benefits of a DSEL could be enhanced correctness but it isn’t necessarily so; I can write bugs in Phoenix just as effectively as I can with straight C++. They can help readability for people who are already familiar with the domain’s notation and concepts, but hurt readability for others. People often gripe about the readability of regexes, but you can pry them out of my cold dead hands.

      The one consistent property I can discern in DSELs is that they all aim to provide notation and concepts that are well-suited to a particular problem domain.

        Quote
  9. df says:

    Looks awesome, I can’t wait for the sequel :P

      Quote
  10. Joel Falcou says:

    Great piece of writings, I like the historical introduction :) Can’t wait for the rest of the series ;)

      Quote
  11. Ahmed Charles says:

    I was wondering, where did the ‘std::’ come from in ‘terminal(std::cout)’? Or is there more magic going on?

      Quote

Leave a Comment (post replies using links below individual comments)