Finite Improbability

Just programming and math, no spontaneously jumping undergarments

Archive for November, 2009

Lambda Calculus compiler, Part II: Wading in with arithmetic

Table of contents for A Lambda Calculus compiler for LLVM

  1. A compiler for Lambda Calculus to LLVM, Part 1
  2. Lambda Calculus compiler, Part II: Wading in with arithmetic
  3. Lambda Calculus Compiler: Part III: First-Order Functions

Rather than dive entirely into working on a compiler, I’m going to approach the task in pieces. This post will go over how the Haskell LLVM bindings work while creating support for computing algebraic expressions. The next post will cover first-order functions, with closures to follow.

This calculator will be a standalone program that takes in algebraic equations on the command line, compiles them into LLVM bytecode, runs that code and prints the result. At the time of this writing, the LLVM bindings cannot load functions generated at runtime in ghci, so if you’re following along you’ll have to statically compile the result.

As before, this post is literate Haskell. It needs the LLVM bindings installed and set up correctly, and uses the parser from the last post. Please note that I needed to make a couple of changes to the parser module, so if you downloaded a copy before this post existed, you should grab an update.

> module Main where

> import Parser

> import Control.Monad.State
> import Control.Monad(join)

> import Data.Int

> import LLVM.Core
> import LLVM.ExecutionEngine
> import LLVM.Util.Arithmetic
> import LLVM.Util.File(writeCodeGenModule)

> import System.IO

The Haskell LLVM bindings can be split into two levels. There’s a low level interface that doesn’t give much more than direct bindings to the C++ api. This is great as a fallback, but can be rather tedious to work with.

The high level api provides translation between Haskell and LLVM values, and heavy-duty abstractions for generating LLVM bytecode. Most of this abstraction is provided by a CodeGen monad, which keeps track of basic blocks and uses the syntactic sugar of do blocks to make your code look significantly like assembly psuedocode.

Simplifying Assumptions

In order to get very far in compiling arithmetic expressions in a reasonable amount of time, we’ll need to make a few simplifying assumptions. Some are obvious, like that we won’t be dealing with functions at this point and can unhelpfully error out when someone tries to use one. Others are slightly more subtle.

In reality, after we’ve parsed the input, we should really be doing some checks on the output to ensure it is valid. This includes things like ensuring someone doesn’t try to add together a number and a lambda expression. If you look back at the parser, you’ll notice it was carefully structured to avoid this specific problem.

An issue we aren’t going to deal with is name conflicts. A real compiler would add an identifier renaming pass that ensures all name references are unique. Since this is a budget compiler, we’re going to assume the program was careful and did that for us. It isn’t exactly hard to do this, just a bit tedious, and might be worth experimenting with as an exercise.

It also makes life a lot easier if we require that names be defined before they are used. Haskell allows code like this:

let f = g + 5
    g = 10 in
    ...

In order to compile f, we need to know what g is. The obvious answer is to come up with a way to wait on compiling f until we’ve either compiled g or compiled everything there is to compile and not seen g, but that is a lot of work. This also might be something worth doing for practice though.

Another simplifying assumption that we’re making is assuming our programmers are only interested in creating functions that add or multiply 32-bit integer values. Partly this is done to avoid work, like supporting floating point values, that I feel isn’t quite interesting enough to be worth the effort for this kind of toy project. The 32-bit part is just plain cheating so we can easily convert from a Haskell Int32 to an LLVM integer on any kind of system.

Compilation Miscellany

In order to compile our programs, we need an environment that tracks the register a value was compiled into. This can be done with a simple association list of identifier names to register values, which the prelude handily defines a few functions to work with.

The type: Value Int32 corresponds to an llvm register storing a 32-bit integer. This environment would obviously need some adjusting if we wanted to support other kinds of values. The LLVM bindings approach this through limiting Value a based on appropriate type classes and the existential types extension. That would probably be the best bet to proceed from here.

> type Env = [(String, Value Int32)]

The LLVM bindings provide add and mul instructions that take two registers (or immediate values) and produce the appropriate result. We’ll need a function to translate the operations from our AST into the equivalent LLVM functions.

> getOpt Add = add
> getOpt Mul = mul

Compiling Expressions

Our entire compilation step can be expressed with a single function. Since we will want to reference variables bound in a containing let, that function will obviously need to accept the environment as an argument. The function will walk the AST, compiling results as it goes and trusting the CodeGen monad to do all of the bookkeeping.

The result of our expression compilation will be a register, bound up in the CodeGen monad. This is about the area where my understanding of Haskell gets fuzzy, so I can’t fully explain the reasoning behind the return value of this function.

> compileExp :: Env -> Exp -> CodeGenFunction r (Value Int32)

Compiling a constant is simple, just convert the AST’s string (we already know it’s an integer, thanks to the parser) to a Haskell Int32, then pass that to valueOf for conversion to an LLVM Int32, and bind the whole thing up in our monad.

> compileExp _ (Con x)       = return $ valueOf ((fromIntegral x)::Int32)

Likewise, compiling a variable is as simple as looking up the register it was previously compiled into based on the variable name. Since we’re not trying to be useful compiler writers, we’ll just give an error and exit if we can’t find the variable name.

> compileExp e (EVar v)      = case lookup v e of
>                             Just e  -> return e
>                             Nothing -> error $ "could not find variable: " ++ v

Compiling an arithmetic expression is a simple matter of compiling each side, then performing the arithmetic on the result.

> compileExp e (EOp o l r)   = do
>                              l' <- compileExp e l
>                              r' <- compileExp e r
>                              (getOpt o) l' r'

Let bindings are the only way to introduce variable names into the environment. The body of the let needs the variable to be bound to the value. For now we’re going to assume the value is not a lambda expression, in which case we can simply compute it into a register and associate that register with the variable name when compiling the body of the let.

> compileExp e (Let v val body) = do
>                  v' <- compileExp e val
>                  let e' = ((v::String),v'):e in
>                           compileExp e' body

Just to be clear, we’ll make sure that everything else (function application and lambda expressions) gives an error.

> compileExp _ _            = error "none of that, I'm still cheating"

Our “calculator” portion is formed by wrapping the expression in an LLVM function that computes it and returns the result. Here’s a simple function that does just that.

Since this is the first time we’ve actually made a function, it warrants going over what’s happening here. The createFunction function does pretty much what you would expect. It needs an argument detailing the link-time visibility of the function. For our specific purposes the parameter is irrelevant, but it’s there if you need it.

Notice that we aren’t providing a name for the function. Even according to our bindings we’re still working with anonymous functions! Of course, if you decompile the resulting bytecode you’ll see that LLVM comes up with names like @_fun1.

Our function takes no arguments, and simply compiles the expression, stores the result into a register, and returns it. Couldn’t be easier than that. The Function type represents this. If our function had taken a single integer argument its type would look like Function (Int32 -> IO Int32). Since we’re calling out to bindings that do all kinds of IO work, it’s reasonable to expect the IO monad to show up. With this type of declaration it’s impossible to call the function we’re making without being in the IO monad or committing irredeemable sins. We’ll try to keep our functional hearts pure, at least for the rest of this module.

> compileFun :: Exp -> CodeGenModule (Function (IO Int32))
> compileFun exp = createFunction ExternalLinkage $ do
>                    t <- compileExp [] exp
>                    ret t

Now we can wrap up the process of compiling our function and binding to it in Haskell. The LLVM bindings provide simpleFunction for just this task.

> compileToFun :: Exp -> IO Int32
> compileToFun exp = join $ simpleFunction $ compileFun exp

Our main function will simply loop forever, pulling in user input and running it through something to respond to it.

> main =
>     forever $ 
>     do putStr "> "
>        hFlush stdout -- have to flush for proper output ordering
>        s <- getLine
>        go s

Here’s the heart of our calculator. It attempts to parse the user’s input (generating an error message on failure), then compiles it and runs the containing function to get the result. Finally it writes the bytecode out to Calc.bc.

> go s =
>   case runwith letexpr id s of
>   Left err  -> putStrLn $ show err
>   Right exp -> do let fun = compileToFun exp in
>                             do r <- fun
>                                putStrLn $ "result: " ++ (show r)
>                                writeCodeGenModule "Calc.bc" $ compileFun exp
>                                                          

This file can be disassembled with llvm-dis, which generates somewhat surprising output. No matter how complicated your let bindings or arithmetic expression, you’ll always get a function that simply returns the pre-computed result. LLVM has an extensive number of optimization passes, and a good selection of them are enabled for you by default in the Haskell bindings. One of these is the constant propogation pass, which looks for arithmetic that can be done at compile time and eliminates the intermediary computation.

I’ll walk through an example, to show how the magic works. We’ll start with this expression:

let f = 5; g = 2 in 4 * f + (20 + 2)

It parses into:

Let "f" (Con 5)
  (Let "g" (Con 2)
    (EOp Add
      (EOp Mul (Con 4) (EVar "f"))
      (EOp Add (Con 20) (Con 2))))

A (very) naive “hand-compilation” would give us something like the following:

define internal i32 @_fun1() {
_L1:
        %1 = $5
        %2 = $2
        %3 = mul $4 %1
        %4 = add $20 %2
        %5 = add %3 %4
        ret i32 %5
}

First off, we obviously don’t need the %1 and %2 registers, all they do is hold constants. It’s simple to push their values into the relevant expressions. This act of recognizing a constant and using it to replace the register it was being stored in is constant propagation. Once that push has happened, the %1 and %2 registers become unused, and can be dropped.

define internal i32 @_fun1() {
_L1:
        %3 = mul $4 $5
        %4 = add $20 $2
        %5 = add %3 %4
        ret i32 %5
}

Now we can look at the %3 register, which just multiplies two constants. We know how to do that too, and it’s no stretch to handle the addition on %4. This optimization, called constant folding, is closely related to constant propagation.

define internal i32 @_fun1() {
_L1:
        %3 = $20
        %4 = $22
        %5 = add %3 %4
        ret i32 %5
}

But now we’ve got more registers that are nothing but labels for constants, you know what to do.

define internal i32 @_fun1() {
_L1:
        %5 = add $20 $22
        ret i32 %5
}

Now we’re another constant fold & propagate step away from the exact output LLVM gives us, which is:

define internal i32 @_fun1() {
_L1:
        ret i32 $42
}

Tune in next time for first-order functions.

No comments

A compiler for Lambda Calculus to LLVM, Part 1

Table of contents for A Lambda Calculus compiler for LLVM

  1. A compiler for Lambda Calculus to LLVM, Part 1
  2. Lambda Calculus compiler, Part II: Wading in with arithmetic
  3. Lambda Calculus Compiler: Part III: First-Order Functions

In this series of posts, I’m going to walk through building a compiler for a simple Lambda Calculus to llvm. I’ll try to keep things to the point that nothing more than a general understanding of Haskell, Parsec, and the basic concepts of LLVM are required.

If you’d like to follow along, this post is a Literate Haskell file. That is unless Wordpress broke it, in which case let me know and I’ll fix it.

This first post concentrates on the parser. We’re going to target the abstract syntax found below. It’s a relatively simple untyped Lambda Calculus, with the conceit of adding let bindings to make life a bit more interesting. Note that we’re simplifying things considerably by limiting our types and arithmetic operations to stay within the realm of positive integers.

For educational value, I will not be using Text.ParserCombinators.Parsec.Language. Please note that, if you’re trying to implement a “real” language parser, that’s a much more sane option.

Let’s get started. We’re going to need to define a module for htis, and we obviously need some parsec libraries as well as the AST we’re translating to, so let’s take care of that.

> module Parser(letexpr, runwith, runIO,
>               Exp(EVar, Let, App, Lam, EOp, Con),
>               Op(Add, Mul),
>               Var) where

> import Data.List

> import Text.ParserCombinators.Parsec
> import Text.ParserCombinators.Parsec.Error(messageString, errorMessages)

Here’s the AST I will be using:

> data Exp  = EVar Var
>           | Let Var Exp Exp
>           | App Exp Exp
>           | Lam Var Exp
>           | EOp Op Exp Exp
>           | Con Int
>             deriving Show

> type Var  = String

> data Op = Add
>         | Mul
>           deriving Show

I’m going to start out by defining a few useful tools. It’s entirely likely that these already exist in Parsec, and furthermore that my implementations are inferior, so please let me know if you spot a problem.

Expressing “this should be between spaces” is a pain, so we’ll make something to handle that. But, we don’t want “expected space or…” in all of our errors, so this isn’t as intuitive as it would seem:

> spaced x = between spaces' spaces' x where
>     spaces' = spaces <?> ""

We’ll need a similar construct for things in parentheses, and while we’re at it a “this string should be spaced out” construct wouldn’t hurt either:

> symbol s = spaced $ string s

> parens = between (symbol "(") (symbol ")")

Here’s a few example expressions to “sketch out” what the language should look like:

The identity function:

\x. x

A sample let expression:

let id = \x. x in
id 5

Some math:

\x. x * 3 + 2

Closures

\x. (\y. x+y)

From this, we can see that there are only two keywords, “let” and “in”. We’ll define those now:

> letKW = try $ do string "let"
>                  notFollowedBy alphaNum

> inKW = try $ do string "in"
>                 notFollowedBy alphaNum

> keywords = ["in", "let"]

A “this is a keyword” test will also be useful:

> isKW c = null ([c] \\ keywords)

Pretty much entirely stolen from the code for Parsec’s token parser library.

> kwfail f = do{ x <- f
>              ; if isKW x
>                then unexpected ("reserved word '" ++ x ++ "'")
>                else return x
>              }

Variable names are pretty simple, just have to check that the identifier we’ve parsed is not a keyword.

> var = spaced $ do
>                name <- kwfail ident
>                return $ EVar name

> ident = do
>         c <- letter <|> char '_'
>         cs <- many (letter <|> char '_' <|> digit)
>         return (c:cs)

We’re going to have literal integer values, which are pretty straightforward except that we need to convert the value into an integer when putting it in the ast.

> literal = spaced $ do
>                    n <- many1 digit
>                    return $ Con ((read n)::Int)

We’re going to have two arithmetic operations, addition and multiplication. Just to be fancy, we’ll abstract out the process of recognizing an operator. The following code is a function that takes a symbol and the AST representation and generates a function that, given a left and right value, generates the AST for that operation.

> mkOp sym val = do
>      symbol sym
>      return (\l r -> EOp val l r)

Now we can trivially define operators for addition and multiplication (and anything else, when we need to):

> add = mkOp "+" Add
> mul = mkOp "*" Mul

Order of operations can be tricky here. The standard “definition” (stolen from wikipedia) looks like this:

expression ::= additive-expression
additive-expression ::= multiplicative-expression ( '+' multiplicative-expression ) *
multiplicative-expression ::= primary ( '*' primary ) *
primary ::= '(' expression ')' | NUMBER | VARIABLE

Unfortunately, it’s left recursive, which will throw Parsec into an infinite loop. Fortunately, there’s tools to deal with this, namely chainl1. Our arithmetic expression looks pretty similar to the above, once that’s taken into account:

> arithExpr    = arithTerm   `chainl1` add
> arithTerm    = arithFactor `chainl1` mul
> arithFactor  = parens aexpr <|> var <|> literal

Obviously we need a definition for an expression. Since we’re only concerned with what is useful in the context of arithmetic expressions, we’ll limit what can happen here to the things that could have an arithmetic value.

> aexpr = spaced $ do
>   n <- (try app) <|> (try $ parens app) <|> arithExpr <|> literal <|> var <|> parens aexpr
>   return n

A function application is composed of a (parenthesized) lambda expression or variable on the left and an arithmetic expression on the right.

> app = do
>   f <- (parens lam) <|> var
>   x <- (try $ parens arithExpr) <|>
>        (try $ parens app)       <|>
>        var                      <|>
>        literal
>   return $ App f x

A let expression binds a var to either a value or a lambda expression. Nested let expressions are allowed, but we have to take care to ensure that ultimately an arithmetic expression is evaluated. We’re also going to allow multiple bindings in a single let, immediately desugaring them into equivalent nested let expressions. Note that, unlike in Haskell, the order of variable naming is relevant. Fixing that is an exercise left to the reader.

Sample desugaring:

let f = 5
    g = \x. x + x in
    g f

let f = 5 in
let g = \x. x + x in
g f

First, we’ll start with the name binding. To aid in using this for our syntactic sugar, it produces a (Var, Exp) pair for the expression bound to a given name:

> binding = do
>   (EVar v) <- var
>   spaced (char '=')
>   x <- aexpr <|> lam <|> var <|> literal
>   return $ (v,x)

Desugaring the lets is as simple as taking a list of the binding results, along with the eventual body, and walking the list to nest each binding in the let expression for the binding before it. For completeness, we’ll handle the empty list case as well, even though it won’t be permitted by the parser:

> desugarLets :: [(Var,Exp)] -> Exp -> Exp
> desugarLets [] b         = b
> desugarLets ((v,x):[]) b = Let v x b
> desugarLets ((v,x):xs) b = Let v x (desugarLets xs b)

> letexpr = do
>   spaced letKW
>   xs <- sepBy1 binding (newline <|> char ';')
>   spaced inKW
>   many $ newline <|> char ';'
>   b <- letexpr <|> aexpr
>   return (desugarLets xs b)

A lambda expression starts with a backslash, introduces a variable name, and has either a let expression, a raw arithmetic expression, or another lambda expression for its body.

> lam = do
>   symbol "\\"
>   (EVar v) <- var
>   symbol "."
>   x <- letexpr <|> (try aexpr) <|> (try app) <|> lam <|>
>        parens lam
>   return $ Lam v x

We’ll need this function later on to ease the process of running functions over the parser’s output.

> runwith :: Parser a -> (a -> b) -> String -> Either ParseError b
> runwith p conv input
>         = case (parse p "" input) of
>                 Left err -> Left err
>                 Right x  -> Right $ conv x

Finally, we have a simple function to run a parser and print the output.

> runIO :: Show a => Parser a -> String -> IO ()
> runIO p input
>         = case (parse p "" input) of
>             Left err -> do{ putStr "parse error at "
>                           ; print err
>                           }
>             Right x  -> print x

Next time, we’ll start working on an llvm-based simple calculator.

2 comments