Lisp on LLVM

Status

This is a work-in-progress. Right now the language doesn’t have cons so you can’t really call it a lisp.

Lisp

I need to get my head around LLVM, so I’m following the Kaleidoscope tutorial and writing a language. I thought it might be a good idea to write a lisp instead of a python-like language as I follow along, so I can engage with LLVM and the docs rather than just copy code out of a tutorial. I have no idea how to write a lisp though, so I’m also reading Peter Norvig’s (How to Write a (Lisp) Interpreter (in Python)) and translating it to C++ and LLVM concepts.

The tokenizer

C++ doesn’t really have nice string facilities, so I can’t just split the string in one line of code¹. Thankfully, lisp is simple and I can iterate over my characters. Special tokens are (, and ), and then anything separated by parens or whitespace is also a token. Simple!

// tokenizer ------------------------------------------------------

std::vector<std::string> tokenize(const std::string &chars) {
  std::vector<std::string> tokens;
  std::string symbol = "";

  // I got sick of writing this over and over
  // It just adds the current token to our token list
  auto l = [&]() {
    if (symbol.length() > 0) {
      tokens.push_back(symbol);
      symbol = "";
    }
  };

  for (auto it = chars.begin(); it != chars.end(); ++it) {
    if (*it == '(') {
      l();
      tokens.push_back("(");
    } else if (*it == ')') {
      l();
      tokens.push_back(")");
    } else if (*it == ' ' || *it == '\n' || *it == '\r' || *it == '\t') {
      l();
    } else {
      symbol.push_back(*it);
    }
  }
  l();

  return tokens;
}

The AST

I’m representing the AST as the Kaleidoscope tutorial does, by using an abstract class.

class ASTVisitor;

class ASTNode {
public:
  virtual ~ASTNode()=0;
  virtual void accept(ASTVisitor&)=0;
};
ASTNode::~ASTNode(){}

I’m using the Visitor pattern for operations on the nodes. The first visitor I wrote was a printer for debugging purposes. Right now my nodes are either symbols, floating point numbers, or sexps. Most of the following code is just noise, the key things to note are a SymbolNode has a text attribute, an FPNode has a double attribute, and a SexpNode just has ASTNode children.

class SymbolNode;
class FPNode;
class SexpNode;

class ASTVisitor {
public:
  virtual ~ASTVisitor()=0;
  virtual void visit (SymbolNode &node) = 0;
  virtual void visit (FPNode &node) = 0;
  virtual void visit (SexpNode &node) = 0;
};
ASTVisitor::~ASTVisitor(){}

class SymbolNode : public ASTNode {
public:
  std::string text;

  SymbolNode(const std::string &text) {
    this->text = text;
  }

  ~SymbolNode()=default;

  void accept(ASTVisitor& visitor) {
    visitor.visit(*this);
  }
};

class FPNode : public ASTNode {
public:
  double val;

  FPNode(const double val) {
    this->val = val;
  }

  ~FPNode()=default;

  void accept(ASTVisitor& visitor) {
    visitor.visit(*this);
  }
};

class SexpNode : public ASTNode {
public:
  std::vector<std::unique_ptr<ASTNode> > forms;
  SexpNode()=default;
  ~SexpNode()=default;

  void addForm(std::unique_ptr<ASTNode> form) {
    forms.push_back(std::move(form));
  }

  void accept(ASTVisitor& visitor) {
    visitor.visit(*this);
  }
};

Parser

Okay, we have a tokenizer and an AST, now we need to parse. I’ve copied this almost straight from the Norvig page, but as C++. The original code was popping tokens from the front of the token list, which is not something I really want to do with std::vector, and using std::queue would require reversing my list², so I’m just passing an iterator around that points at the current token. Here’s the code that produces a SymbolNode or an FPNode by trying to convert the string:

std::unique_ptr<ASTNode> atom(const std::string &token) {
  try {
    double f = std::stof(token);
    return std::make_unique<FPNode>(f);
  } catch (const std::invalid_argument &e) {
    return std::make_unique<SymbolNode>(token);
  }
}

And the parser:

std::unique_ptr<ASTNode> read_from_tokens(const std::vector<std::string> &tokens,
                                          std::vector<std::string>::iterator &it) {
  if (it == tokens.end()) {
    throw std::runtime_error("Syntax error: unexpected EOF!");
  }

  std::string token = *it;
  ++it;

  if (token.compare("(") == 0) {
    std::unique_ptr<SexpNode> sexp = std::make_unique<SexpNode>();
    while (it->compare(")") != 0) {
      sexp->addForm(read_from_tokens(tokens, it));
    }
    ++it; // pop ")"
    return sexp;
  } else if (token.compare(")") == 0) {
    throw std::runtime_error("Syntax error: unexpected )");
  } else {
    return atom(token);
  }
}

I know the iterator thing is a bad hack that makes it hard to reason about the state of the parser, but it works. As always, this code could do with more useful error messages, maybe highlighted syntax errors or line numbers or something?

For completeness, here is the parse function:

std::unique_ptr<ASTNode> parse(const std::string &program) {
  std::vector<std::string> tokens = tokenize(program);
  std::vector<std::string>::iterator it = tokens.begin();
  return read_from_tokens(tokens, it);
}

TODO Code Generation

This is harder than I thought it would be, and for now I’m ignoring first class symbols and lists (cons, car, cdr) until I know how to work with memory, data structures, and external functions in LLVM. I have basic function definitions and calls working. The only data type is double and the only function I have implemented right now is add2, which just turns into LLVM’s fadd instruction. So right now you can write things like:

> (defn f (a b) (add2 (add2 a b) (add2 a b)))
> (f 1.0 2.0)
Evaluated to: 6

Which is really cool to see. Hopefully it will do useful things soon.

Laundry list

+ , - , * , /

do / begin

Nice to haves

LLVM IR

It’s really easy for me to provide functions externally through C++ code, like so:

extern "C" double sub2(double a, double b) {
  return a-b;
}

(extern sub2 (a b) ()) ; I'll fix this syntax soon
(sub2 2.0 1.0) => 1.0

I would prefer to have it written directly in LLVM for tighter integration, but emitting LLVM IR from C++ is painful. I think it would be cool to emit IR from lisp and redefine the library like that. Maybe something along these lines:

(let ((add2 (llvm-func "add2" double '(double double)))
      (a (llvm-args add2 0))
      (b (llvm-args add2 1))
      (bb "entry"
          (llvm-ret
           (llvm-fadd a b "addtmp")))))

For extra points this could be compiled into an object and linked dynamically on next load rather than compiled every time. Maybe this could lead to bootstrapping?

TODO A type system

TODO Pattern matching & Algebraic data types

What would this look like?

Footnotes:

I have since learned about <regex>, so I could clean this up a little.

Although now that I think about it, I could just use std::deque.