Ocaml track: assignment 5: Implementing Scheme, part 1

Goals

This assignment and assignment 6 will involve implementing an interpreter for a simple programming language. This language will be a very stripped-down dialect of the Scheme programming language called "bogoscheme". Despite the name, the language will still possess many features of real Scheme, including first-class functions and proper lexical scoping. Along the way, we will learn to use some of the special tools that ocaml provides for use in constructing language parsers.

If you've never seen Scheme before, you should try to familiarize yourself with the basics of the language before attempting this lab. Fortunately, you will not need to know much Scheme to write this lab.

Concepts covered this week

Lexer generators
Parser generators
S-expressions
Abstract syntax trees

Language tools covered this week

ocamllex
ocamlyacc

Reading

Read the standard ocamllex and ocamlyacc documentation here.

Another excellent resource is this ocamlyacc tutorial, which goes into the material in considerably more depth than the manual from the language designers. Note especially this page on recursive grammar rules -- we guarantee that it will be useful to you.

Another useful ocamllex tutorial is here.

Overview

Basic sequence of interpretation

The process of writing a programming language interpreter consists of a series of steps which lead us all the way from the original source code (which is treated as a big long string of characters) to the result of executing the program. Although the specific steps can vary depending on the sophistication of the interpreter, for our purposes the steps will include:

Lexing: Converting the source code into a stream of tokens, which are primitive syntactic entities like numbers, punctuation marks, and identifiers. This stage is called "lexical analysis" or "lexing" for short, and programs/functions that do it are called "lexers".
Parsing: Converting the stream of tokens into an abstract representation of the program, usually referred to as an abstract syntax tree or AST. This stage is called "parsing" and programs/functions that do it are called "parsers". Sometimes people informally use the word "parsing" to mean the combination of lexing and parsing, but we'll use the more restricted form of the word in what follows.
Interpretation: Executing the abstract representation of the program.

Compilers (and more sophisticated interpreters) invariably have many more stages than this, though they also start off by doing lexing and parsing, just as we will do.

Lexer and parser generators

Lexers and parsers have been written since the early days of programming languages, and there is an extensive literature associated with the theory behind them. At some point, someone realized that

Writing lexers and parsers is very boring work.
It's possible to automate most of this boring work away.

As a result, people started writing lexer generators and parser generators, which are programs that read some kind of textual specification of what a lexer or parser is supposed to do, and generate the code to do it. This is similar to the kind of things compilers do, but it's much more restricted. The best-known lexer generator is called lex and generates C code for a lexer (there is a better version of this program called flex which is available on most Linux systems). The best-known parser generator is called yacc (short for "Yet Another Compiler-Compiler" and also a self-deprecating name) and generates C code for a parser (there is a better version of this program called bison (the name is a pun on "yacc" as well as "gnu", since it's a GNU program) which is available on most Linux systems).

These programs are very popular, but they only generate C code, so it wasn't long before people started writing versions that could generate code in other languages. Now, many languages have the equivalent of lex and yacc. Ocaml, in particular, has the programs ocamllex and ocamlyacc which generate ocaml code. These are the programs we'll be using. Their usage will be described in the lectures and is also described in the ocaml manual (but not in Jason Hickey's textbook!).

Goal of the next two labs

The goal of the next two labs is to write a program that will enable us to interpret the following Scheme code, which computes the factorial of 10:

;; bogoscheme program to compute factorials.

(define factorial
  (lambda (n)
    (if (= n 0)
        1
        (* n (factorial (- n 1))))))

(print (factorial 10))

Note that "print" is not standard Scheme; it simply prints its argument to the terminal, followed by a newline.

This bit of code (which will be in a file called "factorial.bs") contains only a few Scheme constructs:

constants (numbers)
variables (identifiers)
define
lambda
if
function application

Abstract syntax trees

The goal of the current lab is more modest; we simply want to write a program that will take the code above and convert it into an intermediate representation called an "abstract syntax tree" or AST for short. An AST is a representation of a program (or part of a program) in terms of an ocaml datatype. One reason ocaml is such a great language for writing compilers in is that its union types (also known as algebraic data types) are perfectly suited to representing abstract syntax trees. For instance, the AST representation of our Scheme language will be the following:

(* Type of Scheme identifiers. *)
type id = string

(* Type of Scheme expressions. *)
type expr =
   | Expr_unit
   | Expr_bool   of bool
   | Expr_int    of int
   | Expr_id     of id
   | Expr_define of id * expr
   | Expr_if     of expr * expr * expr
   | Expr_lambda of id list * expr list
   | Expr_apply  of expr * expr list

This code says that the type id (for "identifier") is just an alias for the string type, and the type expr has eight different cases. Three of them represent data values (the unit value, boolean values and integer values). One represents identifiers. One each represent Scheme define, if, and lambda expressions. Finally, one represents the application of a function to its arguments. The unit type (which is also not standard Scheme) will be used as the return type for any expression which doesn't return any meaningful value. Note that the only value that has the unit type is the unit value, just like in ocaml (the name "unit" means "type that has only one (meaningless) value"). In our program, the print function will return a unit value.

The AST contains a complete description of the syntax of any valid piece of code we might read in, in a form which is easy for an ocaml program to manipulate. Thus, our goal for this lab is to take the factorial program given above and convert it into this AST. In the next lab, we will see how to interpret the AST to compute the answer.

S-expressions

In a section above I said that interpreted languages are processed first by lexing, then parsing, then interpreting. This is true for most languages. However, languages derived from Lisp (like Scheme) take a more indirect approach:

First, they use a lexer to convert an input string (i.e. the contents of a file) into tokens. This is no different from the approach we described above.
Then, they convert the token stream into an intermediate representation called an "S-expression" (which originally meant "symbolic expression"). I'll describe this below.
Then the S-expression gets converted into an "abstract syntax tree" (AST), which we described above.
Finally, the AST is given to an evaluator function, which evaluates it and returns the result.

Most language interpreters skip the second step and go straight from the token stream to the AST. This is what is usually called "parsing". Here, we use the word "parsing" to mean the process of going from the token stream to an S-expression. Converting from the S-expression to an AST is a subsequent step.

The advantage of S-expressions is that they are much more generic than an AST. If you ever decide to add a new fundamental form to your language, for instance, you will almost certainly have to change your AST, and this would normally mean changing the parser. But with our approach, we usually won't have to change the parser; we'll only have to change the functions that convert S-expressions to ASTs, and that's usually easier (sometimes much easier).

The S-expression data type is very simple:

(* Type of atomic expressions. *)
type atom =
   | Atom_unit
   | Atom_bool of bool
   | Atom_int  of int
   | Atom_id   of string

(* Type of all S-expressions. *)
type expr =
   | Expr_atom of atom
   | Expr_list of expr list

In short, an S-expression is either an "atom", which is a single scalar value like a number, a boolean, a unit value or an identifier, or it's a list of S-expressions. Note that this means that an S-expression can contain nested lists (lists of lists, or lists of lists of lists, etc.).

Notice that the S-expression data type is called expr, as is the AST data type. This is OK, because they will be in different modules, so the full type name of an S-expression will be Sexpr.expr, and AST expressions will have the type Ast.expr.

Program to write

You will write a program which will take an input file, convert it into an AST, and print out the AST representation. The output you should get will look something like this:

DEFINE[factorial
  LAMBDA[(n)
    IF[
      APPLY[
        ID[ = ]
        ID[ n ]
        INT[ 0 ] ]
      INT[ 1 ]
      APPLY[
        ID[ * ]
        ID[ n ]
        APPLY[
          ID[ factorial ]
          APPLY[
            ID[ - ]
            ID[ n ]
            INT[ 1 ] ] ] ] ] ] ]


APPLY[
  ID[ print ]
  APPLY[
    ID[ factorial ]
    INT[ 10 ] ] ]

You should also write two other programs, which will help you as you debug your program:

A program that will do the lexing only, and which will then print out the tokens that have been read in.
A program that will do the lexing as well as the parsing to generate the S-expression representation of the program, and which will then print out the S-expression.

The main focus of this program is to write the lexer and the parser. The lexer will go into a file called lexer.mll and the parser will go into a file called parser.mll. In addition, you will have to write the code to convert an S-expression into an AST expression. This amounts to less than 80 lines in all. All of the rest of the code, including code to convert S-expressions and AST expressions to strings, will be supplied for you.

The locations in the file where you have to add code are marked by comments including the word "TODO". You should remove these comments as you fill in the code.

Lexer

This will be done in the file lexer.mll. This is an ocamllex file, which is not the same as a real ocaml source code file; instead, ocamllex will use what's in this file to generate the real ocaml file lexer.ml. You should fill in the part of the file following the line

rule lex = parse
    (* fill in code here *)

Your lexer should convert chunks of the input string into one of the tokens defined in the parser (see below). Specifically, your lexer should handle:

Single-line comments, starting with ; and going to the end of the line. These should be discarded. A useful trick here is to simply call the lexer recursively to get another token if a comment is encountered. Note that to do this, you should realize that the lex lexer (that comes after the word rule in lexer.mll) is converted by ocamllex into a one-argument ocaml function; the argument is the lexing buffer, called lexbuf. Knowing this will allow you to call the lexer recursively from the action part of the lexing rule.
Whitespace, which should be discarded as well (use the same trick you used with comments).
Left and right parentheses, each of which is a distinct token type (TOK_LPAREN and TOK_RPAREN).
End-of-file, which will yield a token of type TOK_EOF.
Unit values, which are written as "#u". This will yield a token of type TOK_UNIT.
Boolean values, which are written as "#t" (for true) and "#f" (for false). These will become tokens of type TOK_BOOL. TOK_BOOL tokens have an associated value (true or false).
Integer values (possibly including a leading minus sign). These will become tokens of type TOK_INT, which also have an associated (integer) value. Don't forget to convert the string token to an integer before creating the TOK_INT value.
Identifiers, which will become tokens of type TOK_ID. These include the string which represents the identifier.

Anything else is an error.

Parser

The purpose of the parser is to take a sequence of tokens and convert it into S-expressions. Each time you call the parser it will return a new S-expression. If you hit EOF, the parser will return None. Thus, the return type of the parser is Sexpr.expr option.

The token types are defined for you in the file parser.mly. Look at this file and you'll see that there are tokens for left and right parentheses, unit values, boolean values, integers, etc. Note that some tokens have associated data and others don't.

We've also provided stubs for the five nonterminals of the grammar:

parse: This represents an S-expression, or None if EOF is encountered.
sexpr: This represents an S-expression, which can be either an atom or a list of S-expressions.
atom: This represents an atom, which can be either a unit value, a boolean value, an integer value, or an identifier.
slist: This represents a list of S-expressions, surrounded by parentheses.
sexpr_list: This represents the contents of a list of S-expressions, without the parentheses. Note that empty lists are valid S-expressions.

All the types of each of the above nonterminals are given in the file.

This parser is just about the most trivial language parser imaginable; you only need to fill in 11 lines of code. Despite this, S-expressions are very flexible and are a useful way to structure syntax. In fact, XML documents are often considered to be S-expressions on steroids.

S-expression to AST conversion

This should be done in the file ast.ml, in a function called ast_of_sexpr. You will need to pattern-match on the S-expression to figure out which AST expression it corresponds to, or raise an exception if it doesn't correspond to any valid AST expression. This will be the trickiest part of the lab, so make sure your parser is working correctly (using the parser_test program) before you attempt it.

Supporting files

Since there are a lot of supporting files for this lab, we've prepared a tarball of all the files, which is located here. To unpack it, do:

% tar xvf lab5.tar

Then edit the files lexer.mll, parser.mly, and ast.ml in order to complete the lab. The rest of the files should be left alone.

Note that we provide a Makefile which will compile all the code for the lab, so you won't have to invoke ocamllex or ocamlyacc manually. However, feel free to do so anyway, to see what it does.

Testing the code

It's highly advisable to test the code at multiple points to make sure that each stage is working properly. To that end, the Makefile that we supply has targets for programs called lexer_test, parser_test, and ast_test. You should compile all of these programs (typing make will do this), run all of them on the factorial.bs file, and check that the output is what you expect it to be. An easy way to do this is to type make test_lexer (to test the lexer), make test_parser (to test the parser), and make test_ast (to test the ast). The lexer test uses a file called tokens.bs which includes the contents of factorial.bs plus a few other tokens put at the end of the file.

Typing make clean will remove any files that have been generated by the compiler.

Note that the parser and the lexer are interdependent, so you can't test the lexer until the parser is working (more specifically, the parser has to be working well enough that it can pass through ocamlyacc without causing an error, though it doesn't have to parse correctly for the lexer to work properly). Sound confusing? It is.

To hand in

The files lexer.mll, parser.mly, and ast.ml.

References

The standard ocamllex and ocamlyacc documentation, which is available here.
Another ocamllex tutorial, available here.
Another ocamlyacc tutorial, available here. Note especially this page on recursive grammar rules.