This assignment and assignment 6 will involve implementing an interpreter for a simple programming language. This language will be a very stripped-down dialect of the Scheme programming language called "bogoscheme". Despite the name, the language will still possess many features of real Scheme, including first-class functions and proper lexical scoping. Along the way, we will learn to use some of the special tools that ocaml provides for use in constructing language parsers.
If you've never seen Scheme before, you should try to familiarize yourself with the basics of the language before attempting this lab. Fortunately, you will not need to know much Scheme to write this lab.
ocamllex
ocamlyacc
Read the standard ocamllex
and ocamlyacc
documentation here.
Another excellent resource is this
ocamlyacc
tutorial, which goes into the material in considerably
more depth than the manual from the language designers. Note especially this
page on recursive grammar rules -- we guarantee that it will be useful to
you.
Another useful ocamllex
tutorial is
here.
The process of writing a programming language interpreter consists of a series of steps which lead us all the way from the original source code (which is treated as a big long string of characters) to the result of executing the program. Although the specific steps can vary depending on the sophistication of the interpreter, for our purposes the steps will include:
Lexing: Converting the source code into a stream of tokens, which are primitive syntactic entities like numbers, punctuation marks, and identifiers. This stage is called "lexical analysis" or "lexing" for short, and programs/functions that do it are called "lexers".
Parsing: Converting the stream of tokens into an abstract representation of the program, usually referred to as an abstract syntax tree or AST. This stage is called "parsing" and programs/functions that do it are called "parsers". Sometimes people informally use the word "parsing" to mean the combination of lexing and parsing, but we'll use the more restricted form of the word in what follows.
Interpretation: Executing the abstract representation of the program.
Compilers (and more sophisticated interpreters) invariably have many more stages than this, though they also start off by doing lexing and parsing, just as we will do.
Lexers and parsers have been written since the early days of programming languages, and there is an extensive literature associated with the theory behind them. At some point, someone realized that
As a result, people started writing lexer generators and parser
generators, which are programs that read some kind of textual
specification of what a lexer or parser is supposed to do, and generate the
code to do it. This is similar to the kind of things compilers do, but it's
much more restricted. The best-known lexer generator is called
lex
and generates C code for a lexer (there is a better
version of this program called flex
which is available on
most Linux systems). The best-known parser generator is called
yacc
(short for "Yet Another Compiler-Compiler" and also
a self-deprecating name) and generates C code for a parser (there is a better
version of this program called bison
(the name is
a pun on "yacc" as well as "gnu", since it's a GNU program) which is
available on most Linux systems).
These programs are very popular, but they only generate C code, so it
wasn't long before people started writing versions that could generate code
in other languages. Now, many languages have the equivalent of
lex
and yacc
. Ocaml, in particular, has the
programs ocamllex
and ocamlyacc
which generate
ocaml code. These are the programs we'll be using. Their usage will be
described in the lectures and is also described in the ocaml manual (but not
in Jason Hickey's textbook!).
The goal of the next two labs is to write a program that will enable us to interpret the following Scheme code, which computes the factorial of 10:
;; bogoscheme program to compute factorials. (define factorial (lambda (n) (if (= n 0) 1 (* n (factorial (- n 1)))))) (print (factorial 10))
Note that "print"
is not standard Scheme; it simply prints
its argument to the terminal, followed by a newline.
This bit of code (which will be in a file called
"factorial.bs"
) contains only a few Scheme constructs:
define
lambda
if
The goal of the current lab is more modest; we simply want to write a program that will take the code above and convert it into an intermediate representation called an "abstract syntax tree" or AST for short. An AST is a representation of a program (or part of a program) in terms of an ocaml datatype. One reason ocaml is such a great language for writing compilers in is that its union types (also known as algebraic data types) are perfectly suited to representing abstract syntax trees. For instance, the AST representation of our Scheme language will be the following:
(* Type of Scheme identifiers. *) type id = string (* Type of Scheme expressions. *) type expr = | Expr_unit | Expr_bool of bool | Expr_int of int | Expr_id of id | Expr_define of id * expr | Expr_if of expr * expr * expr | Expr_lambda of id list * expr list | Expr_apply of expr * expr list
This code says that the type id
(for "identifier") is just an
alias for the string
type, and the type expr
has
eight different cases. Three of them represent data values (the unit value,
boolean values and integer values). One represents identifiers. One each
represent Scheme define
, if
, and
lambda
expressions. Finally, one represents the application of
a function to its arguments. The unit type (which is also not standard
Scheme) will be used as the return type for any expression which doesn't
return any meaningful value. Note that the only value that has the unit type
is the unit value, just like in ocaml (the name "unit" means "type that has
only one (meaningless) value"). In our program, the print
function will return a unit value.
The AST contains a complete description of the syntax of any valid piece of code we might read in, in a form which is easy for an ocaml program to manipulate. Thus, our goal for this lab is to take the factorial program given above and convert it into this AST. In the next lab, we will see how to interpret the AST to compute the answer.
In a section above I said that interpreted languages are processed first by lexing, then parsing, then interpreting. This is true for most languages. However, languages derived from Lisp (like Scheme) take a more indirect approach:
First, they use a lexer to convert an input string (i.e. the contents of a file) into tokens. This is no different from the approach we described above.
Then, they convert the token stream into an intermediate representation called an "S-expression" (which originally meant "symbolic expression"). I'll describe this below.
Then the S-expression gets converted into an "abstract syntax tree" (AST), which we described above.
Finally, the AST is given to an evaluator function, which evaluates it and returns the result.
Most language interpreters skip the second step and go straight from the token stream to the AST. This is what is usually called "parsing". Here, we use the word "parsing" to mean the process of going from the token stream to an S-expression. Converting from the S-expression to an AST is a subsequent step.
The advantage of S-expressions is that they are much more generic than an AST. If you ever decide to add a new fundamental form to your language, for instance, you will almost certainly have to change your AST, and this would normally mean changing the parser. But with our approach, we usually won't have to change the parser; we'll only have to change the functions that convert S-expressions to ASTs, and that's usually easier (sometimes much easier).
The S-expression data type is very simple:
(* Type of atomic expressions. *) type atom = | Atom_unit | Atom_bool of bool | Atom_int of int | Atom_id of string (* Type of all S-expressions. *) type expr = | Expr_atom of atom | Expr_list of expr list
In short, an S-expression is either an "atom", which is a single scalar value like a number, a boolean, a unit value or an identifier, or it's a list of S-expressions. Note that this means that an S-expression can contain nested lists (lists of lists, or lists of lists of lists, etc.).
Notice that the S-expression data type is called expr
, as is
the AST data type. This is OK, because they will be in different modules, so
the full type name of an S-expression will be Sexpr.expr
, and
AST expressions will have the type Ast.expr
.
You will write a program which will take an input file, convert it into an AST, and print out the AST representation. The output you should get will look something like this:
DEFINE[factorial LAMBDA[(n) IF[ APPLY[ ID[ = ] ID[ n ] INT[ 0 ] ] INT[ 1 ] APPLY[ ID[ * ] ID[ n ] APPLY[ ID[ factorial ] APPLY[ ID[ - ] ID[ n ] INT[ 1 ] ] ] ] ] ] ] APPLY[ ID[ print ] APPLY[ ID[ factorial ] INT[ 10 ] ] ]
You should also write two other programs, which will help you as you debug your program:
A program that will do the lexing only, and which will then print out the tokens that have been read in.
A program that will do the lexing as well as the parsing to generate the S-expression representation of the program, and which will then print out the S-expression.
The main focus of this program is to write the lexer and the parser. The
lexer will go into a file called lexer.mll
and the parser will
go into a file called parser.mll
. In addition, you will have to
write the code to convert an S-expression into an AST expression. This
amounts to less than 80 lines in all. All of the rest of the code, including
code to convert S-expressions and AST expressions to strings, will be
supplied for you.
The locations in the file where you have to add code are marked by comments including the word "TODO". You should remove these comments as you fill in the code.
This will be done in the file lexer.mll
. This is an
ocamllex
file, which is not the same as a real ocaml source code
file; instead, ocamllex
will use what's in this file to generate the
real ocaml file lexer.ml
. You should fill in the part of the file
following the line
rule lex = parse (* fill in code here *)
Your lexer should convert chunks of the input string into one of the tokens defined in the parser (see below). Specifically, your lexer should handle:
Single-line comments, starting with ; and going to the end of the
line. These should be discarded. A useful trick here is to simply call the
lexer recursively to get another token if a comment is encountered. Note
that to do this, you should realize that the lex
lexer (that
comes after the word rule
in lexer.mll
) is
converted by ocamllex
into a one-argument ocaml function; the
argument is the lexing buffer, called lexbuf
. Knowing this will
allow you to call the lexer recursively from the action part of the lexing
rule.
Whitespace, which should be discarded as well (use the same trick you used with comments).
Left and right parentheses, each of which is a distinct token type (TOK_LPAREN and TOK_RPAREN).
End-of-file, which will yield a token of type TOK_EOF.
Unit values, which are written as "#u". This will yield a token of type TOK_UNIT.
Boolean values, which are written as "#t" (for true) and "#f" (for false). These will become tokens of type TOK_BOOL. TOK_BOOL tokens have an associated value (true or false).
Integer values (possibly including a leading minus sign). These will become tokens of type TOK_INT, which also have an associated (integer) value. Don't forget to convert the string token to an integer before creating the TOK_INT value.
Identifiers, which will become tokens of type TOK_ID. These include the string which represents the identifier.
Anything else is an error.
The purpose of the parser is to take a sequence of tokens and convert it
into S-expressions. Each time you call the parser it will return a new
S-expression. If you hit EOF, the parser will return None
.
Thus, the return type of the parser is Sexpr.expr option
.
The token types are defined for you in the file parser.mly
. Look
at this file and you'll see that there are tokens for left and right
parentheses, unit values, boolean values, integers, etc. Note that some
tokens have associated data and others don't.
We've also provided stubs for the five nonterminals of the grammar:
parse: This represents an S-expression, or None
if
EOF is encountered.
sexpr: This represents an S-expression, which can be either an atom or a list of S-expressions.
atom: This represents an atom, which can be either a unit value, a boolean value, an integer value, or an identifier.
slist: This represents a list of S-expressions, surrounded by parentheses.
sexpr_list: This represents the contents of a list of S-expressions, without the parentheses. Note that empty lists are valid S-expressions.
All the types of each of the above nonterminals are given in the file.
This parser is just about the most trivial language parser imaginable; you only need to fill in 11 lines of code. Despite this, S-expressions are very flexible and are a useful way to structure syntax. In fact, XML documents are often considered to be S-expressions on steroids.
This should be done in the file ast.ml
, in a function called
ast_of_sexpr
. You will need to pattern-match on the
S-expression to figure out which AST expression it corresponds to, or raise
an exception if it doesn't correspond to any valid AST expression. This will
be the trickiest part of the lab, so make sure your parser is working
correctly (using the parser_test
program) before you attempt it.
Since there are a lot of supporting files for this lab, we've prepared a tarball of all the files, which is located here. To unpack it, do:
% tar xvf lab5.tar
Then edit the files lexer.mll
, parser.mly
, and
ast.ml
in order to complete the lab. The rest of the files should
be left alone.
Note that we provide a Makefile
which will compile all the code
for the lab, so you won't have to invoke ocamllex
or
ocamlyacc
manually. However, feel free to do so anyway, to see what
it does.
It's highly advisable to test the code at multiple points to make
sure that each stage is working properly. To that end, the
Makefile
that we supply has targets for programs called
lexer_test
, parser_test
, and ast_test
.
You should compile all of these programs (typing make
will do this), run all of them on the factorial.bs
file,
and check that the output is what you expect it to be. An easy way to do
this is to type make test_lexer
(to test the lexer),
make test_parser
(to test the parser), and
make test_ast
(to test the ast). The lexer test
uses a file called tokens.bs
which includes the contents of
factorial.bs
plus a few other tokens put at the end of the
file.
Typing make clean
will remove any files that
have been generated by the compiler.
Note that the parser and the lexer are interdependent, so you can't test
the lexer until the parser is working (more specifically, the parser has to
be working well enough that it can pass through ocamlyacc
without
causing an error, though it doesn't have to parse correctly for the lexer to
work properly). Sound confusing? It is.
The files lexer.mll
, parser.mly
, and
ast.ml
.
The standard ocamllex
and ocamlyacc
documentation, which is available here.
Another ocamllex
tutorial, available here.
Another ocamlyacc
tutorial, available here.
Note especially this
page on recursive grammar rules.