This post is about Assignment 1 on on the Coursera Compilers course, which is to write a flex specification for lexing COOL.

This was pretty challenging for me, I took about 4 hours in total, spread over a couple of nights to complete this.

The challenging part was handling edge cases, and thinking about how the lexer will match strings. For example, because the lexer matches greedily, an over-optimistic regex will cause group that you won't think of capturing to be captured.

The really helpful part was running the grading script, and checking the differences between the reference lexer and my lexer. This allowed me to drill down to specific portions of my regex or code, such as areas I should be increasing the line count.

Matching operators and keywords are not difficult, the most tedious part was matching multiline nested comments and also string constants.

To match nested comments, for every ")" matched, we need to determine if this is matching a opening "(", or if it is a stray closing comment and hence an error. This can be done using a start condition.

"(*" { BEGIN(comment);}
<comment>"*)" {}
"*)" { return ERROR; }

The last rule for matching "*)" feels pretty unintuitive, but it works!

Matching string constants is even more complicated. We need to handle all the valid escaped sequences, \b, \t, \n, \f, \", transform non escaped sequences with a slash into it's normal counter part, e.g. \a will be transformed into a. Also, null characters are illegal within string constants, the same for EOF. And any new lines need to be escaped first. It becomes unobvious how to match certain sequences, for example, how do you match a new line in a string?

<str>\n { return ERROR; }
<str>\\n { /* matches a slash followed by a 'n' in the input */ }
<str>\\\n { /* matches a slash followed by a new line character */ }

The first line matches a new line while reading a string constant. The second line is a slash followed by a character 'n'. Finally, the third line matches an escaped new line. \\ is an escaped \, and hence matches a literal \ within the input, followed by \n, which matches a new line character.

Here's other tips that may come in helpful:

curr_lineno indicates the current line in the source file, this needs to be updated in multiple places, for example, in string constants, in comments, in general source code definition.

yyinput() reads one character from the stream

<<EOF>> matches EOF, this is useful when scanning for string constants as an EOF while in string is an error. However <<EOF>> only works on the pattern, as in <str><<EOF>>, when in the <str> context.

Useful links:

A quick introductin to flex: http://dinosaur.compilertools.net/lex/index.html

Good sample lex file for ANSI C grammar: http://www.lysator.liu.se/c/ANSI-C-grammar-l.html#count

How to match EOF: http://flex.sourceforge.net/manual/EOF.html

Might be helpful, flex description for C/C++ scanner: http://scottmcpeak.com/elkhound/sources/elsa/cc.lex

The coursera forums are helpful as well.