An English Compiler

by Peter Chapman

51FWXX9KWVL__SY300_

Over the years, I have written several compilers/interpreters: Forth, Pascal, Basic, HyperTalk (from HyperCard), SQL, dBase, SQL to name a few. As the old adage goes, “if you have a hammer everything looks like a nail”, and this problem set looked like just another compiler for me; albeit a one with a few extra wrinkles.

So, this post discusses the current state of the English compiler:

1. Sentence determination: First you have to determine sentences from an input stream. Sadly, not as easy as you might think because periods are sprinkled throughout English in things like “Dr.” and “$4.99”. I advocate that we remove all periods except at the end of lines. In the meantime, I wrote code to determine a true end of line.

2. Lexical analysis breaks the English text into small pieces called tokens. Each token is a single atomic unit of the language. This phase is also called lexing or scanning, and the software doing lexical analysis is called a lexical analyzer or scanner. I’ve decided to do word level parsing instead of using a character based one.

3. Preprocessing: Unlike a traditional compiler where macro expansion, etc. occurs in this step, I decided now would be a good time to do determination of known character patterns such as time, date, units, zip codes, telephone numbers, etc. At this point, the original string is still available, so it’s easier than later where multiple tokens might get created for these known patterns.

4. Syntax analysis involves parsing the token sequence to identify the syntactic structure. This phase builds a parse tree, which replaces the linear sequence of tokens with a tree structure built according to the rules of a formal grammar which define the language’s syntax. The parse tree is often analyzed, augmented, and transformed by later phases in the compiler. It’s during this phase that we add word level tagging and determine word relationships.

5. Stemming, lemmatisation, word identification, word joining and metadata adornment is done next. In a previous post, we talked about stemming and lemmatization, so I won’t go into it further here. During this phase, we combine proper nouns together to form full nouns. For instance, two tokens for my name “Peter” and “Chapman”, would be combined into a single token “Peter Chapman”. Also, during this step, we examine each token and corresponding word and adorn the token with additional information required in future steps. For instance, we might add metadata that a particular token is a floating point number, etc.

This is where I am today. The next step is what I am going to work on tomorrow.

6. Semantic analysis is the phase in which the compiler adds semantic information to the parse tree and performs semantic checking. Semantic analysis usually requires a complete parse tree, meaning that this phase logically follows the parsing phase, and logically precedes the code generation phase, though it is often possible to fold multiple phases into one pass over the code in a compiler implementation.