University of Calgary

tiny lab

Tiny Basic Interpreter
created by: Katrin Becker, 1999
adapted from: ACM programming contest problem set
CONTENTS of THIS PAGE...
Introduction (just below)
English Examples
Explanation
What is an Interpreter?
What is Parsing?
Tiny Basic Language Definition
Tiny Basic Formal Syntax
Outlining a Solution
The Scanner
Reading Tokens & Converting them to Descriptors
Scanning Statements
Scanning Expressions
Memory
The Interpreter
Executing a Program
Evaluating Expressions
Pictures of the Objects

Some problems don't lend themselves to solutions described only in terms of objects. This is one of them.
 
Where to begin.....
At first glance this project may seem like an impossible task but once it is properly broken down it's not so bad. So let's begin with some terminology:
 
A Lexical analyzer takes the "raw" words in a statement and breaks the statement up into it's various components (tokens). Then the Parser takes over who's job is to determine whether or not these tokens constitute a properly formed statement according to the rules of the language (rules of syntax). The final stage then is to determine the meaning (semantics) of the raw words given their context. This is done by the Interpreter.
 
In natural languages (especially English) the rules of syntax are fairly well known but the interpretation of the meanings of the words is often ambiguous and the same statement can be broken up various ways.
For example:

"British Left Waffles on Falkland Islands" can be interpreted at least two ways:
[British Left] [Waffles] [on Falkland Islands]
as [subject] [verb] [prepositional phrase] --- result 1 from the lexical analyzer; the parser says this is syntactically correct; the interpreter may proceed
-or-
[British] [Left] [Waffles] [on Falkland Islands]
as [subject] [verb] [object] [prepositional phrase] --- result 2 from the lexical analyzer; parser says this is also syntactically correct; the interpreter may proceed here too but the meaning (interpretation) is now different.

"Squad helps dog bite victim" can also be interpreted two ways:
[Squad] [helps] [dog bite victim]
as [subject] [verb] [object] --- result 1 from the lexical analyzer; parser says this is syntactically correct; the interpreter may proceed
-or-
[Squad] [helps] [dog] [bite] [victim]
as [subject] [verb, part 1] [object 1] [verb, part 2] [object 2] --- result 2 from the lexical analyzer; parser says this is also syntactically correct; the interpreter may proceed here too but the meaning (interpretation) is again different.

"Enraged Cow Injures Farmer With Ax" can be interpreted two ways too:
[Enraged Cow] [Injures] [Farmer With Ax]
as [subject] [verb] [object] --- result 1 from the lexical analyzer; parser says this is syntactically correct; the interpreter may proceed
-or-
[Enraged Cow] [Injures] [Farmer] [With Ax]
as [subject] [verb] [object] [prepositional phrase] --- result 2 from the lexical analyzer; parser says this is also syntactically correct; the interpreter may proceed here too but the meaning (interpretation) is different.

Well, you get the picture. What we do when we analyze these statements is to apply known rules of syntax to decompose the statement into it's syntactic components (parts of speech) so we can apply the appropriate meaning (semantics) to each part. Before we even get to the semantics part we must reduce each word (called a token) and classify it (as "subject", "verb", etc.). The resultant syntactic components are called descriptors. This allows us to check that the statement is syntactically correct (to parse it). We can conclude that the statement is syntactically correct when the descriptors match one of the known rules for syntax. In other words, if the statements can be reduced to the descriptors [subject] [verb] [object] then it is syntactically correct and we can proceed to the next step, namely to interpret the meaning. Humans are wonderfully adapted to do all of these jobs and once we "know" a language, we can go through these steps without even being conscious of it. The process is more obvious when we are forced to deal with a language with which we are not that familiar.
 
Computers can interpret languages too, but in order to be able to do this by computer we must find a way of eliminating ALL ambiguities. In human languages, we can make adjustments to our "lexical analyzer" on the fly to resolve ambiguities as they come up. We are very good at using context for this ("It can't mean this because that makes no sense so it must mean that"). Computers are not. As a result the rules of syntax for a computer language do not contain ambiguities. Each language has a precisely outlined set of syntactical rules and an equally precise set of semantic rules to go along with them. With this done, we can now write a program that can do the steps outlined above automatically.

1. What is an Interpreter? (from The Computer Desktop Encyclopedia Copyright © 1981-2000 The Computer Language Company Inc. All rights reserved. Ver. 13.1, 1st Quarter 2000)
A high-level programming language translator that translates and runs the program at the same time. It translates one program statement into machine language, executes it, then proceeds to the next statement. This differs from regular executable programs that are presented to the computer as binary-coded instructions. Interpreted programs remain in the same source language format the programmer wrote in: as text files, not machine language files.
Interpreted programs run slower than their compiler counterparts. Whereas the compiler translates the entire program before it is run, interpreters translate a line at a time while the program is run. However, it is very convenient to write an interpreted program, since a single line of code can be tested interactively.
Interpreted programs must always be run with the interpreter. For example, in order to run a BASIC or dBASE program, the BASIC or dBASE interpreter must be in the target computer.
If a language can be both interpreted and compiled, a program may be developed with the interpreter for ease of testing and debugging and later compiled for production use.
Interpreters and Compilers
Unlike compiled languages that have been translated into machine language (right), interpreted languages must be translated at runtime. With languages such as dBASE and BASIC (middle), the interpreter translates the original source code. Languages such as Java and Visual Basic (left) use an intermediate language. The source code is compiled into "byte code," and the bytecode is interpreted at runtime.


2. What is parsing (a somewhat more formal explanation)
An algorithm or program to determine the syntactic structure of a sentence or string of symbols in some language. A parser normally takes as input a sequence of tokens output by a lexical analyzer. It may produce some kind of abstract syntax tree as output.

3. Tiny Basic Language definition: It's Syntax Rules, or "How to Read the Notation & What it Means"
The Notation......
::= means 'is defined as'
<something> means an entity that is further defined (usually later); this is a token
{<something>} means that the thing enclosed in braces is optional and can be omitted
<X> | <Y> means that either '<X>' or '<Y>' can be used here
all other marks must appear literally

 

Rules of Syntax for Tiny Basic:
Each line of a Tiny Basic program has the following form:
{<spaces>}<line-number><spaces><statement>
 
Which means:
1. there may be optional leading blank spaces, which we just scan past and ignore
2. there will be a line number (required) which starts with 1, and increases by one on each line
3. there are one or more blank spaces following the line number, we only need to remember there were spaces, how many doesn't matter
4. the statement may be one of the following:
LET<spaces><variable>=<expression>
PRINT<spaces><variable>
GOTO<spaces><expression>
IF<spaces><expression>
STOP
 
Complete Formal Syntax Rules:
 
<program> ::= <line> | <line> {<line>}
 
<line> ::= {<spaces>}<line number><spaces><statement>
 
<statement> ::=
LET<spaces><variable>=<expression> |
PRINT<spaces><variable> |
GOTO<spaces><expression> |
IF<spaces><expression |
STOP
 
<expression> ::= <constant> | <variable> |
<variable><operator><variable>
<operator> ::= + | >
<constant> ::= -10000 ... 10000
<variable> ::= A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
<spaces> ::= \040 | \040 {<spaces>}
<line number> ::= 1 | 2 | 3 | ... | 100
 
<variable> > <variable> evaluates to 1 if true; 0 if false
 
There are to be no blanks surrounding the "=" of a LET statement or within an expression.
The maximum number of lines in a Tiny Basic program is 100.

4. Outline of solution:
 
Here's one approach:
The first step, and perhaps the most difficult is to identify those things in our solution that can be described as objects. To keep the solution as straight-forward as possible we will limit the number of objects we define.
 
Step 1: identify those things that are objects in this problem.
Each object:
- consists of .. (is/ has)
- operates on... (does)
- reports on... (output?)
Some possibilities:
maybe token
descriptor
memory
scanner (the lexical analyzer)
interpreter
 
Step 2: lay out the main algorithm
- usual intro/ initializations
- get the arguments (command line arguments would work well here)
- open the appropriate file(s)
- scan / parse
- while not eof(file)
scan-a-line
translate into descriptors
write to "memory"
- interpret
- execute the "program"
 
The scanner/parser is the only part of our solution that deals with tokens. Tokens are simple (and passive) enough that we don't need to define them as actual objects.

The role of "Token" will be played by a simple string:
char token[32];
Each token, once read can be converted into a fairly simple data-object, namely a descriptor. Descriptors need to maintain 3 pieces of information:
1. a type: VARIABLE, RESERVED_WORD, NUMBER, OPERATOR, EQUALS
2. a name: for variables, operators, equals, and reserved_words (values could be: 'A'..'Z', 'l','p','g','i','s','+','>','=')
3. a number: for numbers; could also be used for variables to store the "address" of the variable in memory)
After reading a token, we can convert it to a descriptor. Descriptors aren't really active objects - they merely store information
 
Descriptors are "created" by the Scanner/Parser and stored in a Memory object. For simplicity, the memory object will store descriptors directly in two regions: the program-region and the data-region. The program-region can be a 2-D array of descriptors: we can have 1 row for each line of the original TB-program (100 in all) - each with 5 columns (which is the maximum number of descriptors we will have per line). The data-region is an array of 26 descriptors, one for each possible variable. What we need to be able to do with memory is Fetch and Store descriptors. We also need to be able to retrieve and set the value of a variable. Since the value of the variable is something that "belongs" to the descriptor, we will have to include method(s) in the descriptor that will allow us to change a variable.

 

That leaves us with the two main "engines" needed to drive the program:
1. scanner ( the lexical analyzer which in our solution also contains the parser )
- it's job will be to read the original TinyBasic program from the source file and load it into memory
2. interpreter
- who sort of acts like the CPU of a real machine
- it retrieves instructions (descriptors) from our memory and "executes" them

The Scanner:
Reading Tokens and Converting them to Descriptors:
A token is simply a logical entity from our original source. In our TinyBasic language, these would be numbers (including line numbers), variables, reserved words, and a few special characters: '=', '+', '>'.

The part or our program that gets the next token is the part that actually reads character by character and it knows the following things about tokens:
1. Our solution will treat white-space as insignificant. There are only a few places where we can have white-space: before the line-number; before the reserved-word, after the reserved-word. There are no spaces allowed anywhere else. We will not treat them as tokens, we'll just skip white-space when we encounter it.

2. If the current character is a <digit> or a <minus sign>, we must keep reading characters until we get something that is NOT a digit. We can then convert what we've got to a number, save that as the value and set the descriptor's type to [number]

Note that the allowable values of the number are determined by context: if this number is found when we are looking for a <line number> it must be in a certain range. If found when we are expecting a <constant> it must be in a certain different range. This is handled elsewhere.

3. If the current character is a <letter> then we must keep reading characters until we get one that is NOT a letter. At this point we can try to match what we've got against one of the <reserved words>. If our match is successful, the descriptor's value is the <reserved word> (or some code for it; we could use the first letter of the word in lower case) and it's type is [reserved word]. If it is NOT a reserved word, it must be a single letter, in which case it's value is the variable's name and it's type is [variable]

4. All the rest of the tokens are single characters and must be one of:
=  +  >
The = can only appear in one place, it's value is [=] and it's type is [equals]
The other two can also only appear in one place, the value of the first is [+] and it's type is [operator]

Summary of descriptors:
[1:number] [<value>]
[2:reserved word] [<name>]
[3:variable] [<name>]
[4:equals] [=]
[5:operator] [<name>]

 


5. How to parse statements (with some examples)
Typically, the rules of syntax are listed from the "biggest" (all-encompassing) to the "smallest" (where the literals are defined).
The lexical analyzer begins with the original source (contained in a text file) and starts reading and examining things character by character. Since it "knows" the rules, it is looking for a limited sub-set of acceptable tokens at each step. It first tries to identify individual tokens such as <variable>, <spaces>, <line number>, <constant>. It will also recognize LET as a token. These could be our "reserved words".
 
Looking at it from the "top down", the lexical analyzer (LA) is looking for <program> Which is it's starting point. A <program> consists of <line>s so it will proceed as follows:
while not EOF
{
do_a_line
}
each <line> consists of {<spaces>}<line number><spaces><statement> so we now continue:
skip_blanks // where spaces are optional
get_line_number
get_spaces // where at least the first blank is not optional
do_statement
There is a set of rules for each of these that we can follow precisely.

 

An Example:
1 LET A=10
2 LET I=0
3 LET X=I+I
4 LET T=1
5 LET X=X+T
6 PRINT X
7 LET T=1
8 LET I=I+T
9 IF A>I
10 GOTO 3
11 STOP

Let's step through the preceding program and see where we end up.
1 LET A=10
1. we skip the leading spaces
2. we get the next token (which has a place for the value: [1] and it's type: [number])
Since we are assuming the program is correct, we don't really need to save the line numbers anywhere. The statements will get stored in our memory, row by row, starting with line 1 (row1 - for simplicity we'll leave row 0 blank)
It's in the allowable range so we can save it and
3. begin to deal with a statement. First we get the next token. (this time we get value: [l] and type [reserved word]) The type of this token tells us which statement we have begun and so it tells us what we are looking for next...
switch [reserved word]
LET means we need
    a. <space>
    b. <variable>... value: [A] type [variable]
    c. <equals>...value: [=] type [equals]
    d. <expression> {we need something to deal with expressions}
PRINT means we need
    a. <space>
    b. <variable>
GOTO means we need
    a. <space>
    b. <expression>
IF means we need
    a. <space>
    b <expression>
STOP means we don't need anything else
4. We're done with this line. Go on to the next line.

Dealing With Expressions:
Whenever we need to deal with (parse) an expression we end up doing the same thing, regardless of where the expression occurs, so we can create one routine to handle expressions in general:
get the next token
if it's a <constant>, we're done
otherwise
if there is a next token
it must be the <operator>
get the last token (which is also a <variable>)
else
it's just the one <variable>

Memory:
If everything went well, when we are done with the lexical analyzer, we will have an internal representation of the original program that has been reduced to a series of lines that contain a number of descriptors, each one represented by two pieces of information. See if you can match the descriptor type/value pairs with the original program (line numbers have been left as they were):

 

LINE #    Descr1    Descr2    Descr3    Descr4    Descr5
1 [3][l] [4][A] [2][10]
2 [3][l] [4][I] [2][0]
3 [3][l] [4][X] [4][I] [6][+] [4][I]
4 [3][l] [4][T] [2][1]
5 [3][l] [4][X] [4][X] [6][+] [4][T]
6 [3][p] [4][X]
7 [3][l] [4][T] [2][1]
8 [3][l] [4][I] [4][I] [6][+] [4][T]
9 [3][i] [4][A] [6][>] [4][I]
10 [3][g] [2][3]
11 [3][s]
Note that we have left out the '=' - it's not needed for the next step so we can omit it.
 
It would not be difficult to store the entire program in an array (100 rows, 5 columns) where each row corresponds to one line and each row can hold up to 5 tokens.
 
Since we also know we will have no more than 26 variables, we can set those up ahead of time too (maybe in another array?)
 
We are now ready for the next step:

The Interpreter:
This is the part that takes our intermediate "program" stored in our very own memory and runs it.
 
The interpreter can be made to look and act like a CPU. It needs:
- program counter (simple integer)
- instruction register (a descriptor)
- an address register (a descriptor)
- two places for arguments retrieved from memory (two descriptors)
- an accumulator (a simple integer)
- two registers (both integers)
 
Executing a TinyBasic program will then consist of:
- start at first row of memory (program counter is 1)
- get the first reserved word (into instruction register)
- based on it's 'name'
we will retrieve the rest of the arguments
do what we need to
- get the next instruction according to what the program counter says

Executing a Program:
 
Once we have reduced the original program to descriptors, we can begin to interpret them (i.e. run the program). We know the first descriptor in the row always tells us which statement we have so we don't even have to look at it's type. We get it's value and proceed:
switch on the value of the first token in the current "line"
l
REG1 = value of first argument (Descr3)
REG2 = value of third argument (Descr5)
OP = value of 2nd argument (Descr4)
ACCUMULATOR = evaluate_the_expression
- set the variable named in Descr2 to ACCUMULATOR
p
retrieve value of Descr2 and print it out
g
ACCUMULATOR = evaluate_the_expression
set program_counter to value of ACCUMULATOR
i
if value of Descr2 > value of Descr4 is false
then program_counter++
s
end program

 


Evaluating Expressions:
We need to be told the location of the first descriptor (operand). It will be either Descriptor2 or Descriptor3, depending on whether we're processing a LET or a GOTO or an IF.
 
Examine the first descriptor
if it's a variable
retrieve it's value and save it (in REG1)
If there is a second <operand> (which will be two descriptors "to the right")
retrieve it's value and save it (in REG2)
retrieve the <operand>
if it's +
ACCUMULATOR = REG1 + REG2
otherwise it must be >
ACCUMULATOR = REG1 > REG2 ? 1 : 0;
otherwise there is no second <operand>
ACCUMULATOR = REG1
otherwise the first descriptor is a <constant>
this is our ACCUMULATOR
 

The interpreter can continue in this way, where the next statement is determined by next_statement_to_execute until it gets a STOP
 
And TA-DA!!!! we've got it.

Pictures of the Objects (click on them for a bigger version):




Updated: August 6, 2005 12:11 AM