polish postfix converter

Polish Postfix Converter
Source: Katrin Becker, 2001

CONTENTS of THIS PAGE...
Introduction (just below)

Ian Craw's (University of Aberdeen) Excellent (!) notes.....
English Examples Explanation What is an Interpreter? What is Parsing? Language Definition Formal Syntax
Outlining a Solution
The Parser: Reading Tokens & Converting them to Descriptors Parsing Statements Dealing with Expressions Evaluating Expressions Pictures of the Objects

Ian Craw's (University of Aberdeen) Excellent (!) notes.....

Grammars and Parsing

Formal Languages
Formal Grammars
First Example
Second Example
Property P
Arithmetic Expressions AE Grammar
Postfix Expressions Postfix Grammar
Converting Infix to Postfix
FBIE Grammar
Evaluating a Postfix Expression
Questions 4 (Hints and solutions start on page .)

Some problems don't lend themselves to solutions described only in terms of objects. This is one of them.

Where to begin.....

At first glance this project may seem like an impossible task but once it is properly broken down it's not so bad. So let's begin with some terminology:

A Lexical analyzer takes the "raw" words in a statement and breaks the statement up into it's various components (tokens). Then the Parser takes over who's job is to determine whether or not these tokens constitute a properly formed statement according to the rules of the language (rules of syntax). The final stage then is to determine the meaning (semantics) of the raw words given their context. This is done by the Interpreter.

In natural languages (especially English) the rules of syntax are fairly well known but the interpretation of the meanings of the words is often ambiguous and the same statement can be broken up various ways.

For example:

"British Left Waffles on Falkland Islands" can be interpreted at least two ways:

[British Left] [Waffles] [on Falkland Islands]

as [subject] [verb] [prepositional phrase] --- result 1 from the lexical analyzer; the parser says this is syntactically correct; the interpreter may proceed

-or-

[British] [Left] [Waffles] [on Falkland Islands]

as [subject] [verb] [object] [prepositional phrase] --- result 2 from the lexical analyzer; parser says this is also syntactically correct; the interpreter may proceed here too but the meaning (interpretation) is now different.

"Squad helps dog bite victim" can also be interpreted two ways:

[Squad] [helps] [dog bite victim]

as [subject] [verb] [object] --- result 1 from the lexical analyzer; parser says this is syntactically correct; the interpreter may proceed

-or-

[Squad] [helps] [dog] [bite] [victim]

as [subject] [verb, part 1] [object 1] [verb, part 2] [object 2] --- result 2 from the lexical analyzer; parser says this is also syntactically correct; the interpreter may proceed here too but the meaning (interpretation) is again different.

"Enraged Cow Injures Farmer With Ax" can be interpreted two ways too:

[Enraged Cow] [Injures] [Farmer With Ax]

as [subject] [verb] [object] --- result 1 from the lexical analyzer; parser says this is syntactically correct; the interpreter may proceed

-or-

[Enraged Cow] [Injures] [Farmer] [With Ax]

as [subject] [verb] [object] [prepositional phrase] --- result 2 from the lexical analyzer; parser says this is also syntactically correct; the interpreter may proceed here too but the meaning (interpretation) is different.

Well, you get the picture. What we do when we analyze these statements is to apply known rules of syntax to decompose the statement into it's syntactic components (parts of speech) so we can apply the appropriate meaning (semantics) to each part. Before we even get to the semantics part we must reduce each word (called a token) and classify it (as "subject", "verb", etc.). The resultant syntactic components are called descriptors. This allows us to check that the statement is syntactically correct (to parse it). We can conclude that the statement is syntactically correct when the descriptors match one of the known rules for syntax. In other words, if the statements can be reduced to the descriptors [subject] [verb] [object] then it is syntactically correct and we can proceed to the next step, namely to interpret the meaning. Humans are wonderfully adapted to do all of these jobs and once we "know" a language, we can go through these steps without even being conscious of it. The process is more obvious when we are forced to deal with a language with which we are not that familiar.

Computers can interpret languages too, but in order to be able to do this by computer we must find a way of eliminating ALL ambiguities. In human languages, we can make adjustments to our "lexical analyzer" on the fly to resolve ambiguities as they come up. We are very good at using context for this ("It can't mean this because that makes no sense so it must mean that"). Computers are not. As a result the rules of syntax for a computer language do not contain ambiguities. Each language has a precisely outlined set of syntactical rules and an equally precise set of semantic rules to go along with them. With this done, we can now write a program that can do the steps outlined above automatically.

A high-level programming language translator that translates and runs the program at the same time. It translates one program statement into machine language, executes it, then proceeds to the next statement. This differs from regular executable programs that are presented to the computer as binary-coded instructions. Interpreted programs remain in the same source language format the programmer wrote in: as text files, not machine language files.

Interpreted programs run slower than their compiler counterparts. Whereas the compiler translates the entire program before it is run, interpreters translate a line at a time while the program is run. However, it is very convenient to write an interpreted program, since a single line of code can be tested interactively.

Interpreted programs must always be run with the interpreter. For example, in order to run a BASIC or dBASE program, the BASIC or dBASE interpreter must be in the target computer.

If a language can be both interpreted and compiled, a program may be developed with the interpreter for ease of testing and debugging and later compiled for production use.

Interpreters and Compilers
Unlike compiled languages that have been translated into machine language (right), interpreted languages must be translated at runtime. With languages such as dBASE and BASIC (middle), the interpreter translates the original source code. Languages such as Java and Visual Basic (left) use an intermediate language. The source code is compiled into "byte code," and the bytecode is interpreted at runtime.

2. What is parsing (a somewhat more formal explanation)

An algorithm or program to determine the syntactic structure of a sentence or string of symbols in some language. A parser normally takes as input a sequence of tokens output by a lexical analyzer. It may produce some kind of abstract syntax tree as output.

3. Arithmetic Expression "Language" definition:

It's Syntax Rules, or "How to Read the Notation & What it Means"

The Notation......(called BNF, or Backus-Naur-Form)

::= means 'is defined as'
<something> means an entity that is further defined (usually later); this is a token
{<something>} means that the thing enclosed in braces is optional and can be omitted
<X> | <Y> means that either '<X>' or '<Y>' can be used here
all other marks must appear literally

Rules of Syntax:

Each statement has the following form:

<statement> ::= <expression>;
Which means: a <statement> consists of an <expression> followed by a semi-colon

More on "Backus-Naur Form" (the notation used above)

Backus-Naur Form (from FOLDOC (on-line computer dictionary)

<language, grammar> (BNF, originally "Backus Normal Form") A formal metasyntax used to express context-free grammars. Backus Normal Form was renamed Backus-Naur Form at the suggestion of Donald Knuth.

BNF is one of the most commonly used metasyntactic notations for specifying the syntax of programming languages, command sets, and the like. It is widely used for language descriptions but seldom documented anywhere (how do you document a metasyntax?), so that it must usually be learned by osmosis (but see RFC 2234).

Consider this BNF for a US postal address:

<postal-address> ::= <name-part> <street-address> <zip-part>
<personal-part> ::= <name> | <initial> "."
<name-part> ::= <personal-part> <last-name> [<jr-part>] <EOL>
| <personal-part> <name-part>
<street-address> ::= [<apt>] <house-num> <street-name> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>

This translates into English as: "A postal-address consists of a name-part, followed by a street-address part, followed by a zip-code part. A personal-part consists of either a first name or an initial followed by a dot. A name-part consists of either: a personal-part followed by a last name followed by an optional "jr-part" (Jr., Sr., or dynastic number) and end-of-line, or a personal part followed by a name part (this rule illustrates the use of recursion in BNFs, covering the case of people who use multiple first and middle names and/or initials). A street address consists of an optional apartment specifier, followed by a street number, followed by a street name. A zip-part consists of a town-name, followed by a comma, followed by a state code, followed by a ZIP-code followed by an end-of-line."

Note that many things (such as the format of a personal-part, apartment specifier, or ZIP-code) are left unspecified. These lexical details are presumed to be obvious from context or specified somewhere nearby.

There are many variants and extensions of BNF, possibly containing some or all of the regexp wild cards such as "*" or "+". EBNF is a common one. In fact the example above isn't the pure form invented for the ALGOL 60 report. "[]" was introduced a few years later in IBM's PL/I definition but is now universally recognised. ABNF is another extension.

(1997-11-23)


Complete Formal Syntax Rules (BNF and Syntax Diagrams):

<statement> ::= <expression>;
<expression> ::=

<number> |
(<expression>) |
<expression><operator><expression>

<operator> ::= + | - | *| / (binary operators only; evaluated left to right; normal precedence: i.e. * / before + -)

<spaces> ::= \040 | \040 {<spaces>}

<number> ::= 0 |1 | 2 | 3 | 4 | 5| 6 | 7 | 8 | 9





Full Definition of Expressions adapted from Pascal Language Definition

<unsigned constant> ::= <unsigned number>
<factor> ::= <unsigned constant> | (<expression>)
<term> ::= <factor> | <term> <multiplying operator> <factor>
<simple expression> ::= <term> |

<simple expression> <adding operator> <term> |
<sign> <factor>

<expression> ::= <simple expression>
<multiplying operators> ::= * | /
<adding operators> ::= + | -

<unsigned number> ::= <unsigned integer> | <unsigned real>
<unsigned integer> ::= <digit> { <digit> }
<unsigned real> ::= <unsigned integer>.<digit> { <digit> } |

<unsigned integer>.<digit> { <digit> } E <scale factor> |
<unsigned integer> E <scale factor>

<scale factor> ::= <unsigned integer> | <sign> <unsigned integer>
<sign> ::= + | -

4. Outline of solution:

Here's one approach:

The first step, and perhaps the most difficult is to identify those things in our solution that can be described as objects. To keep the solution as straight-forward as possible we will limit the number of objects we define.

Step 1: identify those things that are objects in this problem.

Each object:

- consists of .. (is/ has)
- operates on... (does)
- reports on... (output?)

Some possibilities:

maybe token
descriptor
scanner (the lexical analyzer)
the stack(s) and deque

Step 2: lay out the main algorithm

- usual intro/ initializations
- get the arguments (command line arguments would work well here)
- open the appropriate file(s)

The Parser:
Reading Symbols and Converting them to Tokens:

each "token" will only be a single character but you'll need to be able to skip white space between tokens.
you'll probably need the following information associated with each token:
- it's value ( "5", ";", or "+", etc.)
- it's type (see below)
- for numbers: it's numerical equivalent
you will end up with the following "kinds" of tokens:
- numbers
- additive operators: + and -
- multiplicative operators: * and /
- brackets
- semi-colon
each "kind" of token can be assigned a precedence

5. How to parse statements (with some examples)

Rough outline, without parentheses, no error detection/recovery:
Approach 1 (Infix to Postfix):

getToken // it should be a number
pushD // push it onto deque
getToken // should be an operator
while NOT (semi-colon)

if precedence(current token) > precedence(top-of-stack)

pushS // push it onto holding stack

else

while (precedence <= top-of-stack)

popS // get the operator off the holding stack, and...
pushD // push it onto the deque

pushS // push the current operator onto the holding stack

getToken // it should be a number
pushD // push it onto deque
getToken // should be an operator

END-while
//finally
while holding-stack NOT empty

popS
pushD

ALL-DONE

A + B * C - D ==> A B C * + D -
A / B / C ==> A B / C /
A * B + C / D - E + F ==> A B * C D / + E - F +

Now, if we want to handle brackets (it isn't that much different from before):

getToken // it should be a number or '('
if NUM

pushD // push it onto deque
getToken // should be an operator

while NOT (semi-colon)

if ('(')

popS
while NOT ('(')

pushD
popS

else if (precedence(current token) > precedence(top-of-stack)) OR (top-of-stack = '(')

pushS // push it onto holding stack

else

while (precedence <= top-of-stack)

popS // get the operator off the holding stack, and...
pushD // push it onto the deque

pushS // push the current operator onto the holding stack

getToken // it should be a number
if NUM

pushD // push it onto deque
getToken // should be an operator

END-while
//finally
while holding-stack NOT empty

popS
pushD

ALL-DONE

AND.... if you were paying attention you might have noticed that we don't put the '(' ')' ANYWHERE in the Post-fix version. By our clever "re-arrangement" of the original operators and operands, we have eliminated the need for the brackets. The Post-fix can now be evaluated EXACTLY the same way. If this were a compiler, we would have spent as bit of extra time compiling and generating machine code but what we gain is efficient evaluation (i.e. execution) of the expressions.

A + B * C - D ==> A B C * + D -
but(A + B) * (C - D) ==> A B+ C D -*

A / B * C ==> A B / C *

but A / (B * C) ==> A B C * /

A * B + C / D - E + F ==> A B * C D / + E - F +

but (A * B + C )/ (D - E) + F ==> A B * C + D E - / F +

Dealing With Expressions:
Another Approach: (using the syntax diagrams as shown):
doExpression() { doTerm(); while ((sy=='+')||(sy=='-')) { X = sy; nextSy(); doTerm(); printSy(X); } } doTerm() { doFactor(); while((sy=='*')||(sy=='/')) { X = sy; nextSy(); doFactor(); printSy(X); } } doFactor() { if (sy.isDigit) { X = sy; nextSy(); printSy(value-of-X); } else if (sy=='(' ) { nestSy(); doExpression(); nextSy(); } } Parser() { nextSy(); doExpression(); }

Evaluating Expressions:

You need to use 2 stacks: 1 is a regular stack, used for temporary holding of descriptors; one will be a deque. The deque is where we will "build" the postfix expression. While building the expression, it is treated as an ordinary stack but once finished we will want to pull values off from the "bottom".

What we will do is pull values off, again using the 'regular' stack for temporary holding. When we get an operator, we pop the top 2 values off the 'regular' stack; do the operation and push the result back on the regular stack. When the deque is empty, the answer will be the only thing left on the stack.
And TA-DA!!!! we've got it.

Once we've got the Post-fix in the deque, we can proceed as follows:
Approach 1 (Evaluate Postfix):

while NOT DONE

pullD // get the item at the bottom of the stack
if isOP

X1 = popS // get an operand from the holding stack
X2 = popS // get another
R = X2 OP X1 // do the operation
pushS(R) // push the result back onto the holding stack

else

pushS // otherwise just push it onto the holding stack
And that's it. Evaluating expressions in Post-fix form is actually quite simple and efficient.

Pictures of the Objects: