|
Parsing
Lab Notes on Latin-English Translation
|
- Formal Syntax Rules for our WeeEnglish (BNF and Syntax Diagrams):
- <STATEMENT> ::= <Sentence> <PUNCT>
- <Sentence> ::= <NounPhrase> <VerbPhrase> |
- <NounPhrase> <VerbPhrase> <Conjunction> <Sentence>
- <NounPhrase> ::=
- <ProNoun> |
- <ProperNoun> |
- <Article> <AdjectiveList> <Noun> |
- <Article> <Noun> |
- <Noun>
- <VerbPhrase> ::=
- <AdverbList> <Verb> |
- <Verb> <AdverbList> |
- <Verb> <NounPhrase> <AdverbList> |
- <AdverbList> <Verb> <NounPhrase> |
- <Verb>
- <AdjectiveList> ::=
- <AdjectiveList> <Adjective> |
- <nothing>
- <AdverbList> ::=
- <AdverbList> <Conjunction> <Adverb> |
- <Adverb>
- <ProNoun> ::= they | it | he | she .......
- <ProperNoun> ::= name of a person or place
- <Article> ::= <determiner like 'a', 'the',...; see ArticleList in Dictionary>
- <Noun> ::= <person, place, or thing; see NounList in Dictionary>
- <Conjunction> ::= <a word that joins two phrases, like: 'and', 'or', 'although',...; see ConjunctionList in Dictionary>
- <Adverb> ::= <modifies a verb, like: 'slowly', 'hourly', 'lightly', '; see AdverbList in Dictionary>
- <Adjective> ::= <modifies a noun, like: 'slow', 'black', '; see AdjectiveList in Dictionary>
- <Verb> ::= <action see VerbList in Dictionary>
- <PUNCT> ::= . |? | ! | ...
-
-
-
- Outline of solution:
-
- Here's one approach:
- The first step, and perhaps the most difficult is to identify those things in our solution that can be described as objects. To keep the solution as straight-forward as possible we will limit the number of objects we define.
-
- Step 1: identify those things that are objects in this problem.
- Each object:
- - consists of .. (is/ has/ knows)
- - operates on... (does)
- - reports on... (output?)
- Some possibilities:
- maybe token (each "thing" in a sentence) *****
- scanner or parser (the lexical analyzer)
- the queue (where the parser will deposit each token)
- the translator (the part that takes the tokens from the queue and translates them)
-
- Step 2: lay out the main algorithm
- - usual intro/ initializations
- - get the arguments (command line arguments might work well here)
- - open the appropriate file(s)
- The Token: This is a do-dad that often gives people grief. Tokens are not magical entities. In our application each identifiable THING in our sentence is a token. Most of them are words, but some are punctuation. A token is simply a catch-all word, so we don't have to keep saying, "words and punctuation". In some applications there are more than two kinds of things in a sentence. Using tokens also gives us some leeway - we can identify other entities as tokens that are usually not considered to be part of a sentence but that we need in order to write a program to parse sentences. An example of this would be the end of the sentence - that is, the part that comes AFTER the punctuation (the EOLN marker).
- The Queue
- Words VS Tokens
- Parser VS Translator
- The Parser:
- Reading Symbols and Converting them to Tokens:
- each "token" will be a single word but you'll need to be able to skip white space between tokens.
- you'll probably need the following information associated with each token:
- it's value ( "the", "cat", or "happy", etc.)
- it's type (see below)
- you will probably end up with the following "kinds" of tokens:
- noun
- pronoun
- propernoun
- article
- conjunction
- adverb
- adjective
- verb
- punctuation
- end-of-statement
-
- How to Get Started
-
- Rough outline, , no error detection/recovery:
- First Crack:
- Let's take stock: we need to see what we already have to avoid duplication.
- What do we already have?
- 1. that annoying Debug Class
- we sort of know how it works and how to use it so we can afford to ignore this for now.
- 2. A Word Class
- looks big and ugly - we'll come back to it later - for now let's just assume it is an object that deals with the words in the sentence.
- 3. A Dictionary Class
- ditto - let's assume it has lists of words in different languages & ways to search for them
- 4. A ListElement Class
- huh?
- Maybe if we ignore it it won't matter (I have inside information and I know this to be true, but generally speaking, it is inefficient to get stuck on one part of a solution because we don't quite understand it. If we get stuck, we will try to continue on and see if we can come back later with more information and make sense of it then.)
- 5. A MyQueue Class
- We sort of know what a queue is - let's look at this and confirm
- - the goal is to understand this well enough that we can ignore the insides of this too
- 6. NotAWordException.java
- We'll also ignore this for now.
- The Queue: If we can figure out the public stuff of this Queue class we may not have to look at it further, so....
- It uses things of type "Object". Our Word & Token classes are both Objects (by inheritance) so we can use this class to hold either Words or Tokens. To avoid confusion, and since I'm going to be putting Tokens in my queue I'll just pretend I see "Token" where ever there is "Object"
- Things I can do with this queue:
- 1. get a reference to the Token at the front of the Queue (front)
- 2. put a Token in the Queue (add)
- 3. pop a Token off the Queue (pop)
- 4. see if the Queue is empty (empty)
- 5. display the entire contents of the Queue (without changing it) (display)
- I also discover that the Queue class uses the ListElement class but I don't appear to need to know anything about the ListElement class in order to use the Queue class. Good. I'll continue to ignore the ListElement class. Time to move on.
- Take a second look at the Words class - let's re-organize for our purposes:
- It has:
- 1. a bunch of integer codes for the parts of speech (NOUN, VERB, etc.)
- 2. a bunch of booleans that will say if a Word is this particular part of speech (isNoun, etc.)
- 3. access methods for the word, it's type and it's transaltion
- 4. a static method that "translates" a type like "noun" into the integer code (NOUN)
- 5. a display function
- 6. some constructors
- This makes it a little less intimidating. Continuing....
- The Dictionary Class:
- The constructor opens a text file, reads it, and builds a dictionary.
- There is an add method that lets me add a word to the dictionary (it assumes English is first)
- There is a find that uses a simple String (and ENGLISH or LATIN) and there is one that uses a Word - but it only uses the first part of the Word - it ignores the translation. There are actually a whole bunch of find functions - they all have the same basic goals, but they take different parameters and some behave a little differently from the others.
- I can ignore the rest of the stuff in that class.
- We have established what we HAVE.
- Next step is to start figuring out what we NEED.
- What do we need?
- 1. We need to be able to cope with lines of text read from the keyboard. We can use tio to help us or do our own. We only need something to read a line of text so it doesn't need to be complicated. This will do it:
- public int getNewLine() {
- System.out.print("Please enter your sentence. \nSeperate each thing with"
- + " at least one blank \n- " +
- "including the punctuation at the end .\n");
- try {
- String line = in.readLine();
- pos = 0;
- max = line.length();
- return line.length();
- }
- catch (IOException E) {
- System.out.print("getNextLine::IO Error detected.\n\n");
- max = 0;
- return 0;
- }
- }
|
| where in (used in "in.readLine()") is: |
- BufferedReader in =
- new BufferedReader(new InputStreamReader(System.in));
|
| We need try & catch because this object can throw an IOException and Java forces us to handle them. |
- What we are aiming for is something to "capture" words. We will assume words are seperated from each other by white space. This will also be true for the punctuation. The end-of-line will signify the end of the sentence even if we don't have the punctuation yet. For this we can make a Line Class. It should have:
- a string : one whole line of input
- a pos : an integer keeping track of where we are now in the line
- max : the length of the line
It should be able to do:
- getNewLine - prompt the user for a new line and set it up in the class
- getNext - return the next non-blank string from the current line
- moreLine : true until we see the end-of-line
We can write this and build a little test program to make sure it works. This should do it:
- public static void main (String args[]) {
- String s;
- Line l = new Line();
-
- try {
- Debug.setFlag(true);
- System.out.print("Here we go...\n");
- l.getNewLine();
- while (l.moreLine()){
- s = l.getNext();
- System.out.print("=" + s + "=\n");
- }
- System.out.print("Done.\n\n");
- }
- catch (Exception E) {
- System.out.print("Uh-Oh!: " + E + "\n\n");
- }
- }
|
After this there is almost nothing left for the Token to do and if it weren't for that pesky punctuation, we could probably avoid using tokens altogether and deal directly with Words. Oh well - the Token class is easier now.
- Now, What do we need?
- There's that Token do-dad, the scanner (aka lexical analyzer, aka parser), and the translator.
- Let's see if we can figure out how this will need to be packaged (as in, what does the main program look like and do?):
The main program will need:
- the Dictionary
- a queue of Tokens ( we will give the queue to the parser to fill up)
- a parser (which needs to know about the queue and the dictionary and needs to use the line)
- a translator (which needs the queue of tokens but not the dictionary)
Maybe if the parser is the only one that needs the dictionary it should be the one to own it? On the other hand, it might allow us more flexibility later if we gave the dictionary to the translator too. This opens up the possibility of providing the translator with a different dictionary at some other point.
So now, what does the main program look like?
- public static void main (String args[]) {
- try {
Dictionary dic = new Dictionary("List.txt");
- Queue Q = new Queue();
- Parser P = new Parser(dic,Q);
- Translator T = new Translator(dic,Q);
- Line l = new Line();
-
- Debug.setFlag(true);
- System.out.print("Here we go...\n");
- P.go(l); // start the parser on the current line
- Q.display(); // show the contents of the queue
- T.go(); // do the translation
- System.out.print("Done.\n\n");
- }
- catch (Exception E) {
- System.out.print("Uh-Oh!: " + E + "\n\n");
- }
- }
|
- The queue stuff is done. The translator will simply go through the queue, one token at a time and print out the translation to go with the word we have. That's not too hard so we can do that now and test it by putting a bunch of made-up words in the queue or we can leave it till later.
-
- That pretty much just leaves the Parser. That's for the next section........
- Dealing With Syntax:
- Second Crack: (using the syntax diagrams as shown):
- The Syntax Diagrams actually show us what the parser should be doing - and it is broken up into parts already. Each one of the diagrams can be translated into an algorithm. Here's how. The following is an algorithm for the "Sentence" syntax diagram.
The basic idea is that each diagram (not each slide but each set of connected entities) can be translated into a function. If all we are interested in is determining whether we have a legal sentence or no, then the result of each function can be true or false. An alternative is to throw an exception when we discover a problem. The sample below doesn't really handle errors, but it DOES follow through a legal sentence properly. Code should be designed for the correct input - errors should be treated as special conditions.
NOTE: This is NOT complete code. This is provided as a guide. If you can't follow and understand it, typing it in and hoping it will work , won't.
- doSentence()
- {
-
- // we're assuming each part will return with the NEXT
-
- // thing in the line - the one we HAVEN'T handled yet
- doNounPhrase(); // handle the noun phrase
- doVerbPhrase(); // handle the verb phrase
- if ((symbol.isConj()) // this part is optional in the sentence
- {
- queue.push(symbol); // push it on the queue
- symbol = symbol.getNext(); // get the next token
- doSentence(); // parse a sentence
- }
- else if ((symbol.isPunct()) {
- queue.push(symbol); // push it on the queue
- symbol = symbol.getNext(); // get the next token
- }
- else {
- error();
- }
- }
-
-
- Translating Sentences:
-
- The translator will simply go through the queue, one token at a time and print out the translation that goes with the word we have.
-
- And TA-DA!!!! we've got it.
- Once we've got the syntactically correct words in the queue, we can proceed as follows:
- Translate:
- while queue not empty
- pop the next token off the queue
- print out it's translation
-
- And that's it.
-
-
- Pictures of the Objects: