|
Parsing
English - Latin Translator
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Assignment Links:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Classes you can use:
|
|
|
|
|
|
|
|
|
|
|
|
|
- Goals, Skills and Concepts:
- definition and use of classes
- composition, aggregation
- intro to BNF
- intro to formal definition of languages
- intro to parsing
- encapsulation
-
- Introduction and Background:
- What is Parsing?
- - It is part of the compilation process when your source program is transformed into executable code.
- - It is part of ANY translation process - even for natural languages.
- - It is something we are all experts at already.
-
- Suppose you were given the following statement:
-
- Enraged cow injures farmer with axe.
-
- Our goal is to gain meaning from this. The first step is the "lexical analyzer". It breaks the senetence up into individual "things" which we will call tokens. In the sentence above the tokens are mostly words, except the '.' at the end. All are tokens, only some are words.
- The lexical analyzer gives us:
| Enraged |
cow |
injures |
farmer |
with |
axe |
. |
- There are seven tokens. We are using the term "token" rather than "word" - it is more general. "Sentence" refers to a statement made up of tokens. An arithmetic expression can also be a sentence, and its tokens look different:
- Once we have things separated into tokens, we can proceed to the parser stage.
- The parser's job is to determine if the tokens form a syntactically correct (legal) sentence.
- If we look back at our sample statement, we can see that there is more than one way to "parse" this statement. Which one is the right one depends on context, which is one of the things that makes natural languages so tricky (they are "context sensitive"). Computer languages don't usually have this problem (they are usually "context free").
| Enraged |
cow |
injures |
farmer |
with |
axe |
. |
| Enraged |
cow |
injures |
farmer |
with |
axe |
. |
- The parser will attempt to match the tokens to the grammar (as specifiied using BNF or Syntax Diagrams, or some other notation). If this can be done successfully, the statement is deemed to be correct. Once we know the statement is syntactically correct, we can proceed to the last phase: the translator. If we did the parsing correctly, the last phase is quite straight-forward.
- Description:
- Write a program that will read in a simple English sentence and translate it into Latin. Your program must verify that the sentence is syntactically valid before translating it.
- All "terminal symbols" (i.e. words and punctuation) are assumed to consist of a string seperated from the next "thing" by white space. Because of this, all other white space should be ignored. (Note: spaces can appear everywhere / anywhere; we must ignore them. If you wish to allow more complex sentences, blanks should still be redundant).
-
- Specifications:
- All sentences are terminated with periods. You may choose to force the user to enter only one sentence per "physical line" or allow multiple, period separated sentences per line as well as sentences that span physical lines.
- Empty sentences must be accepted.
- Minimal requirements are that your program be able to read any and all legally formed sentences. Your program is to read and parse the sentence and print whether or not this is a valid statement. This must be an object-oriented approach; each class must be contained in a separate source file. You must include a graphical (picture) representation of your classes and their relationships in your documentation. Your program must make use of a queue for storing the sentence and building the translation.
- The B solution does all this plus it can recognize invalid sentences (recovery is best but recognition and "bailing" is sufficient). It will also be able to convert the sentence to Latin and print out the Latin form.
- The A solution adds better error detection / recovery facilities as well as implementing the ability to parse and transform sentences containing conjunctions.
-
- Approach:
- You are given the following Classes to use as you choose: Debug.java, NotAWordException.java, ListElement.java, MyQueue.java, Word.java, Dictionary.java All can be found in ~becker/Courses/235/A4 along with several data files containing words for the dictionary. Please don't copy the word files into your directory. Please link to them instead. If you are woring at home, you may copy these files as necessary.
-
- See Lab Pages for way too much extra on this.
|
The Lexical Analyzer part of this exercise (our assignment) is essentially done. We have a Line Class which will read a sentence typed by the user and give it to us as separate strings. We also have a Dictionary Class which we can use to look the strings up and get their matching Words (using the Word Class).
To avoid multiple passes over the input data, we will have the Parse get strings from the Line as needed.
|
|
There are many types of parser - the one we will implement is called a "Recursive Descent Top Down Parser". It is in reality a rather poor choice for translating natural languages, but a fairly good choice for many programming languages, and that's why we are using it. It will help us to understand how compilers do their job.
Our parser relfects the syntactical rules defined in our formal grammar (this picture isn't quite it - it's just an example). The formal grammar defines the "rules" for making valid (legal) setences. If we use the syntax diagrams as our guide, we can create a set of functions that will work their way through ANY legal statement. Handling illegal statements is a little trickier, but essentially involves deciding when the token we have is the wrong one.
|
|
It's called a Top-Down Parser because it begins with the "highest" syntactic entity ==> the sentence. We begin with the assumption that we have a valid sentence. We know that a sentence starts with a Noun Phrase (Subject in the picture). A Noun Phrase consists of: a pronoun, or an article followed by a noun, or an adjective followed by.... The word we have is an adjective, so we can proceed.
|
|
We work our way through the entire sentence this way. If we make it to the end (back at the end of "sentence", we made it and can go on to the next phase ==> TRANSLATION.
|
Once we know the sentence is valid, the translation is a matter of taking each token, translating it to our "target" language, and printing it out.
- Requirements & Grading: The following is a rough guideline ONLY. The grading rubric is the official marking guide.
- 'C' Version:
- Approach is OO
- Documentation includes UML(ish) class diagrams
- Recognizes a valid sentence (according to the syntax given)
- Uses a queue.
- Prints the contents of the queue.
-
- 'B' Version: All that the 'C' Version does, PLUS:
- Handles (recognizes and doesn't blow up) invalid sentences
- Translates to Latin
- Handles empty sentences.
-
- 'A' Version: All that the 'B' Version does, PLUS:
- Handles conjunctions recursively.
- Prints meaningful diagnostic messages
-
- Testing:
- Sample Executables: /home/233/A4
- The Word List
- Test your program using at least the following sentences :
|
'C' Version
|
'B' Version ('C' plus:)
|
'A' Version ('B' plus:)
|
| give at least 10 sample sentences altogether |
|
give at least 10 more sample sentences altogether |
|
give another 10 sample sentences in addition to the "C" and "B" ones |
|
| sentences consisting of |
such as |
invalid sentences consisting of |
such as |
sentences consisting of |
such as |
| noun verb |
deer disappear . |
adjective noun |
bitter blockhead |
pronoun verb conjunction <sentence> |
I bite and deer disappear . |
| article noun verb |
this horse smoke . |
noun phrase noun phrase |
this statement without verb |
|
|
| article adjective noun verb |
this stupid horse smoke . |
< pick some other invalid combinations > |
|
|
|
| adjective noun verb [ allow missing article] |
two lord limp . |
|
|
|
|
| pronoun verb |
I bite . |
|
|
|
|
| < the "C" version does not need to work for any INVALID sentences> |
|
<the "B" version should not blow up with invalid sentences, but simply reporting "error found" and terminating is sufficient> |
|
< the "A" version should give more detail on the report: "punctuation missing" or "noun expected" > |
|
|
|
can handle sentence made of just period: i.e. shouldn't blow up |
|
|
|
- BONUSES:
- handle other punctuation
- cover other parts of speech
- handles more kinds of sentences: questions, imperatives,...
- Reproduce case sensitivity, i.e. capitalized english word becomes capitalized latin word.
-
- Challenges:
- The sky's the limit on this one folks - pick your poison......