translator

translator

Parsing
English - Latin Translator

Goals, Skills & Concepts	Introduction & Background	Description
Approach + Hints + HELP	Specifications	Minimal Requirements
('A' Version)	('B' version)	('C' version)
Testing and Marking	Bonuses	Challenge
	Assignment Links:
About Our WeeEnglish Translator (lab notes)	The Word List (text file)
About Parsing (General) (lab notes)	Syntax Diagrams: NounPhrase	OLD Syntax Diagrams: NounPhrase
	Syntax Diagrams: VerbPhrase	OLD Syntax Diagrams: VerbPhrase
Syntax Diagrams: Statement & Sentence	Syntax Diagrams: Adjectve & Adverb Lists	OLD Syntax Diagrams: Adjectve & Adverb
	Classes you can use:
	~becker/Courses/235/A4
Word.java	Dictionary.java	NotAWordException.java
MyQueue.java	ListElement.java	Debug.java

Reference Links:
Grammars and derivations	Intro to Sentence Analysis, Grammars & Parsing
Natural Language Parsing	http://www.geocities.com/adelie3/academics/ latin/LatinContext.htm
http://www.cs.auckland.ac.nz/compsci330s1c/lectures/	http://ancienthistory.about.com/cs/ latingrammar/
http://www.slu.edu/colleges/AS/languages/ classical/latin/tchmat/gr-helps.html	http://www.ling.lu.se/disseminations/ pdf/48/EegOlofsson_Sigurd.pdf
http://www.arts.cuhk.edu.hk/Lexis/Wheelock-Latin/

Goals, Skills and Concepts:

definition and use of classes
composition, aggregation
intro to BNF
intro to formal definition of languages
intro to parsing
encapsulation

Introduction and Background:
What is Parsing?
- It is part of the compilation process when your source program is transformed into executable code.
- It is part of ANY translation process - even for natural languages.
- It is something we are all experts at already.
Suppose you were given the following statement:

Enraged cow injures farmer with axe.

Our goal is to gain meaning from this. The first step is the "lexical analyzer". It breaks the senetence up into individual "things" which we will call tokens. In the sentence above the tokens are mostly words, except the '.' at the end. All are tokens, only some are words.
The lexical analyzer gives us:

Enraged

cow

injures

farmer

with

axe

There are seven tokens. We are using the term "token" rather than "word" - it is more general. "Sentence" refers to a statement made up of tokens. An arithmetic expression can also be a sentence, and its tokens look different:

4.2

;

Once we have things separated into tokens, we can proceed to the parser stage.
The parser's job is to determine if the tokens form a syntactically correct (legal) sentence.
If we look back at our sample statement, we can see that there is more than one way to "parse" this statement. Which one is the right one depends on context, which is one of the things that makes natural languages so tricky (they are "context sensitive"). Computer languages don't usually have this problem (they are usually "context free").

Enraged

cow

injures

farmer

with

axe

Enraged

cow

injures

farmer

with

axe

Specifications:
All sentences are terminated with periods. You may choose to force the user to enter only one sentence per "physical line" or allow multiple, period separated sentences per line as well as sentences that span physical lines.

Empty sentences must be accepted.

Minimal requirements are that your program be able to read any and all legally formed sentences. Your program is to read and parse the sentence and print whether or not this is a valid statement. This must be an object-oriented approach; each class must be contained in a separate source file. You must include a graphical (picture) representation of your classes and their relationships in your documentation. Your program must make use of a queue for storing the sentence and building the translation.

The B solution does all this plus it can recognize invalid sentences (recovery is best but recognition and "bailing" is sufficient). It will also be able to convert the sentence to Latin and print out the Latin form.

Approach:
You are given the following Classes to use as you choose: Debug.java, NotAWordException.java, ListElement.java, MyQueue.java, Word.java, Dictionary.java All can be found in ~becker/Courses/235/A4 along with several data files containing words for the dictionary. Please don't copy the word files into your directory. Please link to them instead. If you are woring at home, you may copy these files as necessary.

See Lab Pages for way too much extra on this.

The Lexical Analyzer part of this exercise (our assignment) is essentially done. We have a Line Class which will read a sentence typed by the user and give it to us as separate strings. We also have a Dictionary Class which we can use to look the strings up and get their matching Words (using the Word Class).

To avoid multiple passes over the input data, we will have the Parse get strings from the Line as needed.

There are many types of parser - the one we will implement is called a "Recursive Descent Top Down Parser". It is in reality a rather poor choice for translating natural languages, but a fairly good choice for many programming languages, and that's why we are using it. It will help us to understand how compilers do their job.

Our parser relfects the syntactical rules defined in our formal grammar (this picture isn't quite it - it's just an example). The formal grammar defines the "rules" for making valid (legal) setences. If we use the syntax diagrams as our guide, we can create a set of functions that will work their way through ANY legal statement. Handling illegal statements is a little trickier, but essentially involves deciding when the token we have is the wrong one.

It's called a Top-Down Parser because it begins with the "highest" syntactic entity ==> the sentence. We begin with the assumption that we have a valid sentence. We know that a sentence starts with a Noun Phrase (Subject in the picture). A Noun Phrase consists of: a pronoun, or an article followed by a noun, or an adjective followed by.... The word we have is an adjective, so we can proceed.

We work our way through the entire sentence this way. If we make it to the end (back at the end of "sentence", we made it and can go on to the next phase ==> TRANSLATION.

'B' Version: All that the 'C' Version does, PLUS:

Handles (recognizes and doesn't blow up) invalid sentences
Translates to Latin
Handles empty sentences.
'A' Version: All that the 'B' Version does, PLUS:

Handles conjunctions recursively.
Prints meaningful diagnostic messages
Testing:
Sample Executables: /home/233/A4
The Word List
Test your program using at least the following sentences :

'C' Version		'B' Version ('C' plus:)		'A' Version ('B' plus:)
give at least 10 sample sentences altogether		give at least 10 more sample sentences altogether		give another 10 sample sentences in addition to the "C" and "B" ones
sentences consisting of	such as	invalid sentences consisting of	such as	sentences consisting of	such as
noun verb	deer disappear .	adjective noun	bitter blockhead	pronoun verb conjunction <sentence>	I bite and deer disappear .
article noun verb	this horse smoke .	noun phrase noun phrase	this statement without verb
article adjective noun verb	this stupid horse smoke .	< pick some other invalid combinations >
adjective noun verb [ allow missing article]	two lord limp .
pronoun verb	I bite .
*< the "C" version does not need to work for any INVALID sentences>*		*<the "B" version should not blow up with invalid sentences, but simply reporting "error found" and terminating is sufficient>*		*< the "A" version should give more detail on the report: "punctuation missing" or "noun expected" >*
		*can* handle sentence made of just period: i.e. shouldn't blow up

BONUSES:

handle other punctuation
cover other parts of speech
handles more kinds of sentences: questions, imperatives,...
Reproduce case sensitivity, i.e. capitalized english word becomes capitalized latin word.

Challenges:
The sky's the limit on this one folks - pick your poison......

Updated: December 20, 2005 05:26 PM