*The Morphonizer The Morphonizer's job is to take words and divide the word up into a root morpheme and inflectional affixes. Since English has only inflectional suffixes, we will implement suffixes. We will also implement prefixes since they can be useful in handling certain forms. We will not implement infixes, however, we will design this module so that infixes can be implemented as well, in case we decide to parse Turkish or German. ** Design Specification *** Input The Morphonizer will take as input a character string representing a single English word, suffixes and all. This may require some preprocessing in order to extract the string from some data structure containing it (such as a Token). The Morphonizer should produce the correct morphonetic analysis (amongst several incorrect ones) for the following inputs: Single morphologically complex words, i.e: "turned", "toppings" Capitalized, or mixed-case words, i.e. "BMW's", "RUNNING" Numbers, i.e: "eighty-six", "86" Tags, i.e. "", "" It does not have to handle: Multiple-word inputs, i.e. "shooting the breeze" *** Output The Morphonizer will output a set of all possible morphological divisions. Each division consists of a root and zero or more affixes. The output will always contain at least one item, which is the 'trivial' morpheme, with 0 affixes and the root as the entire word. Affixes will be returned in some ordered form so one can easily identify the morphological breakdown of the word. *** Knowledge The knowledge contained by the Morphonizer is a set of rules that can be applied to a string. In addition, the Morphonizer must keep a list of all irregular morphemes, and use them as necessary. We must categorize this knowledge into a formal structure so we can understand its requirements. We'll think about the regular rules first. Suffixes go into several classes. For example, the plural suffix in the words 'messes', 'leaves', 'flies', and 'bugs' has the same semantic connotation, but different morphological ones. So we will classify our suffixes into several 'suffix classes'. The 'suffix classes' of -S, -ED, -ING will henceforth be denoted as 'morphsemantic' affixes, since they differ in semantics. But the individual suffixes, such as -s, -ES, and -ies, which have the same semantics (pluralizing a noun or rendering a verb 3rd-person-singular), but different phonetic use (which one gets used depends on the phonetic pronunciation of the word) will henceforth be denoted as 'morphonetic' affixes. Many examples won't fit into this model. Of course, this model does not have to be complete, either here in the design or in the implementation...many individual morphological variations will have to be handled by the irregular forms dictionary. For example, the word 'men' is an irregular morphological variation on 'man', so it will have an entry in the irregular dictionary, denoting it as "man + S". Morphsemantic: -S Constraint: Root.category is one of {noun, verb} if (Root.category == noun) Root.number == singular if (Root.category == verb) Root.tense == present or 3rd_sing Any concepts without (count +) get deleted. There must be at least one concept left. Change: if (Root.category == noun) Root.number = plural else if (Root.category == verb) if Root.tense == 3rd_sing, Root.number = plural else Root.tense = 3rd_sing Morphonetic: -s Constraint: Root does not end with [suxz] Examples: hobos, tastes Morphonetic: -es Constraint: Root ends with [szhoux] Examples: glasses, crashes, tomatoes Morphonetic: -ies Constraint: Root does not end with [aeiou] Change: Append 'y' to Root Examples: flies, spies Morphonetic: -ves Change: Add 'f' to Root Examples: knives, leaves Morphsemantic: -'S Constraint: Root.category is a noun Root.possessive == False Change: Root.possessive == True Morphonetic: -'s Examples: class's, Caterpillar's Morphonetic: -s' Changes: Add 's' back to word Examples: classs', flies' (This handles possessive plural, it converts to normal plural) Morphsemantic: -ER Constraint: Root.category is in {Adjective, Adverb} Root.degree == normal Root.final_double == False Change: Root.degree = comparative Morphonetic: -er Examples: thicker, higher, taller Morphonetic: -er Change: Add 'e' back to root Examples: larger, paler Morphonetic: -ier Constraint: Root does not end with [aeiou] Change: Append 'y' to root. Examples: happier, friendlier Morphonetic: -cker Change: Add 'c' back to root Examples: ??? Morphsemantic: -EST Constraint: Root.category is in {Adjective, Adverb} Root.degree == normal Root.final_double == False Change: Root.degree = superlative Morphonetic: -est Examples: thickest, highest, tallest Morphonetic: -est Change: Add 'e' back to root Examples: largest, palest Morphonetic: -iest Constraint: Root does not end with [aeiou] Change: Append 'y' to root. Examples: happiest, friendliest Morphonetic: -ckest Change: Add 'c' back to root Examples: ??? Morphsemantic: -ED Constraint: Root.category == verb Root.tense == present Root.final_double == False Change: Root.tense = past Morphonetic: -ed Examples: matched, stacked Morphonetic: -ed Change: Add 'e' back to root Examples: staged, filed Morphonetic: -ied Constraint: Root does not end with [aeiou] Change: Append 'y' to root Examples: fried, spied Morphonetic: -cked Change: Add 'c' back to root Examples: picnicked, panicked Morphsemantic: -EN Constraint: Root.category == verb Root.tense == present Root.final_double == False Change: Root.tense = participle (past) (Same morphonetic rules as -ED.) Morphsemantic: -ING Constraint: Root.category == verb Root.tense == present Root.final_double == False Change: Root.tense == gerund Morphonetic: -ing Examples: matching, stacking Morphonetic: -ing Change: Add 'e' back to root Examples: staging, filing Morphonetic: -cking Change: Add 'c' back to root Examples: picnicking, panicking Morphonetic: -ying Change: Add 'ie' back to root Examples: lying, dying Morphsemantic: DOUBLE Constraint: Root.final_double == True Change: Root.final_double = False Morphonetic: (empty) Constraint: Last two letters are equivalent Change: Strip off last letter Examples: bigg (-er), stopp (-ing) In reality, these morphological rules are sub-composed of many phonological rules. It is quite plausible that these rules should be collapsed into several 'rule classes'. But this should probably wait until the morphological rules we will use have been formalized more. There are morphonetic rules that don't get represented by an affix. For example, many verbs double their last consonant when adding -ed or -ing. This will be implemented as a morphonetic rule that doesn't add its own affix. Also, many words are irregular in their application of morphonetic rules. This will be solved by listing them in an 'irregular' dictionary. So we can design our 'morphonetic rules database' to have records of the following elements: Morphonetic Affix Morphsemantic Affix Morphonetic Constraint Morphonetic Change There should also be a 'morphsemantic' database which has records of the following elements: Morphsemantic Affix Morphsemantic Constraint Morphsemantic Change Since the application of the semantic changes of an affix is not the responsibility of the Morphonizer, we will defer discussion of the morphsemantic database for later. The Morphonizer will only need the morphonetic database. Construction of both databases are tasks for some other module (or modules) and will be addressed thoroughly in another design spec. The Morphonizer need not worry about constructing either database, it need only call the constructor for the morphonetic database. In addition the Morphonizer should maintain a database of irregular forms, such as "men", "taken", and "worst". The database should contain the following records: Irregular form Root Morphsemantic Affix So, for our three examples, the database would contain: {"men", "man", +S} {"taken", "take", -ED} {"worst", "bad", -EST} It is perfectly legal for the root itself to have further morphological breakdowns, though I don't think there are any such roots in English. However, the Morphonizer will check, just to be safe. Each affix may have a 'variable' slot, which indicates a 'parameter' that may apply to the suffix. For example: {"eighty-six", "six" +NUMBER(80)} *** Expected Behavior The Morphonizer should apply every morphonetic rule outlined above in parsing a word. The Morphonizer should never signal an error. It shall always output at least the trivial morpheme, which consists of the original word as root, with no affixes. The Morphonizer should optionally output copious log messages indicating how it is interpreting the word. For each component it identifies, it will output a log message indicating the component it 'discovered'. ** Implementation Specification *** Algorithms The Morphonizer shall use the following algorithm: Create the trivial morpheme using the initial word. Construct the morphonetic database. (Actual construction is left as an exercise for another module.) Also load the irregular forms. Call the function Apply_Morphonetic_Rules with the trivial morpheme. Return the result. The Apply_Morphonetic_Rules function takes a morpheme. It returns all possible morphological divisions of the morpheme. For this function we need to distinguish between a sub-morpheme that should have more sub-morphemes but doesn't yet from a sub-morpheme that has a complete nest of sub-sub-morphemes. We'll call these deep sub-morphemes and shallow sub-morphemes. This function will work as follows: The Apply_Morphonetic_Rules must also handle the sheep dilemma. This applies to irregular forms like 'sheep' which is the plural of 'sheep', which is also the plural of 'sheep', which is... We will solve this problem by having Apply_Morphonetic_Rules take an optional parameter indicating whether or not to use the irregular dictionary to look up a morphological variant, which defaults to True. Then when calling itself recursively, it instructs itself not to use the irregular dictionary if a morpheme matches its sub-morpheme with the IRREGULAR affix. Thus the problem of infinitely recursing on 'sheep' is stopped after 1 call, which is exactly what we want. Start with an empty list of sub-morphemes. Let the rule list be defined as all the morphonetic rules plus the irregular dictionary. For each morphonetic rule in the rule list: Try to apply it to the morpheme. (use the morpheme's string in lower case). If it won't apply, go to the next one. If it does apply, it will return a shallow sub-morpheme. Convert the case of the shallow sub-morpheme's root back to the original case of the morpheme. (For irregular forms, look up the morpheme's string in the irregular dictionary. For each match found, construct a shallow morpheme by copying the morpheme returned by the irregular dictionary, and adding an Irregular 'affix' to the morpheme's sub-morpheme. (so the morpheme is actually 2 sub-morphemes).) Recursively call Apply_Morphonetic_Rules on the shallow sub-morpheme returned by the above apply function. This returns a list of deep sub-morphemes. Call Splice with the list of deep sub-morphemes and the original morpheme. It will return a list of deep sub-morphemes that include the original morpheme. Concatenate this list to the initial empty list. Return the initial list of sub-morphemes. The Splice function is used on a shallow morpheme and a list of deep morphemes, each of which belongs as the sub-morpheme on the shallow morpheme. So a copy of the shallow morpheme needs to be made for each deep sub-morpheme. Start with a list of morphemes with one element, the shallow morpheme. For each deep morpheme Clone the shallow morpheme Attach the deep sub-morpheme to the shallow morpheme. Add new clone to the new list of morphemes. Return the new list of morphemes. The actual application of a single morphonetic rule to a morpheme is fairly straightforward. Since this is the application of a rule to a string, it is well documented in rules.txt. **** Example Let's see how the Morphonizer's algorithm operates on the word "Stoppings". It then calls Apply_Morphonetic_Rules on "Stoppings". In Apply_Morphonetic_Rules: It looks up "Stoppings" in the Irregular_Dictionary, to no avail. It also tries to apply every morphonetic rule, and only one works. This rule yields: "Stopping" + -s. Then Apply_Morphonetic_Rules gets called on "Stopping" + -s: It looks up "Stopping" in the Irregular Dictionary, to no avail. It also tries to apply every morphonetic rule, and two work: "Stopp" + -ing "Stoppe" + -ing It then tries Apply_Morphonetic_Rules and gets one more breakdown: "Stop" + DOUBLE. No breakdowns are possible with "Stoppe". After splicing is called, here are the resulting morphemes: Stoppings Stopping + -S Stoppe + -ING + -S Stopp + -ING + -S Stop + DOUBLE + -ING + -S (Presumably the Wordifier will determine the correct morpheme when it discovers that only "stop" exists in the word dictionary. The Idiomifier will do likewise.) *** Object Diagram ====================== | Morphonizer | | (Module< String, | | Set< Morpheme> >) | |--------------------| |--------------------| | knowledge() | ================== | execute() | | Morphonetics | ====================== | (Rules< | <> | Morphonetic>) | | |----------------|--------------| ======================== |----------------| | | Irregular_Dictionary | ================== | | (Multimap< String, | <> | | Morpheme>) | | |-------|----------------------| | |----------------------| | ===================== ======================== | | Morphonetic | | -@| (Rule< Morpheme>) | | |-------------------| | | type | ================= v |-------------------|@--| Morphsemantic | | | prepare() | ================= ============= | splice() | | | Knowledge | ===================== | ============= <> text | @ | ============ ========== | | Morpheme | | Affix | ========== |----------|@-----------|--------| | String |---<>|----------| .AND. | parm | ==========text | root() |<>------ | type | | create() | | |--------| | clone() | | ========== ============ | | | |________| sub-morpheme The Morphsemantic class is a semantic identifier for affixes. It associates with several Morphonetic rules, each of which lives in the Morphonetic class. Morphemes exist in a linear hierarchy, similar to a linked list. A Morpheme always contains a String. This is the textual representation of a morpheme. A Morpheme must either contain both an affix and a sub-morpheme, or it must contain neither, in which case the morpheme is the most atomic 'root' morpheme. The first Morpheme contains the same string as was given. Subsequent morphemes contain near-subsets of this string, although morphological changes may alter the representation, hence it is kept for each morpheme. The bottommost morpheme is the root morpheme, which gets looked up in a Word Dictionary later on. The Morphonizer stores as Knowledge the Irregular and Morphonetic classes. The Morphsemantic class will make an appearance later in the Morphsemizer, where it will contain the Semantic Changes attributes discussed above. There will also be a database of Morphsemantics, consulted by the Morphsemizer in constructing meanings. *** Classes Here are the classes supported by this module: Affix Morpheme Morphonetic Morphonetics Irregular_Dictionary Morphonizer **** Affix This represents the affix involved in a morphology rule ***** Data String id Indicates which morphsemantic affix is being identified here String parm This is an optional variable that may be filled by a Morphonetic rule and used by a Morphsemantic rule. For example, a Morphonetic rule analyzing numbers might use this parm to indicate a numeric value. Affix_Type type Indicates whether this is a prefix, suffix, or infix. ***** Methods None **** Morpheme This class represents morphological components of a string. A single Morpheme can represent 0 divisions, in which case the string is considered already in root form, or it can contain a morphological affix that can be removed from the string, along with a simpler morpheme, which itself can contain more sub-morphemes. In this manner, one can identify a linked list of sub-morphemes of a single word down to its root form. ***** Data Affix affix Indicates the affix identified in this morpheme. Since a morpheme can represent 0 affixes, this may be NULL. The Morpheme does not allocate or deallocate Morphsemantics since they are persistent, and will outlive the Morpheme. (In order to eliminate a cyclic dependency between the Morphonizer and Morphsemizer, the affix here will actually be a String, and Morphsemantic objects will be identifiable by their String id's. Thus the Morphonizer does not need to know what a Morphsemantic is.) Hard_Pointer< Morpheme> sub-morpheme Indicates the sub-morpheme identified in this morpheme. Since a morpheme can represent 0 sub-morphemes, this may be NULL. String text This contains the textual representation of a morpheme. For the topmost morpheme, this will be the original word, for the bottommost, this will be the root morpheme. Usually a sub-morpheme's text will be a substring of a morpheme's text, but not always...a morpheme's affix may include rules to change a text in any way suitable for the affix. ***** Methods virtual Morpheme* clone() const Returns a duplicate of this object, which must be freed later. virtual Morpheme* create() const Returns a default object of same class, which must be freed later. const String& root() const Returns the root text of a morpheme. Done by traversing down the morpheme hierarchy to the end and returning the text found there. **** Morphonetic (Rule< Morpheme>) This class represents a particular morphonetic rule, how it can be used, and how it changes a word (in textual representation only). Semantic changes are stored elsewhere. ***** Data Affix affix Indicates the affix identified in this morpheme. Since a morpheme can represent 0 affixes, this may be empty. String text This is the actual text of the affix. A string must contain this text (at the end if this is a Suffix or the beginning if this is a Prefix) for this affix to be applicable. Hard_Pointer< Code< String> > code { bool code(String& x) } While provided by the Rule template, this data is much more specific for this class, so it is best to describe it specifically: This code should take a String. This string will be the Morpheme's root, with the Morphonetic affix stripped. This function should then return False if the rule may not apply to this root, and it should not change the string it was given. If the rule may apply, this function may make any changes to the string deemed necessary, and return True. If this value is NULL, then no constraint is checked and no change is made. In other words, this rule is always applicable, and it makes no changes to the affix-stripped String. ***** Methods bool apply( Morpheme& x) This method is provided by the Rules template, but its functionality is explicitly described here. Tries to apply this Morphonetic to this Morpheme. Will be used by rules module. It should call the user-defined apply-rule function on the Morpheme's root, with the affix stripped off. If successful, it should make any changes to the Morpheme warranted by the user-defined code, and return True. Otherwise, it should return False, leaving the Morpheme unchanged. If no user-defined apply-rule function is supplied, then this function will leave the Morpheme unchanged and quietly return True. bool prepare( String& x) This method strips the affix off the String; it is part of the apply() method described above. virtual void write(ostream& o) const Write operator. Outputs the morphonetic rule data. Does not have to output the user-defined function. virtual void read(istream& i) Read operator. Reads in the morphonetic rule data. Does not have to input the user-defined function (that will be set to NULL). **** Morphonetics (Rules< Morphonetic>) This class represents all the morphonetic rules used by the Morphonizer. ***** Data None ***** Methods None **** Irregular_Dictionary (Multimap< String, Morpheme) This class represents the information needed to analyze an irregular form. It is descended from Knowledge. ***** Data None ***** Methods None **** Morphonizer (Module< String, Set< Morpheme> >) This class represents this design spec's module. The Morphonizer takes a String as input, and yields a Set of Morphemes as output. Its knowledge is the Morphonetics rules, and Irregular_Dictionary. It has no submodules. ***** Data Rule_DB_Shlib< Morphonetics> rdm This contains the database needed to load the Morphonetic rules. This module's knowledge comes from two databases, this one, and the Dictionary database, defined in the Lexifier. The parameter name for this database will be MORPHONETIC. Irregular_Dictionary* id The irregular dictionary. Created by a database outside this module. ***** Methods Set< Morpheme> execute(const String& input) The execute function. Will run the algorithm described above. virtual Vector< Soft_Pointer< Knowledge> > knowledge() const This returns a list of all Knowledge this Module uses. void splice(Morpheme& top, Set< Morpheme>& sm) const Implements the splice algorithm described above. Takes a set of sub-morphemes, adds this one to each, and returns the resulting set of morphemes.