\ Assignments - Homework 7 - 15-121 - Fall 2012
Carnegie Mellon University Website Home Page
 
Fall 2012

15-121 Homework 7: A Complete Anagram Generator - Due 12/5 at midnight

Download and unzip hw7-code.zip, which contains all the files you will need for this assignment. You will be writing code in the AnagramMap.java file. The goals of this assignment are:

  • to familiarize you with Java maps and sets
  • to gain more experience with using the Java API
  • to gain more experience in creating a solution to a problem with minimal guidance as to what methods to write and how to write them

Note: You will be graded in part on your coding style. Your code should be easy to read, well organized, and concise. You should avoid duplicate code.


Background: The Assignment

You're going to write a complete anagram generator as was demo'd in class. In Assignment 6, you effectively implemented a map using the BST to map a canonical form of words (a sorted String) to a list of words that have that canonical form. For this assignment, you will leverage the power of the Java TreeMap and TreeSet classes to not only re-implement the simple angram generator from the last assignment, but generate all possible two-letter and greater length words that can be made from the given word.

AnagramMap.java is where you will write your code to create a map using the sorted form of a word as the key and map that key to all the words that have the same sorted form. As in Homework 6, you will read all the words in the dictionary file one at a time, sort the word and build the map. Once again, you have a small file to initially test with. Once you can build the map from that, you should use the big file, words.txt.

Once you can create and display all the anagrams for a word, it's on to the real utility of this assignment - to create a word-game "assistant" that generates all possible words (of length 2 to the length of the word) that can be made from that word. This sounds simple but is harder than you might think.


Process

The Map

I will demo my solution in recitation. You can have your user interface behave any way you want (I will provide the output of my solution below: your results have to match, but your program's interaction with the user does not have to mimic mine line-by-line).

You want to have all the words with the same letters be mapped to the same location. So you will use the sorted form of a word as the key, and associate that key with all the words that have that same sorted form. Thus, "tar" will map to "tar, art, rat" (in any order). So our map will be from String to a collection of String. You get to decide what kind of collection of String you want to use. As long as "tar" maps to "tar, art, rat" (in any order), you're fine.

Similar to Assignment 6, you will build the map asking the user what the maximum word size is that they want to consider and then open and read all the words from a dictionary file, storing all the words that are less than or equal to the maximum length provided by the user. I have provided you with two dictionary files: small-words-qatar.txt and words.txt. The files are quite differently sized: small-words-qatar.txt contains 25 words, while words.txt contains over 172,000 words! You should test your code on the small file before going to the big one! :-)

The AnagramMap constructor takes a file name and maximum word size and builds a map with all the words that are less than or equal to the length specified. If, when you're done reading the small file, the size of your map's keySet is 9 with a maximum word size of 7, you appear to be on the right track. If you then run on the big file and get 41121 as the size of your keySet with a max word size of 7, and 66538 with a max word size of 8, you seem to have built a correct map. The size of the keySet, by the way, is, of course, the number of canonical forms of words that are less than or equal to that maximum length (which should be less than the number of words that are less than or equal to that length).

Once you've got your map built, you can now ask the user for a word and, if its length is less than or equal to your max length, you should search for it in your map (using what as the key?) and, if found, print all the anagrams of that word (which will be found in the collection that the key returned as its value attribute), just like in the last assignment. If the word is not found in the map, you should tell the user that it was not found. The user should be allowed to search for as many individual words as they want until they enter a sentinel value, at which point the program should end. As an example, if you are using the small file and the user enters "tar", your program should print "tar rat art" (in any order) as your output. Once you're here, you've just got a little more to do. :-)

The anagrams, all the anagrams, and nothing but the anagrams...

The last part of the assignment involves collecting and printing the anagrams of all the 2-letter or more words that can be generated from the string that the user enters. This requires generating the set of all substrings (the power set) of the string (sorted) that the user enters, and looking for the anagrams of each of those substrings, instead of just the anagrams of the string the user entered.

I have given you a class, PowerSet that you can use to your advantage. You do not have to understand this code (although you're welcome to look at it!). However, you should be able, by examining the sample test code I have provided in the main routine of PowerSet to understand how this code may be able to help you in solving this problem. More below...

How many subsets are there in the power set (the set of all subsets) of a set? How much wood could a wood chuck chuck if a wood chuck could chuck wood? Well, if the cardinality of the set is n, i.e., there are n elements in the set, there are 2n subsets in the power set. I have provided you, in the PowerSet class with a way to generate every subset of a set of n elements, returning each subset in turn as an array of integers, similar to how an iterator would behave.

How does this help you, you may ask? Well let's say the particular subset of 7 elements that you currently have is [0,2,3,5,6]. What you need to do is form a substring that consists of the characters at those positions. An example:

original string:         { t e a c h e r }
our subset of indices:     0 - 2 3 - 5 6
the subset indicated:    { t   a c   e r }

Notice that only the characters indicated by the subset instance will be incorporated into the substring to use to search the map. How is this helpful? If the number of possible subsets is 2n, our PowerSet class will generate each one. As it does, we should create a substring from the characters indicated and, once built, see what words that substring maps to, if there are any. If the substring does exist in the map, we put all the resulting words into another collection (one that can hold words of length 2 through length n in separate bins), indexed by the length of the substring. And when there are no more subsets left fro our PowerSet generator, we've generated the power set, the set of all substrings of the initial word and filed away all the words they can generate!

Hints

To make sure you're on the right track, make sure you're correctly using the PowerSet iterator and are correctly generating substrings of the initial string to search for in your map. Then make sure you're correctly storing and printing all the anagrams that should be generated. In particular, realize that if you input a word like "starry", the power set will generate the "star" substring twice (since there are two r's, so subsets [0,1,2,3] and [0,1,2,4] will both generate "star"), but the words that map to (have the same canonical form as) "star" should only be printed once! Also, the words should be grouped by length. You will likely want an ArrayList of some kind of collection (no duplicates allowed, hint, hint) to store the words before printing. And, again, make sure you test this part on the small file before running it on the 172,000-word file!

I have provided you with the main method I used to test my program; yours can be different! You will need to write a few other public functions and some helper functions for this assignment!


Expected Output

The following was the output of my reference solution, using 7 as the maximum word size on the large dictionary (recall that your words can appear in any order but there can be no duplicate words printed):

Maximum word size to consider? 7
  Number of entries = 41121

word to search [#] to stop: star 
2: [ar, as, at, ta]
3: [ars, art, ras, rat, sat, tar, tas]
4: [arts, rats, star, tars, tsar]

word to search [#] to stop: horse
2: [eh, er, es, he, ho, oe, oh, or, os, re, sh, so]
3: [ers, her, hes, hoe, oes, ohs, ore, ors, ose, res, rho, roe, ser, she]
4: [eros, hero, hers, hoer, hoes, hose, ores, resh, rhos, roes, rose, shoe, sore]
5: [heros, hoers, horse, shoer, shore]

word to search [#] to stop: #

Submitting Your Work

When you have completed the assignment and tested your code thoroughly, create a .zip file with your work (including Word.java and AnagramMap.java. Name the zip file "your-andrew-id".zip and email it to me mjs @ cs.cmu.edu.

Make sure to keep a copy of your work just in case!