15-121 Homework 7: A Complete Anagram Generator - Due 12/5 at midnight
Download and unzip hw7-code.zip, which contains all
the files you will need for this assignment. You will be writing code in
the AnagramMap.java file. The goals of this assignment are:
- to familiarize you with Java maps and sets
- to gain more experience with using the Java API
- to gain more experience in creating a solution to a problem with minimal
guidance as to what methods to write and how to write them
Note: You will be graded in part on your coding style. Your code
should be easy to read, well organized, and concise. You should avoid
duplicate code.
Background: The Assignment
You're going to write a complete anagram generator as was demo'd in class. In
Assignment 6, you effectively implemented a map using the BST to map
a canonical form of words (a sorted String) to a list of words that have that
canonical form. For this assignment, you will leverage the power of the
Java TreeMap and TreeSet classes to not only re-implement the
simple angram generator from the last assignment, but generate all possible
two-letter and greater length words that can be made from the given word.
AnagramMap.java is where you will write your code to create
a map using the sorted form of a word as the key and map that key to all the
words that have the same sorted form. As in Homework 6, you will read all the
words in the dictionary file one at a time, sort the word and build the map.
Once again, you have a small file to initially test with. Once you can build
the map from that, you should use the big file, words.txt.
Once you can create and display all the anagrams for a word, it's on to the
real utility of this assignment - to create a word-game "assistant" that
generates all possible words (of length 2 to the length of the word) that can
be made from that word. This sounds simple but is harder than you might
think.
Process
The Map
I will demo my solution in recitation. You can have your user interface
behave any way you want (I will provide the output of my solution below: your
results have to match, but your program's interaction with the user does not
have to mimic mine line-by-line).
You want to have all the words with the same letters be mapped to the same
location. So you will use the sorted form of a word as the key, and associate
that key with all the words that have that same sorted form. Thus, "tar" will
map to "tar, art, rat" (in any order). So our map will be from String
to a collection of String. You get to decide what kind of collection
of String you want to use. As long as "tar" maps to "tar, art, rat"
(in any order), you're fine.
Similar to Assignment 6, you will build the map asking the user what the
maximum word size is that they want to consider and then open and read all the
words from a dictionary file, storing all the words that are less than or
equal to the maximum length provided by the user. I have provided you with
two dictionary files: small-words-qatar.txt and words.txt. The
files are quite differently sized: small-words-qatar.txt contains 25
words, while words.txt contains over 172,000 words! You should test
your code on the small file before going to the big one! :-)
The AnagramMap constructor takes a file name and maximum word size and
builds a map with all the words that are less than or equal to the length
specified. If, when you're done reading the small file, the size of your
map's keySet is 9 with a maximum word size of 7, you appear to be on the right
track. If you then run on the big file and get 41121 as the size of your
keySet with a max word size of 7, and 66538 with a max word size of 8, you
seem to have built a correct map. The size of the keySet, by the way, is, of
course, the number of canonical forms of words that are less than or equal to
that maximum length (which should be less than the number of words that
are less than or equal to that length).
Once you've got your map built, you can now ask the user for a word and, if
its length is less than or equal to your max length, you should search for it
in your map (using what as the key?) and, if found, print all the anagrams of
that word (which will be found in the collection that the key returned as its
value attribute), just like in the last assignment. If the word is not found
in the map, you should tell the user that it was not found. The user should
be allowed to search for as many individual words as they want until they
enter a sentinel value, at which point the program should end. As an example,
if you are using the small file and the user enters "tar", your program should
print "tar rat art" (in any order) as your output. Once you're here, you've
just got a little more to do. :-)
The anagrams, all the anagrams, and nothing but the anagrams...
The last part of the assignment involves collecting and printing the anagrams
of all the 2-letter or more words that can be generated from the string that
the user enters. This requires generating the set of all substrings
(the power set) of the string (sorted) that the user enters, and looking for
the anagrams of each of those substrings, instead of just the anagrams of the
string the user entered.
I have given you a class, PowerSet that you can use to your advantage.
You do not have to understand this code (although you're welcome to look at
it!). However, you should be able, by examining the sample test code I have
provided in the main routine of PowerSet to understand how this
code may be able to help you in solving this problem. More below...
How many subsets are there in the power set (the set of all subsets) of a set?
How much wood could a wood chuck chuck if a wood chuck could chuck wood?
Well, if the cardinality of the set is n, i.e., there are n elements in the
set, there are 2n subsets in the power set. I have provided you,
in the PowerSet class with a way to generate every subset of a set of n
elements, returning each subset in turn as an array of integers, similar to
how an iterator would behave.
How does this help you, you may ask? Well let's say the particular subset of
7 elements that you currently have is [0,2,3,5,6]. What you need to do is
form a substring that consists of the characters at those positions. An
example:
original string: { t e a c h e r }
our subset of indices: 0 - 2 3 - 5 6
the subset indicated: { t a c e r }
Notice that only the characters indicated by the subset instance will be
incorporated into the substring to use to search the map. How is this
helpful? If the number of possible subsets is 2n,
our PowerSet class will generate each one. As it does, we should
create a substring from the characters indicated and, once built,
see what words that substring maps to, if there are any. If the
substring does exist in the map, we put all the resulting words into another
collection (one that can hold words of length 2 through length n in separate
bins), indexed by the length of the substring. And when there are no more
subsets left fro our PowerSet generator, we've generated the power set, the
set of all substrings of the initial word and filed away all the words they
can generate!
Hints
To make sure you're on the right track, make sure you're correctly using the
PowerSet iterator and are correctly generating substrings of the initial
string to search for in your map. Then make sure you're correctly storing and
printing all the anagrams that should be generated. In particular, realize
that if you input a word like "starry", the power set will generate the "star"
substring twice (since there are two r's, so subsets [0,1,2,3] and [0,1,2,4]
will both generate "star"), but the words that map to (have the
same canonical form as) "star" should only be printed once! Also, the words
should be grouped by length. You will likely want an ArrayList of some kind of
collection (no duplicates allowed, hint, hint) to store the words before
printing. And, again, make sure you test this part on the small file before
running it on the 172,000-word file!
I have provided you with the main method I used to test my program;
yours can be different! You will need to write a few other
public functions and some helper functions for this assignment!
Expected Output
The following was the output of my reference solution, using 7 as the maximum
word size on the large dictionary (recall that your words can appear in any
order but there can be no duplicate words printed):
Maximum word size to consider? 7
Number of entries = 41121
word to search [#] to stop: star
2: [ar, as, at, ta]
3: [ars, art, ras, rat, sat, tar, tas]
4: [arts, rats, star, tars, tsar]
word to search [#] to stop: horse
2: [eh, er, es, he, ho, oe, oh, or, os, re, sh, so]
3: [ers, her, hes, hoe, oes, ohs, ore, ors, ose, res, rho, roe, ser, she]
4: [eros, hero, hers, hoer, hoes, hose, ores, resh, rhos, roes, rose, shoe, sore]
5: [heros, hoers, horse, shoer, shore]
word to search [#] to stop: #
Submitting Your Work
When you have completed the assignment and tested your code thoroughly,
create a .zip file with your work (including Word.java and
AnagramMap.java. Name the zip file "your-andrew-id".zip and email it
to me mjs @ cs.cmu.edu.
Make sure to keep a copy of your work just in case!