SCRABBLE, ANYONE? John Krahmer INTRODUCTION My wife is a word game wizard. She can regularly beat me at Scrabble, Boggle, Anagrams, Crossword Puzzles or any other game requiring the manipulation of letters. I don't even bother watching Wheel of Fortune with her; when she's in the room, the show doesn't even have a Neilson rating. Perhaps because of this untarnished record of consecutive losses, I occasionally gave some thought to the idea of a computer program that would give me enough help to at least be a respectable opponent. She never had any objection to my use of such a program, probably because of her boredom with playing against an inferior, but I always ran into the stumbling block of combinations and permutations. THE PROBLEM It is the theory of combinations and permutations that says a poker player has only one chance in 2,598,960 of getting a straight flush in the first five cards that are dealt (and this is without regard to the order in which the cards are dealt). As applied to words, the order of the letters becomes important as well and the number of permutations takes a quantum leap. Not only would the simple generation of the permutations take a long time but, since not all sequences of letters are also words, there is the added difficulty of distinguishing words from non- words. I toyed with the idea of using a spelling checker to identify legitimate words from a list of letter combinations, but the number of possible permutations made that idea ridiculous, not to mention the difficulty of modifying a spelling checker to find words that were spelled correctly instead of incorrectly. The problem seemed insurmountable until I happened upon the ingenious concept of "word signatures." (See Scientific American, Vol. 251, No. 4, pp. 20-21 (Oct. 1984)). THE SOLUTION A "word signature" is simply the alphabetical re-arrangement of the letters contained in any given word. For example, the word signature for the word, "reaction," is "aceinort." As it happens the word, "creation," has the same word signature. Thus, with the combination of letters forming the word signature "aceinort," at least two legitimate words can be made by using all eight letters. If less than eight letters are used, the words "certain, nectar, action, carton, ornate, ration, erotic, canoe, ocean, enact, trace, tonic, cane, race, care, iron and torn" can be formed (among others). The idea of word signatures meant that I did not have to be concerned with all possible combinations after all. I only had to know a list of letters in alphabetical order to determine any word or words that could be made with those letters, provided I had a database of words linked to the word signatures. It began to look like dBASE might be the answer to the problem of creating a viable Scrabble aid. By combining a data base of legitimate words with an index keyed to word signatures, I could take any given combination of letters and quickly access the words by entering the word signature and using the dBASE FIND command. My difficulty now became how to create a database of words without going through the drudgery of typing in a long list of words and word signatures. CREATING THE DATABASE The simplest way of doing this seemed to be by using a spelling checker to output the contents of its dictionary to a disk file but, alas, my spelling checker would only send its contents to a printer, not to a disk file. Undaunted by this limitation, I bought some plugs and cable from a local computer store and wired my two Osborne CP/M machines together and "printed" the dictionary from one computer to a disk file on the other (MS-DOS users might be able to use the redirection capability of DOS to do the same thing on one machine). The resulting disk file was then read into a dBASE II .DBF file with a single field named "Word" by using the Standard Data Format (SDF) option. Because my purpose was to create a database that would be useful in playing Scrabble, I arbitrarily DELETED all words of more than eight letters because a player normally has only seven letters in a "hand" and can usually link those letters to only one letter already played on the game board. There seemed little need for words of more than eight letters. Words of three letters or less were also DELETED because they are easy to find and there was no point in unnecessarily increasing the size of the database. The database ended up with 12,759 words. Since the game of Scrabble is really a varient of the class of word games described as "anagrams," the name ANA12759.DBF suggested itself as a good way to identify the file as an anagram database with a specified number of words contained as individual records. The name could then be changed to accurately reflect the addition of more words and help insure using the most current version of the database. GENERATING THE WORD SIGNATURES As the next step, I APPENDED the words to another dBASE II file containing two fields with a length of eight characters each, one named "Word" and the other named "Key," and ran the 12,759 words in the resulting .DBF file through a dBASE II command file to generate the word signatures and REPLACE the "Key" field in the data base with the proper signature. This command file appears below in Figure 1 under the title SIGSORT.CMD. While the command file is fairly short, it has some interesting aspects because of the nested loops that perform the internal sorting of letters within strings, a technique that might be useful for other applications. Following the standard lines to clear the system, set defaults and use a named database, the command file stores the ASCII value of a capital "A" (65 decimal) to a memory variable called WHATCHAR and the length of the word to be sorted to a memory variable called LENGTH. A space character was stored to the memory variable MKEY just to establish that location. I added a non-functional line at that point to write a screen message that all variables were loaded just to keep track of the processing sequence (debugging). The next several lines run through two nested loops that do two main things. First, so long as the length of the word is not exceeded, the command file keeps testing each letter in the word in sequential order against the letter stored in WHATCHR to determine if the letter appears one or more times in the word. If the letter does appear, it is stored in the first non-blank location in the MKEY memory variable. Second, when the length of the word is exceeded, the program increments the ASCII value of the search letter by one (i.e., "A" increments to "B" by adding 1 to the decimal value) and the sequential search is performed again with the last search being for the letter "Z" (90 decimal). To speed up the processing a little bit, a CASE statement is used to test the word with the dBASE @ function (i.e., "does this character appear in this string" function) and a negative response immediately causes the command file to move on to the next letter in the alphabet. The last part of the program takes the rearranged word and replaces the word signature with the alphabetized letters contained in MKEY. This command file will be useful if you create your own word database and need to generate the necessary word signatures. It will also be useful if you want to add large numbers of words to the original database in ANA12759.DBF and need to sort the letters of each word before you APPEND them to the original database. (The original database and the programs described in this article are available on disk in the FOG Library. One small word of warning is necessary about the use of SIGSORT: It takes about six seconds to sort each word into a word signature. If you multiply six seconds by several thousand words, it is readily apparent that the processing time must be measured in hours for a sizable database. Fortunately, the processing time can be calculated in advance and a database can be split into separate parts for overnight processing, with the parts reassembled into a whole when all of the word signatures have been generated. It took about 24 hours to generate the signatures for ANA12759.DBF. An interesting experiment would be a comparison of processing times for SIGSORT with compiled versions of dBASE .CMD or .PRG files on different hardware configurations. USING THE DATABASE AND THE WORD SIGNATURES The last step was to write a command file to USE ANA12759 to FIND each word signature that would form a word contained in any given combination of letters. This command file appears below in Figure 2 under the name ANAGRAM.CMD. Oddly enough this was the easiest part of the job because the command file was highly repetitive in its search logic. The beginning point, of course, was to index the database on KEY (the word signature) to take advantage of the speed of the dBASE FIND function. I originally did this directly from the dot prompt, but later modified the command file to generate the index file automatically if it is missing when the command file is run. Because the database is one that is likely to grow in size by the addition of more words and word signatures over a period of time, the only lines that need to be changed to reflect these additions appear in lines 4 and 5 of the command file. If you have added words to the database without having the index file in USE, just change the name of the database and erase the outdated index file. The command file will ask if you want a new index created and a "Yes" answer will generate the file for you. After selecting the database and index file for USE (or generating the index file if needed), the command file goes through a series of error traps and tests. The first IF/ENDIF tests for the existence of the database itself, primarily to be sure the name has not been changed from that expected by the command file. The next IF/ENDIF tests for the index file and asks if the user wants to create the index file if it is not found (again, a simple failure to change the name of the index file may be the reason the file is not found). Following these two file verifications, a long command string is stored to the memory variable TEST to determine if the letters that are entered are already in alphabetical order. The program runs a little bit faster if the letters are entered alphabetically, but this is not critical to the operation. One more error test checks the length of the entered string to be sure it is between four and eight letters long, to be compatible with the word lengths contained in the database. At this point, the program begins a word signature search if the search string is already in alphabetical order or runs through an alphabetizing loop if the letters are not already alphabetized. The search pattern itself is simply a process of storing each letter in the search string to a separate memory variable (with periods used for padding) and then using the FIND command to test each possible pattern to see if the database contains a word formed from the permutation. For ease of calculating patterns, these memory variables are named "A", "B", "C", etc. This violates the usual dBASE custom of naming memory variables with meaningful names, but the need to manipulate the possible permutations made the use of such names useful in the context of this program. The search sequence itself is tedious, but simple. The search for a string of eight letters is illustrative. If the eighth letter is not a period (causing an abort of that search sequence), the program attempts to FIND a word signature that matches the letters contained in memory variables "A" through "H" by combining those variables in a macro sequence (i.e., &A&B&C...etc.). If a match is found, the word formed by the letters in the word signature is displayed. The program then proceeds through a search for seven letter words by doing a FIND for all combinations of seven letters that can be derived from eight letters, provided that the seven letters are kept in alphabetical order. It is at this point that the concept of word signatures proves its worth in word searches. Because the order of the letters is important in forming a word, a complete set of permutations would require fifty-six searches. By using word signatures, the number of searches is reduced to eight. As a look at the program will demonstrate, this reduction factor becomes even more important as the searches for six, five and four letter combinations proceeds. Without the underlying theory of word signatures, the program would quickly become unmanageable. PROCESSING TIME If this program does nothing else, it certainly proves the speed of the dBASE II FIND command. A search for all possible words of four to eight letters that can be formed from the word signature "aceinort" takes about two and one-half minutes on an Osborne 4 with a Westwind/Trantor hard disk using the ANA12759 database. I do not have access to other equipment for comparison tests, but such comparisons would be of interest. Compiled versions of dBASE would no doubt be much faster. The search time will increase when more words can be formed from a given word signature because the program must perform a screen write as an intervening step between searches. Pascal experts might be able to use the idea of word signaures to develop a "stand-alone" program that performs similar functions. I wonder how fast that would be? ODD RESULTS When running the program, you will probably find that the same word sometimes appears more than once in the list of possible words. This is not an error, but a function of the search process that treats a letter in a particular position in a word signature as being different from the same letter in a another position in the same word signature. For example, the word "data" would show up three times if the search string were entered as "database" because the word signature is "aaabdest" and a search for four letter words would sequentially identify the various "a" combinations. PROGRAM USES Besides being a useful tool for Scrabble players like me who tend to read words like "aceinort" literally, the program is great for solving anagram puzzles. It has also been of interest in testing letter combinations derived from various word games. Merely having a database composed of words has been useful in finding words where letter locations are important by using the the dBASE "LIST for $(WORD,)="?" .AND. $(WORD,)="?" .AND. etc.," While this is fairly slow, it does solve Wheel of Fortune and Hangman puzzles in a reasonable time. Beyond such miscellaneous uses, the program has been a lot of fun for my eight year old daughter (who happens to be named Ana, perhaps the best reason of all for naming the database ANA12759). She loves reading and playing word games and has enjoyed using the program to solve anagrams and to find words that can be made from different letter combinations. I have no way to measure a difference, but I think her spelling has improved by using the program as a game. One more use, equally difficult to measure, is the simple fact that some programs allow people to enjoy themselves in ways that were difficult or impossible before personal computers were generally available. There is something intriguing about typing in a random sequence of letters and letting the computer list the words that can be formed from the combination of letters. I have a suspicion that semantics, linguistics analysis and encryption theory can somehow be tied into the concept of word signatures. It is an idea that deserves further exploration. I hope you enjoy the program. By the way, my wife can still beat me at Scrabble, but the scores are closer.