Of course, substitution ciphers are completely insecure in this day and age, but the general idea of substitution still has a place in modern cryptography. Substitutions are what give the Serpentacci and Worm ciphers their strength. Computer ciphers can be attacked, in no small measure, because the substitutions built into them are fixed. I have looked around for a simple way to make a scrambled alphabet, which is what a substitution essentially consists of, from a password or key phrase, but typically the method you can find is very crude: start writing every new letter found in the password, and then the rest of the alphabet when you run out of password. This will cause most scrambled alphabets, among other defects, to end in XYZ, since those letters are rare.
In this article I discuss better ways to turn a password into a scrambled alphabet, which are not all that complicated.
A substitution key, also known as a scrambled or mixed alphabet, is a permutation of the standard alphabet. The idea is that every time a certain letter appears in a text, it can be replaced with the corresponding letter in the mixed alphabet. Doing so encrypts the message so that those knowing the mixed alphabet only need to work the substitution backwards in order to recover the original text. The reason why a simple substitution is not secure is that the letter frequencies of the original language are preserved after the substitution, so that the most frequent letter is likely to be the substitute for “E” in most Western languages, and so forth. Two- and three- letter groups also preserve their frequencies, which are perhaps even more telltale than single letters.
But mixed alphabets can be used in conjunction with a Tabula Recta, as I explained in this article, and obtain a much more secure encryption whose result looks completely random. The Tabula Recta gives it its randomness, but it is the substitution that makes it irreversible (without the key). The key space is pretty good: 26! = 4E26 possibilities for a 26-letter alphabet. If two substitutions are used (into and out of the Tabula Recta), then we get a very respectable key space of size 8E52, equivalent to 177 binary bits. This is in the same league as today’s highly complex computer ciphers.
The problem is, the mixed alphabet still has a low entropy if it is obtained from a word or phrase memorized by the user. Even worse, the standard method of generating a mixed alphabet from text described at the top loses some of the (already low) entropy of the keyphrase, and the problem gets worse as you make the keyphrase longer. Let’s see this with an example. Suppose the keyphrase is “To be or not to be”. The standard method, using only new letters encountered in the phrase followed by the rest of the alphabet yields this: TOBERNACDFGHIJKLMPQSUVWXYZ. The end of this alphabet is essentially the original, and most letters have been replaced by other that are only a couple spaces away. Not very good.
We obtain a slightly better result if we add the remaining letters in reverse order, rather than in their normal order, yielding this: TOBERNZYXWVUSQPMLKJIHGFDCA. Now the tricky “XYZ” group is buried well in the middle, but still many letters remain in consecutive order. Another problem is that a lot of the letters of the keyphrase are repeats and therefore have not been used. This defeats the purpose of using a longer keyphrase.
We can solve the second problem partially by adding one of the unused letters every time a repeat appears. For instance, if we pick the first available letter in the straight alphabet, we get: TOBEARNCDFGHIZYXWVUSQPMLKJ, where the alphabet has been completed with the rest of the letters in reverse order. Much better, but observe what happens when the keyphrase is long, like “to be or not to be, that is the question.” The alphabet then becomes: TOBEARNCDFGHIJKLMPSQUVWXYZ, which brings back the dreaded XYZ at the end with a vengeance. If it wasn’t for the Q in “question”, the result would have been almost the same as the first one we tried. This is because the many repeats bring in many letters in alphabetical order. If we put in those letters in reverse order, then we’d get: TOBEZRNYXWVUSQHAPIMLKJGFDC, which still contains a couple groups of near-consecutive letters (in reverse alphabetical order). We also find that the actual keywords near the end are pretty much irrelevant since all the letters have appeared earlier. It would be better if the letter picked for a repeat had some relationship to the original letter.
So here’s the improved process: when a repeat occurs, use the closest letter preceding it in the alphabet that is still available, wrapping to the end of the alphabet if needed; when the keyphrase ends, fill with the rest in reverse order. “To be or not to be” yields: TOBENRMLSQKADZYXWVUPJIHGFC, and the complete sentence from Hamlet’s soliloquy yields: TOBENRMLSQKADPHZJIGFCYXUWV. The second sample is particularly interesting because our longish keyphrase produced a mixed alphabet containing groups of two consecutive letters at most. Some more examples taken from famous novels:
|Call me Ishmael||C A L K M E I S H J Z D G Y X W V U T R Q P O N F B|
|Happy families are all alike||H A P O Y F Z M I L G E S X R D W K J V C B U T Q N|
|It was the best of times, it was the
worst of times
|I T W A S R H E B D Q P O F N G M C L Z K V Y J X U|
|It was a bright cold day in April, and
the clocks were striking thirteen
|I T W A S Z B R H G F Q C O L D Y X V E N U P M K J|
|Somewhere in la Mancha, in a place whose
name I do not care to remember
|S O M E W H D R C I N L A K Z J B G Y F X V P U T Q|
Of these phrases, only the first two contain fewer than 26 letters so that we have to finish off the alphabet with the unused letters. The phrase letters past the 26th position do not have any influence on the alphabet and can be safely ignored. I have marked those as italics. You can see that the resulting alphabets look pretty random in all cases, which is good. Near the end of the fist 26 letters, their value ends up being less and less important, but it still matters from time to time rather than the simple fact of whether or not that letters has appeared before. This makes for a larger effective key space.
Now, a typical English word contains around 10 bits of entropy, and the average length of an English word is 5.1 letters; 26 letters are therefore just about 5 words, or 50 bits of entropy. This is a long shot from the 26! possible permutations of an alphabet, which is about 89 bits of entropy, but not terrible. A cipher using two substitution keys would involve a maximum of 10 words to generate those keys, or about 100 bits of entropy, which is competitive with a typical 128-bit computer cipher.
But there is still another way to cram more entropy into alphabet making. Take a long phrase and write it on a table containing 26 columns. Then add the columns using a Tabula Recta in order to obtain 26 letters from which to generate an alphabet using the method just described. Let’s say our phrase is the fourth one on the table above (which I’m sure you recognized as the beginning of George Orwell’s “1984”). Here’s what we get by folding it and “adding” the resulting lines with a Tabula Recta through serpentine operations like those used in the Snake and Serpentacci ciphers:
ItwasabrightcolddayinApril andtheclockswerestrikingth irteen -------------------------- QXMLPJBUGWDZUQGBPTTAXIYPLW
And then we derive the alphabet from the resulting string. This is the alphabet: QXMLPJBUGWDZTOFANSRYVIKHEC. There were 14 words in the key phrase, for an estimated 140 bits of entropy. Adding letters with a Tabula Recta does lose some entropy but not a lot if the entropy density was low to begin with. As we saw when discussing the TripleText cipher, you can add together three rows of English text and obtain a result that appears to have conserved all of the entropy of those three rows. So this time there is plenty of source entropy for making a mixed alphabet. It still won’t be the full 89 bits because letters near the end have less influence than letters near the beginning, but it should be enough for most practical purposes.
Another way to look at this. English text has an average of 1.58 bits of entropy per character, according to some recent estimates. Therefore, to achieve 89 bits of entropy we need a minimum of 89/1.58 = 57 characters, which is a little over two lines at 26 characters each. Three whole lines (78 characters), contain 123 bits of entropy, which is probably more adequate since entropy is lost both when the letters are combined and when a mixed alphabet is made from the result.