BookPad is a paper and pencil “one-time pad” cipher, described in this other post. Real cryptographers are leery of ciphers claiming to be incarnations of the unbreakable one-time pad for good reasons, the best of them being that true one-time pads must contain perfectly random numbers, which not even a computer can produce. In this post, therefore, I put on my cryptanalyst’s hat and attempt to break a longish message encrypted with BookPad.
Who will win? Find out after the break.
The whole point of BookPad is to generate a keystream starting from a piece of text, so that the keystream is as random and easy to produce as possible. Additionally, it would be nice if the resulting keystream were non-invertible, otherwise an adversary might be able to obtain the key text from the keystream, and recognizing the source end up compromising all our communications. This can be done with the help of a PRNG, which does not need to be cryptographically strong, in the sense that a sequence of bits much longer than the seed cannot be predicted from previous bits, because we are going to use it only to generate short sequences, shorter than the seed itself.
Of the various PRNGs available that can be run on paper and pencil, lagged Fibonacci generators are especially interesting because they are so easy to do by hand. LFGs are based on a variant of the Fibonacci series, where each number is the sum of the preceding two: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55. . . . To make a PRNG, we don’t take the full numbers, but rather we subtract a multiple of 10 as needed so the result is less than 10 (or we neglect to carry digits as we add), like this: 1, 1, 2, 3, 5, 8, 3, 1, 4, 5. . . . Then, a lagged Fibonacci series doesn’t add the last two numbers, but something further back. For instance, if we add the last and third to last numbers (use zero if there is nothing there), we get: 1, 1, 1, 2, 3, 4, 6, 9, 3, 9, 8, 1, 0, 8. . . which is different from the straight Fibonacci series.
A better way to do this is to write the numbers in separate rows, so we are adding the top and bottom numbers to get the next one. For instance, if each number is the carryless sum of the previous and the one five spaces before, we can set them up into rows of five like this, where the first row and the first number of the second, in bold, come from the seed (496737) that I have used to initialize the PRNG:
4 9 6 7 3 7 1 0 6 3 6 3 4 4 0 3 9 2 6 0 0 3 2 4 0
And so on. . . . You can see that adding without carries you get: 4+7=1 and I write it right of the 7, 9+1=0, 6+0=6, 7+6=3, 3+3=6 (this one goes on the next row), etc. You can check that, if you change a single digit in the seed, all the numbers in the third row change, although you may get the same if you change a different seed digit by the opposite amount, not so for the fourth row and beyond. This is a good property, which makes the result that much harder to predict. The fourth row is even better because a single change changes the result in a more complex way (with three rows, all the numbers are simply shifted up or down). But for the rest of our present discussion we are going to take as the “result” value the digits in the third row. In our example, this will be: 63440.
To qualify as a “hash” of sorts, the operation must also be one-way, so that having the result will not reveal the seed. As it stands this LFG fails this test, for it is possible to nearly reconstruct the second row from the third, and then we have this, where the letters represent unknowns with possible values from 0 to 9:
a b c d e 7 1 0 6 f 6 3 4 4 0
And then we can establish this system of equations (addition is mod 10):
a + 7 = 1 b + 1 = 0 c + 0 = 6 d + 6 = f e + f = 6
That is, five equations with six unknowns. It is not strictly solvable, but then we can guess one of the unknowns and then get the rest. If the seed obtained this way has some recognizable pattern (because it comes from actual text), then we might be able to tell which guess out of ten possible is the good one and then the problem is solved. The problem doesn’t get any harder to solve if we take the fourth row instead of the third one, only longer, adding five more unknowns but also five more equations.
In other words, to make this non-invertible we need to remove information. One easy way is to output a string shorter than a full row. One unknown is added to the problem for every digit skipped. When the number of digits skipped is equal to the digits in the output (that is, when we output only half of the last row), our guess adds the same uncertainty as data we have. In that case, we can always reproduce the half of the last row we have by choosing a certain guess for the other half, regardless of what the seed might be, so that we can never tell what the correct seed is.
But ultimately there is no need to make this non-invertible, especially if the price to be paid for it is to throw away some of the key text, which effectively only gets used for this purpose. In this other article, I show that the LFG might not be necessary at all since the seed fed into he LFG is as random as the result of the LFG, provided the key text was indeed sufficiently long.
In Dirk Rijmenants’s excellent article on one-time pads, he insists that real one-time pads have failed in practice not because the cipher is not inherently unbreakable, but because people committed mistakes that reduced their security. This was the case also with the Enigma machine of world war II and many other ciphers, especially if some complexity was involved on the user’s side. Well, BookPad is not free of complexities, so let us begin trying to exploit different ways people can misuse the cipher. Here’s a list:
- BookPad is designed so the key text is at least three times the length (as encoded into numbers) of the plaintext, but nothing prevents a lazy user from taking a much shorter key text and just repeating it several times in order to achieve the required length.
- BookPad is a stream cipher, and so it will be easily broken if the keystream is re-used. This can happen in several ways: re-using the key text for a message or identical length as a previous one, re-using it for a message of a different length, using a serial that is consecutive (or simply close) to a previously used serial.
- As all stream ciphers, an adversary in possession of the plaintext (but not the key text) might be able to change the ciphertext so it decrypts to something of his choosing. Adding a message authentication code (MAC) makes this much more difficult, but the MAC process takes extra effort and opens up the possibility of leaking information so the cipher is no longer theoretically unbreakable.
We will now look at these scenarios one by one. But first we need to have something to crack. In order to provide room for statistical analysis, we’ll encrypt something rather long. For instance, the first paragraph in the text-only version of Charles Dickens’s “Oliver Twist” at gutenberg.org, which reads:
Among other public buildings in a certain town, which for many reasons it will be prudent to refrain from mentioning, and to which I will assign no fictitious name, there is one anciently common to most towns, great or small: to wit, a workhouse; and in this workhouse was born; on a day and date which I need not trouble myself to repeat, inasmuch as it can be of no possible consequence to the reader, in this stage of the business at all events; the item of mortality whose name is prefixed to the head of this chapter.
We will be encrypting this with different keys. If we use as key the second and third paragraphs from the same source, we will have enough key material. This is:
For a long time after it was ushered into this world of sorrow and trouble, by the parish surgeon, it remained a matter of considerable doubt whether the child would survive to bear any name at all; in which case it is somewhat more than probable that these memoirs would never have appeared; or, if they had, that being comprised within a couple of pages, they would have possessed the inestimable merit of being the most concise and faithful specimen of biography, extant in the literature of any age or country. Although I am not disposed to maintain that the being born in a workhouse, is in itself the most fortunate and enviable circumstance that can possibly befall a human being, I do mean to say that in this particular instance, it was the best thing for Oliver Twist that could by possibility have occurred. The fact is, that there was considerable difficulty in inducing Oliver to take upon himself the office of respiration,--a troublesome practice, but one which custom has rendered necessary to our easy existence; and for some time he lay gasping on a little flock mattress, rather unequally poised between this world and the next: the balance being decidedly in favour of the latter. Now, if, during this brief period, Oliver had been surrounded by careful grandmothers, anxious aunts, experienced nurses, and doctors of profound wisdom, he would most inevitably and indubitably have been killed in no time. There being nobody by, however, but a pauper old woman, who was rendered rather misty by an unwonted allowance of beer; and a parish surgeon who did such matters by contract; Oliver and Nature fought out the point between them. The result was, that, after a few struggles, Oliver breathed, sneezed, and proceeded to advertise to the inmates of the workhouse the fact of a new burden having been imposed upon the parish, by setting up as loud a cry as could reasonably have been expected from a male infant who had not been possessed of that very useful appendage, a voice, for a much longer space of time than three minutes and a quarter.
In order to make things easier for myself, I start doing the cipher using this web app (a deprecated version implementing serials and LFGs) which goes through the same steps I’d follow to use BookPad with paper and pencil, only much more quickly. Go ahead and click the link (the app will load on a separate tab) in order to follow what I’ll be saying and experiment on your own.
If now we feed both into BookPad and do the encryption with the default “Arsenio” encoding and serial “0000”, the ciphertext before any decoding back to digits becomes:
Some quick statistics of the ciphertext:
length = 689 digits (the plaintext was 522 characters, so the ciphertext is 1.32 times longer). '0' appears 72 times, '1'=73, '2'=65, '3'=64, '4'=77, '5'=66, '6'=74, '7'=60, '8'=81, '9'=57, which is close to a uniform distribution
Now let’s go through the different scenarios.
Short key scenario.
If the user get’s lazy and uses a very short key text, he may easily produce a repeating keystream. For instance, if the key text is simply “a”, the resulting keystream, ready to be combined with the encoded plaintext, becomes:
Which is made up of the period “86556815063100136051” (20 digits) repeated 34 times, plus a fraction. The resulting ciphertext, therefore, is as if produced by a Vigenère cipher of key “86556815063100136051”, acting on the encoded plaintext. To crack this, we would make statistics for the digits found at 1, 1+n, 1+2n, and so forth, where n is our guess of the key length. When we try the correct length, then all the digits in each set have been combined with the same keystream value, so that their peculiar statistics are not affected by the keystream. If the plaintext was English, there would be a clear statistical bias toward certain letters (“e” being the most frequent after the space, then “t”, “a”, and so forth) that would allow us to recognize that length as the right one, and likely to recover the corresponding plaintext, each letter shifted by the same amount in the alphabet, as in the Caesar cipher. The scanning and testing process can be automated, so it would take a fraction of a second in a computer. The only requirement is that each set is large enough to produce meaningful statistics and that the encoding does not obliterate the patterns present in the plaintext.
Let’s take a look at this last point. If I count letter frequencies (including space) in the plaintext, I find the following:
Which only has a few inversions with respect to the average English text frequencies (that is the order used in the list above).
In the encoded plaintext, however, the frequencies are:
Here ‘0’, ‘8’ and ‘9’ are the most frequent digits by far. ‘0’ and ‘9’ encode the less-frequent letters in most Western languages (‘t’, which is frequent in English, is encoded as ’03’ in our scheme, sharing the digit ‘0’ with other much less frequent letters), ‘8’ encodes the space. So it stands to reason that a real piece of English plaintext will have a lot of 8’s, 9’s, and 0’s, although it would be hard to tell which encodes which. Looking at the less frequent digits, we find ‘2’, which encodes ‘r’, which is not so frequent in English, and ‘7’, which encodes ‘o’. The most frequent ‘e’ is encoded as ‘4’, which is in the middle of the pack and doesn’t stand out in any way. Still, let us assume that a piece of encoded English might be recognizable because of the high frequency of three digits (encoding spaces and less-frequent letters) over the rest. Unfortunately, digraphs and trigraphs, which usually are much better to detect encoded language than single letters, won’t help with Vigenère cryptanalysis (unless we add a lot of complexity to the process) because consecutive letters end up in different groups.
It is interesting to compare this statistic with what we would get if the plaintext was random. One way to obtain this is to encode the string “abcdefghijklmnopqrstuvwxyz .’” which contains all encodable characters once. If we repeat it 14 times, for a total of 672 digits after encoding, we obtain the following statistic:
so that only two digits, corresponding to the less-frequent letters, appear more frequently than the others. It seems, therefore, that it would be possible to detect the moment when the correct key length is guessed, since then frequency analysis would give three peaks, which is different from a random distribution of digits (no peaks) or of letters (two peaks). From then on, one would apply the standard cryptanalysis of the Vigenère cipher, obtaining the repeated keystream period in the process.
Okay, but that’s with an incredibly bad key. What if the key is a little better? Let’s assume now that the key is “ar”, which is encoded as “12” with the default straddling checkerboard. In this case the keystream consist of repetitions of the sequence “407668162090261866704076681620902618667040766816209026186670”, which comprises 60 digits, so that the string is repeated 11 times plus a fraction. It is a lot harder to come up with a meaningful frequency distribution of 10 different digits if there are only 11 or 12 samples. The expected three peaks might appear, or they might not, and then we’d never know what the true period is. This period is not always the same, either. If the key is “ae”, which also produces only two digits, the repeated keystream sequence is “7310137297781518779273101372977815187792”, which is 40 digits long, leading to an average of 17.2 samples per distribution. The period length isn’t constant, either. If key “ar” is used for a plaintext equal to the original, minus the final period (2 digits shorter after encoding), then the keystream period is 40, not 60. For key “ae”, it is 20 digits long.
Going back to the original plaintext, key “ab” produces a 60-digit keystream period, whereas key “ac” leads to a 45-digit period. Both are three-digit keys, as encoded, so the difference is to be found in the peculiarities of the lagged Fibonacci generator that produces the keystream. A number of four-digit keys also lead to 60- or 40-digit periods, but when we go to five digits, as in “bad”, the period jumps to 100 digits, but “bat” takes us back to 50 digits. A six-digit keyword such as “but” gives 60 digits. In general, one could expect that the keystream period length be ten times the encoded key length, so if the key is one-tenth the length of the plaintext (it should be 30 times longer, at least), the keystream won’t repeat within the length of the message, and we may be back to the one-time pad situation if the resulting keystream has good randomness.
A quick example: since the plaintext encoded length is 689 digits, let’s try a key that encodes to one-tenth of that length, that is, 69 digits. For instance: “This is a not so short key. But not too long either.” Using this key text, we find that the keystream does not repeat. What’s worse, the distribution of digits in the resulting keystream is quite uniform. The shorter key text “This is a not so short key.” (35 digits after encoding) doesn’t lead to repetitions, either, but the resulting keystream does not look so random, with a superabundance of 2’s and 7’s and a dearth of other digits. However, “This is a short key” (25 digits) leads to a period of 125 digits, for a total of five and a half repetitions. So, it seems that in the worst case the keystream period is 5 times as long as the key (as far as we’ve found so far), with a factor of 10 being most frequent, and sometimes 20.
Conclusion: we can expect to be able to crack BookPad by conventional frequency analysis if the user does not follow instructions and uses a short key, with diminishing probability of success as the key gets longer. Once the user takes a key text that is at least one-tenth the length of the plaintext (after encoding), forget about it, even though technically there isn’t enough key entropy to meet the one-time pad criterion for unbreakability.
Clearly, reusing the key text for two messages of identical length breaks all security since then an attacker only needs to subtract the two encoded ciphertexts in order to find the difference between the two messages, which then can be easily attacked in many ways. But what if the messages are of different length? What if a different serial was used with the same key text, but the two serials differ only a little?
To test the first scenario (which will be revisited later), let us re-encrypt the original message with the original key, which is shown at the top, and then use the same key to encrypt the original message minus the final period (two digits shorter). Here’s the ciphertext for the second encryption:
Even though the plaintexts only differ by two digits at the end, the ciphertexts are completely different from the start. No similarity between the two is evident anywhere. This is because the two keystreams, having different lengths, become completely different due to the properties of lagged Fibonacci generators. At first, it doesn’t appear promising at all for the cryptanalyst, so we’ll try an easier target.
How about serials that only differ by a little? Let’s encrypt the original plaintext with the original key text, but this time using “1000” as serial, so a “1” will appear where before there was a “0” in the process of generating the keystream. Then the ciphertext becomes:
We do see some repeated digraphs between the two ciphertexts, but they are not at the same locations so we must conclude that they arise from normal random coincidences. Prominent features like the “9999” found in the first line of the original message don’t appear to be in the second message, and vice-versa. Both messages look as if encrypted with completely different key texts.
We hit pay dirt if the serial is “1900”, however, for then the ciphertext is:
which is the same as the original ciphertext, adding 1 (mod 10) to every digit except the first. This is because the shift in the LFG caused by the first serial digit ‘1’ is promptly canceled by the second ‘9’, resulting in a related keystream sequence. Serial “2800” shifts the whole thing by 2, which again essentially repeats the keystream, and there a number of such serials that lead to results too close to the original. Now, of course, this only works if the two messages are of identical encoded length and they use the same key test and serials whose digits add up to the same, which will happen every 10 times on the average (not bad at all).
Now, if we go back to the ciphertext encrypted with the serial “1000”, we notice that it is more similar to the original (serial “0000”) that we had been led to believe at first. Indeed, if I add the repeated stream “1234567890…” to the original, I get the one with serial “1000”. The effect is similar if the serial is “0100”, “0010”, or “0001”. If the serial is “2000”, then the keystream shifts by the sequence “24680…”, if the serial is “3000”, the string to the added is “3692581470…”, and so forth. Even better, the linearity of the LGF causes a serial composed of several figures to have an effect on the keystream that is the superposition of the effects of all its digits. In other words, utterly predictable, though apparently random at first. If instead of three rows of LFG the encrypted had used four, we’d still be able to predict the effect, though the sequences to be added would be a little more complex.
So, advice to the encryptor: don’t use a serial as described in the old BookPad spec. It gives no extra security because the linearity of the LFG causes the effect of the serial to be completely predictable so it can be removed, leaving behind the original key text. If this is repeated with two messages of equal length, all will be lost.
Shifts move all the digits in the affected parts of the key text, so the result of the sum changes unpredictably, but are they really secure? Not if the attacker has the plaintexts. This is not as unlikely as it seems: perhaps the attacker has planted some news in order to get the user to include certain unusual words in the plaintext, which can be recognized later on, or perhaps he got the plaintext from a different source, and now he is trying to find the origin of the key text. This is know as the “known plaintext attack.” Let us, therefore, assume that I, the attacker, have subtracted the known plaintext from two messages encoded with two different shifts (say, one unshifted and another where the first piece is shifted by just one space to the left), so that I have both keystreams. Let’s say one is “12345” and the other is “67890”; then the process of making the first keystream (assuming no LFG step) will be:
abcde +fghij ------ 12345
where “abcde” is the key text part that is shifted in the next message, and “fghij” is the result of adding the two parts that are not shifted. The process of making the second keytream (again, no LFG), is:
bcdea +fghij ------ 67890
putting the two together, we obtain the following linear system of equations, where the addition is modulo 10:
a + f = 1 ; b + g = 2 ; c + h = 3 ; d + i = 4 ; e + j = 5 ; b + f = 6 ; c + g = 7 ; d + h = 8; e + i = 9 ; a + j = 0
which is a system of 10 equations with 10 unknowns. Solving it we obtain the digits a through j. Then we decode it back to letters, resulting in a piece of the key text that a diligent search will identify as belonging to a certain book or other source. We’ve got it! Now, the example was too simplified, but it serves to illustrate that there is nothing the encryptor can do to stop us:
- If he shifts the part by a different amount, we can still set up the system of equations, so long as we know what the shift is. Now, the shift may be concealed within the ciphertext, but then we only have to try all likely locations (there’s only so many of them), and keep the one that leads to intelligible text in the solution.
- If the shift affects two parts rather than one, we’ll end up with a similar situation but involving three rows of unknowns instead of two. This only means that we’ll need three compromised messages rather than two in order to reach a solution. Since the point of the shift is to reuse a key text, it is very likely that three messages using the same key can be found if the encryptor has reused the key. Why use it only for two messages?
- Things won’t be much better for the encryptor if he uses a LFG, since the LFG can be reversed after making a few guesses, as we saw above. Nothing that a computer cannot handle.
But what if he does not shift the key text parts, but simply reuses the key text for messages of different lengths, which above we saw lead to very different-looking keystreams. Using the example above, the first, second, and third keystreams, for messages that are 5-digits, 4-digits, and 3-digits long would be, for instance:
abcde +fghij +klmno ------ 12345
abcd +efgh +ijkl ----- 6789
abc +def +ghi ----- 012
which results in a system of 5 + 4 + 3 = 12 equations with 15 unknowns. It is not strictly solvable, but it will be if we guess two of the unknowns, for a total of 10 x 10 = 100 possibilities. The right solution will be the one that yields intelligible text for “abcde”, “fghij”, “efgh”, and so forth. It is unlikely that two different solutions will pass this test. So it turns out that reusing a key text for messages of different lengths is not safe, either, even though the keystreams may look quite changed.
What can the encryptor do to thwart me? I can only think of two things:
- Never reuse a key text, instead of relying on a shift to make the keystream different each time, or relying on the fact that keystreams naturally look different for different lengths. That is, so long as an attacker is able to obtain the keystreams by subtracting known plaintexts with a known encoding (such as the default “Arsenio” encoding).
- If reusing a key is unavoidable, use a key text-derived encoding rather than the default, plus a shift. If the encoding is unknown, I won’t know what the correct keystream is, even if I know the plaintext, and the process I just described will be much more difficult to get started. I can still make guesses as to what the encoding is (essentially, it’s a permutation of the top row of the checkerboard and the other two rows), but there’s a lot of possibilities to try: 10!/2 for the first row, times 19! for the other two, for a total of 2.2E23, which will take a while to go through even on a powerful machine. If I work for the NSA, though, I might still be able to muster enough computer power to go through every possible encoding, after which solving the system will hardly take any extra time, and I will still be able to tell when I’ve got the correct solution, because all parts of the key text will be readable (a machine can detect this, by means of a dictionary).
In other words, the encryptor will do well never to reuse a key text, even with key text-derived encoding. After all, we are after perfect secrecy, and this implies NSA-proof secrecy, present and future. This is the reason why shifts were also removed from the BookPad spec. It is much easier to simply choose a different piece of key text for every message, rather than keep track of shifts and encodings in order to reuse a given key text.
Here we have to deal with the issue of “malleability,” which plagues all stream ciphers. An attacker knowing the plaintext (or a part of it, along with its location) will be able to make a ciphertext that decrypts to a plaintext of his choosing, by simply shifting the values in the ciphertext so the plaintext letters shift in the same way. This is possible whenever the attacker knows the encoding used. For instance, the initial “among” of the message is encoded as “1907595” with the default encoding. “Sugar” encodes to “3049512”, which has the same length. Therefore, if I take the ciphertext and add to it the number 1907595 – 3049512 = 8968083 (digit by digit, mod 10), it should decrypt to what I want. The altered ciphertext begins (changes in bold):
which decrypts to:
sugar other public buildings in a certain town...
which is exactly what I wanted to do.
To combat this, the user will either calculate a MAC and include it with the message (or not), which is extra work, or use a key text-derived encoding checkerboard. This is how a message authentication code (MAC) can be calculated (I assume it’s going to be 3 digits long):
- Since the MAC length is 3 digits, I set N = 3. I take a piece of encoded key text that I have not used yet with length 3N = 9 digits. For instance: 672930060
- Now I take the encoded plaintext (“Alice”), followed by the piece of encoded key text, and the shift or serial (if any), and write them row by row into N = 7 columns. Then I add each column without carries and get the result.
199 +692 +467 +293 +006 +0(shift here, wrapped if necessary) ----- 337
My friend on the other end goes through the exact same steps after he has obtained the encoded plaintext. Since he is using the same key text, he should calculate the same MAC (337), which I am sending along with the ciphertext. If the calculated value matches the received value, he knows there’s only one chance in a thousand that the message has been altered, otherwise most likely there has been an error or alteration.
This process of making a MAC doesn’t give away anything about the plaintext, because secret data having as much entropy (remember, entropy is the same as “unpredictability”) as the MAC itself has been mixed in. Those who don’t possess the key text would have to guess this data, which means that they can produce any three-digit MAC just by changing their guess. No good at all for me poor hacker.
Even worse, if the encryptor uses a key-derived encoding (and we saw above there are 2.2E23 possibilities), I cannot begin to modify the ciphertext so the plaintext is something different because I don’t know how the letters translate into digits. I could try to make some guesses, but for this attack I only have one chance since the recipient will realize that something is wrong as soon as the first word decrypts as gibberish while the rest make sense. I’m stuck.
Advice to the encryptor: making your own key-derived checkerboard takes very little time and protects you against this attack, even without a MAC. Much preferred.
We have subjected BookPad to three kinds of attack that might be likely in practice. BookPad remains secure when short keys, so long as they are at least one-tenth the length of the message. Short keys, on the other hand, don’t have the theoretical unbreakability of long key texts, and can be attacked by brute force. Related-serial attack has succeeded against users who reused the key text and relied on serials to make the keystream different, so serials were deprecated in the BookPad spec and replaced at first by key text shifts. Then shifts also fell to attack easily when the encoding was known, not so easily for key-derived encoding but still within the reach of an attacker with a lot of computing power, so it is better to never reuse a key text, period. Key-derived encoding works as well as calculating a MAC when protecting against a known plaintext attack, and is much quicker to do. In this other article, I found that the LFG step does not add security when the key text is the necessarily length, which is why the current BookPad spec does not include it.