## Extracting randomness from text

My BookPad cipher seems to be closely related to the running key cipher, since both take a long piece of text from a book and use it as key to encrypt a plaintext. Yet while the running key cipher can be broken easily, BookPad offers a level of security comparable to that of a one-time-pad. In this article, I try to explain why in layman’s terms. As a bonus, I introduce TripleText, a variant of BookPad where all the operations are done directly with letters.

A little spoiler: not much prevented the guy on the left from discovering this, back in the early XVI century. Had he discovered it, history might have turned out quite different.

This JavaScript version of BookPad displays the result of several randomness tests performed on the keystream right before it is used to encrypt a message, with and without using a lagged Fibonacci generator (LFG) step, or a key text shift. Big surprise: when a piece of real text is placed in the key text box, the keystream usually passes all the randomness tests even if the process of randomizing the digits by means of a LFG is skipped. This means that a user may be able to skip half the work and still get something approaching perfect security.
First, let’s try to define what “passing the tests” means. To do this, I headed to random.org, which provides strings of numbers from true random sources such as atmospheric noise, and got 1000 integers in the 0 to 999,999,999 range. I added the missing zeroes on the right of some numbers by hand in order to obtain a string of 9000 supposedly perfectly random decimal digits, which I won’t print here. Then I pasted it into the key text box of BookPad, opened up a JavaScript console, and applied several tests to this box. The tests were:

1. Chi-squared statistic of single digits, compared to a uniform distribution (probability for each value = 1/10). Since a decimal digit string has 9 degrees of freedom (10 -1), theory predicts that a value of less than 14.7 means less than 10% probability that the stream is not random (see this table, for instance). The sample gave 11.65
2. Chi-squared statistic of pairs of digits, again compared to a uniform distribution (probability for each pair = 1/100). Here the degrees of freedom are 9*9 = 81, so to have less than 10% probability that the stream is not random, the statistic should be less than 97.7. The sample gave 105.57
3. Durbin-Watson statistic, which estimates the correlation between pairs of consecutive digits. If there is no correlation the value should be 2.0. Less than that, down to a minimum of 0, means that they are negatively correlated; more than that, up to a maximum of 4, means they are positively correlated. The sample gave 2.01
4. Shannon entropy, which looks at single digits and determines how much unpredictability each one has, in bits per digit, up to a maximum of  log(10)/log(2) = 3.32193 for perfect randomness. This is related to the chi-squared statistic because it is derived strictly from single-digit frequencies. The sample gave 3.32099 bits/digit
5. Runs test. Here we look at how often the digits switch from low (0 to 4) to high (5 to 9), or vice-versa. A perfectly random sequence, where each digits is truly independent of the previous digits, would switch every two digits on the average, but one that is not random may switch more or less often due to lingering correlations between consecutive digits. The sample gave 2.0008 digits/run

So, the “true random” sample passed all the tests (at 10% confidence, for the chi-squared tests) except the two-digit chi-squared test, but this one was close. It is likely that the sample was not long enough for this test, which is extremely sensitive, to come down to its expected value for infinitely long samples, so we’ll take a value less than 105 to be acceptable for a sample of less than 10000 digits.

So how does a piece of real test fare? I headed to gutenberg.org and got the text version of “Oliver Twist,” by Charles Dickens. I copied chapter 1 (no headings) into the plain text box of BookPad, resulting in a decimal-encoded text of length 8220 digits. I need at least three times that length in the key text box, so I copied chapter 2 (again no headings) into that box, which the JavaScript version of BookPad took as sufficiently long. Here are the statistics for the keystream generated from the key text with the default “Arsenio” encoding and no shift being used, before the lagged Fibonacci step, and after:

• Before LFG: chi-squared = 12.476. Two-digit chi-squared = 114.75. DW statistic = 1.979. Shannon’s entropy = 3.3208 bits/digit. Runs = 1.9985 digits/run.
• After LFG: chi-squared = 8.6155. Two-digit chi-squared = 118.93. DW statistic = 2.009. Shannon’s entropy = 3.3211 bits/digit. Runs = 2.0112 digits/run.

Both streams passed the same tests as the true random sample, missing two-digit chi-squared by a little more, but not a lot (possibly because the text sample is a little shorter). The LFG-processed keystream actually beat the “true random” sample in three tests.

In other words, they may not be truly random, but we can’t tell the difference. Even before the LFG step, which essentially doubles the amount of computation, the keystream is good enough for the purpose of encrypting a message by the one-time pad method.

I’ve performed the same test with a number of different texts (not always taken from Dickens) and in different languages: English, Spanish, German (lots of tildes and umlauts eliminated before encoding into numbers, in the last two). The result is always the same, except that oftentimes the keystream before LFG actually beats the one after the LFG. Therefore, I have revised the official description of the BookPad cipher to remove the LFG. The matching JavaScript code can be found here.

Now, how can this be possible? One of the first things you learn in spy school is to never reuse a one-time pad, because then an adversary can subtract the two ciphertexts, thus eliminating the keystream and obtaining as a result the difference between the two encoded plaintexts, which can then be attacked by statistical methods. In other words, the combination of two encoded plaintexts is expected to be sufficiently non-random to allow statistical attack. But BookPad takes the encoded key text, splits it into three parts of equal length, and simply adds them together (digit by digit, without carries) before the LFG step. So it seems that adding or subtracting two pieces of encoded text leaves statistical signatures in the result, but adding three pieces does not, if we have to believe the test results. How can this be?

Indeed, just encoding the key text does not make it random. The chi-squared statistic of Oliver Twist’s chapter 2, after encoding (29549 digits), is 5629.28 and the two-digit chi-squared is 28898.48. Durbin-Watson looks a little better at 1.9615, Shannon entropy is 3.1845 bits/digit, and runs take 2.10924 digits/run. If we split it into two halves and do a mod 10 addition of those, the chi-squared of the result is a much-lower-but-not-quite-low-enough 63.0326, the two-digit chi-squared drops to 330.27, Durbin-Watson is 2.02, Shannon entropy 3.3188 bits/digit, runs is 2.033 digits/run. If we subtract the parts rather than add them, we get statistics similar to those of the addition. That is, it is still possible to tell the result apart from random, at least through the chi-squared tests. This is the essential flaw of the running key cipher, so similar to BookPad, except that the key text is used straight, without trying to concentrate its entropy in any way. So straight encoded text, or text added or subtracted to another text retain exploitable statistics from the original texts, and yet if we involve three pieces of encoded text instead of two, the result appears to be as statistically random as a true random source. How come?

Here’s the way I would explain it. It is by no means a mathematically acceptable proof, but I think it captures the essence of what is going on:

Recall that we had decided to use a piece of key text that is three times the length of the message (after encoding) because of this calculation: 3.32193 (entropy in bits/digit of a random string of decimal digits) x 1.33 (digits/character in encoding) / 1.58 (average entropy of text in bits/character) = 2.8, so we chose 3 to be sure. The idea was that the piece of key text used would have the same entropy as a decimal random string of length equal to that of the encoded message, so we could use it to make a sort of one-time pad. Since the straddling checkerboard encoding we are using produces about 1.33 digits/character (determined experimentally), we wanted to satisfy this formula: 3.32193 x (1.33 x message length) = entropy needed = entropy supplied = message length x factor x 1.58 (average entropy of text) in order to find the factor indicating how many times longer (before or after encoding) the key text must be, relative to the plaintext.

In other words, a piece of key text three times the length of the plaintext (before or after encoding) is expected to have enough entropy to generate a keystream for that plaintext length that would be statistically indistinguishable from a true random keystream. This is what we observe. Typically,randomness extraction (this is the technical term for what we are doing) requires the use of a one-way function or hash possessing several strict properties, but here we are achieving a comparable result with a simple sum.

The LFG step in BookPad is intended to spread the effect of any single-digit changes in the sum to the complete keystream, which is one of the important properties of a hash, plus smoothing non-uniformities that might be present in the sum, and indeed it does improve the keystream considerably if the key text is too short. If we use Oliver Twist’s chapter 1 as key text (same as the plaintext), chi-squared becomes 23.058 and two-digit chi-squared is 118.38 before the LFG step, but after the LFG step they become 6.6204 and 74.450, respectively. The effect intensifies as the key text shortens, and becomes especially pronounced for keys containing 10 or fewer words (which, of course, should never be used to make a one-time pad replacement), so that the keystream after LFG passes the randomness tests even though it starts repeating after a certain point.

Going back to the result of the sum before the LFG is applied, the crucial question is whether or not the process of cutting the encoded key text into three parts and adding them preserves the entropy of the key text. Let’s say its average entropy is one-third of the random value, that is, 3.32193 / 3 = 1.1073 bits/digit (it actually a little higher, as we saw earlier). Let’s further assume for the time being that consecutive digits are independent of each other (they are not, since in the encoding used consecutive digits often derive from a single letter) so that the entropy of a group of digits is the sum of the entropies of all the digits. If we take groups of three consecutive digits, each group will have an entropy of (3.32193 / 3) x 3 bits/digit, which is the same as that of a truly random decimal digit.

So let us now make a code that assigns a single-digit value to each group of three digits. Since there are only 10 different single digits but 1000 different groups of three digits (10 x 10 x 10), this code will convert 100 different groups to the same decimal digit, which is okay because each group of three isn’t found so often anyway. Any function that does this will work: sums, subtractions, more complex calculations, codebooks; what is essential is that each resulting digit is mapped to a set of groups having equal a priori probability and there are no collisions where a group is mapped to two different digits. We can come up with many such mappings, but a simple sum of the digits in the group, mod 10, does the trick just fine. This is what is done in BookPad.

In fact, in BookPad we are not adding three consecutive key text digits, which might not be independent of each other, but digits separated by the length of the encoded plaintext. It is highly unlikely that any correlation would extend that far in a real text, so we can safely assume that the digits to be added are indeed independent. Therefore, each digit after the addition potentially has as much unpredictability as three digits before the addition, potentially resulting in full random entropy.

If instead of cutting the text into three pieces we had cut it into two, which we then added together, we’d be talking about groups of two digits, each having entropy of 2 x 1.1073 = 2.2146 bits/group. If we map this to single digits by modular addition, each digit would have 2.21 bits of entropy, which is less than truly random, and it shows in the statistical tests.

But now, and here’s the kicker, if we add or subtract to this the encoded plaintext, which has the same average entropy per digit, the resulting ciphertext has a full 3.32 bits/digit of entropy. In other words, the ciphertext is statistically indistinguishable from a random string and cannot be decrypted except by brute force. This would be possible, however, because the key text is only twice as long as the plaintext and does not contain as much entropy as a random string of plaintext length, so it is possible to tell, in principle, whether a guess at the plaintext is correct or not. If it is correct, the keystream deduced by subtracting plaintext and ciphertext will have less than perfect randomness. If a triple-length key text, later divided into three parts and added together is used, however, brute force is of no avail either, for then one would always be able to find a key for every tentative plaintext one can think of, which combined with it will yield the ciphertext at hand.

On the other hand, it is easy to prove that simply adding digits does not conserve their entropy. Let us see this with a simplified example involving binary strings made of one and zeroes. Let’s say we have a particular string with exactly twice as many zeroes as ones. The the probability of a zero is 2/3 and that of a one is 1/3. Now let us cut the string in two pieces and add them without carries, which for binary digits is the XOR operation where: 0 XOR 0 = 1 XOR 1 = 1, and o XOR 1 = 1 XOR 0 = 0. Then the probability of getting a zero at a given position in the result is 2/3 x 2/3 + 1/3 x 1/3 = 5/9, and the probability of getting a one is 2/3 x 1/3 + 1/3 x 2/3 = 4/9. That is, the likelihood of getting a one or a zero in the sum is closer to the 1/2 of the perfectly random sequence, but not quite. What’s worse, cutting this into halves and XORing them together again does not get us there, because then the probabilities of zero and one are 41/81 and 40/81 respectively. In fact, we can never get to a probability of 1/2 for either digit because the denominator of a probability thus calculated will always be a power of three, which can never be simplified to 2 no matter what the numerator might be. The original Shannon entropy was -2/3 x log2(2/3) – 1/3 x log2(1/3) = 0.9182 bits/digit, so if the entropy were conserved with the operation, we’d get over the 1 bit/digit of the perfectly random string (except that it cannot get any bigger than this). Instead we get – 5/9 x log2(5/9) – 4/9 x log2(4/9) = 0.9910, which is close to 1.0, but not quite. In other words, the entropy per digit of the sum is greater than that of the parts being added, but it can never reach the entropy of a perfectly random string.

This is because summing strings without carries is an irreversible operation, which loses entropy. It’s like thermodynamic entropy, only backwards; in thermodynamics, irreversible processes generate entropy rather than destroy it. If we wanted to conserve the information entropy of a string as it is shortened, we should have applied a reversible process such as data compression. The problem is that this takes a lot of processing, which isn’t the best thing for a paper and pencil cipher.

The encoding from letters to numbers used in BookPad does indeed involve a bit of compression. After all, one of the good features of a straddling checkerboard encoding is that it uses single digits for the most frequent characters and double digits for the rest, thus providing some compression that is bound to concentrate the entropy into fewer digits. Indeed, for perfectly random text containing 26 letters, the entropy per character is log10(26) / log10(2) = 4.70044 bits/letter, while for perfectly random decimal digits the entropy is log10(10) / log10(2) = 3.32193 bits/digit, that is, we need 4.700044 / 3.32193 = log10(26) = 1.415 times more digits than letters to represent the same entropy. Another way to obtain the same result is consider the length of a long integer expressed in decimal digits or expressed in base 26 (A to Z, for instance). If the number has N decimal digits, all of them nines, its value is 10^N – 1 which is just one unit short of 10^N. The same number expressed in base 26 will span a number M of letters so that 10^N = 26^M, yielding N/M = log10(26) as before.

But typically a text encoded with the Arsenio checkerboard or one of its key-derived variants only takes 1.33 times more digits than letters (including the space). This is because one-third of the letters in a typical text are contained in “arsenio” and are therefore converted to single decimal digits, while the other third is converted to double digits. Consequently, checkerboard encoding increases the length of a typical text by a factor of 2/3 x 1 + 1/3 x 2 = 4/3  = 1.33, which is what we observe experimentally. Therefore, the encoding compresses the text by a factor of 1.415 / 1.33 = 1.064, making the result a 6% shorter than it would have been with a non-compressing encoding.

Intuitively, it doesn’t look like this small amount of compression could be responsible for the amazingly good randomness of the triple text combination, but in order to dispel all doubts I made a simplified cipher where no encoding to digits is used. I call it “TripleText”, and you can find a JavaScript implementation of matching this article in this page (the most current version is here). Then I ran the same tests on its output.

Instead of adding and subtracting decimal digits as in BookPad, we are adding and subtracting the letters A to Z, which are base 26 digits. These operations can be performed easily with the help of a Tabula Recta:

```A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
---------------------------------------------------
A | A B C D E F G H I J K L M N O P Q R S T U V W X Y Z | a
B | B C D E F G H I J K L M N O P Q R S T U V W X Y Z A | z
C | C D E F G H I J K L M N O P Q R S T U V W X Y Z A B | y
D | D E F G H I J K L M N O P Q R S T U V W X Y Z A B C | x
E | E F G H I J K L M N O P Q R S T U V W X Y Z A B C D | w
F | F G H I J K L M N O P Q R S T U V W X Y Z A B C D E | v
G | G H I J K L M N O P Q R S T U V W X Y Z A B C D E F | u
H | H I J K L M N O P Q R S T U V W X Y Z A B C D E F G | t
I | I J K L M N O P Q R S T U V W X Y Z A B C D E F G H | s
J | J K L M N O P Q R S T U V W X Y Z A B C D E F G H I | r
K | K L M N O P Q R S T U V W X Y Z A B C D E F G H I J | q
L | L M N O P Q R S T U V W X Y Z A B C D E F G H I J K | p
M | M N O P Q R S T U V W X Y Z A B C D E F G H I J K L | o
N | N O P Q R S T U V W X Y Z A B C D E F G H I J K L M | n
O | O P Q R S T U V W X Y Z A B C D E F G H I J K L M N | m
P | P Q R S T U V W X Y Z A B C D E F G H I J K L M N O | l
Q | Q R S T U V W X Y Z A B C D E F G H I J K L M N O P | k
R | R S T U V W X Y Z A B C D E F G H I J K L M N O P Q | j
S | S T U V W X Y Z A B C D E F G H I J K L M N O P Q R | i
T | T U V W X Y Z A B C D E F G H I J K L M N O P Q R S | h
U | U V W X Y Z A B C D E F G H I J K L M N O P Q R S T | g
V | V W X Y Z A B C D E F G H I J K L M N O P Q R S T U | f
W | W X Y Z A B C D E F G H I J K L M N O P Q R S T U V | e
X | X Y Z A B C D E F G H I J K L M N O P Q R S T U V W | d
Y | Y Z A B C D E F G H I J K L M N O P Q R S T U V W X | c
Z | Z A B C D E F G H I J K L M N O P Q R S T U V W X Y | b```

To add two letters: find one letter on the top and the other on the left side; follow the column and row, respectively; the letter found on intersection of the two is the sum. Example: R + H = Y. To subtract two letters: find the first letter on the top and the second letter (the one being subtracted) on the right side; follow the column and row as for addition. Example: R – h = K.

The rest is very similar to BookPad. Here are the steps for encrypting:

1. Remove all spaces, punctuation, and diacritical marks from the plaintext and the key text so only the characters A to Z remain.
2. Measure the length of the processed plaintext and take a piece of processed key text three times as long.
3. Split the key text piece into three parts of equal length and add them letter by letter using the Tabula Recta, which implements addition and subtraction mod 26. The result is the keystream.
4. Now subtract the processed plaintext from the keystream using the Tabula Recta. The result is the ciphertext.

The process for decrypting a ciphertext is identical, except that the ciphertext replaces the plaintext in all steps.

If we take the 9000 true random digits that we used earlier and convert them to base 26, we obtain a string of 6360 presumably random letters (check: 9000 / 6360 = 1.415, which is the expected ratio for random letters to digits conversion). The statistical tests applied to this, slightly modified for base-26 letters rather than base-10 digits, give these results: chi-squared = 26.58113 (less than 34.4 means random with less than 10% chance of error), two-digit chi-squared = 720.0586 (less than 670.7 means random with less than 10% chance of error), Durbin-Watson = 2.0302 (should be close to 2.0), Shannon entropy = 4.69736 bits/letter (should approach 4.7 for perfectly random), runs at 1.9408 letters/run (should be close to 2.0). That is, all tests are a pass, except the two-letter chi-squared test, which does not “pass” at a less than 10% chance of error.

Does a real text cut into three parts and added together fare any worse? Should it? If the argument I made for the digits in BookPad is correct, then it seems it should, but by a narrower margin. Since the average entropy of text (spaces included, so if we take them out, the entropy should be a little higher) is measured at 1.58 bits/letter, then to get the same entropy as a random string of letters we need a piece of text that is 4.70044/1.58 = 2.975 times the length of the random string. Therefore, a triple-length string would still suffice, though not by much.

We again take the 1st chapter of “Oliver Twist” as plaintext and the 2nd chapter as key text, resulting in these statistics for the keystream: Chi-squared = 36.272,  two-letter Chi-squared = 725.51, DW statistic = 2.028, Shannon’s entropy  = 4.6949 bits/letter, runs at 1.9706 letters/run. In other words, it passes the same tests that the random text passes, and “fails” the one it fails to pass at the prescribed confidence level, with a very similar score. It is, as far as our tests can tell, indistinguishable from a truly random keystream. The cipher is not theoretically unbreakable because, even though the entropy of the key text is sufficient to guarantee this, some entropy is lost in the keystream-generation process, so the result can never be perfectly random. It does come close, though, and it only takes a piece of text that is three times as long as the random text we require, split into three parts of equal length, and added together with the help of a classic Tabula Recta.

Both the Tabula Recta (Trithemius, 1508 AD) and the running key cipher on which TripleText is based were devised back in the XVI century. Nothing prevented the cryptographers of that time from coming up with a nearly unbreakable cipher like TripleText, except perhaps scarcity of books to use as source material and lack of development of information theory (not much of it being needed, anyway). Had they invented something like this back then, history might have turned out quite different from the way it did.

## Cracking the BookPad cipher

BookPad is a paper and pencil “one-time pad” cipher, described in this other post. Real cryptographers are leery of ciphers claiming to be incarnations of the unbreakable one-time pad for good reasons, the best of them being that true one-time pads must contain perfectly random numbers, which not even a computer can produce. In this post, therefore, I put on my cryptanalyst’s hat and attempt to break a longish message encrypted with BookPad.

Who will win? Find out after the break.

First, a clarification. I started writing this article with a version of BookPad that used serial numbers, which were combined with the key text to make the seed for a lagged Fibonacci generator (LFG). The research shown below concluded that serial numbers were insecure and the LFG was likely unnecessary, and so the BookPad spec was subsequently changed, along with the JavaScript program at this link. Serials were replaced by shifts and the LFG was made optional (but available), but again research showed that it was better to eliminate the shift and the LFG altogether (as well as the optional MAC, for that matter). We must, therefore, begin with some background on the components that were studied but are no longer in BookPad, using some older JavaScript code as we go along.

The whole point of BookPad is to generate a keystream starting from a piece of text, so that the keystream is as random and easy to produce as possible. Additionally, it would be nice if the resulting keystream were non-invertible, otherwise an adversary might be able to obtain the key text from the keystream, and recognizing the source end up compromising all our communications. This can be done with the help of a PRNG, which does not need to be cryptographically strong, in the sense that a sequence of bits much longer than the seed cannot be predicted from previous bits, because we are going to use it only to generate short sequences, shorter than the seed itself.

Of the various PRNGs available that can be run on paper and pencil, lagged Fibonacci generators are especially interesting because they are so easy to do by hand. LFGs are based on a variant of the Fibonacci series, where each number is the sum of the preceding two: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55. . . . To make a PRNG, we don’t take the full numbers, but rather we subtract a multiple of 10 as needed so the result is less than 10 (or we neglect to carry digits as we add), like this: 1, 1, 2, 3, 5, 8, 3, 1, 4, 5. . . . Then, a lagged Fibonacci series doesn’t add the last two numbers, but something further back. For instance, if we add the last and third to last numbers (use zero if there is nothing there), we get: 1, 1, 1, 2, 3, 4, 6, 9, 3, 9, 8, 1, 0, 8. . . which is different from the straight Fibonacci series.

A better way to do this is to write the numbers in separate rows, so we are adding the top and bottom numbers to get the next one. For instance, if each number is the carryless sum of the previous and the one five spaces before, we can set them up into rows of five like this, where the first row and the first number of the second, in bold, come from the seed (496737) that I have used to initialize the PRNG:

```4 9 6 7 3
7 1 0 6 3
6 3 4 4 0
3 9 2 6 0
0 3 2 4 0```

And so on. . . . You can see that adding without carries you get: 4+7=1 and I write it right of the 7, 9+1=0, 6+0=6, 7+6=3, 3+3=6 (this one goes on the next row), etc. You can check that, if you change a single digit in the seed, all the numbers in the third row change, although you may get the same if you change a different seed digit by the opposite amount, not so for the fourth row and beyond. This is a good property, which makes the result that much harder to predict. The fourth row is even better because a single change changes the result in a more complex way (with three rows, all the numbers are simply shifted up or down). But for the rest of our present discussion we are going to take as the “result” value the digits in the third row. In our example, this will be: 63440.

To qualify as a “hash” of sorts, the operation must also be one-way, so that having the result will not reveal the seed. As it stands this LFG fails this test, for it is possible to nearly reconstruct the second row from the third, and then we have this, where the letters represent unknowns with possible values from 0 to 9:

```a b c d e
7 1 0 6 f
6 3 4 4 0```

And then we can establish this system of equations (addition is mod 10):

```a + 7 = 1
b + 1 = 0
c + 0 = 6
d + 6 = f
e + f = 6```

That is, five equations with six unknowns. It is not strictly solvable, but then we can guess one of the unknowns and then get the rest. If the seed obtained this way has some recognizable pattern (because it comes from actual text), then we might be able to tell which guess out of ten possible is the good one and then the problem is solved. The problem doesn’t get any harder to solve if we take the fourth row instead of the third one, only longer, adding five more unknowns but also five more equations.

In other words, to make this non-invertible we need to remove information. One easy way is to output a string shorter than a full row. One unknown is added to the problem for every digit skipped. When the number of digits skipped is equal to the digits in the output (that is, when we output only half of the last row), our guess adds the same uncertainty as data we have. In that case, we can always reproduce the half of the last row we have by choosing a certain guess for the other half, regardless of what the seed might be, so that we can never tell what the correct seed is.

But ultimately there is no need to make this non-invertible, especially if the price to be paid for it is to throw away some of the key text, which effectively only gets used for this purpose. In  this other article, I show that the LFG might not be necessary at all since the seed fed into he LFG is as random as the result of the LFG, provided the key text was indeed sufficiently long.

In Dirk Rijmenants’s excellent article on one-time pads, he insists that real one-time pads have failed in practice not because the cipher is not inherently unbreakable, but because people committed mistakes that reduced their security. This was the case also with the Enigma machine of world war II and many other ciphers, especially if some complexity was involved on the user’s side. Well, BookPad is not free of complexities, so let us begin trying to exploit different ways people can misuse the cipher. Here’s a list:

1. BookPad is designed so the key text is at least three times the length (as encoded into numbers) of the plaintext, but nothing prevents a lazy user from taking a much shorter key text and just repeating it several times in order to achieve the required length.
2. BookPad is a stream cipher, and so it will be easily broken if the keystream is re-used. This can happen in several ways: re-using the key text for a message or identical length as a previous one, re-using it for a message of a different length, using a serial that is consecutive (or simply close) to a previously used serial.
3. As all stream ciphers, an adversary in possession of the plaintext (but not the key text) might be able to change the ciphertext so it decrypts to something of his choosing. Adding a message authentication code (MAC) makes this much more difficult, but the MAC process takes extra effort and opens up the possibility of leaking information so the cipher is no longer theoretically unbreakable.

We will now look at these scenarios one by one. But first we need to have something to crack. In order to provide room for statistical analysis, we’ll encrypt something rather long. For instance, the first paragraph in the text-only version of Charles Dickens’s “Oliver Twist” at gutenberg.org, which reads:

```Among other public buildings in a certain town, which for many reasons
it will be prudent to refrain from mentioning, and to which I will
assign no fictitious name, there is one anciently common to most towns,
great or small: to wit, a workhouse; and in this workhouse was born; on
a day and date which I need not trouble myself to repeat, inasmuch as
it can be of no possible consequence to the reader, in this stage of
the business at all events; the item of mortality whose name is
prefixed to the head of this chapter.```

We will be encrypting this with different keys. If we use as key the second and third paragraphs from the same source, we will have enough key material. This is:

```For a long time after it was ushered into this world of sorrow and
trouble, by the parish surgeon, it remained a matter of considerable
doubt whether the child would survive to bear any name at all; in which
case it is somewhat more than probable that these memoirs would never
have appeared; or, if they had, that being comprised within a couple of
pages, they would have possessed the inestimable merit of being the
most concise and faithful specimen of biography, extant in the
literature of any age or country.

Although I am not disposed to maintain that the being born in a
workhouse, is in itself the most fortunate and enviable circumstance
that can possibly befall a human being, I do mean to say that in this
particular instance, it was the best thing for Oliver Twist that could
by possibility have occurred.  The fact is, that there was considerable
difficulty in inducing Oliver to take upon himself the office of
respiration,--a troublesome practice, but one which custom has rendered
necessary to our easy existence; and for some time he lay gasping on a
little flock mattress, rather unequally poised between this world and
the next: the balance being decidedly in favour of the latter.  Now,
if, during this brief period, Oliver had been surrounded by careful
grandmothers, anxious aunts, experienced nurses, and doctors of
profound wisdom, he would most inevitably and indubitably have been
killed in no time.  There being nobody by, however, but a pauper old
woman, who was rendered rather misty by an unwonted allowance of beer;
and a parish surgeon who did such matters by contract; Oliver and
Nature fought out the point between them.  The result was, that, after
a few struggles, Oliver breathed, sneezed, and proceeded to advertise
to the inmates of the workhouse the fact of a new burden having been
imposed  upon the parish, by setting up as loud a cry as could
reasonably have been expected from a male infant who had not been
possessed of that very useful appendage, a voice, for a much longer
space of time than three minutes and a quarter.
```

In order to make things easier for myself, I start doing the cipher using this web app (a deprecated version implementing serials and LFGs) which goes through the same steps I’d follow to use BookPad with paper and pencil, only much more quickly. Go ahead and click the link (the app will load on a separate tab) in order to follow what I’ll be saying and experiment on your own.

If now we feed both into BookPad and do the encryption with the default “Arsenio” encoding and serial “0000”, the ciphertext before any decoding back to digits becomes:

`49504560680084015661998496971386126345322028641775701660631118711542791176999963918180950242273730121545261879856530459624343588341942635310173838649537278287338241876062309030325371889683406237844253140131381115271171980084995803843604317040322957784187844717928566346098883557436085871528224684827122698669356179218594631560234600453394784007600430168768118427774015454962575355875003337486458086967754028911903728214916894115858013830641158728078374294705902700957404571860086035594822671548478906542226605209355983624829263526604419894755056810662664584223372640584840832996675770603231995849090181934841162880569602142655230141244650356624845842627970334117126810363172415890169789014`

Some quick statistics of the ciphertext:

```length = 689 digits (the plaintext was 522 characters, so the ciphertext is 1.32 times longer).
'0' appears 72 times, '1'=73, '2'=65, '3'=64, '4'=77, '5'=66, '6'=74, '7'=60, '8'=81, '9'=57, which is close to a uniform distribution```

Now let’s go through the different scenarios.

### Short key scenario.

If the user get’s lazy and uses a very short key text, he may easily produce a repeating keystream. For instance, if the key text is simply “a”, the resulting keystream, ready to be combined with the encoded plaintext, becomes:

`86556815063100136051865568150631001360518655681506310013605186556815063100136051865568150631001360518655681506310013605186556815063100136051865568150631001360518655681506310013605186556815063100136051865568150631001360518655681506310013605186556815063100136051865568150631001360518655681506310013605186556815063100136051865568150631001360518655681506310013605186556815063100136051865568150631001360518655681506310013605186556815063100136051865568150631001360518655681506310013605186556815063100136051865568150631001360518655681506310013605186556815063100136051865568150631001360518655681506310013605186556815063100136051865568150631001360518655681506310013605186556815063100136051865568150`

Which is made up of the period “86556815063100136051” (20 digits) repeated 34 times, plus a fraction. The resulting ciphertext, therefore, is as if produced by a Vigenère cipher of key “86556815063100136051”, acting on the encoded plaintext. To crack this, we would make statistics for the digits found at 1, 1+n, 1+2n, and so forth, where n is our guess of the key length. When we try the correct length, then all the digits in each set have been combined with the same keystream value, so that their peculiar statistics are not affected by the keystream. If the plaintext was English, there would be a clear statistical bias toward certain letters (“e” being the most frequent after the space, then “t”, “a”, and so forth) that would allow us to recognize that length as the right one, and likely to recover the corresponding plaintext, each letter shifted by the same amount in the alphabet, as in the Caesar cipher. The scanning and testing process can be automated, so it would take a fraction of a second in a computer. The only requirement is that each set is large enough to produce meaningful statistics and that the encoding does not obliterate the patterns present in the plaintext.

Let’s take a look at this last point. If I count letter frequencies (including space) in the plaintext, I find the following:

`'space'=90,'e'=44,'t'=40,'a'=30,'o'=39,'i'=34,'n'=37,'s'=29,'h'=21,'r'=21,'l'=15,'d'=11,'u'=10,'c'=13,'m'=14,'f'=10,'w'=12,'y'=5,'p'=5,'v'=1,'b'=8,'g'=6,'k'=2,'j'=0,'q'=1,'x'=1,'z'=0`

Which only has a few inversions with respect to the average English text frequencies (that is the order used in the list above).

In the encoded plaintext, however, the frequencies are:

`'0'=94,'1'=45,'2'=35,'3'=80,'4'=64,'5'=44,'6'=69,'7'=40,'8'=103,'9'=115`

Here ‘0’, ‘8’ and ‘9’ are the most frequent digits by far. ‘0’ and ‘9’ encode the less-frequent letters in most Western languages (‘t’, which is frequent in English, is encoded as ’03’ in our scheme, sharing the digit ‘0’ with other much less frequent letters), ‘8’ encodes the space. So it stands to reason that a real piece of English plaintext will have a lot of  8’s, 9’s, and 0’s, although it would be hard to tell which encodes which. Looking at the less frequent digits, we find ‘2’, which encodes ‘r’, which is not so frequent in English, and ‘7’, which encodes ‘o’. The most frequent ‘e’ is encoded as ‘4’, which is in the middle of the pack and doesn’t stand out in any way. Still, let us assume that a piece of encoded English might be recognizable because of the high frequency of three digits (encoding spaces and less-frequent letters) over the rest. Unfortunately, digraphs and trigraphs, which usually are much better to detect encoded language than single letters, won’t help with Vigenère cryptanalysis (unless we add a lot of complexity to the process) because consecutive letters end up in different groups.

It is interesting to compare this statistic with what we would get if the plaintext was random. One way to obtain this is to encode the string “abcdefghijklmnopqrstuvwxyz .’” which contains all encodable characters once. If we repeat it 14 times, for a total of 672 digits after encoding, we obtain the following statistic:

`'0'=168,'1'=42,'2'=42,'3'=42,'4'=42,'5'=42,'6'=42,'7'=42,'8'=42,'9'=168`

so that only two digits, corresponding to the less-frequent letters, appear more frequently than the others. It seems, therefore, that it would be possible to detect the moment when the correct key length is guessed, since then frequency analysis would give three peaks, which is different from a random distribution of digits (no peaks) or of letters (two peaks). From then on, one would apply the standard cryptanalysis of the Vigenère cipher, obtaining the repeated keystream period in the process.

Okay, but that’s with an incredibly bad key. What if the key is a little better? Let’s assume now that the key is “ar”, which is encoded as “12” with the default straddling checkerboard. In this case the keystream consist of repetitions of the sequence “407668162090261866704076681620902618667040766816209026186670”, which comprises 60 digits, so that the string is repeated 11 times plus a fraction. It is a lot harder to come up with a meaningful frequency distribution of 10 different digits if there are only 11 or 12 samples. The expected three peaks might appear, or they might not, and then we’d never know what the true period is. This period is not always the same, either. If the key is “ae”, which also produces only two digits, the repeated keystream sequence is “7310137297781518779273101372977815187792”, which is 40 digits long, leading to an average of 17.2 samples per distribution. The period length isn’t constant, either. If key “ar” is used for a plaintext equal to the original, minus the final period (2 digits shorter after encoding), then the keystream period is 40, not 60. For key “ae”, it is 20 digits long.

Going back to the original plaintext, key “ab” produces a 60-digit keystream period, whereas key “ac” leads to a 45-digit period. Both are three-digit keys, as encoded, so the difference is to be found in the peculiarities of the lagged Fibonacci generator that produces the keystream. A number of four-digit keys also lead to 60- or 40-digit periods, but when we go to five digits, as in “bad”, the period jumps to 100 digits, but “bat” takes us back to 50 digits. A six-digit keyword such as “but” gives 60 digits. In general, one could expect that the keystream period length be ten times the encoded key length, so if the key is one-tenth the length of the plaintext (it should be 30 times longer, at least), the keystream won’t repeat within the length of the message, and we may be back to the one-time pad situation if the resulting keystream has good randomness.

A quick example: since the plaintext encoded length is 689 digits, let’s try a key that encodes to one-tenth of that length, that is, 69 digits. For instance: “This is a not so short key. But not too long either.” Using this key text, we find that the keystream does not repeat. What’s worse, the distribution of digits in the resulting keystream is quite uniform. The shorter key text “This is a not so short key.” (35 digits after encoding) doesn’t lead to repetitions, either, but the resulting keystream does not look so random, with a superabundance of 2’s and 7’s and a dearth of other digits. However, “This is a short key” (25 digits) leads to a period of 125 digits, for a total of five and a half repetitions. So, it seems that in the worst case the keystream period is 5 times as long as the key (as far as we’ve found so far), with a factor of 10 being most frequent, and sometimes 20.

Conclusion: we can expect to be able to crack BookPad by conventional frequency analysis if the user does not follow instructions and uses a short key, with diminishing probability of success as the key gets longer. Once the user takes a key text that is at least one-tenth the length of the plaintext (after encoding), forget about it, even though technically there isn’t enough key entropy to meet the one-time pad criterion for unbreakability.

### Reused key

Clearly, reusing the key text for two messages of identical length breaks all security since then an attacker only needs to subtract the two encoded ciphertexts in order to find the difference between the two messages, which then can be easily attacked in many ways. But what if the messages are of different length? What if a different serial was used with the same key text, but the two serials differ only a little?

To test the first scenario (which will be revisited later), let us re-encrypt the original message with the original key, which is shown at the top, and then use the same key to encrypt the original message minus the final period (two digits shorter). Here’s the ciphertext for the second encryption:

`080653919354207573082946698279557637717783420159887064587363997801819098623886634621435181617149650881379081385808289648839232357686905572759273683733019242416409622104254425068549459634261536548307305258991049932848714094207982511166147214391626869846663488479725578102687486543909535691590189912834514950633492400878567613933785290007112246339237125515514603589605896248273434294515810209220596453282731762769121058678451926600535491183932264144604511586986598408786068577900862225732301214114919758307362469428955118631085987435666629232914607829656886750844664924781158738892072449040943422393867494120708503026553236782069739087152697883503740382282317330411073067932595532092161284`

Even though the plaintexts only differ by two digits at the end, the ciphertexts are completely different from the start. No similarity between the two is evident anywhere. This is because the two keystreams, having different lengths, become completely different due to the properties of lagged Fibonacci generators. At first, it doesn’t appear promising at all for the cryptanalyst, so we’ll try an easier target.

How about serials that only differ by a little? Let’s encrypt the original plaintext with the original key text, but this time using “1000” as serial, so a “1” will appear where before there was a “0” in the process of generating the keystream. Then the ciphertext becomes:

`51849138581218572451011842650398461913223252108565824016310120056110692300456753031536639254518308022779728669979986138636688156242176192100296284328549513855239475333852422486004383124251307461301043263587060127516749881218452693966050096052667525685311301507041912025000128125337219338318347030506134933237257303775384754916913612798962685231167220281114897439019683355196032145998459016498793654868988585701026174993928139783759247397431271174757386539373803934414294694216765047839490572772935796665672384211690551525053720316727865573767391488563898041013495096263852177564576904160021018295779193279419063014026492265001919153589228257858302632740326013129461488264306972680282135793`

We do see some repeated digraphs between the two ciphertexts, but they are not at the same locations so we must conclude that they arise from normal random coincidences. Prominent features like the “9999” found in the first line of the original message don’t appear to be in the second message, and vice-versa. Both messages look as if encrypted with completely different key texts.

We hit pay dirt if the serial is “1900”, however, for then the ciphertext is:

`40615671791195126772009507082497237456433139752886812771742229822653802287000074029291061353384841232656372980967641560735454699452053746421284949750648389398449352987173410141436482990794517348955364251242492226382282091195006914954715428151433068895298955828039677457109994668547196982639335795938233709770467280329605742671345711564405895118711541279879229538885126565073686466986114448597569197078865139022014839325027905226969124941752269839189485305816013811068515682971197146605933782659589017653337716310466094735930374637715520905866167921773775695334483751695951943007786881714342006950101292045952273991670713253766341252355761467735956953738081445228237921474283526901270890125`

which is the same as the original ciphertext, adding 1 (mod 10) to every digit except the first. This is because the shift in the LFG caused by the first serial digit ‘1’ is promptly canceled by the second ‘9’, resulting in a related keystream sequence. Serial “2800” shifts the whole thing by 2, which again essentially repeats the keystream, and there a number of such serials that lead to results too close to the original. Now, of course, this only works if the two messages are of identical encoded length and they use the same key test and serials whose digits add up to the same, which will happen every 10 times on the average (not bad at all).

Now, if we go back to the ciphertext encrypted with the serial “1000”, we notice that it is more similar to the original (serial “0000”) that we had been led to believe at first. Indeed, if I add the repeated stream “1234567890…” to the original, I get the one with serial “1000”. The effect is similar if the serial is “0100”, “0010”, or “0001”. If the serial is “2000”, then the keystream shifts by the sequence “24680…”, if the serial is “3000”, the string to the added is “3692581470…”, and so forth. Even better, the linearity of the LGF causes a serial composed of several figures to have an effect on the keystream that is the superposition of the effects of all its digits. In other words, utterly predictable, though apparently random at first. If instead of three rows of LFG the encrypted had used four, we’d still be able to predict the effect, though the sequences to be added would be a little more complex.

So, advice to the encryptor: don’t use a serial as described in the old BookPad spec. It gives no extra security because the linearity of the LFG causes the effect of the serial to be completely predictable so it can be removed, leaving behind the original key text. If this is repeated with two messages of equal length, all will be lost.

Because of this, I altered the original spec of BookPad so that a key text shift, applied as soon as a piece of suitable length is extracted from the encoded key text, is used rather than a serial added to the digits. Shifts move all the digits so that, after three rows in the LFG, there is little apparent correlation between the keystreams produced from different shifts. The idea is to be able to reuse a particular key text, so long as the shift itself is not reused with the same key text and a key text-derived encoding is in use. To follow along with the explanation that follows, use this other JavaScript code (also deprecated), which uses shifts rather than serials.

Shifts move all the digits in the affected parts of the key text, so the result of the sum changes unpredictably, but are they really secure? Not if the attacker has the plaintexts. This is not as unlikely as it seems: perhaps the attacker has planted some news in order to get the user to include certain unusual words in the plaintext, which can be recognized later on, or perhaps he got the plaintext from a different source, and now he is trying to find the origin of the key text. This is know as the “known plaintext attack.” Let us, therefore, assume that I, the attacker, have subtracted the known plaintext from two messages encoded with two different shifts (say, one unshifted and another where the first piece is shifted by just one space to the left), so that I have both keystreams. Let’s say one is “12345” and the other is “67890”; then the process of making the first keystream (assuming no LFG step) will be:

``` abcde
+fghij
------
12345```

where “abcde” is the key text part that is shifted in the next message, and “fghij” is the result of adding the two parts that are not shifted. The process of making the second keytream (again, no LFG), is:

``` bcdea
+fghij
------
67890```

putting the two together, we obtain the following linear system of equations, where the addition is modulo 10:

`a + f = 1 ; b + g = 2 ; c + h = 3 ; d + i = 4 ; e + j = 5 ; b + f = 6 ; c + g = 7 ; d + h = 8; e + i = 9 ; a + j = 0`

which is a system of 10 equations with 10 unknowns. Solving it we obtain the digits a through j. Then we decode it back to letters, resulting in a piece of the key text that a diligent search will identify as belonging to a certain book or other source. We’ve got it! Now, the example was too simplified, but it serves to illustrate that there is nothing the encryptor can do to stop us:

1. If he shifts the part by a different amount, we can still set up the system of equations, so long as we know what the shift is. Now, the shift may be concealed within the ciphertext, but then we only have to try all likely locations (there’s only so many of them), and keep the one that leads to intelligible text in the solution.
2. If the shift affects two parts rather than one, we’ll end up with a similar situation but involving three rows of unknowns instead of two. This only means that we’ll need three compromised messages rather than two in order to reach a solution. Since the point of the shift is to reuse a key text, it is very likely that three messages using the same key can be found if the encryptor has reused the key. Why use it only for two messages?
3. Things won’t be much better for the encryptor if he uses a LFG, since the LFG can be reversed after making a few guesses, as we saw above. Nothing that a computer cannot handle.

But what if he does not shift the key text parts, but simply reuses the key text for messages of different lengths, which above we saw lead to very different-looking keystreams. Using the example above, the first, second, and third keystreams, for messages that are 5-digits, 4-digits, and 3-digits long would be, for instance:

``` abcde
+fghij
+klmno
------
12345```
``` abcd
+efgh
+ijkl
-----
6789```
``` abc
+def
+ghi
-----
012```

which results in a system of 5 + 4 + 3 = 12 equations with 15 unknowns. It is not strictly solvable, but it will be if we guess two of the unknowns, for a total of 10 x 10 = 100 possibilities. The right solution will be the one that yields intelligible text for “abcde”, “fghij”, “efgh”, and so forth. It is unlikely that two different solutions will pass this test. So it turns out that reusing a key text for messages of different lengths is not safe, either, even though the keystreams may look quite changed.

What can the encryptor do to thwart me? I can only think of two things:

1. Never reuse a key text, instead of relying on a shift to make the keystream different each time, or relying on the fact that keystreams naturally look different for different lengths. That is, so long as an attacker is able to obtain the keystreams by subtracting known plaintexts with a known encoding (such as the default “Arsenio” encoding).
2. If reusing a key is unavoidable, use a key text-derived encoding rather than the default, plus a shift. If the encoding is unknown, I won’t know what the correct keystream is, even if I know the plaintext, and the process I just described will be much more difficult to get started. I can still make guesses as to what the encoding is (essentially, it’s a permutation of the top row of the checkerboard and the other two rows), but there’s a lot of possibilities to try: 10!/2 for the first row, times 19! for the other two, for a total of 2.2E23, which will take a while to go through even on a powerful machine. If I work for the NSA, though, I might still be able to muster enough computer power to go through every possible encoding, after which solving the system will hardly take any extra time, and I will still be able to tell when I’ve got the correct solution, because all parts of the key text will be readable (a machine can detect this, by means of a dictionary).

In other words, the encryptor will do well never to reuse a key text, even with key text-derived encoding. After all, we are after perfect secrecy, and this implies NSA-proof secrecy, present and future. This is the reason why shifts were also removed from the BookPad spec. It is much easier to simply choose a different piece of key text for every message, rather than keep track of shifts and encodings in order to reuse a given key text.

### Altered ciphertext

Here we have to deal with the issue of “malleability,” which plagues all stream ciphers. An attacker knowing the plaintext (or a part of it, along with its location) will be able to make a ciphertext that decrypts to a plaintext of his choosing, by simply shifting the values in the ciphertext so the plaintext letters shift in the same way. This is possible whenever the attacker knows the encoding used. For instance, the initial “among” of the message is encoded as “1907595” with the default encoding. “Sugar” encodes to “3049512”, which has the same length. Therefore, if I take the ciphertext and add to it the number 1907595 –  3049512 = 8968083 (digit by digit, mod 10), it should decrypt to what I want. The altered ciphertext begins (changes in bold):

`2818439068008401566199849697138612634532202864177570166063111...`

which decrypts to:

`sugar other public buildings in a certain town...`

which is exactly what I wanted to do.

To combat this, the user will either calculate a MAC and include it with the message (or not), which is extra work, or use a key text-derived encoding checkerboard. This is how a message authentication code (MAC) can be calculated (I assume it’s going to be 3 digits long):

1. Since the MAC length is 3 digits, I set N = 3. I take a piece of encoded key text that I have not used yet with length 3N = 9 digits. For instance: 672930060
1. Now I take the encoded plaintext (“Alice”), followed by the piece of encoded key text, and the shift or serial (if any), and write them row by row into N = 7 columns. Then I add each column without carries and get the result.
``` 199
+692
+467
+293
+006
+0(shift here, wrapped if necessary)
-----
337```

My friend on the other end goes through the exact same steps after he has obtained the encoded plaintext. Since he is using the same key text, he should calculate the same MAC (337), which I am sending along with the ciphertext. If the calculated value matches the received value, he knows there’s only one chance in a thousand that the message has been altered, otherwise most likely there has been an error or alteration.

This process of making a MAC doesn’t give away anything about the plaintext, because secret data having as much entropy (remember, entropy is the same as “unpredictability”) as the MAC itself has been mixed in. Those who don’t possess the key text would have to guess this data, which means that they can produce any three-digit MAC just by changing their guess. No good at all for me poor hacker.

Even worse, if the encryptor uses a key-derived encoding (and we saw above there are 2.2E23 possibilities), I cannot begin to modify the ciphertext so the plaintext is something different because I don’t know how the letters translate into digits. I could try to make some guesses, but for this attack I only have one chance since the recipient will realize that something is wrong as soon as the first word decrypts as gibberish while the rest make sense. I’m stuck.

Advice to the encryptor: making your own key-derived checkerboard takes very little time and protects you against this attack, even without a MAC. Much preferred.

### Summary

We have subjected BookPad to three kinds of attack that might be likely in practice. BookPad remains secure when short keys, so long as they are at least one-tenth the length of the message. Short keys, on the other hand, don’t have the theoretical unbreakability of long key texts, and can be attacked by brute force. Related-serial attack has succeeded against users who reused the key text and relied on serials to make the keystream different, so serials were deprecated in the BookPad spec and replaced at first by key text shifts. Then shifts also fell to attack easily when the encoding was known, not so easily for key-derived encoding but still within the reach of an attacker with a lot of computing power, so it is better to never reuse a key text, period. Key-derived encoding works as well as calculating a MAC when protecting against a known plaintext attack, and is much quicker to do. In this other article, I found that the LFG step does not add security when the key text is the necessarily length, which is why the current BookPad spec does not include it.

## A new look at one-time pads

In another article, I describe how text taken from a book in your library can possibly be used to serve as a one-time pad of sorts, since normal text also contains some unpredictability. The trick is to use a piece of text from an agreed-upon book that is five times the length of the plaintext. That method uses a computer-based hash function, but in this article I tell you how to obtain good security from simple paper and pencil calculations, actually using a key text out of the book that is only three times the length of the plaintext.

This cipher can be performed with pencil and paper, but in any case I have made a JavaScript version of it, which you may find useful as you read the article. It can be downloaded from this link

Yes, doing a standard hash by hand is incredibly long and tedious. This video shows a man trying to do a SHA256 by hand, which is enough to dissuade anyone from even trying. But perhaps nothing as complicated as that is needed. What we need is a way to convert a piece of encoded text, which will have a lot of underlying patterns that can be exploited by an adversary, into something that will pass most statistical tests for randomness. In other words, we need some sort of pseudo-random number generator (PRNG).

Earlier versions of BookPad used a lagged Fibonacci generator (LFG) for this purpose, but testing showed that randomness was sufficient even when this step was skipped, so long as the key text had the required length. Therefore, the BookPad spec was simplified to what you see here. The result is a cipher that achieves perfect secrecy (that is, it can never be broken, no matter how much computing power adversaries might have) with only a few simple steps that can be carried out by paper and pencil.

So here’s the BookPad cipher step by step, illustrated with an example in which the plaintext I wish to conceal is “Alice”:

1. BookPad uses randomness extracted from common text. Therefore, I identify a piece of text that my friend on the other side can also obtain and won’t incriminate me in the event of a search. It can come from  a popular book, a newspaper, a blog, whatever. What matters is that my friend can obtain an exact copy, that adversaries cannot guess the source, and that neither I nor my friend use that text again, ever. Let’s say the text begins: “This is my password. It is long.” We’ll call this key text.
2. The next step is to encode it, and the plaintext, as decimal digits. I will use the “Arsenio” straddling checkerboard below, which works well with most Western languages, or a similar one deriving from my key text. Spaces are encoded (represented by “+”), and there is an extra generic punctuation mark “=”. Letters on the top row are replaced by the single digit above and those in the other two rows by the digit on the left followed by that on top.
```    1 2 3 4 5 6 7 8 9 0
--------------------
| A R S E N I O +
9 | B C D F G H J K L M
0 | P Q T U V W X Y Z =```

which would lead to the encoding “Alice” = “1 99 6 92 4” (7 digits).

But I can also use a key-derived checkerboard made this way: 1. Take the first sentence (up to the first period) of the key text. 2. Count the number of letters in the first and second words (subtract 10, if necessary); these will be the numbers at the top of the checkerboard that won’t have a letter directly below; if the second word has the same number of letters as the first, take the third, and so on; if you reach the end without a different number, take the first + 1 . 3. Place the letters in the key text in the order they come, on the checkerboard: those in “arsenio” plus the space (represented as +) on the top row, the others in the second and third rows. 4. When you reach the end of the sentence, place the period (or = sign), and then the rest of the letters that have not appeared, in reverse alphabetical order (in reverse “arsenio” order, placed on the top row, for those in this word that have not appeared in the first sentence). Since the first sentence in the key text is “This is my password,” the empty cells on the top row are under the digits 4 (length of “This”) and 2 (length of “is”), and the filled checkerboard becomes:

```    1 2 3 4 5 6 7 8 9 0
--------------------
| I   S   + A O R N E
4 | T H M Y P W D = Z X
2 | V U Q L K J G F C B```

The key-derived checkerboard takes just a little longer to compose, and makes the resulting ciphertext non-malleable without the need to make a message authentication code (MAC), as described in this article, which usually takes a lot more work than generating a checkerboard. Highly recommended. With this encoding, the plaintext becomes “6 24 1 29 0”.

1. Now I also convert the key text into decimal digits using the chosen checkerboard. If I chose the key-derived checkerboard, the key text gets encoded as:414213513543445456334 67847481425135247927 (41 digits). The minimum I need is 3 x N digits, where N is the number of digits of the encoded plaintext, and the rest of the digits are ignored. I only need 7 x 3 = 21 digits needed to encrypt a message containing 7 digits. I take this piece of encoded key text and split it into three parts of equal length: 4142135, 1354344, 5456334. At this point I can put away the book since I’ll never use that key text again. This is very important.
2. Now I need to obtain the keystream from this encoded key text. To do this, I add the three parts digit by digit without carries, which can be done from left to right. This is what I get:
``` 4142135
+1354344
+5456334
--------
0842703```
3. Finally, I subtract the encoded plaintext from the keystream, also without carries, which can be done from left to right. The result is the ciphertext.
``` 0842703
-6241290
--------
4601513```

And that’s it. I take the result 4601513 and send it out to my friend, along with the start location (page or chapter, line or paragraph number) of the key text. Obviously, I won’t send the book’s name. For added security, I may want to insert those numbers at a predetermined location in the ciphertext so an adversary won’t know where it is.

My friend at the other end performs steps 1-3 in the same way as I did, thus obtaining the same keystream. Then he subtracts from it the ciphertext I transmitted, also without carries, like this:

``` 0842703
-4601513
--------
6241290```

The result is the encoded plaintext, which now he decodes using the straddling checkerboard back to “Alice”.

For those of us who are really paranoid, BookPad can be strengthened in a few ways:

1. Use a key-derived straddling checkerboard instead of the default “Arsenio” checkerboard. The process for deriving this, which is described in step 1, does not add a lot of work, and using a key-derived straddling checkerboard prevents an adversary from changing the contents to something else of his/her/its choice.
2. Decode the ciphertext back to letters, and send it that way. This works especially well if a key-derived checkerboard is in use, since the digits to letter decoding is unknown to attackers.
3. Split the plaintext at a random spot and arrange the parts in reverse order prior to decoding, perhaps with a special code such as “..” separating the two. This is to lessen the impact of stereotyped beginnings and endings.
4. Add nulls here and there in the plaintext. Nulls are nonsensical characters that the recipient can easily spot and ignore, but which make cryptanalysis more difficult.
5. (Deprecated) Add a message authentication code (MAC) to prevent an adversary from modifying the message at ciphertext level. The process for making a MAC is described in this article. It is deprecated because using a key text-derived checkerboard also achieves this and is much easier to do.
6. (Deprecated) Use a randomly chosen serial, and do the same as in point 3 above. Serials were found found to be insecure in this same article, so serial usage is no longer included in the BookPad spec in this article, and has been replaced with a shift of the encoded key text. The JavaScript code that still implements a serial can be found at this link.
7. (Also deprecated) Use a randomly chosen and never-reused shift, and then insert the shift at a location in the ciphertext based on the key text, rather than sending them in the clear or inserting them always in the same spot. For instance, count the length of the first two words of the key text, and insert the shift (and MAC, if any) after so many ciphertext digits have elapsed, and then continue with the rest of the ciphertext. If this is then decoded back to letters, so much better. An attacker will have a hard time finding the shift because he won’t even know where it is.  Unfortunately, the same article shows that shifts, although better than serials, still provide no real security when a key text is reused. The JavaScript code that still implements a shift can be found at this link.

These are added as well to a webpage that implements all the steps mentioned above, programmed in JavaScript. You can find it at this link. I think you will find this program quite useful, especially since it allows you to experiment with the cipher with a minimum of hassle.

Okay, but how secure is all this? I’m going to address this by putting on my cryptanalyst’s (hacker’s if you will) hat and looking at different ways to attack the process. Here’s a few things I’d try:

1. Statistical analysis of the ciphertext. The straddling checkerboard encoding makes it more difficult since the resulting digits, unlike the original text, have a fairly flat expected frequency distribution. But digraphs and trigraphs are preserved after encoding so it may be possible to find a few on a long piece of text. Additionally, there will be other patterns and repetitions that may be picked up by statistical tools. The problem for me poor hacker is that each plaintext digit is combined with a quite “random” (in the statistical sense) keystream, so I shouldn’t expect any statistical properties of the plaintext to survive in the ciphertext.
2. Brute force. I can try every possible key text, but this won’t help much since the number of possible key texts is equal to the number of possible ciphertexts, due to the high entropy of the long key text. Brute force will produce every possible plaintext, and I don’t have a way to tell which is the good one (not even using the MAC, which would require additional brute forcing). I can still try all the books from obvious places like gutenberg.org, all the articles from nytimes.com, etc., etc., etc., and I might get lucky. The Wikipedia article on the running key cipher, which is a lot like BookPad except that the key text has identical length as the plaintext and no extra randomizing step is used, states that the number of published books in the world is of the order of 2^40, which is within the reach of today’s computers to process. Still, the risk of missing the source is great, especially if it is not as obvious as those noted above.
3. Somehow get the plaintext for a particular ciphertext, and then get the keystream by subtracting the known plaintext from the ciphertext. Then reverse the keystream back to key text so I can guess what book it was taken from. The problem is that the keystream derived from a piece of key text split into three and combined by addition is quite random, even without the LFG step (see this article), so that it could derive from a variety of key texts.
4. Change the hidden plaintext by changing the ciphertext in certain ways (especially if I know the plaintext). This is possible if the encoding is known and there is no MAC accompanying the ciphertext, but not easy at all if there is a MAC of sufficient length or the encoding is derived from the key text. I could try different encodings in the latter case, but for this attack I have to be right on the first try, which doesn’t seem possible if the encoding derives from the key text, and therefore I don’t know how it’s done.
5. Force the sender to re-use a key text for two messages, and then subtract the ciphertexts to eliminate the keystream and apply statistical tools to the result, which combines the two plaintexts. This won’t work if the sender never reuses a key text, as he/she should. Still, I could try to mount a related-key attack, where the sender has used the same key text and two different shifts. This is studied in this article.
6. Early versions of BookPad used Lagged Fibonacci generators, which are known to have problems, which I exploit. One of them is their linearity, which makes combining a serial number with the seed quite ineffective, as discussed here. LFGs are also very sensitive to the seed, so that a bad seed will lead to a sequence of digits that repeats after a short interval. If all the digits in the seed are even or divisible by 5, every digit in the sequence will be even or divisible by 5, respectively. In BookPad, the seed is not being tested for this, so it is possible that the resulting keystream is not statistically random, especially if the keystream is long and the key text is short, because then there is room for a bad LFG to start repeating digits. These and other reasons recommend not using a LFG. If a key text of sufficient length is used, the keystream passes randomness tests even without a LFG.

Does all this mean that BookPad has no flaws and you can go ahead and entrust high secrets to it? By no means, but at least you are encouraged to test it. Time and testing will tell if it lives out to its promise of perfect security with a minimum of work.

## BookPad, a paper and pencil “one time pad” cipher

In another article, I describe how text taken from a book in your library can possibly be used to serve as a one-time pad of sorts, since normal text also contains some unpredictability. The trick is to use a piece of text from an agreed-upon book that is five times the length of the plaintext. That method uses a computer-based hash function, but in this article I tell you how to obtain good security from simple paper and pencil calculations, actually using a key text out of the book that is only three times the length of the plaintext.

This cipher can be performed with pencil and paper, but in any case I have made a JavaScript version of it, which you may find useful as you read the article. It can be downloaded from this link

Yes, doing a standard hash by hand is incredibly long and tedious. This video shows a man trying to do a SHA256 by hand, which is enough to dissuade anyone from even trying. But perhaps nothing as complicated as that is needed. What we need is a way to convert a piece of encoded text, which will have a lot of underlying patterns that can be exploited by an adversary, into something that will pass most statistical tests for randomness. In other words, we need some sort of pseudo-random number generator (PRNG).

Earlier versions of BookPad used a lagged Fibonacci generator (LFG) for this purpose, but testing showed that randomness was sufficient even when this step was skipped, so long as the key text had the required length. Therefore, the BookPad spec was simplified to what you see here. The result is a cipher that achieves perfect secrecy (that is, it can never be broken, no matter how much computing power adversaries might have) with only a few simple steps that can be carried out by paper and pencil.

So here’s the BookPad cipher step by step, illustrated with an example in which the plaintext I wish to conceal is “Alice”:

1. BookPad uses randomness extracted from common text. Therefore, I identify a piece of text that my friend on the other side can also obtain and won’t incriminate me in the event of a search. It can come from  a popular book, a newspaper, a blog, whatever. What matters is that my friend can obtain an exact copy, that adversaries cannot guess the source, and that neither I nor my friend use that text again, ever. Let’s say the text begins: “This is my password. It is long.” We’ll call this key text.
2. The next step is to encode it, and the plaintext, as decimal digits. I will use the “Arsenio” straddling checkerboard below, which works well with most Western languages, or a similar one deriving from my key text. Spaces are encoded (represented by “+”), and there is an extra generic punctuation mark “=”. Letters on the top row are replaced by the single digit above and those in the other two rows by the digit on the left followed by that on top.
```    1 2 3 4 5 6 7 8 9 0
--------------------
| A R S E N I O +
9 | B C D F G H J K L M
0 | P Q T U V W X Y Z =```

which would lead to the encoding “Alice” = “1 99 6 92 4” (7 digits).

But I can also use a key-derived checkerboard made this way: 1. Take the first sentence (up to the first period) of the key text. 2. Count the number of letters in the first and second words (subtract 10, if necessary); these will be the numbers at the top of the checkerboard that won’t have a letter directly below; if the second word has the same number of letters as the first, take the third, and so on; if you reach the end without a different number, take the first + 1 . 3. Place the letters in the key text in the order they come, on the checkerboard: those in “arsenio” plus the space (represented as +) on the top row, the others in the second and third rows. 4. When you reach the end of the sentence, place the period (or = sign), and then the rest of the letters that have not appeared, in reverse alphabetical order (in reverse “arsenio” order, placed on the top row, for those in this word that have not appeared in the first sentence). Since the first sentence in the key text is “This is my password,” the empty cells on the top row are under the digits 4 (length of “This”) and 2 (length of “is”), and the filled checkerboard becomes:

```    1 2 3 4 5 6 7 8 9 0
--------------------
| I   S   + A O R N E
4 | T H M Y P W D = Z X
2 | V U Q L K J G F C B```

The key-derived checkerboard takes just a little longer to compose, and makes the resulting ciphertext non-malleable without the need to make a message authentication code (MAC), as described in this article, which usually takes a lot more work than generating a checkerboard. Highly recommended. With this encoding, the plaintext becomes “6 24 1 29 0”.

1. Now I also convert the key text into decimal digits using the chosen checkerboard. If I chose the key-derived checkerboard, the key text gets encoded as:414213513543445456334 67847481425135247927 (41 digits). The minimum I need is 3 x N digits, where N is the number of digits of the encoded plaintext, and the rest of the digits are ignored. I only need 7 x 3 = 21 digits needed to encrypt a message containing 7 digits. I take this piece of encoded key text and split it into three parts of equal length: 4142135, 1354344, 5456334. At this point I can put away the book since I’ll never use that key text again. This is very important.
2. Now I need to obtain the keystream from this encoded key text. To do this, I add the three parts digit by digit without carries, which can be done from left to right. This is what I get:
``` 4142135
+1354344
+5456334
--------
0842703```
3. Finally, I subtract the encoded plaintext from the keystream, also without carries, which can be done from left to right. The result is the ciphertext.
``` 0842703
-6241290
--------
4601513```

And that’s it. I take the result 4601513 and send it out to my friend, along with the start location (page or chapter, line or paragraph number) of the key text. Obviously, I won’t send the book’s name. For added security, I may want to insert those numbers at a predetermined location in the ciphertext so an adversary won’t know where it is.

My friend at the other end performs steps 1-3 in the same way as I did, thus obtaining the same keystream. Then he subtracts from it the ciphertext I transmitted, also without carries, like this:

``` 0842703
-4601513
--------
6241290```

The result is the encoded plaintext, which now he decodes using the straddling checkerboard back to “Alice”.

For those of us who are really paranoid, BookPad can be strengthened in a few ways:

1. Use a key-derived straddling checkerboard instead of the default “Arsenio” checkerboard. The process for deriving this, which is described in step 1, does not add a lot of work, and using a key-derived straddling checkerboard prevents an adversary from changing the contents to something else of his/her/its choice.
2. Decode the ciphertext back to letters, and send it that way. This works especially well if a key-derived checkerboard is in use, since the digits to letter decoding is unknown to attackers.
3. Split the plaintext at a random spot and arrange the parts in reverse order prior to decoding, perhaps with a special code such as “..” separating the two. This is to lessen the impact of stereotyped beginnings and endings.
4. Add nulls here and there in the plaintext. Nulls are nonsensical characters that the recipient can easily spot and ignore, but which make cryptanalysis more difficult.
5. (Deprecated) Add a message authentication code (MAC) to prevent an adversary from modifying the message at ciphertext level. The process for making a MAC is described in this article. It is deprecated because using a key text-derived checkerboard also achieves this and is much easier to do.
6. (Deprecated) Use a randomly chosen serial, and do the same as in point 3 above. Serials were found found to be insecure in this same article, so serial usage is no longer included in the BookPad spec in this article, and has been replaced with a shift of the encoded key text. The JavaScript code that still implements a serial can be found at this link.
7. (Also deprecated) Use a randomly chosen and never-reused shift, and then insert the shift at a location in the ciphertext based on the key text, rather than sending them in the clear or inserting them always in the same spot. For instance, count the length of the first two words of the key text, and insert the shift (and MAC, if any) after so many ciphertext digits have elapsed, and then continue with the rest of the ciphertext. If this is then decoded back to letters, so much better. An attacker will have a hard time finding the shift because he won’t even know where it is.  Unfortunately, the same article shows that shifts, although better than serials, still provide no real security when a key text is reused. The JavaScript code that still implements a shift can be found at this link.

These are added as well to a webpage that implements all the steps mentioned above, programmed in JavaScript. You can find it at this link. I think you will find this program quite useful, especially since it allows you to experiment with the cipher with a minimum of hassle.

Okay, but how secure is all this? I’m going to address this by putting on my cryptanalyst’s (hacker’s if you will) hat and looking at different ways to attack the process. Here’s a few things I’d try:

1. Statistical analysis of the ciphertext. The straddling checkerboard encoding makes it more difficult since the resulting digits, unlike the original text, have a fairly flat expected frequency distribution. But digraphs and trigraphs are preserved after encoding so it may be possible to find a few on a long piece of text. Additionally, there will be other patterns and repetitions that may be picked up by statistical tools. The problem for me poor hacker is that each plaintext digit is combined with a quite “random” (in the statistical sense) keystream, so I shouldn’t expect any statistical properties of the plaintext to survive in the ciphertext.
2. Brute force. I can try every possible key text, but this won’t help much since the number of possible key texts is equal to the number of possible ciphertexts, due to the high entropy of the long key text. Brute force will produce every possible plaintext, and I don’t have a way to tell which is the good one (not even using the MAC, which would require additional brute forcing). I can still try all the books from obvious places like gutenberg.org, all the articles from nytimes.com, etc., etc., etc., and I might get lucky. The Wikipedia article on the running key cipher, which is a lot like BookPad except that the key text has identical length as the plaintext and no extra randomizing step is used, states that the number of published books in the world is of the order of 2^40, which is within the reach of today’s computers to process. Still, the risk of missing the source is great, especially if it is not as obvious as those noted above.
3. Somehow get the plaintext for a particular ciphertext, and then get the keystream by subtracting the known plaintext from the ciphertext. Then reverse the keystream back to key text so I can guess what book it was taken from. The problem is that the keystream derived from a piece of key text split into three and combined by addition is quite random, even without the LFG step (see this article), so that it could derive from a variety of key texts.
4. Change the hidden plaintext by changing the ciphertext in certain ways (especially if I know the plaintext). This is possible if the encoding is known and there is no MAC accompanying the ciphertext, but not easy at all if there is a MAC of sufficient length or the encoding is derived from the key text. I could try different encodings in the latter case, but for this attack I have to be right on the first try, which doesn’t seem possible if the encoding derives from the key text, and therefore I don’t know how it’s done.
5. Force the sender to re-use a key text for two messages, and then subtract the ciphertexts to eliminate the keystream and apply statistical tools to the result, which combines the two plaintexts. This won’t work if the sender never reuses a key text, as he/she should. Still, I could try to mount a related-key attack, where the sender has used the same key text and two different shifts. This is studied in this article.
6. Early versions of BookPad used Lagged Fibonacci generators, which are known to have problems, which I exploit. One of them is their linearity, which makes combining a serial number with the seed quite ineffective, as discussed here. LFGs are also very sensitive to the seed, so that a bad seed will lead to a sequence of digits that repeats after a short interval. If all the digits in the seed are even or divisible by 5, every digit in the sequence will be even or divisible by 5, respectively. In BookPad, the seed is not being tested for this, so it is possible that the resulting keystream is not statistically random, especially if the keystream is long and the key text is short, because then there is room for a bad LFG to start repeating digits. These and other reasons recommend not using a LFG. If a key text of sufficient length is used, the keystream passes randomness tests even without a LFG.

Does all this mean that BookPad has no flaws and you can go ahead and entrust high secrets to it? By no means, but at least you are encouraged to test it. Time and testing will tell if it lives out to its promise of perfect security with a minimum of work.

## Absolute forward secrecy Case scenario: Alice and Bob are emailing messages back and forth between them. They know their email is not secure, so they use encryption to preserve their privacy. Suddenly, SWAT teams break simultaneously into Alice’s and Bob’s apartments. Their respective computers are seized and they are asked at gunpoint for their encryption keys. Can their prior conversation, which has been duly recorded before the break-in, remain private?

Answer: it can, but it requires a very stringent form of secrecy, which I will call Absolute Forward Secrecy (AFS). This is one step beyond Perfect Forward Secrecy (PFS), which is touted a lot these days. In this article, I discuss the different kinds of forward secrecy, and how to obtain the absolute kind with a minimum of hassle. Read More

## We’re back!

Yes, the rum ors of our death were somewhat exaggerated. It all started when our web host, Wizzerwerks.com, disappeared into thin air without any warning around May 23rd. It took all our content down with it, so we’ll see how much can we get back. My hopes are high; after all, doesn’t the NSA have a copy of everything?

Wizzerdwerks was an awesome web host while it lived. The new host is SiteGround, which has high ratings online that seem legitimate. Even better, the files are located in the Netherlands, and if someone were to mess with them this could start a nasty diplomatic situation. This is why the Dutch flag is proudly displayed in this post.